Skip to content

Refactor and add benchmarks #330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 4, 2024
Merged

Refactor and add benchmarks #330

merged 3 commits into from
Nov 4, 2024

Conversation

scotts
Copy link
Contributor

@scotts scotts commented Nov 2, 2024

This PR adds new benchmarks, changes existing benchmarks, and refactors the benchmark code itself. More specifically:

  1. Refactors the benchmark code, and most obviously, changed a bunch of the decoder names. The names in the code now match more what we call them in conversation, and I think make both the code and the output more readable.
  2. Adds an experiment using random timestamps. The three experiments are now: random timestamps, uniform timestamps, and decode the first 100 frames.
  3. Adds a new decoder kind in the experiments: TorchCodecPublic. All of the pre-existing TorchCodec decoder kinds now start with TorchCodecCore. The naming is, I hope, obvious: one kind uses the public API, the other directly uses the core API.

Running the benchmarks with the command:

python benchmarks/decoders/benchmark_decoders.py

On my dev machine yields:

[--------- video=/home/scottas/github/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps ---------]
                                         |  uniform 10 seek()+next()  |  random 10 seek()+next()  |  1 next()  |  10 next()  |  100 next()  |  create()+next()
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------
      DecordNonBatchDecoderAccurateSeek  |            55.1            |           119.1           |    12.7    |     18.4    |     62.0     |                 
      TorchVision[backend=video_reader]  |           349.0            |           327.3           |    12.7    |     16.0    |     44.2     |                 
      TorchAudioDecoder                  |           467.3            |           513.7           |     8.0    |     15.3    |     73.1     |                 
      TorchCodecCore:                    |            48.9            |           113.7           |     9.6    |     12.8    |     44.8     |        9.6      
      TorchCodecCore:num_threads=1       |           108.9            |           259.7           |     6.9    |     11.9    |     70.0     |                 
      TorchCodecCoreCompiled             |            47.8            |           110.1           |     9.7    |     13.3    |     49.3     |                 
      TorchCodecCoreBatch                |            49.3            |            43.3           |    11.5    |     14.5    |     61.8     |                 
      TorchCodecPublic                   |            50.7            |            45.2           |    11.8    |     15.0    |     48.1     |                 

I think we have an opportunity to improve performance here.

The current implementation of iterators for VideoDecoder use the indexing API (__getitem__()). That is, we don't actually implement our own iterators. The Python language just uses __getitem__() and __len__() for us. That means we are seeking. If we implement our own iterators, we could do the same thing that TorchCodecCore is doing and get better performance with the public API.

Also note that the README data and charts now use the public API, not the core. I think a principle we should stick to is that when we talk externally about performance, we should always talk about the performance of the public API.

Finally, #329 should merge before this PR. Then this PR should rebase on main and I need to rerun the script to generate the graph.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 2, 2024
@scotts scotts marked this pull request as ready for review November 2, 2024 03:53
@scotts scotts requested a review from ahmadsharif1 November 2, 2024 03:53
print(
f"video={video_file_path}, decoder={decoder_name}, pts_list={pts_list}"

random_pts_list = (torch.rand(num_samples) * duration).tolist()
Copy link
Contributor

@ahmadsharif1 ahmadsharif1 Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the random seed here? Is this the source of the benchmark run time variation?

Is it the same for all decoders at least?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good point. We're doing a different random list for all decoders. We should fix that. I can also set the seed to be deterministic. Maybe in the future that should be a benchmark parameter.

df_item["fps_p25"] = 1.0 * num_samples / results[-1]._p25
df_data.append(df_item)

for num_consecutive_nexts in [100]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do need 1 here, or at least the ability to pass in 1 here. (Maybe don't need 1 for the readme, but this lib should be able to accept 1 here somehow)

Time to first frame is a useful metric for devs to track

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think we'll want to make that configurable, at least as input to run_benchmark(). I'll do that.

elif decoder == "torchvision":
decoder_dict["TVNewAPIDecoderWithBackendVideoReader"] = (
decoder_dict["TorchVision[backend=video_reader]"] = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug can tell you about torchvision's APIs and what the name should be for this. This is a specific API -- that I have heard him say it's the new API

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming in torchvision is awful. Basially the VideoReader class and the "video_reader" backend are 2 orthogonal concepts with the same name.

There are 2 main Python decoding APIs.

  • read_video() in io.video.py.
  • VideoReader(). It is considered more "fine-grained" and the "new" API (although not that new at this point, it's just more recent).

Both APIs support pyav and "video_reader" backend where "video_reader" backend just means "torchvision cpu" backend. The VideoReader() also supposedly supports the "cuda" backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the benchmark, we're calling the VideoReader class, providing it with the "video_reader" backend. I think we should name the TorchVision decoder on the config option that makes the most difference for performance. I think that the backend would be more important here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree.

@ahmadsharif1
Copy link
Contributor

Can you add the cli you used to generate this data too?

Also it's weird to see the core being 2x slower than public for the random seek -- maybe it's because the random pts list is different for every decoder? Or we aren't running the benchmarks for long enough?

@scotts
Copy link
Contributor Author

scotts commented Nov 4, 2024

Benchmarks were run with:

python benchmarks/decoders/benchmark_decoders.py

Note that I changed the defaults of what benchmarks run when there are no options provided. Let me fix the call to random, and then we can look at the numbers again. I did increase the runtime for the README chart, but not for benchmark_decoders.

@ahmadsharif1
Copy link
Contributor

Benchmarks were run with:

python benchmarks/decoders/benchmark_decoders.py

Can you put this in the PR description just before the table so git log shows it?

@scotts
Copy link
Contributor Author

scotts commented Nov 4, 2024

I update the description with how I ran the benchmarks, and the top result from this batch. This is from calling the benchmarks four consecutive times:

[--------- video=/home/scottas/github/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps ---------]
                                         |  uniform 10 seek()+next()  |  random 10 seek()+next()  |  1 next()  |  10 next()  |  100 next()  |  create()+next()
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------
      DecordNonBatchDecoderAccurateSeek  |            55.1            |           119.1           |    12.7    |     18.4    |     62.0     |                 
      TorchVision[backend=video_reader]  |           349.0            |           327.3           |    12.7    |     16.0    |     44.2     |                 
      TorchAudioDecoder                  |           467.3            |           513.7           |     8.0    |     15.3    |     73.1     |                 
      TorchCodecCore:                    |            48.9            |           113.7           |     9.6    |     12.8    |     44.8     |        9.6      
      TorchCodecCore:num_threads=1       |           108.9            |           259.7           |     6.9    |     11.9    |     70.0     |                 
      TorchCodecCoreCompiled             |            47.8            |           110.1           |     9.7    |     13.3    |     49.3     |                 
      TorchCodecCoreBatch                |            49.3            |            43.3           |    11.5    |     14.5    |     61.8     |                 
      TorchCodecPublic                   |            50.7            |            45.2           |    11.8    |     15.0    |     48.1     |                 


[--------- video=/home/scottas/github/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps ---------]
                                         |  uniform 10 seek()+next()  |  random 10 seek()+next()  |  1 next()  |  10 next()  |  100 next()  |  create()+next()
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------
      DecordNonBatchDecoderAccurateSeek  |            56.0            |           122.6           |    12.9    |     18.6    |     62.0     |                 
      TorchVision[backend=video_reader]  |           343.7            |           345.0           |    12.6    |     15.7    |     50.2     |                 
      TorchAudioDecoder                  |           498.7            |           560.0           |     8.0    |     14.0    |     83.9     |                 
      TorchCodecCore:                    |            48.9            |           111.9           |     9.5    |     12.8    |     59.1     |        9.6      
      TorchCodecCore:num_threads=1       |           111.2            |           263.4           |     6.9    |     12.0    |     72.6     |                 
      TorchCodecCoreCompiled             |            48.9            |           114.0           |     9.7    |     13.5    |     69.5     |                 
      TorchCodecCoreBatch                |            50.1            |            45.0           |    11.5    |     14.4    |     62.7     |                 
      TorchCodecPublic                   |            50.4            |            46.1           |    11.6    |     15.1    |     60.9     |                 

[--------- video=/home/scottas/github/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps ---------]
                                         |  uniform 10 seek()+next()  |  random 10 seek()+next()  |  1 next()  |  10 next()  |  100 next()  |  create()+next()
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------
      DecordNonBatchDecoderAccurateSeek  |            52.8            |           117.8           |    12.9    |     16.3    |     68.8     |                 
      TorchVision[backend=video_reader]  |           361.6            |           337.1           |    12.8    |     16.0    |     51.5     |                 
      TorchAudioDecoder                  |           472.7            |           518.9           |     8.1    |     15.6    |     84.5     |                 
      TorchCodecCore:                    |            48.1            |           109.3           |     9.6    |     12.8    |     59.8     |        9.6      
      TorchCodecCore:num_threads=1       |           113.6            |           260.9           |     6.9    |     11.9    |     70.7     |                 
      TorchCodecCoreCompiled             |            48.4            |           109.2           |     9.7    |     13.4    |     64.8     |                 
      TorchCodecCoreBatch                |            49.1            |            43.7           |    11.5    |     14.6    |     61.9     |                 
      TorchCodecPublic                   |            50.1            |            45.0           |    11.6    |     14.9    |     59.4     |                 

[--------- video=/home/scottas/github/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps ---------]
                                         |  uniform 10 seek()+next()  |  random 10 seek()+next()  |  1 next()  |  10 next()  |  100 next()  |  create()+next()
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------
      DecordNonBatchDecoderAccurateSeek  |            54.9            |           121.5           |    12.8    |     18.1    |     69.5     |                 
      TorchVision[backend=video_reader]  |           351.3            |           325.7           |    12.7    |     15.8    |     46.6     |                 
      TorchAudioDecoder                  |           463.7            |           521.3           |     8.0    |     14.0    |     84.8     |                 
      TorchCodecCore:                    |            47.4            |           110.4           |     9.6    |     12.7    |     58.5     |        9.5      
      TorchCodecCore:num_threads=1       |           112.1            |           271.0           |     6.9    |     11.9    |     69.0     |                 
      TorchCodecCoreCompiled             |            48.2            |           112.2           |     9.8    |     13.3    |     63.7     |                 
      TorchCodecCoreBatch                |            49.2            |            44.3           |    11.6    |     14.7    |     63.4     |                 
      TorchCodecPublic                   |            49.4            |            43.8           |    11.6    |     14.8    |     59.6     |                 

The only major anomaly I see is how variable across runs TorchCodecCore and TorchCodecPublic are for 100 next. Curiously, they're both behaving about the same within a run.

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can double check whether we are seeking or not by dumping decoderStats in C++

I don't think we should be seeking, especially after scanning the file

Without scanning we could accidentally seek sometimes

@scotts scotts merged commit 6028420 into pytorch:main Nov 4, 2024
37 of 40 checks passed
@scotts scotts deleted the exp_refactor branch November 4, 2024 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants