Feature request for `transformers` use-cases #673

zucchini-nlp · 2025-05-09T10:42:03Z

🚀 The feature

Hi 👋

First of all, huge thanks to you and the team, the latest torchcodec release with audio support is fantastic! It's a long-awaited feature

I'm the maintainer of multimodal models in transformers and I'm thinking to use torchcodec to load multimodal data for MLLMs. Looking forward for a stable version to be released. For now, I’ve been testing the latest release and noticed a few points that might be useful to consider for future support.

Mono channel audio support: Some audio models (like Whisper from Hugging Face) only support mono-channel input. It would be helpful if audio loading allowed channel selection or converted stereo to mono optionally.
Fallback for video files with no audio: When loading audio from a video file that has no audio stream, an error is raised currently. A more flexible behavior would be to return None, similar to how moviepy handles it and can be checked as if clip.audio is not None.
Loading from URL: Loading audio/video from URLs seems to work for some urls I have tested with, though I couldn’t find in the docs whether URL input is officially supported. Hope it will be officially supported for the stable release
Video decoder issues with avi format: When trying to load avi files, the decoder fails to infer duration and related metadata, which prevents sampling frames by seconds. Loading the same video saved as mp4 resolves the issue. You can try this video as an example.

Let me know if you'd like me to file any of these separately or provide reproducible examples. Thanks again for the awesome work!

Motivation, pitch

No response

The text was updated successfully, but these errors were encountered:

NicolasHug · 2025-05-12T12:51:23Z

Hi @zucchini-nlp

Thanks a lot for the great feedback!

I've opened Allow user to choose num_channels in AudioDecoder #675 to keep track of this feature. I think we should be able to implement it for the next release.
As you noted, TorchCodec raises and error instead of returning None. We're going by the principle that TorchCodec should make it very obvious when something goes wrong. Instead of using if audio is not None, users can catch errors within a try/except statement, which is probably just a matter of taste.
We have updated the docstrings of the AudioDecoder and VideoDecoder now, and both should indicate that URLs are supported. Let us know if you encounter issues with some specific URLs!
Thank you for sharing the video! I've opened 2 follow-up issues on this: Approximate mode fails on video #677 and ZeroDivisionError when accessing metadata #676. To unblock you, if what you need is just to decode some frames, I think you should be able to decode them by using seek_mode="exact". If you need to access metadata, use seek_mode="approximate". There is definitely something wrong in the way TorchCodec is handling this video, it's probably related to the metadata, so we'll investigate a bit more. I think the reason it works when you re-encode into mp4 is because the encoding is able to "fix" the metadata problem. I.e. it's more related to the metadata of these specific videos, rather than to the format itself.

zucchini-nlp · 2025-05-12T13:02:59Z

Thanks a lot @NicolasHug ! Looking forward for future releases 🤗

NicolasHug · 2025-05-14T12:42:48Z

Quick update @zucchini-nlp , I'm working on mono/stereo conversion in #678.

Regarding point 4 and the problematic video: I can confirm that the video itself isn't correctly encoded: all frames and packets are specified with a pts of -9223372036854775808, which is INT64_MIN. That's why things start working when you re-encode the video: the second encoder is able to set the pts to valid values. I suspect that it would still work if you were to re-encode into avi, instead of mp4.

I'll try to see what TorchCodec can do to smoothly handle such videos.

In the mean time, I'm curious how blocking this is for you right now? Do you have a lot of such poorly encoded videos? Do you absolutely need them to be decoded as-is, or is re-encoding an option?

Thanks!

zucchini-nlp · 2025-05-14T13:15:11Z

Great thanks!

I'll try to see what TorchCodec can do to smoothly handle such videos.

This sounds good, will be nice to get an informative error or probably set duration to 0 so that other metedata can be obtained. For me currently it is not a blocker, since we're still using two different libraries to decode audio (librosa) and video (pyav). This video is from one of the many video LLMs repos and was listed in their demo page, so we were just lucky to find a poorly encoded example 😄

Do you absolutely need them to be decoded as-is, or is re-encoding an option?

Not really of a need if decoding is not possible and video is corrupted. An error we can catch or similar will be enough

NicolasHug · 2025-05-15T14:00:32Z

I've got some good news @zucchini-nlp , we found a way to properly decode the video you linked to. The PTS info is missing, but we can fallback to DTS values (which, in that video, were correctly set). I hope it will address the other videos you had issues with (if not, let us know!)

BTW, about this:

Looking forward for a stable version to be released

We used to have some notice indicating that some APIs could change, but we consider the public APIs to be very stable now. It's extremely unlikely that we'll be changing public stuff (other than for major bug-fixes), so please feel free to rely on the public APIs, they're stable.

I'll be pushing a new release (TorchCodec 0.4) in the coming days with:

The fix for the video
A new num_channels parameter to AudioDecoder.

We're very excited that you consider TorchCodec for transformers, let us know if there's anything else we can help with!

zucchini-nlp · 2025-05-15T14:13:18Z

Great news @NicolasHug ! The I will try to integrate TorchCodec for the next release which we'll have in about ~1 month. Thanks a lot for the fix, I will let you know if we need any more features/fixes 😄

NicolasHug · 2025-05-16T09:28:51Z

I just published 0.4 with the improvements mentioned above: https://github.com/pytorch/torchcodec/releases

I'll close the issue, thank you for your feedback and keep us updated on the transformers integration if there are any issues!

NicolasHug mentioned this issue May 12, 2025

Allow user to choose num_channels in AudioDecoder #675

Closed

NicolasHug mentioned this issue May 15, 2025

Fallback to DTS if PTS info doesn't exist #683

Merged

NicolasHug closed this as completed May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request for `transformers` use-cases #673

Feature request for `transformers` use-cases #673

zucchini-nlp commented May 9, 2025 •

edited

Loading

NicolasHug commented May 12, 2025 •

edited

Loading

zucchini-nlp commented May 12, 2025

NicolasHug commented May 14, 2025 •

edited

Loading

zucchini-nlp commented May 14, 2025

NicolasHug commented May 15, 2025

zucchini-nlp commented May 15, 2025

NicolasHug commented May 16, 2025

Feature request for transformers use-cases #673

Feature request for transformers use-cases #673

Comments

zucchini-nlp commented May 9, 2025 • edited Loading

🚀 The feature

Motivation, pitch

NicolasHug commented May 12, 2025 • edited Loading

zucchini-nlp commented May 12, 2025

NicolasHug commented May 14, 2025 • edited Loading

zucchini-nlp commented May 14, 2025

NicolasHug commented May 15, 2025

zucchini-nlp commented May 15, 2025

NicolasHug commented May 16, 2025

Feature request for `transformers` use-cases #673

Feature request for `transformers` use-cases #673

zucchini-nlp commented May 9, 2025 •

edited

Loading

NicolasHug commented May 12, 2025 •

edited

Loading

NicolasHug commented May 14, 2025 •

edited

Loading