Skip to content

AudioDecoder: specify desired num_channels #678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 15, 2025
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions src/torchcodec/_core/Encoder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -293,18 +293,20 @@ void AudioEncoder::encodeInnerLoop(
if (mustConvert) {
if (!swrContext_) {
swrContext_.reset(createSwrContext(
avCodecContext_,
AV_SAMPLE_FMT_FLTP,
avCodecContext_->sample_fmt,
srcAVFrame->sample_rate, // No sample rate conversion
srcAVFrame->sample_rate));
srcAVFrame->sample_rate,
srcAVFrame,
getNumChannels(srcAVFrame) // No num_channel conversion
));
}
convertedAVFrame = convertAudioAVFrameSampleFormatAndSampleRate(
convertedAVFrame = convertAudioAVFrameSamples(
Copy link
Member Author

@NicolasHug NicolasHug May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the function, because it converts too many things to list all of them now. @scotts I also remember that you wanted to align all the sourceStuff to srcStuff - which I'll do as a quick follow-up!

swrContext_,
srcAVFrame,
avCodecContext_->sample_fmt,
srcAVFrame->sample_rate, // No sample rate conversion
srcAVFrame->sample_rate);
getNumChannels(srcAVFrame)); // No num_channel conversion
TORCH_CHECK(
convertedAVFrame->nb_samples == srcAVFrame->nb_samples,
"convertedAVFrame->nb_samples=",
Expand Down
82 changes: 66 additions & 16 deletions src/torchcodec/_core/FFMPEGCommon.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,6 @@ void setDefaultChannelLayout(
AVChannelLayout channel_layout;
av_channel_layout_default(&channel_layout, numChannels);
avCodecContext->ch_layout = channel_layout;

#else
uint64_t channel_layout = av_get_default_channel_layout(numChannels);
avCodecContext->channel_layout = channel_layout;
Expand All @@ -106,32 +105,79 @@ void setChannelLayout(
#endif
}

namespace {
#if LIBAVFILTER_VERSION_MAJOR > 7 // FFmpeg > 4

// Returns:
// - the srcAVFrame's channel layout if srcAVFrame has desiredNumChannels
// - the default channel layout with desiredNumChannels otherwise.
AVChannelLayout getDesiredChannelLayout(
int desiredNumChannels,
const UniqueAVFrame& srcAVFrame) {
AVChannelLayout desiredLayout;
if (desiredNumChannels == getNumChannels(srcAVFrame)) {
desiredLayout = srcAVFrame->ch_layout;
} else {
av_channel_layout_default(&desiredLayout, desiredNumChannels);
}
return desiredLayout;
}

#else

// Same as above
int64_t getDesiredChannelLayout(
int desiredNumChannels,
const UniqueAVFrame& srcAVFrame) {
int64_t desiredLayout;
if (desiredNumChannels == getNumChannels(srcAVFrame)) {
desiredLayout = srcAVFrame->channel_layout;
} else {
desiredLayout = av_get_default_channel_layout(desiredNumChannels);
}
return desiredLayout;
}
#endif
} // namespace

// Sets dstAVFrame' channel layout to getDesiredChannelLayout(): see doc above
void setChannelLayout(
UniqueAVFrame& dstAVFrame,
const UniqueAVFrame& srcAVFrame) {
const UniqueAVFrame& srcAVFrame,
int desiredNumChannels) {
#if LIBAVFILTER_VERSION_MAJOR > 7 // FFmpeg > 4
dstAVFrame->ch_layout = srcAVFrame->ch_layout;
AVChannelLayout desiredLayout =
getDesiredChannelLayout(desiredNumChannels, srcAVFrame);
auto status = av_channel_layout_copy(&dstAVFrame->ch_layout, &desiredLayout);
TORCH_CHECK(
status == AVSUCCESS,
"Couldn't copy channel layout to avFrame: ",
getFFMPEGErrorStringFromErrorCode(status));
#else
dstAVFrame->channel_layout = srcAVFrame->channel_layout;
dstAVFrame->channel_layout =
getDesiredChannelLayout(desiredNumChannels, srcAVFrame);
dstAVFrame->channels = desiredNumChannels;
#endif
}

SwrContext* createSwrContext(
UniqueAVCodecContext& avCodecContext,
AVSampleFormat sourceSampleFormat,
AVSampleFormat desiredSampleFormat,
int sourceSampleRate,
int desiredSampleRate) {
int desiredSampleRate,
const UniqueAVFrame& srcAVFrame,
int desiredNumChannels) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly changed the signature to take an AVFrame instead of the AVCodecContext. This is where we get the source num_channel and source channel layout. It's more accurate to get it from the AVFrame, although they should both always be the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I prefer for the primary FFmpeg data structures to appear first in a function. I think of these functions as if they are methods on those structures, and that gets reinforced by making them the first parameter. I think I understand the current order, which srcFrame is placed next to desiredNumChannels because you're getting the source number of channels from srcFrame. But given what we're doing, I think of this function as "Create a swrContext from this AVFrame with these desire parameters."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that for functions that look like methods, we want the FFmpeg struct to be the first. The reason I removed the avCodecContext from the top and now use an AVFrame is precisely because this function shouldn't read as a method.

I think I understand the current order, which srcFrame is placed next to desiredNumChannels because you're getting the source number of channels from srcFrame

Yes, you're correct. Ideally, we shouldn't need to pass the AVFrame at all, we should just pass its AVChannelLayout as sourceChannelLayout. But unfortunately, AVChannelLayout only exists from FFmpeg 5. That's the main reason we're passing an AVFrame: it's so that we can call createSwrContext() uniformly across ffmpeg versions, and contain the preproc #ifdef directives into FFmpegCommon.cpp.

Does that make sense?

Basically: we're passing an AVFrame because of some FFmpeg version constraint, but really this just be "source channel layout", and thus this isn't a method on an AVFrame at all.

SwrContext* swrContext = nullptr;
int status = AVSUCCESS;
#if LIBAVFILTER_VERSION_MAJOR > 7 // FFmpeg > 4
AVChannelLayout layout = avCodecContext->ch_layout;
AVChannelLayout desiredLayout =
getDesiredChannelLayout(desiredNumChannels, srcAVFrame);
status = swr_alloc_set_opts2(
&swrContext,
&layout,
&desiredLayout,
desiredSampleFormat,
desiredSampleRate,
&layout,
&srcAVFrame->ch_layout,
sourceSampleFormat,
sourceSampleRate,
0,
Expand All @@ -142,13 +188,14 @@ SwrContext* createSwrContext(
"Couldn't create SwrContext: ",
getFFMPEGErrorStringFromErrorCode(status));
#else
int64_t layout = static_cast<int64_t>(avCodecContext->channel_layout);
int64_t desiredLayout =
getDesiredChannelLayout(desiredNumChannels, srcAVFrame);
swrContext = swr_alloc_set_opts(
nullptr,
layout,
desiredLayout,
desiredSampleFormat,
desiredSampleRate,
layout,
srcAVFrame->channel_layout,
sourceSampleFormat,
sourceSampleRate,
0,
Expand All @@ -167,20 +214,21 @@ SwrContext* createSwrContext(
return swrContext;
}

UniqueAVFrame convertAudioAVFrameSampleFormatAndSampleRate(
UniqueAVFrame convertAudioAVFrameSamples(
const UniqueSwrContext& swrContext,
const UniqueAVFrame& srcAVFrame,
AVSampleFormat desiredSampleFormat,
int sourceSampleRate,
int desiredSampleRate) {
int desiredSampleRate,
int desiredNumChannels) {
UniqueAVFrame convertedAVFrame(av_frame_alloc());
TORCH_CHECK(
convertedAVFrame,
"Could not allocate frame for sample format conversion.");

setChannelLayout(convertedAVFrame, srcAVFrame);
convertedAVFrame->format = static_cast<int>(desiredSampleFormat);

convertedAVFrame->sample_rate = desiredSampleRate;
int sourceSampleRate = srcAVFrame->sample_rate;
if (sourceSampleRate != desiredSampleRate) {
// Note that this is an upper bound on the number of output samples.
// `swr_convert()` will likely not fill convertedAVFrame with that many
Expand All @@ -200,6 +248,8 @@ UniqueAVFrame convertAudioAVFrameSampleFormatAndSampleRate(
convertedAVFrame->nb_samples = srcAVFrame->nb_samples;
}

setChannelLayout(convertedAVFrame, srcAVFrame, desiredNumChannels);

auto status = av_frame_get_buffer(convertedAVFrame.get(), 0);
TORCH_CHECK(
status == AVSUCCESS,
Expand Down
22 changes: 15 additions & 7 deletions src/torchcodec/_core/FFMPEGCommon.h
Original file line number Diff line number Diff line change
Expand Up @@ -157,20 +157,28 @@ void setChannelLayout(

void setChannelLayout(
UniqueAVFrame& dstAVFrame,
const UniqueAVFrame& srcAVFrame);
const UniqueAVFrame& srcAVFrame,
int desiredNumChannels);

SwrContext* createSwrContext(
UniqueAVCodecContext& avCodecContext,
AVSampleFormat sourceSampleFormat,
AVSampleFormat desiredSampleFormat,
int sourceSampleRate,
int desiredSampleRate);

UniqueAVFrame convertAudioAVFrameSampleFormatAndSampleRate(
int desiredSampleRate,
const UniqueAVFrame& srcAVFrame,
int desiredNumChannels);

// Converts, if needed:
// - sample format
// - sample rate
// - number of channels.
// createSwrContext must have been previously called with matching parameters.
UniqueAVFrame convertAudioAVFrameSamples(
const UniqueSwrContext& swrContext,
const UniqueAVFrame& srcAVFrame,
AVSampleFormat desiredSampleFormat,
int sourceSampleRate,
int desiredSampleRate);
int desiredSampleRate,
int desiredNumChannels);

// Returns true if sws_scale can handle unaligned data.
bool canSwsScaleHandleUnalignedData();
Expand Down
39 changes: 31 additions & 8 deletions src/torchcodec/_core/SingleStreamDecoder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -453,6 +453,13 @@ void SingleStreamDecoder::addAudioStream(
TORCH_CHECK(
seekMode_ == SeekMode::approximate,
"seek_mode must be 'approximate' for audio streams.");
if (audioStreamOptions.numChannels.has_value()) {
TORCH_CHECK(
*audioStreamOptions.numChannels > 0 &&
*audioStreamOptions.numChannels <= AV_NUM_DATA_POINTERS,
"num_channels must be > 0 and <= AV_NUM_DATA_POINTERS (usually 8). Got: ",
*audioStreamOptions.numChannels);
}

addStream(streamIndex, AVMEDIA_TYPE_AUDIO);

Expand Down Expand Up @@ -1171,27 +1178,33 @@ void SingleStreamDecoder::convertAudioAVFrameToFrameOutputOnCPU(
int desiredSampleRate =
streamInfo.audioStreamOptions.sampleRate.value_or(sourceSampleRate);

int sourceNumChannels = getNumChannels(srcAVFrame);
int desiredNumChannels =
streamInfo.audioStreamOptions.numChannels.value_or(sourceNumChannels);

bool mustConvert =
(sourceSampleFormat != desiredSampleFormat ||
sourceSampleRate != desiredSampleRate);
sourceSampleRate != desiredSampleRate ||
sourceNumChannels != desiredNumChannels);

UniqueAVFrame convertedAVFrame;
if (mustConvert) {
if (!streamInfo.swrContext) {
streamInfo.swrContext.reset(createSwrContext(
streamInfo.codecContext,
sourceSampleFormat,
desiredSampleFormat,
sourceSampleRate,
desiredSampleRate));
desiredSampleRate,
srcAVFrame,
desiredNumChannels));
}

convertedAVFrame = convertAudioAVFrameSampleFormatAndSampleRate(
convertedAVFrame = convertAudioAVFrameSamples(
streamInfo.swrContext,
srcAVFrame,
desiredSampleFormat,
sourceSampleRate,
desiredSampleRate);
desiredSampleRate,
desiredNumChannels);
}
const UniqueAVFrame& avFrame = mustConvert ? convertedAVFrame : srcAVFrame;

Expand All @@ -1204,8 +1217,17 @@ void SingleStreamDecoder::convertAudioAVFrameToFrameOutputOnCPU(
"source format = ",
av_get_sample_fmt_name(format));

int numChannels = getNumChannels(avFrame);
TORCH_CHECK(
numChannels == desiredNumChannels,
"Something went wrong, the frame didn't get converted to the desired ",
"number of channels = ",
desiredNumChannels,
". Got ",
numChannels,
" instead.");

auto numSamples = avFrame->nb_samples; // per channel
auto numChannels = getNumChannels(avFrame);

frameOutput.data = torch::empty({numChannels, numSamples}, torch::kFloat32);

Expand Down Expand Up @@ -1240,7 +1262,8 @@ std::optional<torch::Tensor> SingleStreamDecoder::maybeFlushSwrBuffers() {
return std::nullopt;
}

auto numChannels = getNumChannels(streamInfo.codecContext);
int numChannels = streamInfo.audioStreamOptions.numChannels.value_or(
getNumChannels(streamInfo.codecContext));
torch::Tensor lastSamples =
torch::empty({numChannels, numRemainingSamples}, torch::kFloat32);

Expand Down
1 change: 1 addition & 0 deletions src/torchcodec/_core/StreamOptions.h
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ struct AudioStreamOptions {
AudioStreamOptions() {}

std::optional<int> sampleRate;
std::optional<int> numChannels;
};

} // namespace facebook::torchcodec
6 changes: 4 additions & 2 deletions src/torchcodec/_core/custom_ops.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ TORCH_LIBRARY(torchcodec_ns, m) {
m.def(
"add_video_stream(Tensor(a!) decoder, *, int? width=None, int? height=None, int? num_threads=None, str? dimension_order=None, int? stream_index=None, str? device=None) -> ()");
m.def(
"add_audio_stream(Tensor(a!) decoder, *, int? stream_index=None, int? sample_rate=None) -> ()");
"add_audio_stream(Tensor(a!) decoder, *, int? stream_index=None, int? sample_rate=None, int? num_channels=None) -> ()");
m.def("seek_to_pts(Tensor(a!) decoder, float seconds) -> ()");
m.def("get_next_frame(Tensor(a!) decoder) -> (Tensor, Tensor, Tensor)");
m.def(
Expand Down Expand Up @@ -280,9 +280,11 @@ void add_video_stream(
void add_audio_stream(
at::Tensor& decoder,
std::optional<int64_t> stream_index = std::nullopt,
std::optional<int64_t> sample_rate = std::nullopt) {
std::optional<int64_t> sample_rate = std::nullopt,
std::optional<int64_t> num_channels = std::nullopt) {
AudioStreamOptions audioStreamOptions;
audioStreamOptions.sampleRate = sample_rate;
audioStreamOptions.numChannels = num_channels;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implicitly converts an optional int64 to an optional int. The same implicit conversion already happens for sample_rate, and for the VideoStreamOptions' fields like height, width, and ffmpegThreadCount.

I'll open an issue to fix / check these all at once.


auto videoDecoder = unwrapTensorToGetDecoder(decoder);
videoDecoder->addAudioStream(stream_index.value_or(-1), audioStreamOptions);
Expand Down
2 changes: 2 additions & 0 deletions src/torchcodec/_core/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,8 @@ def add_audio_stream_abstract(
decoder: torch.Tensor,
*,
stream_index: Optional[int] = None,
sample_rate: Optional[int] = None,
num_channels: Optional[int] = None,
) -> None:
return

Expand Down
8 changes: 7 additions & 1 deletion src/torchcodec/decoders/_audio_decoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ class AudioDecoder:
the :term:`best stream` is used.
sample_rate (int, optional): The desired output sample rate of the decoded samples.
By default, the samples are returned in their original sample rate.
num_channels (int, optional): The desired number of channels of the decoded samples.
By default, the original number of channels is used.

Attributes:
metadata (AudioStreamMetadata): Metadata of the audio stream.
Expand All @@ -54,11 +56,15 @@ def __init__(
*,
stream_index: Optional[int] = None,
sample_rate: Optional[int] = None,
num_channels: Optional[int] = None,
):
self._decoder = create_decoder(source=source, seek_mode="approximate")

core.add_audio_stream(
self._decoder, stream_index=stream_index, sample_rate=sample_rate
self._decoder,
stream_index=stream_index,
sample_rate=sample_rate,
num_channels=num_channels,
)

container_metadata = core.get_container_metadata(self._decoder)
Expand Down
30 changes: 30 additions & 0 deletions test/test_decoders.py
Original file line number Diff line number Diff line change
Expand Up @@ -1305,3 +1305,33 @@ def test_samples_duration(self, asset, sample_rate):
decoder = AudioDecoder(asset.path, sample_rate=sample_rate)
samples = decoder.get_samples_played_in_range(start_seconds=1, stop_seconds=2)
assert samples.duration_seconds == 1

@pytest.mark.parametrize("asset", (SINE_MONO_S32, NASA_AUDIO_MP3))
# Note that we parametrize over sample_rate as well, so that we can ensure
# that the extra tensor allocation that happens within
# maybeFlushSwrBuffers() is correct.
@pytest.mark.parametrize("sample_rate", (None, 16_000))
# FFmpeg can handle up to AV_NUM_DATA_POINTERS=8 channels
@pytest.mark.parametrize("num_channels", (1, 2, 8, None))
def test_num_channels(self, asset, sample_rate, num_channels):
decoder = AudioDecoder(
asset.path, sample_rate=sample_rate, num_channels=num_channels
)
samples = decoder.get_all_samples()

if num_channels is None:
num_channels = asset.num_channels

assert samples.data.shape[0] == num_channels

@pytest.mark.parametrize("asset", (SINE_MONO_S32, NASA_AUDIO_MP3))
def test_num_channels_errors(self, asset):
with pytest.raises(
RuntimeError, match="num_channels must be > 0 and <= AV_NUM_DATA_POINTERS"
):
AudioDecoder(asset.path, num_channels=0)
with pytest.raises(
RuntimeError, match="num_channels must be > 0 and <= AV_NUM_DATA_POINTERS"
):
# FFmpeg can handle up to AV_NUM_DATA_POINTERS=8 channels
AudioDecoder(asset.path, num_channels=9)
Loading