Allow user to change the channel axis for `BatchNorm` function and the likes #1664

vboussange · 2021-07-14T17:08:13Z

Hi all,
Currently, the BatchNorm function selects by default the N-1 dimension of the N dimensional array as the channel dimension. I think it would be useful to let the user decide on the dimension that corresponds to the channel dimension in his/her particular setting.

I have already a proposal, that I am happy to release as a pull request.

Example

using Flux
channel_size = 3
channel_axis = 1
BN = BatchNorm(channel_size, dim = channel_axis)
x = randn(channel_size, 10, 10)
BN(x)

The text was updated successfully, but these errors were encountered:

DhairyaLGandhi · 2021-07-14T17:41:41Z

Thanks! We already support multiple dims in the norm layers for different data types. The question really is that in order to have a channel axis different in one set of layers compared to all the others in Flux. It might cause confusion and increase complexity of the package more generally.

ToucheSir · 2021-07-14T17:47:02Z

Sounds like a reasonable (if somewhat uncommon) proposal to me if it is extended to GroupNorm and InstanceNorm as well. The main consideration I can think of would be how GPU-friendly this is.

Edit: do you have publications or other concrete examples of where this flexibility would be helpful? I had a look through other libraries and none of them appear to support this.

vboussange · 2021-07-15T06:59:00Z

Thanks for your answers.
@ToucheSir This is an option proposed in TensorFlow with tf.keras.layers.BatchNormalization. It can be useful for scientific machine learning.
I am writing a package (paper in prep.) that relies on Flux.jl for solving high dimensional Partial Differential Equations. In this setting, I need to evaluate BN with an array x where ndims(x) is 2 or 3, but the "feature" axis is always the first one.
As regards to GPU performance, it works very well in my user case.

I agree, this option should be also implemented for GroupNorm and InstanceNorm, which is what I did. Let me know if you want me to PR.

darsnack · 2021-07-15T13:04:00Z

If you have code, a PR is always better to discuss a possible change.

This is not meant to be push back, but is using permutedims before the BN not an option? Is it just too cumbersome to write?

DhairyaLGandhi · 2021-07-15T15:12:12Z

Well, something like this would need to be consistent through Conv*, Dense* etc too, not just *Norm for it to be used consistently in a model. I agree a x -> permutedims(x, (2,1,3)) before a BN would be a better solution. Its more in line with flux style and is one of the benefits of the way we do things imo.

ToucheSir · 2021-07-15T17:54:51Z

Well, this might require two permutedims to restore the original format. Dense doesn't even have a concept of channels so it hardly applies there. For conv, my main concern (as with this) whether our cuDNN integration can be made to support changing the channel dimension.

darsnack · 2021-07-15T18:39:33Z

Yeah if we did do this, I would prefer to do it across the board rather than have some layers with fixed features x channel x samples and some flexible. Being able to see a proposed implementation would be helpful to gauge how easily we can spread that API across all the layers.

Well, this might require two permutedims to restore the original format.

Yeah this is what I was thinking w.r.t. "cumbersome." It might make sense to have a layer that reshapes/permutes, applies a wrapped layer, then reshapes/permutes back. Just an alternate idea.

vboussange · 2021-07-16T12:16:09Z

I have proposed a PR (#1666).

using permutedims before the BN not an option? Is it just too cumbersome to write?

Correct me if I am wrong, but using permutedims goes necessarily along with allocating a new array (in fact two since as mentioned one would need to apply permutedims twice). This can slow dow the execution for large arrays, or for one like me who would need to do this transformation many times in a run.

darsnack · 2021-07-16T13:27:03Z

That's true. PermuteDimsArray is a zero-copy version of permutedims.

DhairyaLGandhi · 2021-07-16T14:06:01Z

Here's a proposal, what if we check that PermuteDimsArray can work with Zygote, and write a constructor adjoint if not. Then we wouldn't need copies (does permutedims even copy? I doubt it) and downstream packages can write a simple helper utility that does the right thing by permuting twice with a layer in the middle without complicating the semantics of Flux layers.

function f(...; dims)
  Chain(PermuteDimsArray(...; dims),
        BN,
        PermuteDimsArray(...; inv(dims)))
end

darsnack · 2021-07-16T14:09:09Z

permutedims does copy AFAIK. It isn't default no-copy like reshape. Sounds like a good proposal though.

ToucheSir · 2021-07-16T15:28:04Z

How GPU-friendly is this double permute? Would the second one restore the original dense array for downstream operations?

DhairyaLGandhi · 2021-07-16T15:38:22Z

Some layers may copy, so no it shouldn't restore the original (hence it shouldn't be in Flux because it can't be generic). Note, the dimensionality of the output would match those of the input (the channels dim would be N - 1).

It should be gpu friendly as PermuteDimsArray is handled in CUDA (via Adapt) which I think is fine.

vboussange · 2021-08-02T12:36:11Z

@DhairyaLGandhi your proposal is appealing. Nonetheless I am not sure how to interpret your piece of code.

I still gave a try on a double permute with a GPU array, and it throws a scalar indexing ...

using CUDA; CUDA.allowscalar(false)
import Base.PermutedDimsArray

x = CUDA.randn(10, 5 ,2 );
x = PermutedDimsArray(x,[3,2,1]);
x = PermutedDimsArray(x, invperm([3,2,1]));
x * 2

DhairyaLGandhi · 2021-08-02T15:44:28Z

Could you check whether PermuteDimsArray has an appropriate rule in Adapt.jl?

vboussange · 2021-08-05T15:10:31Z

@DhairyaLGandhi Yes it looks like there is something

darsnack · 2021-08-05T15:14:43Z

You are not using Adapt.jl here, because x starts off on the GPU before being passed to PermuteDimsArray. You would be using Adapt.jl if you started with a PermuteDimsArray{Array} then called gpu/adapt.

~~The issue here is likely from invperm which is probably not a GPU-friendly implementation in Base.~~ (never mind, that call doesn't even involve the GPU) It would be helpful to post complete stack traces so that we can easily spot where the error is occurring and why.

vboussange · 2021-08-05T15:19:07Z

Sure, this is what it throws

julia> x * 2
ERROR: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:33
  [2] assertscalar(op::String)
    @ GPUArrays ~/.julia/packages/GPUArrays/8dzSJ/src/host/indexing.jl:53
  [3] getindex(::CuArray{Float32, 3}, ::Int64, ::Int64, ::Int64)
    @ GPUArrays ~/.julia/packages/GPUArrays/8dzSJ/src/host/indexing.jl:86
  [4] getindex (repeats 2 times)
    @ ./permuteddimsarray.jl:71 [inlined]
  [5] _getindex
    @ ./abstractarray.jl:1214 [inlined]
  [6] getindex
    @ ./abstractarray.jl:1170 [inlined]
  [7] _broadcast_getindex
    @ ./broadcast.jl:614 [inlined]
  [8] _getindex
    @ ./broadcast.jl:644 [inlined]
  [9] _broadcast_getindex
    @ ./broadcast.jl:620 [inlined]
 [10] getindex
    @ ./broadcast.jl:575 [inlined]
 [11] macro expansion
    @ ./broadcast.jl:984 [inlined]
 [12] macro expansion
    @ ./simdloop.jl:77 [inlined]
 [13] copyto!
    @ ./broadcast.jl:983 [inlined]
 [14] copyto!
    @ ./broadcast.jl:936 [inlined]
 [15] copy
    @ ./broadcast.jl:908 [inlined]
 [16] materialize
    @ ./broadcast.jl:883 [inlined]
 [17] broadcast_preserving_zero_d
    @ ./broadcast.jl:872 [inlined]
 [18] *(A::PermutedDimsArray{Float32, 3, (3, 2, 1), (3, 2, 1), PermutedDimsArray{Float32, 3, (3, 2, 1), (3, 2, 1), CuArray{Float32, 3}}}, B::Int64)
    @ Base ./arraymath.jl:55
 [19] top-level scope
    @ REPL[7]:1

darsnack · 2021-08-05T15:30:12Z

Okay, that's because broadcasting with a PermuteDimsArray triggers getindex which uses scalar indexing.

DhairyaLGandhi · 2021-08-05T16:05:59Z

Cc @maleadt

maleadt · 2021-08-05T16:56:02Z

Broadcast generally only works with plain CuArrays, see JuliaGPU/CUDA.jl#228.

vargonis · 2024-06-23T09:03:39Z

Any updates here? Channel dimension is naturally N-1 in space-like models, but for sequential or time-like models it is natural to have it first... Or more generally N-2, if one also has space-like dimensions.

vboussange mentioned this issue Jul 16, 2021

Allow user to change the channel axis for BatchNorm function and the likes #1666

Open

foldfelis mentioned this issue Aug 11, 2021

support GPU SciML/FluxNeuralOperators.jl#7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow user to change the channel axis for `BatchNorm` function and the likes #1664

Allow user to change the channel axis for `BatchNorm` function and the likes #1664

vboussange commented Jul 14, 2021

DhairyaLGandhi commented Jul 14, 2021

ToucheSir commented Jul 14, 2021 •

edited

Loading

vboussange commented Jul 15, 2021

darsnack commented Jul 15, 2021

DhairyaLGandhi commented Jul 15, 2021 •

edited

Loading

ToucheSir commented Jul 15, 2021

darsnack commented Jul 15, 2021

vboussange commented Jul 16, 2021

darsnack commented Jul 16, 2021 •

edited

Loading

DhairyaLGandhi commented Jul 16, 2021 •

edited

Loading

darsnack commented Jul 16, 2021

ToucheSir commented Jul 16, 2021

DhairyaLGandhi commented Jul 16, 2021 •

edited

Loading

vboussange commented Aug 2, 2021

DhairyaLGandhi commented Aug 2, 2021

vboussange commented Aug 5, 2021 •

edited

Loading

darsnack commented Aug 5, 2021 •

edited

Loading

vboussange commented Aug 5, 2021

darsnack commented Aug 5, 2021

DhairyaLGandhi commented Aug 5, 2021

maleadt commented Aug 5, 2021 •

edited

Loading

vargonis commented Jun 23, 2024

Allow user to change the channel axis for BatchNorm function and the likes #1664

Allow user to change the channel axis for BatchNorm function and the likes #1664

Comments

vboussange commented Jul 14, 2021

Example

DhairyaLGandhi commented Jul 14, 2021

ToucheSir commented Jul 14, 2021 • edited Loading

vboussange commented Jul 15, 2021

darsnack commented Jul 15, 2021

DhairyaLGandhi commented Jul 15, 2021 • edited Loading

ToucheSir commented Jul 15, 2021

darsnack commented Jul 15, 2021

vboussange commented Jul 16, 2021

darsnack commented Jul 16, 2021 • edited Loading

DhairyaLGandhi commented Jul 16, 2021 • edited Loading

darsnack commented Jul 16, 2021

ToucheSir commented Jul 16, 2021

DhairyaLGandhi commented Jul 16, 2021 • edited Loading

vboussange commented Aug 2, 2021

DhairyaLGandhi commented Aug 2, 2021

vboussange commented Aug 5, 2021 • edited Loading

darsnack commented Aug 5, 2021 • edited Loading

vboussange commented Aug 5, 2021

darsnack commented Aug 5, 2021

DhairyaLGandhi commented Aug 5, 2021

maleadt commented Aug 5, 2021 • edited Loading

vargonis commented Jun 23, 2024

Allow user to change the channel axis for `BatchNorm` function and the likes #1664

Allow user to change the channel axis for `BatchNorm` function and the likes #1664

ToucheSir commented Jul 14, 2021 •

edited

Loading

DhairyaLGandhi commented Jul 15, 2021 •

edited

Loading

darsnack commented Jul 16, 2021 •

edited

Loading

DhairyaLGandhi commented Jul 16, 2021 •

edited

Loading

DhairyaLGandhi commented Jul 16, 2021 •

edited

Loading

vboussange commented Aug 5, 2021 •

edited

Loading

darsnack commented Aug 5, 2021 •

edited

Loading

maleadt commented Aug 5, 2021 •

edited

Loading