Improved Scan #855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

devshgraphicsprogramming wants to merge 22 commits into master from improve_scan

Member

devshgraphicsprogramming commented Mar 19, 2025

Description

Testing

TODO list:


          intial changes

d4e3738

devshgraphicsprogramming commented

View reviewed changes

include/nbl/builtin/hlsl/subgroup/arithmetic_portability.hlsl Outdated Show resolved Hide resolved

keptsecret added 4 commits

March 27, 2025 15:04


          subgroup2 implementations

10d9c39


          some fixes, example

f2a281c


          changed template parameters

4622f1f


          working subgroup2 template and funcs

abfaf67

devshgraphicsprogramming commented

View reviewed changes

include/nbl/builtin/hlsl/subgroup2/arithmetic_portability.hlsl Outdated Show resolved Hide resolved

keptsecret added 6 commits

March 31, 2025 16:40


          fix reduction bug

f2d6d8a


          minor fix

eeec20a


          latest example

53ffc60


          merge master, fix conflicts

0efeb8d


          new example number


          partial spec for items per invoc =1

e88f51a

devshgraphicsprogramming commented

View reviewed changes

include/nbl/builtin/hlsl/subgroup/ballot.hlsl Outdated Show resolved Hide resolved

devshgraphicsprogramming commented

View reviewed changes

include/nbl/builtin/hlsl/subgroup2/arithmetic_portability.hlsl Outdated Show resolved Hide resolved

devshgraphicsprogramming commented

View reviewed changes

include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl Outdated Show resolved Hide resolved

devshgraphicsprogramming commented

View reviewed changes

include/nbl/builtin/hlsl/subgroup2/arithmetic_portability.hlsl Outdated Show resolved Hide resolved

devshgraphicsprogramming commented

View reviewed changes

include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl Outdated

Comment on lines 102 to 129

+                  using op_t = subgroup::impl::inclusive_scan<binop_t, native>;
+                  // assert T == scalar type, binop::type == T
+                  T operator()(NBL_CONST_REF_ARG(T) value)
+                  {
+                      op_t op;
+                      return op(value);
+                  }
+              };
+              template<class Binop, typename T, bool native>
+              struct exclusive_scan<Binop, T, 1, native>
+              {
+                  using binop_t = Binop;
+                  using op_t = subgroup::impl::exclusive_scan<binop_t, native>;
+                  T operator()(NBL_CONST_REF_ARG(T) value)
+                  {
+                      op_t op;
+                      return op(value);
+                  }
+              };
+              template<class Binop, typename T, bool native>
+              struct reduction<Binop, T, 1, native>
+              {
+                  using binop_t = Binop;
+                  using op_t = subgroup::impl::reduction<binop_t, native>;

Member Author

devshgraphicsprogramming Apr 9, 2025

benchmark is invalid if you do stuff in terms of subgroup functions, because you are are supposed to use the Params::Configuration::SizeLog2 to make sure your loops unroll, as opposed to the subgroup v1 loops which can't unroll because the loop invariant depends on gl_SubgroupSize which is a uniform and not a compile time constant (you can only hope that the IHV compiler is not dump and actually uses the subgroup size you provide in pipeline creation parameters when lowering SPIR-V to ISA)

TL;DR there can be no dependency between subgroup2 and subgroup namespace, copy the code over

devshgraphicsprogramming commented

View reviewed changes

include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl Outdated Show resolved Hide resolved

devshgraphicsprogramming commented

View reviewed changes

include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl Outdated

Comment on lines 35 to 43

+                      for (uint32_t i = 1; i < ItemsPerInvocation; i++)
+                          retval[i] = binop(retval[i-1], value[i]);
+                      exclusive_scan_op_t op;
+                      scalar_t exclusive = op(retval[ItemsPerInvocation-1]);
+                      //[unroll(ItemsPerInvocation)]
+                      for (uint32_t i = 0; i < ItemsPerInvocation; i++)
+                          retval[i] = binop(retval[i], exclusive);

Member Author

devshgraphicsprogramming Apr 9, 2025

this only works if the subgroup invocations are not coalesced

devshgraphicsprogramming commented

View reviewed changes

include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl Outdated

+                      inclusive_scan_op_t op;
+                      value = op(value);
+                      type_t left = glsl::subgroupShuffleUp<type_t>(value,1);

Member Author

devshgraphicsprogramming Apr 9, 2025 •

edited

Loading

yeah, if each invocation holds consecutive input and output elements, this shift becomes a mess (see that loop you have at the end)

also there was never a need to shuffle the entire vector, because you only ever used the last component

Member Author

devshgraphicsprogramming Apr 9, 2025

if you do coalesced, then a plain subgroup shuffle on the vector and then conditional set of first element (literal vectorized version of old code) will achieve what you want

const uint32_t invocationID = glsl::gl_SubgroupInvocationID();
// cyclic/modulo shuffle instead of relative needed
const type_t left = ItemsPerInvocation ? glsl::subgroupShuffle<type_t>(value,(invocationID-1)&SubgroupMask):glsl::subgroupShuffleUp<type_t>(value,1);
type_t newFirst; newFirst[0] = binop_t::identity;
[unroll]
for (uint32_t i=1; i<ItemsPerInvocation; i++)
   newFirst[i] = left[i-1];
return mix(newFirst,left,bool(glsl::gl_SubgroupInvocationID()));

P.S. also use mix(T,T,bool) instead of ? bevcause of HLSL short circuiting and turning ternaries into branches.

Member Author

devshgraphicsprogramming Apr 10, 2025

btw the subgroupShuffle with a modulo SubgroupSize can be replaced with new intrinsic from SPV_KHR_subgroup_rotate if you extend the device_limits.json and so on (so that device_capability_traits gets it)


          changes to Params, Config handling types

a8e02a3

devshgraphicsprogramming commented

View reviewed changes

include/nbl/builtin/hlsl/subgroup2/arithmetic_portability.hlsl Outdated Show resolved Hide resolved


          rework specializations for native, emulated funcs

237ac09

devshgraphicsprogramming commented

View reviewed changes

include/nbl/builtin/hlsl/subgroup2/arithmetic_portability.hlsl Outdated Show resolved Hide resolved

devshgraphicsprogramming commented

View reviewed changes

include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl Outdated Show resolved Hide resolved

keptsecret added 4 commits

April 11, 2025 10:25


          added OpSelect intrinsic for mix, fix mix behavior with bool

859c313


          use mix instead of ternary op

c5a3223


          fixes to subgroup2 funcs

87bca2b


          changes to handle coalesced data loads

49fd605

keptsecret added 5 commits

April 21, 2025 14:53


          merge master, fix example conflicts

4ae51a1


          fixes to inclusive_scan for coalesced

609ad85


          removed redundant code

6b692f4


          enabled handling vectors in spirv group ops with templates and enable_if

d0acb31


          added impl component wise inclusive scan for inclusive scan

fc92538

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet