-
Notifications
You must be signed in to change notification settings - Fork 65
Improved Scan #855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Improved Scan #855
Conversation
include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl
Outdated
Show resolved
Hide resolved
using op_t = subgroup::impl::inclusive_scan<binop_t, native>; | ||
// assert T == scalar type, binop::type == T | ||
|
||
T operator()(NBL_CONST_REF_ARG(T) value) | ||
{ | ||
op_t op; | ||
return op(value); | ||
} | ||
}; | ||
|
||
template<class Binop, typename T, bool native> | ||
struct exclusive_scan<Binop, T, 1, native> | ||
{ | ||
using binop_t = Binop; | ||
using op_t = subgroup::impl::exclusive_scan<binop_t, native>; | ||
|
||
T operator()(NBL_CONST_REF_ARG(T) value) | ||
{ | ||
op_t op; | ||
return op(value); | ||
} | ||
}; | ||
|
||
template<class Binop, typename T, bool native> | ||
struct reduction<Binop, T, 1, native> | ||
{ | ||
using binop_t = Binop; | ||
using op_t = subgroup::impl::reduction<binop_t, native>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
benchmark is invalid if you do stuff in terms of subgroup
functions, because you are are supposed to use the Params::Configuration::SizeLog2
to make sure your loops unroll, as opposed to the subgroup
v1 loops which can't unroll because the loop invariant depends on gl_SubgroupSize
which is a uniform and not a compile time constant (you can only hope that the IHV compiler is not dump and actually uses the subgroup size you provide in pipeline creation parameters when lowering SPIR-V to ISA)
TL;DR there can be no dependency between subgroup2
and subgroup
namespace, copy the code over
include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl
Outdated
Show resolved
Hide resolved
for (uint32_t i = 1; i < ItemsPerInvocation; i++) | ||
retval[i] = binop(retval[i-1], value[i]); | ||
|
||
exclusive_scan_op_t op; | ||
scalar_t exclusive = op(retval[ItemsPerInvocation-1]); | ||
|
||
//[unroll(ItemsPerInvocation)] | ||
for (uint32_t i = 0; i < ItemsPerInvocation; i++) | ||
retval[i] = binop(retval[i], exclusive); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this only works if the subgroup invocations are not coalesced
inclusive_scan_op_t op; | ||
value = op(value); | ||
|
||
type_t left = glsl::subgroupShuffleUp<type_t>(value,1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, if each invocation holds consecutive input and output elements, this shift becomes a mess (see that loop you have at the end)
also there was never a need to shuffle the entire vector, because you only ever used the last component
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you do coalesced, then a plain subgroup shuffle on the vector and then conditional set of first element (literal vectorized version of old code) will achieve what you want
const uint32_t invocationID = glsl::gl_SubgroupInvocationID();
// cyclic/modulo shuffle instead of relative needed
const type_t left = ItemsPerInvocation ? glsl::subgroupShuffle<type_t>(value,(invocationID-1)&SubgroupMask):glsl::subgroupShuffleUp<type_t>(value,1);
type_t newFirst; newFirst[0] = binop_t::identity;
[unroll]
for (uint32_t i=1; i<ItemsPerInvocation; i++)
newFirst[i] = left[i-1];
return mix(newFirst,left,bool(glsl::gl_SubgroupInvocationID()));
P.S. also use mix(T,T,bool)
instead of ?
bevcause of HLSL short circuiting and turning ternaries into branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw the subgroupShuffle
with a modulo SubgroupSize can be replaced with new intrinsic from SPV_KHR_subgroup_rotate
if you extend the device_limits.json
and so on (so that device_capability_traits
gets it)
Description
Testing
TODO list: