Skip to content

Improved Scan #855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open

Improved Scan #855

wants to merge 22 commits into from

Conversation

devshgraphicsprogramming
Copy link
Member

Description

Testing

TODO list:

Comment on lines 102 to 129
using op_t = subgroup::impl::inclusive_scan<binop_t, native>;
// assert T == scalar type, binop::type == T

T operator()(NBL_CONST_REF_ARG(T) value)
{
op_t op;
return op(value);
}
};

template<class Binop, typename T, bool native>
struct exclusive_scan<Binop, T, 1, native>
{
using binop_t = Binop;
using op_t = subgroup::impl::exclusive_scan<binop_t, native>;

T operator()(NBL_CONST_REF_ARG(T) value)
{
op_t op;
return op(value);
}
};

template<class Binop, typename T, bool native>
struct reduction<Binop, T, 1, native>
{
using binop_t = Binop;
using op_t = subgroup::impl::reduction<binop_t, native>;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

benchmark is invalid if you do stuff in terms of subgroup functions, because you are are supposed to use the Params::Configuration::SizeLog2 to make sure your loops unroll, as opposed to the subgroup v1 loops which can't unroll because the loop invariant depends on gl_SubgroupSize which is a uniform and not a compile time constant (you can only hope that the IHV compiler is not dump and actually uses the subgroup size you provide in pipeline creation parameters when lowering SPIR-V to ISA)

TL;DR there can be no dependency between subgroup2 and subgroup namespace, copy the code over

Comment on lines 35 to 43
for (uint32_t i = 1; i < ItemsPerInvocation; i++)
retval[i] = binop(retval[i-1], value[i]);

exclusive_scan_op_t op;
scalar_t exclusive = op(retval[ItemsPerInvocation-1]);

//[unroll(ItemsPerInvocation)]
for (uint32_t i = 0; i < ItemsPerInvocation; i++)
retval[i] = binop(retval[i], exclusive);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only works if the subgroup invocations are not coalesced

inclusive_scan_op_t op;
value = op(value);

type_t left = glsl::subgroupShuffleUp<type_t>(value,1);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, if each invocation holds consecutive input and output elements, this shift becomes a mess (see that loop you have at the end)

also there was never a need to shuffle the entire vector, because you only ever used the last component

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you do coalesced, then a plain subgroup shuffle on the vector and then conditional set of first element (literal vectorized version of old code) will achieve what you want

const uint32_t invocationID = glsl::gl_SubgroupInvocationID();
// cyclic/modulo shuffle instead of relative needed
const type_t left = ItemsPerInvocation ? glsl::subgroupShuffle<type_t>(value,(invocationID-1)&SubgroupMask):glsl::subgroupShuffleUp<type_t>(value,1);
type_t newFirst; newFirst[0] = binop_t::identity;
[unroll]
for (uint32_t i=1; i<ItemsPerInvocation; i++)
   newFirst[i] = left[i-1];
return mix(newFirst,left,bool(glsl::gl_SubgroupInvocationID()));

P.S. also use mix(T,T,bool) instead of ? bevcause of HLSL short circuiting and turning ternaries into branches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw the subgroupShuffle with a modulo SubgroupSize can be replaced with new intrinsic from SPV_KHR_subgroup_rotate if you extend the device_limits.json and so on (so that device_capability_traits gets it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants