-
Notifications
You must be signed in to change notification settings - Fork 751
Fast Path StringCoding.countPostives and hasNegative for Power #21597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Still a draft PR since I need to figure out a good way to deal with shorter arrays. |
… Power Fast path the StringCoding methods countPositives and hasNegative on Power, since their logics are similar, they can be implemented by a single instrinsic. Signed-off-by: Luke Li <luke.li@ibm.com>
f1a5eaf
to
35cdf26
Compare
For reference, this is what we are dealing with: On jdk21+:
Before jdk21:
|
Performance: as outlined before it is not doing well for arrays shorter than byte[16]
|
After some experimentations, I ended up with two totally different implementations: The first one modifies The second one is the one on the PR branch, it uses unaligned vector loads, with the residue going into a serial loop. The performance tradeoffs are outlined before
There seems to be some inescapable performance tradeoffs here. |
The reason why the default build could be so fast some of the times, was because in those times the offset value was fixed. I now have two benchmarks, one randomises the starting offset, while the other does not. I am not sure which one presents a more realistic scenario: Randomised offset:
Offset fixed to 0:
|
8a88de9
to
39583ed
Compare
A day of testing only showed what didn't work: Making first-byte-mismatch the fall-through branch only had negligible performance difference while making the code more convoluted. Using the counter register for the unroll loop resulted in a 50% reduction in throughput while only saving 1 register. |
6eaebec
to
165cfdd
Compare
7b96532
to
bc97a8f
Compare
New data with the updated code and randomised offset:
I don't really understand why the P9+ version is slightly slower on arrays shorter than 3, given the new instructions should not affect them at all. |
8a58536
to
aaea552
Compare
aaea552
to
ca84576
Compare
I made a broken version of the intrinsic that simply does nothing, and it could reach a throughput of 800M, compared to the jitted code's 1000M... |
Fast path the StringCoding methods countPositives and hasNegative on Power, since their logics are similar, they can be implemented by a single instrinsic.