Skip to content

Fast Path StringCoding.countPostives and hasNegative for Power #21597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

luke-li-2003
Copy link
Contributor

Fast path the StringCoding methods countPositives and hasNegative on Power, since their logics are similar, they can be implemented by a single instrinsic.

@luke-li-2003
Copy link
Contributor Author

Still a draft PR since I need to figure out a good way to deal with shorter arrays.

@luke-li-2003
Copy link
Contributor Author

… Power

Fast path the StringCoding methods countPositives and hasNegative on
Power, since their logics are similar, they can be implemented by a single
instrinsic.

Signed-off-by: Luke Li <luke.li@ibm.com>
@luke-li-2003
Copy link
Contributor Author

luke-li-2003 commented Apr 8, 2025

For reference, this is what we are dealing with:

On jdk21+:

    public static boolean hasNegatives(byte[] ba, int off, int len) {
        return countPositives(ba, off, len) != len;
    }

    @IntrinsicCandidate
    public static int countPositives(byte[] ba, int off, int len) {
        int limit = off + len;
        for (int i = off; i < limit; i++) {
            if (ba[i] < 0) {
                return i - off;
            }
        }
        return len;
    }

Before jdk21:

    public static boolean hasNegatives(byte[] ba, int off, int len) {
        for (int i = off; i < off + len; i++) {
            if (ba[i] < 0) {
                return true;
            }
        }
        return false;
    }

@luke-li-2003
Copy link
Contributor Author

Performance: as outlined before it is not doing well for arrays shorter than byte[16]

build byte[1] 4 8 16 30 50 100
nightly 1100 325 217 107 34 19 14
fast path 478 175 97 40 41 144 77

@luke-li-2003
Copy link
Contributor Author

After some experimentations, I ended up with two totally different implementations:

The first one modifies stringIndexOf, which uses aligned loads and masking mechanisms to ensure a serial loop is not necessary.

The second one is the one on the PR branch, it uses unaligned vector loads, with the residue going into a serial loop.

The performance tradeoffs are outlined before

build byte[1] 2 3 4 15 17 128
default 306 195 150 176 78 83 20
aligned 307 165 160 160 147 145 86
unaligned 304 250 190 155 93 80 58

There seems to be some inescapable performance tradeoffs here.

@luke-li-2003
Copy link
Contributor Author

The reason why the default build could be so fast some of the times, was because in those times the offset value was fixed.

I now have two benchmarks, one randomises the starting offset, while the other does not. I am not sure which one presents a more realistic scenario:

Randomised offset:

build 1 2 3 15 30
default 300 291 191 100 64
fast-pathed 571 362 302 125 124

Offset fixed to 0:

build 1 2 3 15 30
default 1000 418 360 89 60
fast-pathed 575 369 290 127 125

@luke-li-2003
Copy link
Contributor Author

A day of testing only showed what didn't work:

Making first-byte-mismatch the fall-through branch only had negligible performance difference while making the code more convoluted.

Using the counter register for the unroll loop resulted in a 50% reduction in throughput while only saving 1 register.

@luke-li-2003 luke-li-2003 force-pushed the CountPostivesP branch 2 times, most recently from 7b96532 to bc97a8f Compare April 29, 2025 19:44
@luke-li-2003
Copy link
Contributor Author

luke-li-2003 commented May 1, 2025

New data with the updated code and randomised offset:

build 1 2 3 15 30 33
default 429 279 229 97 63 64
fast-path 617 393 320 129 128 110
P9+ 615 383 313 136 141 328

I don't really understand why the P9+ version is slightly slower on arrays shorter than 3, given the new instructions should not affect them at all.

@luke-li-2003
Copy link
Contributor Author

I made a broken version of the intrinsic that simply does nothing, and it could reach a throughput of 800M, compared to the jitted code's 1000M...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant