11622 : OutlierDetection should use Ticker, not TimeProvider #12110

vimanikag · 2025-05-29T06:07:42Z

Description :

TimeProvider provides wall time. That can move forward and backward as time is adjusted. OutlierDetection is measuring durations, so it should use a monotonic clock. We use Ticker (either Guava's or Deadline's, but generally Guava's) for that. FakeClock supports both for testing.

Doing a quick audit, this seems to be the only present incorrect use of TimeProvider.

I feel like I noticed this before, but I don't know why I didn't change it at the time. Maybe I had wanted to change the weird Long usage as well. (E.g., EPOCH = currentTimeNanos() - 1 and use EPOCH instead of null.)

Fixes : #11622

ejona86

This is insufficient. You need to audit all usages of the time and make sure subtraction is used when determining the times. See System.nanoTime's javadoc for explanation.

vimanikag · 2025-06-03T12:36:46Z

This is insufficient. You need to audit all usages of the time and make sure subtraction is used when determining the times. See System.nanoTime's javadoc for explanation.

I have fallowed the System.nanoTime's javadoc for the subtraction cases and avoided the direct usage of ticker.read() finding the time difference.

During my audit , I have noticed below classes also using ticker.read() directly in finding the time difference and fallowed the same suggestions for below two classes :

https://github.com/grpc/grpc-java/blob/master/binder/src/main/java/io/grpc/binder/internal/PingTracker.java
https://github.com/grpc/grpc-java/blob/master/netty/src/main/java/io/grpc/netty/AbstractNettyHandler.java

ejona86 · 2025-06-03T14:09:10Z

During my audit , I have noticed below classes also using ticker.read() directly in finding the time difference and fallowed the same suggestions for below two classes :

I don't understand. All you did is make a variable for the ticker.read() call. It was 100% fine before your change, and you didn't change any behavior. Revert those changes.

I'm confident OutlierDetectionLoadBalancer is still broken.

vimanikag · 2025-06-04T04:49:46Z

During my audit , I have noticed below classes also using ticker.read() directly in finding the time difference and fallowed the same suggestions for below two classes :

I don't understand. All you did is make a variable for the ticker.read() call. It was 100% fine before your change, and you didn't change any behavior. Revert those changes.

I'm confident OutlierDetectionLoadBalancer is still broken.

I am hoping the changes below will fix some edge cases like handling inconsistent time with ticker.read() and avoid -ve elapsedTimeNanos , please let me know Your thoughts on above comments "I'm confident OutlierDetectionLoadBalancer is still broken."

long elapsedTimeNanos = Math.max(0L,currentTickerTimeNanos - detectionTimerStartNanos);

I have added the similar changes in AbstractNettyHandler as elapsed calculation has involved and reverted the PingTracker changes as we are not using a ticker to find elapsed time

I am really not getting Your thoughts on ticker.read() usage and concerns , as per my understanding ticker.read() is used to get System.nanoTime() to find the time difference between the current action using monotonic clock time instead of using wall time.

it will really help to take it forward in this case if You share some more thoughts on "why You feel the OutlierDetectionLoadBalancer is still broken" after using ticker.read().

shivaspeaks · 2025-06-06T07:38:44Z

util/src/main/java/io/grpc/util/OutlierDetectionLoadBalancer.java

        // If a timer has started earlier we cancel it and use the difference between the start
        // time and now as the interval.
        initialDelayNanos = Math.max(0L,
-            config.intervalNanos - (timeProvider.currentTimeNanos() - detectionTimerStartNanos));
+            config.intervalNanos - elapsedTimeNanos);


You can keep everything in-line here and above to make it less confusing. Storing in a variable doesn't change anything here, but sure you need to do Math.max(0L, ...) to not have any negative value there.

Sure , I have addressed the review comments and added negative values check and moved everything to single line but's build has failed as it's reaching to 100 characters in line hence I am planning split it two lines

The max is unnecessary. For it to overflow would require the process to be running for 292 years.

Addressed the review comments

shivaspeaks · 2025-06-06T07:45:49Z

netty/src/main/java/io/grpc/netty/AbstractNettyHandler.java

@@ -199,7 +199,8 @@ public void updateWindow() throws Http2Exception {
      pingReturn++;
      setPinging(false);

-      long elapsedTime = (ticker.read() - lastPingTime);


This was fine I believe, this doesn't need any change.

Addressed the review comments

shivaspeaks · 2025-06-06T07:59:41Z

util/src/main/java/io/grpc/util/OutlierDetectionLoadBalancer.java

@@ -201,7 +203,7 @@ class DetectionTimer implements Runnable {

    @Override
    public void run() {
-      detectionTimerStartNanos = timeProvider.currentTimeNanos();
+      detectionTimerStartNanos = ticker.read();


So you need to audit fully, if somewhere in inner methods it is being used in unwanted way - like it is mentioned in javadoc. The comparison should always be done doing subtraction.

I have taken the report using grep commands and verified with the IDE search with ticker.read() but I did not see the scenario where the ticker.read() is directly used in finding the difference without doing the subtraction and We are frequently using the this ticker.read() in junit's but not seen it's used in unwanted way

I can see the ticker.read() in a few below implementation classes but not seen it's used in an unwanted way , please find the attached Audit_TR.txt for Your reference.

Aduit_ticker_read.txt

AbstractNettyHandler.java
NettyServerHandler.java
CachingRlsLbClient.java
LinkedHashLruCache.java
AdaptiveThrottler.java
PingTracker.java
OutlierDetectionLoadBalancer.java

No need to audit all the files. This single file has such a bug. I'd outright tell you, but then I have to audit it for any other missed cases, because clearly you aren't finding them, and that is literally the only interesting part of this PR.

Thanks for the review and the guidance , I have done a careful audit on the OutlierDetectionLoadBalancer method calls and noticed ticker.read() has invoked two api's

acceptResolvedAddresses(ResolvedAddresses resolvedAddresses)

We have already had discussion on the below API and we are good with existing code as it's already been using subtraction to find the time difference instead of using > or < directly in finding the time difference and we are intended to the System.nanoTime() documentation

initialDelayNanos = Math.max(0L,
config.intervalNanos - (ticker.read() - detectionTimerStartNanos));

maxEjectionTimeElapsed(long currentTimeNanos)

I'm hoping our long discussions on finding the bug in this API and I have observed the maybeUnejectOutliers method has invoked it in run() with detectionTimerStartNanos and which is assigned with ticker.read() and using the currentTimeNanos > maxEjectionTimeNanos expression while returning the boolean value if the currentTimeNanos is after the maxEjectionTimeNanos on maxEjectionTimeElapsed

endpointTrackerMap.maybeUnejectOutliers(detectionTimerStartNanos);

grpc-java/util/src/main/java/io/grpc/util/OutlierDetectionLoadBalancer.java

Line 212 in b8c0e0c

endpointTrackerMap.maybeUnejectOutliers(detectionTimerStartNanos);

I have addressed this issue in the latest commit , please review it and let me know Your thoughts if I have missed to find any other bugs in the OutlierDetectionLoadBalancer

shivaspeaks · 2025-06-06T08:55:40Z

Also, please add the description and link the issue it is trying to resolve.

ejona86 · 2025-06-06T15:21:35Z

The link in the description is broken. We tend to say "Fixes #X" so that github knows they are associated and will close the issue when this PR is merged. https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/using-keywords-in-issues-and-pull-requests . See #12080 for an example. See also our CONTRIBUTING.md

shivaspeaks · 2025-06-09T16:02:31Z

util/src/main/java/io/grpc/util/OutlierDetectionLoadBalancer.java

@@ -638,7 +638,7 @@ public boolean maxEjectionTimeElapsed(long currentTimeNanos) {
              config.baseEjectionTimeNanos * ejectionTimeMultiplier,
              maxEjectionDurationSecs);

-      return currentTimeNanos > maxEjectionTimeNanos;
+      return currentTimeNanos - maxEjectionTimeNanos > 0;


No. It is still not solving the complete problem.

As it was pointed before, you just need follow the variable in which you're storing ticker.read() and you will find where it is being used in an incorrect way. Oh I see, you reached to this point, it's just 2 lines above where you need to see. ejectionTimeNanos , and we are adding something into it! That seems to be a problem right? Probably we can get rid of that maxEjectionTimeNanos and you just need to do algebraic manipulation here.

Probably there could be some more places, idk you need to check.

I've not looked really looked at all the changes yet, but this does have the correct shape and is the case I had noticed.

@shivaspeaks, what's happening here is the subtraction may overflow/underflow in order to produce the correct result, as the nanotime itself is allowed to overflow during the execution of the process. But the difference will still be accurate. Doing the subtraction essentially "removes" nanotime from the result ((nanotime + A) - (nanoTime + B) = A - B), independent of all overflow/underflow, as long as A-B itself isn't too large/small to fit in a long.

Yes, but that subtraction needs to be done in one equation. Storing in a variable itself can also overflow. That's what I was referring to by saying algebraic manipulation. It's just we are trying to avoid any addition and storing in long (which potentially can overflow/underflow).

We can do something like this:
return currentTimeNanos - ejectionTimeNanos + Math.min(config.baseEjectionTimeNanos * ejectionTimeMultiplier, maxEjectionDurationSecs) > 0;
Or
return currentTimeNanos > ejectionTimeNanos - Math.min(config.baseEjectionTimeNanos * ejectionTimeMultiplier, maxEjectionDurationSecs);

Both works fine. The first one will also work since it's in one equation.

Oh the second one will not work, we should avoid that. We are doing same mistake as original if we go by the second way.

Okay, so let me put it this way. There are two problems here overlapped.

The one Eric was talking about. Comparison should be done in a - b > 0 way and not a > b.

Adding something to long potentially may overflow.
long maxEjectionTimeNanos = ejectionTimeNanos + Math.min( config.baseEjectionTimeNanos * ejectionTimeMultiplier, maxEjectionDurationSecs);

I over looked 1st one. I had it first but this time while reviewing I totally over looked what javadoc was saying. So yes, Eric is right about it. This change does solve that. But should we also try to solve 2nd concern here? @ejona86

Storing it in a variable has no impact to the result, as long as the variable is not smaller than the original operands. Floating point sometimes can get messy with the order of operations, but integer math is very safe for addition/subtraction (in Java).

Try it out:

jshell> Integer.MAX_VALUE + 100 - Integer.MAX_VALUE; $1 ==> 100 jshell> int overflow = Integer.MAX_VALUE + 100; overflow ==> -2147483549 jshell> overflow - Integer.MAX_VALUE; $3 ==> 100

Signed addition/subtraction is identical in twos complement as unsigned addition/subtraction. So there's actually not any data lost in the above; just some of the result is encoded in the sign bit. But the same actually holds when the high bits are truncated. If you reverse the operation, information is not lost.

jshell> (byte) 200; $1 ==> -56 jshell> byte b = (byte)(Integer.MAX_VALUE + 200); b ==> -57 jshell> (byte)(((int) b) - Integer.MAX_VALUE) $3 ==> -56 jshell> Byte.toUnsignedInt(b) $4 ==> 199 jshell> (byte)(Byte.toUnsignedInt(b) - Integer.MAX_VALUE) $5 ==> -56 jshell> b & 0xFF $6 ==> 199 jshell> (byte)((b & 0xFF) - Integer.MAX_VALUE) $7 ==> -56

The (int) b cast does a sign extension, so the high bits will be 1s there. The Byte.toUnsignedInt(b) does not do sign extension, which is the same as b & 0xFF (the sign extension happens, and then those bits are thrown away; I assume the JIT recognizes the pattern and optimizes the sign extension away). As you can see, the high bits don't actually impact the result (as long as you truncate back down to byte).

(Note that C++ allows unsigned integer overflow/underflow but considers signed integer overflow/underflow undefined behavior. So in other languages, you might have to be careful. The process is fine with all this, but there can be per-language details.)

You may be confused some by Java auto-promoting types to integer (or long, or double). Those rules are confusing, but it isn't relevant in this case because the intermediate results are long.

jshell> byte b = (byte) 100 + (byte) 100 | Error: | incompatible types: possible lossy conversion from int to byte | byte b = (byte) 100 + (byte) 100; | ^---------------------^ jshell> byte b = (byte) 100 + 100L | Error: | incompatible types: possible lossy conversion from long to byte | byte b = (byte) 100 + 100L; | ^---------------^

The first shows + promotes to int. The second shows how a long operand makes the result long. I'm very thankful that Golang chose not to retain this from C. Auto-promotion/coercion makes things harder to understand.

vimanikag added 2 commits May 29, 2025 06:06

11622 : OutlierDetection should use Ticker, not TimeProvider

ec44974

11622 : OutlierDetection should use Ticker, not TimeProvider

8f513b7

vimanikag marked this pull request as ready for review May 29, 2025 06:17

ejona86 reviewed May 29, 2025

View reviewed changes

ejona86 added the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label May 29, 2025

grpc-kokoro removed the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label May 29, 2025

vimanikag added 2 commits June 3, 2025 10:07

11243 : RLS cleanups

debbd7a

11622 : OutlierDetection should use Ticker, not TimeProvider

3735a6c

11622 : OutlierDetection should use Ticker, not TimeProvider

c831f1e

shivaspeaks reviewed Jun 6, 2025

View reviewed changes

11622 : OutlierDetection should use Ticker, not TimeProvider

b8c0e0c

11622 : OutlierDetection should use Ticker, not TimeProvider

aaf5016

shivaspeaks reviewed Jun 9, 2025

View reviewed changes

11622 : OutlierDetection should use Ticker, not TimeProvider #12110

Are you sure you want to change the base?

11622 : OutlierDetection should use Ticker, not TimeProvider #12110

Conversation

vimanikag commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejona86 left a comment

Choose a reason for hiding this comment

Uh oh!

vimanikag commented Jun 3, 2025

Uh oh!

ejona86 commented Jun 3, 2025

Uh oh!

vimanikag commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaspeaks commented Jun 6, 2025

Uh oh!

ejona86 commented Jun 6, 2025

Uh oh!

shivaspeaks Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaspeaks Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaspeaks Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vimanikag commented May 29, 2025 •

edited

Loading

vimanikag commented Jun 4, 2025 •

edited

Loading

shivaspeaks Jun 9, 2025 •

edited

Loading

shivaspeaks Jun 9, 2025 •

edited

Loading

shivaspeaks Jun 9, 2025 •

edited

Loading