Skip to content

[fix][broker]Fix deadlock when compaction and topic deletion execute concurrently #24366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 5, 2025

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented May 30, 2025

Motivation

Issue 1: deadlock when compaction and topic deletion execute concurrently

Reproduction steps of the deadlock( see also the new test testConcurrentCompactionAndTopicDelete )

  1. thread-compaction: create a raw reader.
  2. thread-topic-deleting: mark the topic as deleted and disconnect clients, including the compaction reader.
  3. thread-compaction: The raw reader attempts to reconnect due to the disconnection in step "2-1".
  4. thread-topic-deleting: unsubscribe all subscriptions that contain "__compaction".
  5. thread-topic-deleting: deleting "__compaction" cursor will wait for the in-progress compaction task.
  6. thread-compaction: the raw read can not connect successfully anymore because the topic was marked as "fenced && isDeleting".
  7. deadlock: "thread-topic-deleting" waiting for compaction task being finished, thread-compaction continuously reconnects.

Note:

  • If the thread-compaction is stuck when the first phase of compaction calling reader.readNextAsync() at step 6, the deadlock can be solved after {brokerServiceCompactionPhaseOneLoopTimeInSeconds) seconds
  • But regarding other scenarios, such as reader.getLastMessageIdAsync... the deadlock will persist forever.

You can reproduce the issue by the new test testConcurrentCompactionAndTopicDelete


Issue 2-1: The responded compostable future of Consumer.receiveAsync() never complete if the topic is deleted

Background:

setState(State.Failed);
closeConsumerTasks();
client.cleanupConsumer(this);
// Issue: it should call "failPendingReceive()" also

You can reproduce the issue by the new test testReceiveWillDoneAfterTopicDeleted

Issue 2-2: resource leak that contains a dead letter producer and a retry letter producer.

Regarding issue 2-1, the client also forgot to release the related dead letter producer and the retry letter producer, which also caused a resource leak. Since there are other issues that prevent the new test from passing, I will write a separate PR to add the test to cover this case, which also contains another fix.


Issue 3: Even though the client received a TopicNotFound error, it continuously retries.

Root cause: Pulsar client prevents retrying when it receives a TopicNotFound error from the Subscribe command, but the TopicNotFound can also be received from the process get broker connections of the target topic.

You can reproduce the issue by the new test testReceiveWillDoneAfterTopicDeleted


Modifications

  • Regarding issue 1 - a deadlock error, closes the RawReader if the compaction task received a ServiceNotReady error, which happens when a topic/subscription is fenced, or the topic/namespace is loading/unloading. Since the RawReader is an internal API, this change is safe.
  • Regarding issue 2 - consumer forgets to complete the pending read request when it stops reconnecting, calls consumer.close() instead of calling a part of the stats modifying, which is more complete.
  • Regarding issue 3 - continues to reconnect even though it received a TopicNotFound error, stops reconnecting after receiving an unrecovered error from grabbing new connections, and closes consumers/producers.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode added this to the 4.1.0 milestone May 30, 2025
@poorbarcode poorbarcode self-assigned this May 30, 2025
@poorbarcode poorbarcode added type/bug The PR fixed a bug or issue reported a bug release/3.0.13 release/4.0.6 release/3.3.8 labels May 30, 2025
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label May 30, 2025
@poorbarcode
Copy link
Contributor Author

/pulsarbot rerun-failure-checks

@codecov-commenter
Copy link

codecov-commenter commented May 30, 2025

Codecov Report

Attention: Patch coverage is 81.53846% with 12 lines in your changes missing coverage. Please review.

Project coverage is 74.26%. Comparing base (bbc6224) to head (fa2f120).
Report is 1128 commits behind head on master.

Files with missing lines Patch % Lines
...va/org/apache/pulsar/client/impl/ConsumerImpl.java 76.19% 3 Missing and 2 partials ⚠️
...va/org/apache/pulsar/client/impl/ProducerImpl.java 83.33% 1 Missing and 2 partials ⚠️
...a/org/apache/pulsar/client/impl/RawReaderImpl.java 88.88% 0 Missing and 1 partial ⚠️
...g/apache/pulsar/client/impl/ConnectionHandler.java 85.71% 1 Missing ⚠️
...rg/apache/pulsar/client/impl/TopicListWatcher.java 50.00% 1 Missing ⚠️
...ulsar/client/impl/TransactionMetaStoreHandler.java 50.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #24366      +/-   ##
============================================
+ Coverage     73.57%   74.26%   +0.68%     
- Complexity    32624    32692      +68     
============================================
  Files          1877     1867      -10     
  Lines        139502   145260    +5758     
  Branches      15299    16610    +1311     
============================================
+ Hits         102638   107876    +5238     
+ Misses        28908    28853      -55     
- Partials       7956     8531     +575     
Flag Coverage Δ
inttests 26.64% <44.61%> (+2.05%) ⬆️
systests 23.27% <9.23%> (-1.06%) ⬇️
unittests 73.76% <81.53%> (+0.91%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...rg/apache/pulsar/broker/service/AbstractTopic.java 88.04% <ø> (+0.06%) ⬆️
...sar/broker/service/persistent/PersistentTopic.java 80.24% <100.00%> (+1.78%) ⬆️
...n/java/org/apache/pulsar/client/api/RawReader.java 100.00% <100.00%> (ø)
...n/java/org/apache/pulsar/compaction/Compactor.java 92.30% <100.00%> (+11.53%) ⬆️
...pache/pulsar/client/api/PulsarClientException.java 64.40% <100.00%> (+1.35%) ⬆️
.../java/org/apache/pulsar/client/impl/ClientCnx.java 70.12% <100.00%> (-1.66%) ⬇️
...a/org/apache/pulsar/client/impl/RawReaderImpl.java 84.07% <88.88%> (-0.70%) ⬇️
...g/apache/pulsar/client/impl/ConnectionHandler.java 86.17% <85.71%> (-0.60%) ⬇️
...rg/apache/pulsar/client/impl/TopicListWatcher.java 65.69% <50.00%> (-2.17%) ⬇️
...ulsar/client/impl/TransactionMetaStoreHandler.java 67.66% <50.00%> (-0.24%) ⬇️
... and 2 more

... and 1076 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lhotari
Copy link
Member

lhotari commented May 30, 2025

@poorbarcode Does this resolve #24148?

@lhotari
Copy link
Member

lhotari commented May 30, 2025

why does the current code (before this PR) handle the if (isClosingOrDeleting) { check inside the write lock?

lock.writeLock().lock();
try {
if (isClosingOrDeleting) {
log.warn("[{}] Topic is already being closed or deleted", topic);
return FutureUtil.failedFuture(new TopicFencedException("Topic is already fenced"));
}

@codelipenghui
Copy link
Contributor

  1. deadlock: "thread-topic-deleting" waiting for compaction task being finished, thread-compaction continuously reconnects.

Actually, we are safe to give up the reconnect from the compaction reader if the topic is deleting.

Now, the solution is to unfence the topic if there is ongoing compaction task. But it unfencing will also make producers/consumers from user side available to connect to this topic which should not be expected.

So, I think the correct fix should be find a way to expose the "Topic is deleting Exception" to Raw reader and the reader can optionally give up the retry.

@poorbarcode
Copy link
Contributor Author

@codelipenghui

Actually, we are safe to give up the reconnect from the compaction reader if the topic is deleting.
Now, the solution is to unfence the topic if there is ongoing compaction task. But it unfencing will also make producers/consumers from user side available to connect to this topic which should not be expected.
So, I think the correct fix should be find a way to expose the "Topic is deleting Exception" to Raw reader and the reader can optionally give up the retry.

The solution you mentioned is what PIP-387: Modify interface TopicCompactionService to support cancelling an in-progress compaction task, which has been canceled 😂

@gaoran10
Copy link
Contributor

gaoran10 commented Jun 3, 2025

We'd better make the compaction task aware of the topic deletion operation and cancel the task.

@poorbarcode poorbarcode requested a review from lhotari June 3, 2025 14:27
@poorbarcode
Copy link
Contributor Author

@codelipenghui @gaoran10 @lhotari I have used a new solution to solve the issue: cancel the pending compaction when deleting the topic, please review again

@poorbarcode poorbarcode requested a review from lhotari June 4, 2025 15:09
@poorbarcode poorbarcode requested a review from codelipenghui June 4, 2025 17:17
@poorbarcode poorbarcode merged commit 37e160f into apache:master Jun 5, 2025
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs ready-to-test release/3.0.13 release/3.3.8 release/4.0.6 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants