-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[fix][broker]Fix deadlock when compaction and topic deletion execute concurrently #24366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][broker]Fix deadlock when compaction and topic deletion execute concurrently #24366
Conversation
/pulsarbot rerun-failure-checks |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #24366 +/- ##
============================================
+ Coverage 73.57% 74.26% +0.68%
- Complexity 32624 32692 +68
============================================
Files 1877 1867 -10
Lines 139502 145260 +5758
Branches 15299 16610 +1311
============================================
+ Hits 102638 107876 +5238
+ Misses 28908 28853 -55
- Partials 7956 8531 +575
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
@poorbarcode Does this resolve #24148? |
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java
Outdated
Show resolved
Hide resolved
why does the current code (before this PR) handle the pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java Lines 1478 to 1483 in 81c94c8
|
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java
Outdated
Show resolved
Hide resolved
Actually, we are safe to give up the reconnect from the compaction reader if the topic is deleting. Now, the solution is to unfence the topic if there is ongoing compaction task. But it unfencing will also make producers/consumers from user side available to connect to this topic which should not be expected. So, I think the correct fix should be find a way to expose the "Topic is deleting Exception" to Raw reader and the reader can optionally give up the retry. |
The solution you mentioned is what PIP-387: Modify interface TopicCompactionService to support cancelling an in-progress compaction task, which has been canceled 😂 |
We'd better make the compaction task aware of the topic deletion operation and cancel the task. |
@codelipenghui @gaoran10 @lhotari I have used a new solution to solve the issue: cancel the pending compaction when deleting the topic, please review again |
Motivation
Issue 1: deadlock when compaction and topic deletion execute concurrently
Reproduction steps of the deadlock( see also the new test
testConcurrentCompactionAndTopicDelete
)thread-compaction
: create a raw reader.thread-topic-deleting
: mark the topic as deleted and disconnect clients, including the compaction reader.thread-compaction
: The raw reader attempts to reconnect due to the disconnection in step "2-1".thread-topic-deleting
: unsubscribe all subscriptions that contain "__compaction".thread-topic-deleting
: deleting "__compaction" cursor will wait for the in-progress compaction task.thread-compaction
: the raw read can not connect successfully anymore because the topic was marked as "fenced && isDeleting".deadlock
: "thread-topic-deleting" waiting for compaction task being finished, thread-compaction continuously reconnects.Note:
thread-compaction
is stuck when the first phase of compaction callingreader.readNextAsync()
at step 6, the deadlock can be solved after{brokerServiceCompactionPhaseOneLoopTimeInSeconds)
secondsreader.getLastMessageIdAsync
... the deadlock will persist forever.You can reproduce the issue by the new test
testConcurrentCompactionAndTopicDelete
Issue 2-1: The responded compostable future of
Consumer.receiveAsync()
never complete if the topic is deletedBackground:
topic auto creation
: Client triggers a new topic creation if users delete a topic with-f
.topic auto creation
: Client stops retrying reconnection and sets the consumer's state toFailed
.Failed
, but it does not trigger a completion for the pending receive requests, leading to the pending receive future never completing. See https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java#L988-L999You can reproduce the issue by the new test
testReceiveWillDoneAfterTopicDeleted
Issue 2-2: resource leak that contains a dead letter producer and a retry letter producer.
Regarding
issue 2-1
, the client also forgot to release the related dead letter producer and the retry letter producer, which also caused a resource leak. Since there are other issues that prevent the new test from passing, I will write a separate PR to add the test to cover this case, which also contains another fix.Issue 3: Even though the client received a
TopicNotFound
error, it continuously retries.Root cause: Pulsar client prevents retrying when it receives a
TopicNotFound
error from theSubscribe
command, but theTopicNotFound
can also be received from the processget broker connections of the target topic
.You can reproduce the issue by the new test
testReceiveWillDoneAfterTopicDeleted
Modifications
consumer.close()
instead of calling a part of the stats modifying, which is more complete.Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: x