Skip to content

Correct TermOrdValComparator competitive iterator intoBitSet implementation #14523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 19, 2025

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Apr 18, 2025

Similar to @ChrisHegarty 's change for the count fix, this will move up the max check to before the into bit set. It seems like we could be calling intobitset erroneously on some edge cases.

Additionally, TermOrdValComparator's competitive iterator will ensure that the docsWithField iterator (used when not all docs have a given field), is at least the baseline of the scoring window when doing intoBitSet.

closes: #14517

@benwtrent benwtrent changed the title intobitset term ord comp Shore up some intoBitSet impls and add paranoid protections Apr 18, 2025
Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -434,6 +434,9 @@ public void intoBitSet(int upTo, FixedBitSet bitSet, int offset) throws IOExcept
if (competitiveIterator.docID() < doc) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is strange. All other intoBitSet impls that I can find call the competitiveIterator to advance to at least offset, not to some internal doc. I suppose NumericComparator is unique?

@benwtrent benwtrent changed the title Shore up some intoBitSet impls and add paranoid protections Correct TermOrdValComparator competitive iterator intoBitSet implementation Apr 19, 2025
Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

LeafCollector leafCollector = collector.getLeafCollector(context);
BulkScorer bulkScorer = weight.bulkScorer(context);
// We need to split on this specific doc ID so that the current doc of the competitive
// iterator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe format this single-line word better?

@benwtrent benwtrent merged commit ca6879d into apache:main Apr 19, 2025
7 checks passed
@benwtrent benwtrent deleted the bugfix/intobitset-term-ord-comp branch April 19, 2025 16:08
benwtrent added a commit that referenced this pull request Apr 19, 2025
…tation (#14523)

Similar to @ChrisHegarty 's change for the count fix, this will move up the max check to before the into bit set. It seems like we could be calling intobitset erroneously on some edge cases.

Additionally, TermOrdValComparator's competitive iterator will ensure that the docsWithField iterator (used when not all docs have a given field), is at least the baseline of the scoring window when doing intoBitSet.

closes: #14517
benwtrent added a commit that referenced this pull request Apr 19, 2025
…tation (#14523)

Similar to @ChrisHegarty 's change for the count fix, this will move up the max check to before the into bit set. It seems like we could be calling intobitset erroneously on some edge cases.

Additionally, TermOrdValComparator's competitive iterator will ensure that the docsWithField iterator (used when not all docs have a given field), is at least the baseline of the scoring window when doing intoBitSet.

closes: #14517
if (upTo <= doc) {
return;
}
// Optimize the case when intersecting the competitive iterator is expensive, which is when it
// hasn't nailed down a disjunction of competitive terms yet.
if (disjunction == null) {
if (docsWithField != null) {
// we need to be absolutely sure that the iterator is at least at offset
if (docsWithField.docID() < offset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract from DocIdSetIterator#intoBitSet requires starting from the current doc rather than the first doc on or after offset, so it would be more correct to advance to doc here. In practice, it wouldn't make a difference today since all call sites first advance iterators to offset before calling intoBitSet, but it may not always be the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract from DocIdSetIterator#intoBitSet requires starting from the current doc rather than the first doc on or after offset, so it would be more correct to advance to doc here.

So things might got broken when current doc is after the first doc like

CompetitiveIterator competitiveIterator = CompetitiveIterator.of(1, 2, 3, 4);
competitiveIterator.advance(2); // advance current doc to 2
competitiveIterator.update(0, 2000); // update to doc value
final int offset = 1;
if (competitiveIterator.docID() < offset) {
  competitiveIterator.advance(offset);
}
competitiveIterator.intoBitSet(upTo, bitset, offset); // we should start from 2 instead of 1.

? Nice catch!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.

jpountz pushed a commit to jpountz/lucene that referenced this pull request Apr 24, 2025
…tation (apache#14523)

Similar to @ChrisHegarty 's change for the count fix, this will move up the max check to before the into bit set. It seems like we could be calling intobitset erroneously on some edge cases.

Additionally, TermOrdValComparator's competitive iterator will ensure that the docsWithField iterator (used when not all docs have a given field), is at least the baseline of the scoring window when doing intoBitSet.

closes: apache#14517
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strange stack traces for new bitset focused doc iterators
4 participants