mirror of
https://github.com/facebook/rocksdb
synced 2026-05-24 09:29:21 +00:00
Summary:
Add a read-path optimization that converts contiguous point tombstones into range tombstones during forward/reverse iteration. When a configurable threshold of consecutive point deletions (kTypeDeletion, kTypeDeletionWithTimestamp, kTypeSingleDeletion — with no live keys between them) is detected, a range tombstone covering `[first_tombstone_key, next_live_key)` is inserted into the active mutable memtable. This benefits future iterators by enabling efficient skipping via range tombstone fragmentation.
If there is a memtable switch during the read iteration, then the range deletion entry is discarded.
The inserted range tombstones are logically redundant (they don't delete anything that isn't already deleted by point tombstones), skip WAL (they're a derived optimization regenerated by future reads on crash), and use the max tombstone sequence number so they don't interfere with newer writes.
## Key changes
- **New option `min_tombstones_for_range_conversion`** (`AdvancedColumnFamilyOptions`): Threshold of contiguous point tombstones before converting to a range tombstone. Default 0 (disabled). Dynamically changeable via `SetOptions()`.
- **`DBIter` tracking logic** (`db/db_iter.cc`): Tracks contiguous tombstones during `FindNextUserEntryInternal()` (forward) and `PrevInternal()` (reverse). When a live key terminates a run that meets the threshold, `MaybeInsertRangeTombstone()` inserts `[first_tombstone, live_key)` into the active memtable.
- **`FindValueForCurrentKey` `found_visible` output** (`db/db_iter.cc`): Distinguishes "key deleted at this snapshot" from "no visible entries" so reverse tracking doesn't treat post-snapshot keys as tombstones.
- **`IterKey::Swap()`** (`db/dbformat.h`): Efficiently tracks reverse tombstone run end keys without extra allocations.
- **`MemTable::AddLogicallyRedundantRangeTombstone()`** (`db/memtable.cc`): Concurrent-safe range tombstone insertion into the active memtable. Range tombstone skiplist always uses concurrent inserts.
- **`ConstructFragmentedRangeTombstones` race fix** (`db/db_impl/db_impl_write.cc`): Moved after `MarkImmutable()` to prevent lost entries.
- **`MarkImmutable` ordering fix** (`db/memtable_list.cc`): Called before `current_->Add()` to close a race window.
- **Prefix filter awareness** (`db/db_iter.cc`): Tombstone tracking scoped to the seek prefix when prefix filtering is active. See dedicated section below.
- **Transaction awareness** (`db/db_iter.cc`): Tombstones with `seq > snapshot` excluded from tracking. `min_uncommitted` guard uses `insert_seq` (which may be bumped to `earliest_seq`) instead of `range_tomb_max_seq_`. See dedicated section below.
- **Duplicate range check**: Skips insertion if the memtable already covers `[start, end)`.
- **New statistics**: `READ_PATH_RANGE_TOMBSTONES_INSERTED` and `READ_PATH_RANGE_TOMBSTONES_DISCARDED`.
- **Memtable MultiGet batch lookup** (`memtable/inlineskiplist.h`, `db/memtable.cc`): `InlineSkipList::MultiGet()` with cached search path ("finger") for sorted key lookups.
- **New option `memtable_batch_lookup_optimization`** (`AdvancedColumnFamilyOptions`): Enables batch lookup for memtable MultiGet. Default false. Immutable.
## Deciding Range Tombstone Seqno
- The range tombstone is inserted with `insert_seq = max(range_tomb_max_seq_, earliest_seq)` where `range_tomb_max_seq_` is the maximum sequence number across all point tombstones in the contiguous run, and `earliest_seq` is the memtable's earliest sequence number. This preserves the memtable's `earliest_seqno_` invariant.
- If the iterator's snapshot sequence (`sequence_`) predates the memtable's `earliest_seq`, insertion is skipped entirely to avoid unintentionally covering entries between `sequence_` and `earliest_seq`.
## ConstructFragmentedRangeTombstones Race Fix
- `MarkImmutable()` and `ConstructFragmentedRangeTombstones()` are now called before `mutex_.Lock()` in `SwitchMemtable`, keeping this work outside the DB mutex. `MarkImmutable()` blocks concurrent `AddLogicallyRedundantRangeTombstone()` calls via `immutable_mutex_`, ensuring no range tombstones are inserted after the fragmented list is built. `MarkImmutable()` is idempotent, so `MemTableList::Add()` calling it again inside the mutex is harmless.
## Prefix Filter Safety
- When prefix filtering is active, the BBTI bloom filter may reject SST files outside the seek prefix, but the memtable (no bloom filter) returns keys across prefix boundaries. Tombstone tracking is scoped to the seek prefix so that converted range tombstones cannot cover live keys hidden in filtered files.
- `total_order_seek=true` disables prefix filtering — all files are visible, so tombstones safely span prefix boundaries.
- **Behavior change**: Seeking to an out-of-domain key with `total_order_seek=false` now treats it as total-order (prefix_ not set). When `prefix_same_as_start=true`, iterating past an out-of-domain key cleanly invalidates the iterator instead of calling `Transform()` on it (which was UB in release builds with `FixedPrefixTransform`). This is a requirement because an incorrect iterator scan could lead to a range tombstone covering a live key.
## Transaction Support
- Tombstones written by the transaction's own uncommitted writes (sequence > snapshot) are now excluded from contiguous tombstone tracking entirely at the tracking site in `FindNextUserEntryInternal()` and `PrevInternal()`. Previously, tracking relied on a `min_uncommitted` check at insertion time, but this was insufficient — a transaction's own Delete with `seq > snapshot` could extend a run of committed tombstones, and the resulting range tombstone would cover data visible to other snapshots.
- The fix skips any tombstone with `ikey_.sequence > sequence_` during tracking. If a transaction-owned tombstone appears mid-run, it flushes the accumulated committed run first, then resets tracking. This ensures only tombstones visible to the current snapshot are ever converted.
- Both WritePrepared and WriteUnprepared transactions are supported with dedicated test coverage:
- **WritePrepared**: When tombstones are committed before `Prepare()`, their seqnos are below `min_uncommitted` and insertion proceeds safely. When `Prepare()` happens first, tombstone seqnos exceed `min_uncommitted` and insertion is blocked.
- **WriteUnprepared**: Multiple unprepared batches with different seqno ranges are handled correctly. Own transaction Deletes that extend a committed tombstone run block insertion of the entire run. After rollback, data correctness is verified.
## UDT support
- When user-defined timestamps (UDT) are enabled, keys include an 8-byte timestamp suffix. The comparator, Put/Delete APIs, and ReadOptions all require timestamps.
- Forward exhaustion with UDT: `iterate_upper_bound_` is a plain user key without a timestamp suffix. It is padded with min timestamp via `AppendKeyWithMinTimestamp()` so it sorts after all entries with this user key, preserving the exclusive bound semantics.
- Reverse exhaustion with UDT: The end key comes from either the previous live key (which already has a proper timestamp suffix) or the seek target set by `SetSavedKeyToSeekForPrevTarget()` (which appends a timestamp via `SetInternalKey(..., timestamp_ub_)` and `UpdateInternalKey(..., ts)`). In both cases, the end key is already properly timestamped, so no additional padding is needed unlike forward exhaustion.
- Contiguous tombstone detection works correctly with UDT because the underlying `kTypeDeletionWithTimestamp` entries are tracked the same way as `kTypeDeletion`.
## Concurrent Iterators
- Should concurrent iterators happen to read the range at the same time, both will produce the same range and seqno entry. Only one will be accepted by the skip list and the others will be rejected. Future iterators will read the range and not even attempt to insert the range.
- There is nothing preventing similar ranges from being inserted however. Two iterators can produce overlapping ranges, but this protection would be complicated to implement and there is no evidence that it is a likely scenario yet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14448
Test Plan:
- **Unit tests** (`db/db_iterator_test.cc`): `ReadPathRangeTombstoneTest` parameterized by forward/reverse with cases for basic insertion, non-contiguous (below threshold), memtable switch, exhausted iterator with/without bounds, direction change, mixed Delete/SingleDelete, single-delete-only runs, snapshot predating memtable, block cache tier incomplete, skip when covered by existing range, UDT basic scan, UDT exhaustion, prefix filter cross-prefix scan (`PrefixFilterCrossPrefixScanCoversLiveKey` with default/total_order_seek/prefix_same_as_start variants), stale ikey from forward-then-reverse scan (`StaleIkeyFromForwardThenReverse`), and reseek stale ikey (`ReseekStaleIkey`).
- **Concurrency test** (`db/db_test2.cc`): `DBTestConcurrentRangeTombstoneConversions` parameterized by `(allow_concurrent_memtable_write, min_tombstones_for_range_conversion)` with mixed writers, deleters, range deleters, and concurrent forward/reverse readers.
- **Transaction tests** (`utilities/transactions/write_prepared_transaction_test.cc`, `write_unprepared_transaction_test.cc`): Tests for WritePrepared (insertion allowed when tombstones committed before prepare, blocked when after; seqno bump shadowing prepared writes `RangeTombstoneSeqnoBumpShadowsPreparedWrite`) and WriteUnprepared (multiple batches, extended visibility with CalcMaxVisibleSeq, own deletions with rollback).
- **IterKey::Swap tests** (`db/dbformat_test.cc`): `IterKeySwapTest` parameterized over `(key_len, copy, use_secondary)` × 2 covering all inline/heap/pinned/secondary combinations.
- **InlineSkipList MultiGet tests** (`memtable/inlineskiplist_test.cc`): Basic, exact matches, empty, single key, randomized validation against `std::set::lower_bound`, duplicate keys with callback walk, and concurrent MultiGet with read-after-write consistency.
- **Memtable MultiGet tests** (`db/db_basic_test.cc`): Batch lookup, overwrite, flush, merge, disabled by default, paranoid checks, and snapshot tests.
- **Stress test coverage**: `min_tombstones_for_range_conversion` and `memtable_batch_lookup_optimization` options added to `db_crashtest.py` and `db_stress` flags.
- `make check` passes all tests.
### Benchmark results
Tombstones scattered randomly in clusters via `seek_nexts_to_delete` for realistic workloads.
**DB Setup A (scattered deletes, 100 per seek)**:
```
# Step 1: Create and compact
./db_bench --benchmarks=fillseq,compact --seed=1 --compression_type=none --num=1000000 --db=<DB>
# Step 2: Scatter tombstones (5000 seeks × 100 deletes ≈ 500k tombstones)
./db_bench --benchmarks=seekrandom,flush --seed=1 --compression_type=none --num=2000 \
--seek_nexts=0 --seek_nexts_to_delete=100 --use_existing_db=1 --threads=1 --db=<DB>
```
**DB Setup A2 (scattered deletes, 8 per seek)**:
```
# Step 1: Create and compact
./db_bench --benchmarks=fillseq,compact --seed=1 --compression_type=none --num=1000000 --db=<DB>
# Step 2: Scatter tombstones (5000 seeks × 8 deletes ≈ 40k tombstones)
./db_bench --benchmarks=seekrandom,flush --seed=1 --compression_type=none --num=2000 \
--seek_nexts=0 --seek_nexts_to_delete=8 --use_existing_db=1 --threads=1 --db=<DB>
```
**DB Setup B (no deletes)**:
```
./db_bench --benchmarks=fillseq,compact --seed=1 --compression_type=none --num=1000000 \
[--key_size=100] --db=<DB>
```
**Read workload** (same for all):
```
./db_bench --benchmarks=seekrandom --seek_nexts=100 --threads=8 \
--reverse_iterator={true,false} --seed=1 --use_existing_db=1 \
--compression_type=none --num=1000000 --duration=10 \
--disable_auto_compactions \
[--key_size=100] [--min_tombstones_for_range_conversion=X] --db=<DB_COPY>
```
Each workload averaged over 3 runs.
**Table 1: seekrandom forward, scattered deletes (2000 seeks × 100 deletes/seek)**
| Variant | avg ops/s | % vs main |
|---------|-----------|-----------|
| main | 2,895 | - |
| threshold=0 | 2,869 | -0.9% |
| threshold=8 | 287,334 | +9,824% |
**Table 2: seekrandom reverse, scattered deletes (2000 seeks × 100 deletes/seek)**
| Variant | avg ops/s | % vs main |
|---------|-----------|-----------|
| main | 544 | - |
| threshold=0 | 548 | +0.7% |
| threshold=8 | 206,491 | +37,860% |
**Table 3: seekrandom forward, scattered deletes (2000 seeks × 8 deletes/seek)**
| Variant | avg ops/s | % vs main |
|---------|-----------|-----------|
| main | 194,049 | - |
| threshold=0 | 195,703 | +0.9% |
| threshold=8 | 310,740 | +60.1% |
**Table 4: seekrandom reverse, scattered deletes (2000 seeks × 8 deletes/seek)**
| Variant | avg ops/s | % vs main |
|---------|-----------|-----------|
| main | 63,854 | - |
| threshold=0 | 69,266 | +8.5% |
| threshold=8 | 218,101 | +241.6% |
**Table 5: seekrandom forward, no deletes (regression check)**
| Variant | key=16B avg ops/s | % vs main | key=100B avg ops/s | % vs main |
|---------|-------------------|-----------|---------------------|-----------|
| main | 330,901 | - | 236,048 | - |
| threshold=0 | 328,398 | -0.8% | 238,055 | +0.9% |
| threshold=8 | 332,539 | +0.5% | 233,776 | -1.0% |
**Table 6: seekrandom reverse, no deletes (regression check)**
| Variant | key=16B avg ops/s | % vs main | key=100B avg ops/s | % vs main |
|---------|-------------------|-----------|---------------------|-----------|
| main | 261,445 | - | 192,177 | - |
| threshold=0 | 265,020 | +1.4% | 191,616 | -0.3% |
| threshold=8 | 250,881 | -4.0% | 189,239 | -1.5% |
Reviewed By: xingbowang
Differential Revision: D96203950
Pulled By: joshkang97
fbshipit-source-id: 06ba66ebde3c355f04671d1e681f1b1586e8751d
|
||
|---|---|---|
| .. | ||
| benchmark/src/main/java/org/rocksdb/benchmark | ||
| crossbuild | ||
| jmh | ||
| rocksjni | ||
| samples/src/main/java | ||
| src | ||
| CMakeLists.txt | ||
| GetPutBenchmarks.md | ||
| HISTORY-JAVA.md | ||
| jdb_bench.sh | ||
| Makefile | ||
| pmd-rules.xml | ||
| pom.xml.template | ||
| RELEASE.md | ||
| spotbugs-exclude.xml | ||
| understanding_options.md | ||