Commit graph

467 commits

Author SHA1 Message Date
Xingbo Wang
21723bbbef Add async WAL precreation
Some checks are pending
facebook/rocksdb/pr-jobs / build-linux-cmake-with-benchmark-no-thread-status (2) (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-cmake-with-benchmark-no-thread-status (3) (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-encrypted_env-no_compression (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-macos-cmake (1) (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-release (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-clang-13-no_test_run (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-clang-21-no_test_run (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-gcc-14-no_test_run (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-unity-and-headers (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-macos-cmake (2) (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-mini-crashtest (240, blackbox_crash_test) (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-mini-crashtest (480, blackbox_crash_test_with_atomic_flush) (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-clang21-asan-ubsan (0) (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-clang21-asan-ubsan (1) (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-clang21-asan-ubsan (2) (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-clang21-mini-tsan (0) (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-clang21-mini-tsan (1) (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-clang21-mini-tsan (2) (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-static_lib-alt_namespace-status_checked (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-macos (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-windows-vs2022 (false, db_test, db_test) (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-windows-vs2022 (true, , java) (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-java (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-java-static (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-macos-java (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-java-pmd (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-windows-vs2022 (false, arena_test,db_basic_test,db_test2,db_merge_operand_test,bloom_test,c_test,coding_test,crc32c_test,dynamic_bloom_test,env_basic_test,env_test,hash_test,random_test, other) (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-macos-java-static (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-macos-java-static-universal (push) Blocked by required conditions
facebook/rocksdb/pr-jobs / build-linux-arm (push) Blocked by required conditions
Summary:
- Add experimental immutable `DBOptions::async_wal_precreate` to reserve and open one future WAL on a background HIGH-priority task, with sanitization that disables the optimization when WAL recycling is configured.
- Split WAL creation into open/preallocate and start phases so `SwitchMemtable()` can consume a prepared WAL after writing normal WAL metadata, wait for in-flight precreation, fall back to synchronous creation, and delete an unstarted prepared WAL on start failure.
- Keep WAL numbering, close, recovery, and read-only open safe for empty future WAL files left by async precreation; `error_if_wal_file_exists=true` now rejects non-empty WALs while tolerating empty WALs.
- Add public option plumbing for the C API, options parsing/stringification, random option testing, `db_bench`, `db_stress`, and crash-test configuration.
- Add WAL precreate statistics counters plus Java `TickerType`/JNI mappings, and update C++, C, and Java read-only-open documentation for the empty-WAL behavior.
- Add focused WAL/option/C/Java tests for async precreate ready/wait/failure/recovery paths, read-only WAL detection, option sanitization, and API plumbing, plus write-flow docs and unreleased history entries for the new feature and behavior change.

PR https://github.com/facebook/rocksdb/pull/14738

Reviewed By: pdillinger

Differential Revision: D105020559

fbshipit-source-id: 5059b424702e021abb8de65ceeb6d3b975280ffc
2026-05-15 10:59:47 -07:00
Peter Dillinger
87c554b492 Persist compacted manifest size for auto-tuning across DB::Open (#14725)
Some checks failed
facebook/rocksdb/pr-jobs / build-linux-cmake-with-folly-coroutines (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-cmake-with-benchmark-no-thread-status (0) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-cmake-with-benchmark-no-thread-status (1) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-cmake-with-benchmark-no-thread-status (2) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-cmake-with-benchmark-no-thread-status (3) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-encrypted_env-no_compression (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-release (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-clang-13-no_test_run (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-clang-21-no_test_run (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-clang21-asan-ubsan (0) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-clang21-asan-ubsan (1) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-clang21-asan-ubsan (2) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-clang21-mini-tsan (0) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-clang21-mini-tsan (1) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-macos-cmake (0) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-clang21-mini-tsan (2) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-static_lib-alt_namespace-status_checked (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-macos (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-macos-cmake (1) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-macos-cmake (2) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-windows-vs2022 (false, arena_test,db_basic_test,db_test2,db_merge_operand_test,bloom_test,c_test,coding_test,crc32c_test,dynamic_bloom_test,env_basic_test,env_test,hash_test,random_test, other) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-windows-vs2022 (false, db_test, db_test) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-windows-vs2022 (true, , java) (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-java (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-java-static (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-macos-java (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-macos-java-static (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-macos-java-static-universal (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-java-pmd (push) Has been cancelled
facebook/rocksdb/pr-jobs / build-linux-arm (push) Has been cancelled
Summary:
last_compacted_manifest_file_size_ drives TuneMaxManifestFileSize() to compute the manifest rotation threshold, but it started at 0 on every DB::Open and was only populated after the first manifest rotation. This is really only a problem with reuse_manifest_on_open, because no fresh manifest is created on open.

Add a new forward-compatible (safe-to-ignore) MANIFEST tag kLastCompactedManifestFileSize that records the approximate compacted manifest size at the end of WriteCurrentStateToManifest. During recovery, the value is loaded and used to immediately tune the rotation threshold.

The record includes a rough estimate of its own overhead (~15 bytes) and must be the last record written by WriteCurrentStateToManifest for accurate estimation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14725

Test Plan:
Extended AutoTuneManifestSize in db_etc3_test to close and reopen with reuse_manifest_on_open after establishing a known auto-tuning state. Verifies that the manifest file number is preserved (no spurious rotation) and that subsequent CF additions don't trigger rotation -- proving the persisted compacted size keeps the tuned threshold correct. Verified the test fails when the recovery loading is disabled.

Relax a fragile Java test that was dependent on the exact size of the manifest file.

SHORT_TEST=1 ./tools/check_format_compatible.sh

Reviewed By: anand1976

Differential Revision: D104464522

Pulled By: pdillinger

fbshipit-source-id: 4f5d22d2e149bd40a523ee11780e5e3344803c19
2026-05-13 18:31:49 -07:00
Danny Chen
2e7cf42cda Add MANIFEST_VALIDATION_FAILURE_COUNT statistic (#14657)
Summary:
CONTEXT: The manifest validation on close feature (verify_manifest_content_on_close) detects corruption but does not increment any statistics counter, making it harder to monitor in production.

WHAT: Add a new ticker MANIFEST_VALIDATION_FAILURE_COUNT that is incremented each time content validation detects manifest corruption during DB::Close(). The counter fires per corruption detection, so it can increment up to 2 times per close (once on initial check, once after rewrite attempt). Updated all existing manifest validation tests to verify the counter value.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14657

Test Plan:
- All 7 manifest validation tests pass with new stat assertions
- 5x repeat with COERCE_CONTEXT_SWITCH=1 shows no flakiness
- Full version_set_test suite (212 tests) passes

Reviewed By: anand1976

Differential Revision: D102404260

Pulled By: dannyhchen

fbshipit-source-id: 21a0aa1ad8de12a935caf5642e41ccf2a47b46d9
2026-04-24 15:22:15 -07:00
Josh Kang
3f51c0a185 Convert sequential single deletes into range tombstones (#14448)
Summary:
Add a read-path optimization that converts contiguous point tombstones into range tombstones during forward/reverse iteration. When a configurable threshold of consecutive point deletions (kTypeDeletion, kTypeDeletionWithTimestamp, kTypeSingleDeletion — with no live keys between them) is detected, a range tombstone covering `[first_tombstone_key, next_live_key)` is inserted into the active mutable memtable. This benefits future iterators by enabling efficient skipping via range tombstone fragmentation.

If there is a memtable switch during the read iteration, then the range deletion entry is discarded.

The inserted range tombstones are logically redundant (they don't delete anything that isn't already deleted by point tombstones), skip WAL (they're a derived optimization regenerated by future reads on crash), and use the max tombstone sequence number so they don't interfere with newer writes.

## Key changes
- **New option `min_tombstones_for_range_conversion`** (`AdvancedColumnFamilyOptions`): Threshold of contiguous point tombstones before converting to a range tombstone. Default 0 (disabled). Dynamically changeable via `SetOptions()`.
- **`DBIter` tracking logic** (`db/db_iter.cc`): Tracks contiguous tombstones during `FindNextUserEntryInternal()` (forward) and `PrevInternal()` (reverse). When a live key terminates a run that meets the threshold, `MaybeInsertRangeTombstone()` inserts `[first_tombstone, live_key)` into the active memtable.
- **`FindValueForCurrentKey` `found_visible` output** (`db/db_iter.cc`): Distinguishes "key deleted at this snapshot" from "no visible entries" so reverse tracking doesn't treat post-snapshot keys as tombstones.
- **`IterKey::Swap()`** (`db/dbformat.h`): Efficiently tracks reverse tombstone run end keys without extra allocations.
- **`MemTable::AddLogicallyRedundantRangeTombstone()`** (`db/memtable.cc`): Concurrent-safe range tombstone insertion into the active memtable. Range tombstone skiplist always uses concurrent inserts.
- **`ConstructFragmentedRangeTombstones` race fix** (`db/db_impl/db_impl_write.cc`): Moved after `MarkImmutable()` to prevent lost entries.
- **`MarkImmutable` ordering fix** (`db/memtable_list.cc`): Called before `current_->Add()` to close a race window.
- **Prefix filter awareness** (`db/db_iter.cc`): Tombstone tracking scoped to the seek prefix when prefix filtering is active. See dedicated section below.
- **Transaction awareness** (`db/db_iter.cc`): Tombstones with `seq > snapshot` excluded from tracking. `min_uncommitted` guard uses `insert_seq` (which may be bumped to `earliest_seq`) instead of `range_tomb_max_seq_`. See dedicated section below.
- **Duplicate range check**: Skips insertion if the memtable already covers `[start, end)`.
- **New statistics**: `READ_PATH_RANGE_TOMBSTONES_INSERTED` and `READ_PATH_RANGE_TOMBSTONES_DISCARDED`.
- **Memtable MultiGet batch lookup** (`memtable/inlineskiplist.h`, `db/memtable.cc`): `InlineSkipList::MultiGet()` with cached search path ("finger") for sorted key lookups.
- **New option `memtable_batch_lookup_optimization`** (`AdvancedColumnFamilyOptions`): Enables batch lookup for memtable MultiGet. Default false. Immutable.

## Deciding Range Tombstone Seqno
- The range tombstone is inserted with `insert_seq = max(range_tomb_max_seq_, earliest_seq)` where `range_tomb_max_seq_` is the maximum sequence number across all point tombstones in the contiguous run, and `earliest_seq` is the memtable's earliest sequence number. This preserves the memtable's `earliest_seqno_` invariant.
- If the iterator's snapshot sequence (`sequence_`) predates the memtable's `earliest_seq`, insertion is skipped entirely to avoid unintentionally covering entries between `sequence_` and `earliest_seq`.

## ConstructFragmentedRangeTombstones Race Fix
- `MarkImmutable()` and `ConstructFragmentedRangeTombstones()` are now called before `mutex_.Lock()` in `SwitchMemtable`, keeping this work outside the DB mutex. `MarkImmutable()` blocks concurrent `AddLogicallyRedundantRangeTombstone()` calls via `immutable_mutex_`, ensuring no range tombstones are inserted after the fragmented list is built. `MarkImmutable()` is idempotent, so `MemTableList::Add()` calling it again inside the mutex is harmless.

## Prefix Filter Safety
- When prefix filtering is active, the BBTI bloom filter may reject SST files outside the seek prefix, but the memtable (no bloom filter) returns keys across prefix boundaries. Tombstone tracking is scoped to the seek prefix so that converted range tombstones cannot cover live keys hidden in filtered files.
- `total_order_seek=true` disables prefix filtering — all files are visible, so tombstones safely span prefix boundaries.

- **Behavior change**: Seeking to an out-of-domain key with `total_order_seek=false` now treats it as total-order (prefix_ not set). When `prefix_same_as_start=true`, iterating past an out-of-domain key cleanly invalidates the iterator instead of calling `Transform()` on it (which was UB in release builds with `FixedPrefixTransform`). This is a requirement because an incorrect iterator scan could lead to a range tombstone covering a live key.

## Transaction Support
- Tombstones written by the transaction's own uncommitted writes (sequence > snapshot) are now excluded from contiguous tombstone tracking entirely at the tracking site in `FindNextUserEntryInternal()` and `PrevInternal()`. Previously, tracking relied on a `min_uncommitted` check at insertion time, but this was insufficient — a transaction's own Delete with `seq > snapshot` could extend a run of committed tombstones, and the resulting range tombstone would cover data visible to other snapshots.
- The fix skips any tombstone with `ikey_.sequence > sequence_` during tracking. If a transaction-owned tombstone appears mid-run, it flushes the accumulated committed run first, then resets tracking. This ensures only tombstones visible to the current snapshot are ever converted.
- Both WritePrepared and WriteUnprepared transactions are supported with dedicated test coverage:
  - **WritePrepared**: When tombstones are committed before `Prepare()`, their seqnos are below `min_uncommitted` and insertion proceeds safely. When `Prepare()` happens first, tombstone seqnos exceed `min_uncommitted` and insertion is blocked.
  - **WriteUnprepared**: Multiple unprepared batches with different seqno ranges are handled correctly. Own transaction Deletes that extend a committed tombstone run block insertion of the entire run. After rollback, data correctness is verified.

## UDT support
- When user-defined timestamps (UDT) are enabled, keys include an 8-byte timestamp suffix. The comparator, Put/Delete APIs, and ReadOptions all require timestamps.
- Forward exhaustion with UDT: `iterate_upper_bound_` is a plain user key without a timestamp suffix. It is padded with min timestamp via `AppendKeyWithMinTimestamp()` so it sorts after all entries with this user key, preserving the exclusive bound semantics.
- Reverse exhaustion with UDT: The end key comes from either the previous live key (which already has a proper timestamp suffix) or the seek target set by `SetSavedKeyToSeekForPrevTarget()` (which appends a timestamp via `SetInternalKey(..., timestamp_ub_)` and `UpdateInternalKey(..., ts)`). In both cases, the end key is already properly timestamped, so no additional padding is needed unlike forward exhaustion.
- Contiguous tombstone detection works correctly with UDT because the underlying `kTypeDeletionWithTimestamp` entries are tracked the same way as `kTypeDeletion`.

## Concurrent Iterators
- Should concurrent iterators happen to read the range at the same time, both will produce the same range and seqno entry. Only one will be accepted by the skip list and the others will be rejected. Future iterators will read the range and not even attempt to insert the range.
- There is nothing preventing similar ranges from being inserted however. Two iterators can produce overlapping ranges, but this protection would be complicated to implement and there is no evidence that it is a likely scenario yet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14448

Test Plan:
- **Unit tests** (`db/db_iterator_test.cc`): `ReadPathRangeTombstoneTest` parameterized by forward/reverse with cases for basic insertion, non-contiguous (below threshold), memtable switch, exhausted iterator with/without bounds, direction change, mixed Delete/SingleDelete, single-delete-only runs, snapshot predating memtable, block cache tier incomplete, skip when covered by existing range, UDT basic scan, UDT exhaustion, prefix filter cross-prefix scan (`PrefixFilterCrossPrefixScanCoversLiveKey` with default/total_order_seek/prefix_same_as_start variants), stale ikey from forward-then-reverse scan (`StaleIkeyFromForwardThenReverse`), and reseek stale ikey (`ReseekStaleIkey`).
- **Concurrency test** (`db/db_test2.cc`): `DBTestConcurrentRangeTombstoneConversions` parameterized by `(allow_concurrent_memtable_write, min_tombstones_for_range_conversion)` with mixed writers, deleters, range deleters, and concurrent forward/reverse readers.
- **Transaction tests** (`utilities/transactions/write_prepared_transaction_test.cc`, `write_unprepared_transaction_test.cc`): Tests for WritePrepared (insertion allowed when tombstones committed before prepare, blocked when after; seqno bump shadowing prepared writes `RangeTombstoneSeqnoBumpShadowsPreparedWrite`) and WriteUnprepared (multiple batches, extended visibility with CalcMaxVisibleSeq, own deletions with rollback).
- **IterKey::Swap tests** (`db/dbformat_test.cc`): `IterKeySwapTest` parameterized over `(key_len, copy, use_secondary)` × 2 covering all inline/heap/pinned/secondary combinations.
- **InlineSkipList MultiGet tests** (`memtable/inlineskiplist_test.cc`): Basic, exact matches, empty, single key, randomized validation against `std::set::lower_bound`, duplicate keys with callback walk, and concurrent MultiGet with read-after-write consistency.
- **Memtable MultiGet tests** (`db/db_basic_test.cc`): Batch lookup, overwrite, flush, merge, disabled by default, paranoid checks, and snapshot tests.
- **Stress test coverage**: `min_tombstones_for_range_conversion` and `memtable_batch_lookup_optimization` options added to `db_crashtest.py` and `db_stress` flags.
- `make check` passes all tests.

### Benchmark results

Tombstones scattered randomly in clusters via `seek_nexts_to_delete` for realistic workloads.

**DB Setup A (scattered deletes, 100 per seek)**:
```
# Step 1: Create and compact
./db_bench --benchmarks=fillseq,compact --seed=1 --compression_type=none --num=1000000 --db=<DB>
# Step 2: Scatter tombstones (5000 seeks × 100 deletes ≈ 500k tombstones)
./db_bench --benchmarks=seekrandom,flush --seed=1 --compression_type=none --num=2000 \
  --seek_nexts=0 --seek_nexts_to_delete=100 --use_existing_db=1 --threads=1 --db=<DB>
```

**DB Setup A2 (scattered deletes, 8 per seek)**:
```
# Step 1: Create and compact
./db_bench --benchmarks=fillseq,compact --seed=1 --compression_type=none --num=1000000 --db=<DB>
# Step 2: Scatter tombstones (5000 seeks × 8 deletes ≈ 40k tombstones)
./db_bench --benchmarks=seekrandom,flush --seed=1 --compression_type=none --num=2000 \
  --seek_nexts=0 --seek_nexts_to_delete=8 --use_existing_db=1 --threads=1 --db=<DB>
```

**DB Setup B (no deletes)**:
```
./db_bench --benchmarks=fillseq,compact --seed=1 --compression_type=none --num=1000000 \
  [--key_size=100] --db=<DB>
```

**Read workload** (same for all):
```
./db_bench --benchmarks=seekrandom --seek_nexts=100 --threads=8 \
  --reverse_iterator={true,false} --seed=1 --use_existing_db=1 \
  --compression_type=none --num=1000000 --duration=10 \
  --disable_auto_compactions \
  [--key_size=100] [--min_tombstones_for_range_conversion=X] --db=<DB_COPY>
```

Each workload averaged over 3 runs.

**Table 1: seekrandom forward, scattered deletes (2000 seeks × 100 deletes/seek)**

| Variant | avg ops/s | % vs main |
|---------|-----------|-----------|
| main | 2,895 | - |
| threshold=0 | 2,869 | -0.9% |
| threshold=8 | 287,334 | +9,824% |

**Table 2: seekrandom reverse, scattered deletes (2000 seeks × 100 deletes/seek)**

| Variant | avg ops/s | % vs main |
|---------|-----------|-----------|
| main | 544 | - |
| threshold=0 | 548 | +0.7% |
| threshold=8 | 206,491 | +37,860% |

**Table 3: seekrandom forward, scattered deletes (2000 seeks × 8 deletes/seek)**

| Variant | avg ops/s | % vs main |
|---------|-----------|-----------|
| main | 194,049 | - |
| threshold=0 | 195,703 | +0.9% |
| threshold=8 | 310,740 | +60.1% |

**Table 4: seekrandom reverse, scattered deletes (2000 seeks × 8 deletes/seek)**

| Variant | avg ops/s | % vs main |
|---------|-----------|-----------|
| main | 63,854 | - |
| threshold=0 | 69,266 | +8.5% |
| threshold=8 | 218,101 | +241.6% |

**Table 5: seekrandom forward, no deletes (regression check)**

| Variant | key=16B avg ops/s | % vs main | key=100B avg ops/s | % vs main |
|---------|-------------------|-----------|---------------------|-----------|
| main | 330,901 | - | 236,048 | - |
| threshold=0 | 328,398 | -0.8% | 238,055 | +0.9% |
| threshold=8 | 332,539 | +0.5% | 233,776 | -1.0% |

**Table 6: seekrandom reverse, no deletes (regression check)**

| Variant | key=16B avg ops/s | % vs main | key=100B avg ops/s | % vs main |
|---------|-------------------|-----------|---------------------|-----------|
| main | 261,445 | - | 192,177 | - |
| threshold=0 | 265,020 | +1.4% | 191,616 | -0.3% |
| threshold=8 | 250,881 | -4.0% | 189,239 | -1.5% |

Reviewed By: xingbowang

Differential Revision: D96203950

Pulled By: joshkang97

fbshipit-source-id: 06ba66ebde3c355f04671d1e681f1b1586e8751d
2026-04-03 22:47:51 -07:00
Josh Kang
5db0603613 Read-triggered compactions (#14426)
Summary:
Add read-triggered compaction, a new feature that reduces read amplification by compacting SST files that receive high read traffic. When an SST file's read frequency (`num_reads_sampled / file_size`) exceeds a configurable threshold, it is marked for compaction to a lower level.

The feature introduces two new options: a CF option `read_triggered_compaction_threshold` (default 0, disabled) and a DB option `max_periodic_compaction_trigger_seconds` (default 43200s) that controls how often the background thread re-evaluates compaction scores on quiet databases. Both options are dynamically changeable.

Lowering `max_periodic_compaction_trigger_seconds` does add some overhead, but generally is minimal, so running this every couple of minutes in a production environment seems fairly reasonable.

## Key changes

- **New CF option `read_triggered_compaction_threshold`** (`advanced_options.h`): When positive, files with `reads_per_byte > threshold` are marked for compaction. Files at the last non-empty level are skipped (bottommost compaction handles those separately). Marked files are sorted by hotness (reads_per_byte descending).
- **New DB option `max_periodic_compaction_trigger_seconds`** (`options.h`): Replaces the hardcoded 12-hour ceiling in `ComputeTriggerCompactionPeriod()`. Essential for read-triggered compaction on quiet DBs since there are no writes to trigger score re-evaluation.
- **Leveled compaction picker** (`compaction_picker_level.cc`): Adds read-triggered as the lowest-priority compaction reason in `SetupInitialFiles()`, using the existing `PickFileToCompact` helper.
- **Universal compaction picker** (`compaction_picker_universal.cc`): Adds `PickReadTriggeredCompaction` as lowest priority. Refactors shared "find output level + compute overlapping inputs + create Compaction" logic from both `PickDeleteTriggeredCompaction` and `PickReadTriggeredCompaction` into `BuildCompactionToNextLevel`, handling both single-level and multi-level universal cases.
- **Periodic trigger integration** (`db_impl.cc`): `TriggerPeriodicCompaction` now also fires for CFs with `read_triggered_compaction_threshold > 0`, even without time-based compaction configured.
- **Stress test & db_bench support**: Both `db_stress` and `db_bench` support the new options. `db_crashtest.py` randomly enables read-triggered compaction and sets a short periodic trigger interval when enabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14426

Test Plan:
**Unit tests**:
- `compaction_picker_test` — 7 new tests: `ReadTriggeredCompactionDisabled`, `ReadTriggeredCompactionBelowThreshold`, `ReadTriggeredCompactionAboveThreshold`, `NeedsCompactionReadTriggered`, `ReadTriggeredPicksFile`, `UniversalReadTriggeredCompaction`, `ReadTriggeredSkipsLastLevel`, `UniversalReadTriggeredNoPickWhenNotMarked`
- `db_compaction_test` — `ReadTriggeredCompaction` integration test verifying end-to-end behavior with sync points
- Stress test coverage

**Stress test**:
```
make V=1 -j "CRASH_TEST_EXT_ARGS=--duration=600 --max_key=2500000 --max_compaction_trigger_wakeup_seconds=10
  --read_triggered_compaction_threshold=0.0001 --interval=600" blackbox_crash_test
```
- confirmed read triggered compactions from LOGS

**Benchmark** (`db_bench`):

Setup: 5M keys (100B values, 16B keys), leveled compaction, 5 levels, 4MB target file size. DB fully compacted, then 2M overlapping keys written without compaction to create L0/L1 overlap (82 files, ~294MB).

LSM shape change during readrandom with read-triggered compaction:
```
BEFORE: L0=9 files (15MB), L1=4 (16MB), L2=20 (69MB), L3=49 (194MB) — 82 files, 294MB
AFTER:  L3=66 files (223MB)
```

| Benchmark | Config | avg ops/s | % change |
|-----------|--------|-----------|----------|
| readrandom (8 threads, 5M reads) | baseline (threshold=0) | 1,086,965 | — |
| readrandom (8 threads, 5M reads) | threshold=0.000001, trigger=5s | 1,453,697 | **+33.7%** |

Reviewed By: xingbowang

Differential Revision: D97838716

Pulled By: joshkang97

fbshipit-source-id: a21fcb270c7fadd4f78d98b9c821982f220dd3f0
2026-03-27 14:43:52 -07:00
Xingbo Wang
1a4b1e42cc Add include_blob_files option to GetApproximateSizes (#14501)
Summary:
**Summary:**
Add a new boolean flag `include_blob_files` (default: `false`) to `SizeApproximationOptions` and a corresponding `INCLUDE_BLOB_FILES` enum value to `SizeApproximationFlags`. When set to `true`, the returned size includes an approximation of blob file data in the queried key range.

**Algorithm:**
The blob file size contribution is prorated using the SST size ratio:
```
blob_size_in_range ≈ total_blob_size * (sst_size_in_range / total_sst_size)
```
The blob-to-SST ratio (`total_blob_size / total_sst_size`) is computed once before the per-range loop, so iterating levels and blob files only happens once per `GetApproximateSizes` call regardless of how many ranges are queried. The per-range SST size (`ApproximateSize`) is computed once and shared between `include_files` and `include_blob_files`.

**Limitations:**
- Assumes blob data is distributed proportionally to SST data across the key space. May be inaccurate if blob value sizes vary significantly across different key ranges (e.g., one range has large blobs while another has small ones).
- If there are no SST files (all data in memtables), the blob size contribution will be 0 even if blob files exist on disk.

**Changes:**
- `include/rocksdb/options.h`: New `include_blob_files` field in `SizeApproximationOptions`; updated doc comments for `include_memtables`/`include_files`
- `include/rocksdb/db.h`: New `INCLUDE_BLOB_FILES` in `SizeApproximationFlags` enum, updated flags-to-options mapping
- `include/rocksdb/c.h`: New `rocksdb_size_approximation_flags_include_blob_files` C API enum value
- `java/`: Added `INCLUDE_BLOB_FILES` to `SizeApproximationFlag.java` and JNI flag mapping in `rocksjni.cc`
- `db/db_impl/db_impl.cc`: Blob-to-SST ratio computed once before loop, SST range size computed once per range and shared
- `db_stress_tool/db_stress_test_base.cc`: Randomized `include_blob_files` in stress test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14501

Test Plan:
- New `DBBlobBasicTest.GetApproximateSizesIncludingBlobFiles` — verifies:
  - Size with blobs > without (full range)
  - Non-overlapping range returns 0
  - Partial range returns proportionally less than full range
  - `SizeApproximationFlags` API works
  - Multi-range query: two sub-ranges sum approximately to the full-range result
- Stress test now exercises the new option randomly

Reviewed By: hx235

Differential Revision: D97984211

Pulled By: xingbowang

fbshipit-source-id: e9127eac3308687fd4f0b17a771fd61fba6a8380
2026-03-27 13:53:21 -07:00
Josh Kang
3ad23b2d94 Support automated interpolation search (#14383)
Summary:
Add automatic per-block interpolation search selection (`kAuto` mode) for index blocks. During SST construction, each index block's key distribution is analyzed using the coefficient of variation (CV) of gaps between restart-point keys. Blocks with uniformly distributed keys are flagged via a new bit in the data block footer, and at read time, `kAuto` resolves to interpolation search for uniform blocks and binary search otherwise.

## Key changes

- **New `BlockSearchType::kAuto` enum value**: Resolves per-block at read time to either `kInterpolation` or `kBinary` based on the block's uniformity flag. Falls back to `kBinary` on older versions that don't recognize it.
- **Write-path uniformity analysis**: `BlockBuilder::ScanForUniformity()` uses Welford's online algorithm to incrementally compute the CV of key gaps at restart points. The result is stored in a new bit (bit 30) of the data block footer's packed restart count.
- **New table option `uniform_cv_threshold`** (default: -1 `disabled`): Controls how strict the uniformity check is. Set to negative to disable. Exposed in C++, Java (JNI), and `db_bench`.
- **Code reorganization**: Block entry decode helpers (`DecodeEntry`, `DecodeKey`, `DecodeKeyV4`, `ReadBe64FromKey`) moved from `block.cc` to a new shared header `block_util.h` so they can be reused by `BlockBuilder` on the write path.
- **New histogram `BLOCK_KEY_DISTRIBUTION_CV`**: Records the CV (scaled by 10000) of each index block's key distribution for observability.
- **Java bindings**: `IndexSearchType.kAuto`, `uniformCvThreshold` getter/setter, JNI portal constructor signature updated, and `HistogramType.BLOCK_KEY_DISTRIBUTION_CV` added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14383

Test Plan:
- `IndexBlockTest.IndexValueEncodingTest` parameterized to include `kAuto` search type alongside `kBinary` and `kInterpolation`, verifying correct seek/iteration behavior across all combinations of key distributions, restart intervals, and key lengths.
- Uniformity detection validated: blocks with uniform key distribution correctly set `is_uniform = true`, blocks with clustered/non-uniform keys set `is_uniform = false`.
- Stress test coverage
- Updated check_format_compatible to also include a "uniform" dataset. By default using uniform_cv_threshold=-1 does not result in an incompatibility issues. When manually changing the threshold (e.g. `uniform_cv_threshold=1000`), I see `bad block contents`, which is expected

## Benchmark

readrandom with `fillrandom,compact -seed=1 --statistics`:

| Benchmark | Branch | Params | avg ops/s | % change vs main | CV P50 |
|-----------|--------|--------|-----------|------------------|--------|
| readrandom | main | `binary_search, shortening=1` | 335,791 | baseline | N/A |
| readrandom | feature | `binary_search, shortening=1` (default) | 335,749 | -0.0% | 1,500 |
| readrandom | feature | `auto_search, shortening=1` (kAuto) | 366,832 | **+9.2%** | 1,500 |
| readrandom | feature | `interpolation_search, shortening=1` | 366,598 | **+9.2%** | 1,500 |
| readrandom | feature | `auto_search, shortening=2` (kAuto) | 344,631 | **+2.6%** | 1,030,000 |
| readrandom | feature | `interpolation_search, shortening=2` | 201,178 | **-40.1%** | 1,030,000 |

As seen with shortening=2, a non-uniform distribution produces a high CV, which does not use interpolation search.

## Write benchmark

There is a write overhead which scans each restart entry for a block upon Finish. In practice this is very low because currently it is only applied to index blocks.

See cpu profile (https://fburl.com/strobelight/io5hwj9h) here of `-benchmarks=fillseq,compact -compression_type=none -disable_wal=1`. Only 0.08% attributed to `ScanForUniformity`.

Reviewed By: pdillinger

Differential Revision: D94738890

Pulled By: joshkang97

fbshipit-source-id: 9661ac593c5fef89d49f3a8a027f1338a0c96766
2026-03-06 10:13:51 -08:00
Josh Kang
f25fb41da6 Add option to validate sst files in the background on DB open (#14322)
Summary:
Add `open_files_async` option for faster DB startup. When enabled, SST file opening and validation is deferred to a background thread after `DB::Open` returns, reducing startup latency for databases with many SST files. WAL recovery remains synchronous.

To support this, `FindTable` is extended with a pinning mechanism that stores the cache handle directly on `FileMetaData` via a new `PinnedTableReader` class, and sets the table reader atomically so subsequent reads skip cache lookups. `FileDescriptor::table_reader` is replaced with `PinnedTableReader pinned_reader` which wraps a `std::atomic<TableReader*>` with acquire/release ordering to safely handle concurrent access between the background opener and read threads.

Should validations fail, the background opener sets a `kAsyncFileOpen` background error. Future read requests will look up the table reader again via the cache, and if any validations fail there it will get propagated to the user (existing behavior when `max_open_files > 0`).

This feature is most useful when `max_open_files=-1`, because otherwise file opening is already capped at 16 files and DB open should be fast.

## Restrictions
- This feature also is incompatible with fifo compaction because fifo compaction requires reading table properties under DB mutex. When table reader is unpinned, this may cause a DB hang.
- This feature is also incompatible with `skip_stats_update_on_db_open=false` because it will result in even longer DB open

## Key changes

- New `open_files_async` DB option with C, Java, and `db_bench` bindings
- `BGWorkAsyncFileOpen` background worker that opens all SST files post-`DB::Open`, with shutdown awareness via `shutting_down_` flag
- New `PinnedTableReader` class in `version_edit.h` — thread-safe wrapper holding `std::atomic<TableReader*>` and `Cache::Handle*` with proper acquire/release ordering. Replaces the old `FileDescriptor::table_reader` raw pointer and `FileMetaData::table_reader_handle`
- Extract `LoadTableHandlersHelper` into `db/version_util.cc` — shared between `VersionBuilder::LoadTableHandlers` (for version edits during recovery) and `BGWorkAsyncFileOpen` (for base storage post-open)
- `FindTable` extended with `pin_table_handle` and `out_table_reader` params — when pinning is enabled, the table reader is stored on `FileMetaData` so Get/MultiGet/Iterator skip redundant cache lookups. `FindTable` now performs the pinned-reader fast-path check internally instead of requiring callers to check `fd.table_reader` beforehand
  - Note: pinning is explicit (not default) because some callers create temporary `FileMetaData`s that would need to properly clean up table handles
- `CompactedDBImpl` updated to use `FindTable` + pinning instead of raw `fd.table_reader` access for Get/MultiGet
- New `kAsyncFileOpen` background error reason in `listener.h` and `error_handler.cc`
- Add a check in ~DBImpl to ensure async file open task has not been forgotten to be scheduled in (future) subclasses of DBImpl. Certain subclasses that never use it will need to explicitly mark it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14322

Test Plan:
- `OpenFilesAsyncTest` parameterized over `num_flushes` (1, 20), `ReadType` (Get, MultiGet, Iterator), `max_open_files` (-1, 10), and `read_only` (true, false)
  - **ConcurrentFileAccess**: concurrent reads and compactions race with async opener
  - **AfterRead**: reads happen before async opener, verifying lazy open and that the opener sees already-pinned readers
  - **BeforeRead**: async opener completes first, verifying reads use pre-loaded table readers
  - **Shutdown**: DB closes before async opener starts, verifying clean cancellation with 0 file opens
  - **Error**: corrupted SST files, verifying `kAsyncFileOpen` background error is set and reads return corruption
  - **DropColumnFamily**: CF dropped before async opener runs, verifying the opener gracefully skips dropped CFs
- Added to crash test

### Benchmark

To simulate a high-latency remote filesystem, I set up a virtual filesystem with dm-delay using 10ms reads, 0 ms writes.

```
# Generate a DB with many L0 files

TEST_TMPDIR=/data/users/jkangs/dm-delay-test/mnt ./db_bench -benchmarks=fillseq -disable_auto_compactions=true -write_buffer_size=1000 -num=1000000
```

```
./db_bench -use_existing_db=true -db=/data/users/jkangs/dm-delay-test/mnt/dbbench -benchmarks=readrandom -reads=1 -report_open_timing=true -open_files_async=true -use_direct_reads -file_opening_threads=1 -skip_stats_update_on_db_open

OpenDb:     25.1419 milliseconds
```

```
./db_bench -use_existing_db=true -db=/data/users/jkangs/dm-delay-test/mnt/dbbench -benchmarks=readrandom -reads=1 -report_open_timing=true -open_files_async=false -use_direct_reads -file_opening_threads=1 -skip_stats_update_on_db_open

OpenDb:     23109.4 milliseconds
```

### No read regressions

On main branch
```
./db_bench -use_existing_db=true -db=/dev/shm/dbbench -benchmarks=readrandom -seed=1 -threads=8 -duration=30

readrandom   :       4.827 micros/op 1657100 ops/sec 30.005 seconds 49720992 operations;  183.3 MB/s (6198999 of 6198999 found)
```

On this branch
```
./db_bench -use_existing_db=true -db=/dev/shm/dbbench -benchmarks=readrandom -seed=1 -threads=8 -duration=30

readrandom   :       4.863 micros/op 1644808 ops/sec 30.007 seconds 49354992 operations;  182.0 MB/s (6099999 of 6099999 found)

./db_bench -use_existing_db=true -db=/dev/shm/dbbench -benchmarks=readrandom -seed=1 -threads=8 -duration=30 -open_files_async=true

readrandom   :       4.803 micros/op 1665392 ops/sec 30.004 seconds 49968992 operations;  184.2 MB/s (6222999 of 6222999 found)
```

Reviewed By: pdillinger, xingbowang

Differential Revision: D93538033

Pulled By: joshkang97

fbshipit-source-id: 32ac70c112cd733b7c1e1c1e2e7ce6422318a5ae
2026-03-02 16:18:14 -08:00
Josh Kang
901c88e37b Separate keys and values in data blocks (#14287)
Summary:
Introduce new table option with separated key-value storage in data blocks.

This PR implements a new SST block format where keys and values are stored in separate sections within data blocks, rather than interleaved. Keys are stored first, followed by all values in a contiguous section. The motivation is better cpu cache hit rate during seeks and potentially better compression.

The additional storage cost is a varint per restart point, and 4 bytes additional block footer. For a data block with a restart interval of 16, it is approximately 1 bit of overhead per entry. But compression actually performs better, resulting in ~3% storage savings from benchmark.

For now I've opted to not separate kvs in non-data blocks since restart interval for those blocks is typically 1, and values are typically small and probably better inlined.

### New block layout

```
+------------------+
| Keys Section     |  <- Key entries with delta encoding
+------------------+
| Values Section   |  <- (new) Values stored contiguously
+------------------+
| Restart Array    |  <- Fixed32 offsets to restart points
+------------------+
| Values Offset    |  <- (new) 4 bytes: offset to values section
| Footer           |  <- 4 bytes: packed index_type + num_restarts
+------------------+
```

### Entry Format

**At restart points**
```
+--------------+------------------+----------------+-----------------+-----------+
| shared (v32) | non_shared (v32) | value_sz (v32) | value_off (v32) | key_delta |
+--------------+------------------+----------------+-----------------+-----------+
```

**At non-restart points**
```
+--------------+------------------+----------------+-----------+
| shared (v32) | non_shared (v32) | value_sz (v32) | key_delta |
+--------------+------------------+----------------+-----------+
```

- `value_offset` is only stored at restart points to save space
- For non-restart entries, value offset is computed as: `prev_value_offset + prev_value_size`

### Forward Compatibility
- We make use of reserved block footer bits to mark if a block has separated kv format. Should an older version read this, it will assume a very large block restart interval and result in a corruption error.

### Key Changes
- **BlockBuilder**: Accumulates values in a separate buffer; value offsets are stored only at restart points (other entries derive offset from previous value's position). There is an additional memcpy cost to place the value data after the key data.
- **Block iteration**: Iteration now needs to know if we are at a restart point. This will rely on `cur_entry_idx_`, which was previously only used for per-kv checksum purposes. In this new format, we also need to know the block_restart_interval, which was previously also only calculated for per-kv checksums.
- **Table properties**: Store `data_block_restart_interval`, `index_block_restart_interval`, and `separate_key_value_in_data_block` in table properties

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14287

Test Plan:
- Extended block_test, table_test, compaction_test to contain new separated_kv param
- Added new parameter to crash test

 ---

## Benchmark

### Varying Value Size

**Write:** `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --value_size=<X> --benchmarks=fillrandom,compact`
**Read:** `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq`

| Value Size | FillRandom (ops/s) | | Compact (s) | | ReadRandom (ops/s) | | ReadSeq (ops/s) | | SST Size (MB) | |
|----------:|-------------------:|-----:|----------:|-----:|-------------------:|-----:|----------------:|-----:|-------------:|-----:|
| | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv |
| 1 | 253,280 | 264,977 | 0.516 | 0.497 (**-3.7%**) | 596,328 | 586,625 (**-1.6%**) | 5,904,681 | 6,069,581 (**+2.8%**) | 5.1 | 4.2 (**-18.6%**) |
| 16 | 235,757 | 242,367 | 0.572 | 0.533 (**-6.8%**) | 570,751 | 572,815 (**+0.4%**) | 5,371,138 | 5,604,908 (**+4.4%**) | 14.7 | 13.4 (**-8.5%**) |
| 100 | 264,299 | 265,427 | 0.461 | 0.454 (**-1.5%**) | 323,696 | 332,790 (**+2.8%**) | 4,239,725 | 4,232,416 (**-0.2%**) | 38.8 | 37.6 (**-3.2%**) |
| 1,000 | 238,992 | 242,764 | 2.349 | 2.329 (**-0.9%**) | 244,608 | 261,403 (**+6.9%**) | 1,285,394 | 1,265,868 (**-1.5%**) | 342.1 | 342.0 (**-0.0%**) |

### Varying Block Restart Interval

**Write:** `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --block_restart_interval=<X> --benchmarks=fillrandom,compact`
**Read:** `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq`

| BRI | FillRandom (ops/s) | | Compact (s) | | ReadRandom (ops/s) | | ReadSeq (ops/s) | | SST Size (MB) | |
|----:|-------------------:|-----:|----------:|-----:|-------------------:|-----:|----------------:|-----:|-------------:|-----:|
| | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv |
| 1 | 251,654 | 263,707 | 0.453 | 0.485 (**+7.1%**) | 334,653 | 328,708 (**-1.8%**) | 4,194,342 | 3,954,291 (**-5.7%**) | 40.6 | 42.4 (**+4.5%**) |
| 4 | 253,797 | 252,394 | 0.476 | 0.476 (**+0.0%**) | 332,719 | 341,676 (**+2.7%**) | 4,135,691 | 4,051,151 (**-2.0%**) | 39.2 | 39.3 (**+0.3%**) |
| 8 | 260,143 | 262,273 | 0.496 | 0.460 (**-7.3%**) | 330,859 | 337,567 (**+2.0%**) | 4,144,081 | 4,187,389 (**+1.0%**) | 38.9 | 38.1 (**-2.1%**) |
| 16 | 252,875 | 263,176 | 0.464 | 0.455 (**-1.9%**) | 323,783 | 335,418 (**+3.6%**) | 4,127,310 | 4,217,028 (**+2.2%**) | 38.8 | 37.6 (**-3.2%**) |
| 32 | 260,224 | 269,422 | 0.464 | 0.451 (**-2.8%**) | 304,001 | 314,989 (**+3.6%**) | 4,310,162 | 4,247,248 (**-1.5%**) | 38.8 | 37.3 (**-3.8%**) |

### Varying Compression

**Write:** `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --compression_type=<X> [--compression_level=<N>] --benchmarks=fillrandom,compact`
**Read:** `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq`

| Compression | FillRandom (ops/s) | | Compact (s) | | ReadRandom (ops/s) | | ReadSeq (ops/s) | | SST Size (MB) | |
|------------|-------------------:|-----:|----------:|-----:|-------------------:|-----:|----------------:|-----:|-------------:|-----:|
| | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv |
| None | 252,494 | 260,552 | 0.413 | 0.419 (**+1.5%**) | 356,290 | 371,535 (**+4.3%**) | 4,479,261 | 4,507,133 (**+0.6%**) | 73.5 | 73.6 (**+0.2%**) |
| LZ4 | 246,010 | 256,360 | 0.477 | 0.455 (**-4.6%**) | 342,497 | 345,882 (**+1.0%**) | 4,400,570 | 4,268,102 (**-3.0%**) | 38.3 | 37.6 (**-2.0%**) |
| ZSTD (L3) | 254,748 | 258,556 | 1.067 | 1.055 (**-1.1%**) | 176,724 | 177,566 (**+0.5%**) | 2,736,841 | 2,717,739 (**-0.7%**) | 32.9 | 31.3 (**-4.7%**) |
| ZSTD (L6) | 256,459 | 259,388 | 1.556 | 1.462 (**-6.0%**) | 177,390 | 176,691 (**-0.4%**) | 2,754,336 | 2,688,682 (**-2.4%**) | 32.8 | 31.1 (**-5.1%**) |

### Varying Block Size

**Write:** `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --block_size=<X> --benchmarks=fillrandom,compact`
**Read:** `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq`

| Block Size | FillRandom (ops/s) | | Compact (s) | | ReadRandom (ops/s) | | ReadSeq (ops/s) | | SST Size (MB) | |
|-----------:|-------------------:|-----:|----------:|-----:|-------------------:|-----:|----------------:|-----:|-------------:|-----:|
| | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv |
| 4 KB | 263,203 | 260,362 | 0.469 | 0.461 (**-1.7%**) | 324,178 | 332,249 (**+2.5%**) | 4,231,537 | 4,217,763 (**-0.3%**) | 38.8 | 37.6 (**-3.1%**) |
| 16 KB | 252,742 | 263,161 | 0.426 | 0.428 (**+0.5%**) | 227,805 | 222,873 (**-2.2%**) | 5,146,997 | 5,081,080 (**-1.3%**) | 38.1 | 36.7 (**-3.6%**) |
| 64 KB | 257,490 | 260,225 | 0.423 | 0.414 (**-2.1%**) | 86,807 | 91,586 (**+5.5%**) | 5,380,403 | 5,372,372 (**-0.1%**) | 36.3 | 35.0 (**-3.5%**) |

### Varying Key Size

**Write:** `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --min_key_size=10 --max_key_size=100 --benchmarks=fillrandom,compact`
**Read:** `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq`

| Key Size | FillRandom (ops/s) | | Compact (s) | | ReadRandom (ops/s) | | ReadSeq (ops/s) | | SST Size (MB) | |
|---------:|-------------------:|-----:|----------:|-----:|-------------------:|-----:|----------------:|-----:|-------------:|-----:|
| | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv |
| 10–100 | 243,740 | 255,183 | 0.618 | 0.622 (**+0.6%**) | 284,304 | 307,569 (**+8.2%**) | 3,738,921 | 3,686,676 (**-1.4%**) | 41.5 | 41.2 (**-0.8%**) |

### CPU Profile Notes
- No compression: DataBlock::SeekForGet uses less cpu (13.2% vs 13.9%)
  - https://fburl.com/strobelight/6mwwebft with separated KV
  - https://fburl.com/strobelight/m9m798ka without
- ZSTD compression: rocksdb::DecompressSerializedBlock uses more CPU (45.8% vs 44.9%), while DataBlock::SeekForGet uses less cpu (5.09% vs 6.52%)
  - https://fburl.com/strobelight/3x5nw1k4 with separated KV
  - https://fburl.com/strobelight/e7809046 without

 ---

Reviewed By: xingbowang, pdillinger

Differential Revision: D92103024

Pulled By: joshkang97

fbshipit-source-id: 47cfeb656ff3c20d34975f0b6c4c0462935a83dc
2026-02-23 12:42:05 -08:00
Hui Xiao
29819f37e1 Remove deprecated ReadOptions::managed, `ColumnFamilyOptions::snap_refresh_nanos (#14350)
Summary:
**Context/Summary:**
Remove deprecated, unused APIs and options:
- ReadOptions::managed: This option was not used anymore. The functionality it controlled has been removed long ago.
- ColumnFamilyOptions::snap_refresh_nanos: Deprecated and unused option.

Corresponding C API (rocksdb_readoptions_set_managed) and Java API (ReadOptions.managed/setManaged) are also removed. All related checks an db_impl and db_impl_secondary iterators are cleaned up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14350

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D93812438

Pulled By: hx235

fbshipit-source-id: e4a9d21c65f83294b6d0878286ba14024f049bac
2026-02-20 14:00:41 -08:00
xingbowang
3556c22059 Remove deprecated option skip_checking_sst_file_sizes_on_db_open (#14346)
Summary:
Remove deprecated option skip_checking_sst_file_sizes_on_db_open

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14346

Test Plan: Unit test

Reviewed By: hx235

Differential Revision: D93602683

Pulled By: xingbowang

fbshipit-source-id: f576825cb107bb0aeb14f4ff29fef0df269b8728
2026-02-19 14:12:38 -08:00
Xingbo Wang
b040ab83e1 Add a new picking algorithm in fifo compaction (#14326)
Summary:
Add a new kv ratio based compaction picking algorithm in fifo compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14326

Test Plan: Unit test

Reviewed By: pdillinger

Differential Revision: D93257941

Pulled By: xingbowang

fbshipit-source-id: fd2d0e1356c7b54682a1197475a1bd26cb45c9d4
2026-02-15 10:04:58 -08:00
Josh Kang
9f47518676 Add interpolation search as an alternative to binary search (#14247)
Summary:
Interpolation search is an alternative algorithm to binary search, which performs better on uniformly distributed keys. Instead of binary search always computing the mid point of the left and right boundaries, interpolation search "interpolates" the mid point based on the distance to the target. Fortunately, we can re-use existing block format to support interpolation search.

For a given block, we compute the shared_prefix length of the first and last key. Interpolation search is usually done with numerical target values, so for a variable binary length key, we calculate the "value" as the first 8 non-shared bytes. This also means interpolation search would only really be effective for bytewise comparator (guarded via options validations).

#### Fallback to binary search
- if the the val(left_key) == val(right_key) then we fallback to classic binary search (to avoid divide by 0)
- interpolation search is significantly more computationally expensive than binary search, so when the search distance is small, we also fallback to binary search.
- if interpolation search does not make significant progress (i.e. reduces search space by more than half each iteration), we can assume data is non-uniform and fallback.

Interpolation search also performs best when there is minimal shortening, especially shortening of the last block, as it can heavily skew the distribution of the actual keys.

Note that each search algorithm is guaranteed to make progress because at each iteration the search space is guaranteed to be reduce by at least 1.

For now this change only applies to index block seeks, as data block seeks and other blocks do not have as many entries and would not require significant number of search rounds, but it could be easily extended to include that support.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14247

Test Plan:
Updated unit tests and crash test with new search option

### Benchmark
The default benchmark sets up keys in generally uniform distribution, so it was a good way to test performance improvements.

Setup: `./db_bench -benchmarks=fillseq,compact -index_shortening_mode=1`

#### Before this change
```
./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1

readrandom   :       2.899 micros/op 344973 ops/sec 2.899 seconds 1000000 operations;   38.2 MB/s (1000000 of 1000000 found)
```

#### After this change

Notice how key comparison counts are the same between the two.
```
./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=binary_search

readrandom   :       2.881 micros/op 347128 ops/sec 2.881 seconds 1000000 operations;   38.4 MB/s (1000000 of 1000000 found)
```

```
./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=interpolation_search

readrandom   :       2.609 micros/op 383209 ops/sec 2.610 seconds 1000000 operations;   42.4 MB/s (1000000 of 1000000 found)
```

With a non-uniform distribution, `i.e. index_shortening_mode=2`

```
./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=binary_search

readrandom   :       2.958 micros/op 338075 ops/sec 2.958 seconds 1000000 operations;   37.4 MB/s (1000000 of 1000000 found)
```

```
./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=interpolation_search

readrandom   :       5.502 micros/op 181750 ops/sec 5.502 seconds 1000000 operations;   20.1 MB/s (1000000 of 1000000 found)
```

Reviewed By: pdillinger

Differential Revision: D91063163

Pulled By: joshkang97

fbshipit-source-id: 151d6aa76f8713740b714de6e406aff40d28ccbc
2026-02-13 17:15:10 -08:00
Xingbo Wang
656b734a5f Support abort background compaction jobs. (#14227)
Summary:
This adds a new public API to allow applications to abort all running compactions and prevent new ones from starting. Unlike DisableManualCompaction() which only pauses manual compactions and waits for them to finish naturally, AbortAllCompactions() actively signals running compactions (both automatic and manual) to terminate early and waits for them to complete before returning.

The abort signal is checked periodically during compaction (every 100 keys), so ongoing compactions abort quickly. Any output files from aborted compactions are automatically cleaned up to prevent partial results from being installed.

This is useful for scenarios where applications need to quickly stop all compaction activity, such as during graceful shutdown or when performing maintenance operations.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14227

Test Plan:
- Unit tests in db_compaction_abort_test.cc cover various abort scenarios including: abort before/during compaction, abort with multiple subcompactions, nested abort/resume calls, abort with CompactFiles API, abort across multiple column families, and timing guarantees
- Updated compaction_job_test.cc to include the new parameter

Reviewed By: anand1976

Differential Revision: D91480994

Pulled By: xingbowang

fbshipit-source-id: 36837971d8a540cd34d3ec28a78bc94b582625b0
2026-01-30 05:53:04 -08:00
Maciej Szeszko
6b5ccbbec6 Remove inline values support (#14270)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14270

Legacy BlobDB's inline values feature (storing small values directly in the LSM tree via `min_blob_size` threshold) is unused in production - all
deployments use `min_blob_size = 0`. This removes the functionality entirely.

Changes:
- Remove `min_blob_size` from `BlobDBOptions`
- Remove `IsInlined()` check from compaction filter (dead code path)
- Remove inline-related statistics (`BLOB_DB_WRITE_INLINED*`)
- Remove `InlineSmallValues` test
- Update stale comments referencing inlined data

Reviewed By: xingbowang

Differential Revision: D91088985

fbshipit-source-id: ec67848ece1a7dc071ca8e8a17faebb435394733
2026-01-28 16:08:10 -08:00
Anand Ananthabhotla
b89d290c20 Add MultiScan statistics (#14248)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14248

### Overview

This diff introduces the addition of multi-scan statistics to RocksDB, enhancing the database's ability to monitor and analyze performance during multi-scan operations.

### Key Changes

#### Implemented Multi-Scan Statistics

The following statistics were implemented to provide deeper insights into multi-scan operations:

- **MULTISCAN_PREPARE_MICROS**: Measures the time (in microseconds) spent preparing for multi-scan operations.
- **MULTISCAN_BLOCKS_PER_PREPARE**: Tracks the number of blocks processed per multi-scan prepare operation.
- **Wasted Prefetch Blocks Count**: Counts the number of prefetched blocks that were not used (i.e., wasted) if the iterator is abandoned before accessing them.
- **MULTISCAN_TOTAL_BLOCKS_SCANNED**: Tracks the total number of blocks scanned during all multi-scan operations.
- **MULTISCAN_TOTAL_KEYS_SCANNED**: Measures the total number of keys scanned across all multi-scan operations.
- **MULTISCAN_TOTAL_MICROS**: Captures the total time (in microseconds) spent in multi-scan operations.
- **MULTISCAN_PREFETCHED_BLOCKS**: Counts the number of blocks that were prefetched during multi-scan operations.
- **MULTISCAN_USED_PREFETCH_BLOCKS**: Tracks the number of prefetched blocks that were actually used during multi-scan operations.

### Impact

This diff provides more fine-grained statistics for multi-scan operations, allowing developers and users to better understand and optimize the performance of their RocksDB instances.

Reviewed By: krhancoc

Differential Revision: D91053297

fbshipit-source-id: 7158741b9f026c0b5ce8ba1264dbd137e7fe985d
2026-01-21 23:23:38 -08:00
Peter Dillinger
a6af317476 Use format_version=7 by default, fix perf bug (#14239)
Summary:
Since it's been > 6 months and we have production uses, migrate to fv=7 by default. One unit test needed an update for the change to table properties with fv=7.

On making this change, PresetCompressionDictTest tests detected extra memory usage by decompressing LZ4 with dictionary compression. This turned out to be a bug in `std::find` usage that led to using the ZSTD-optimized decompressor (with digested dictionary usage) in cases where it is not needed. I've fixed the bug and improved the unit tests that found the bug.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14239

Test Plan: existing tests, including format compatible CI job (updated, and run locally with SHORT_TEST=1)

Reviewed By: hx235

Differential Revision: D90728697

Pulled By: pdillinger

fbshipit-source-id: 8f1a0e9ca59a88c18eaa4cdfdea00309175ce30a
2026-01-21 09:28:06 -08:00
Hui Xiao
8edb99f904 Statistics for successfully resumed compaction output bytes (#14054)
Summary:
Context/Summary: as titled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14054

Test Plan: new UT, manually checking

Reviewed By: jaykorean

Differential Revision: D84828431

Pulled By: hx235

fbshipit-source-id: 56e1a9159f7597a10d6c549657d8b22788aa0599
2025-10-17 11:38:20 -07:00
Xingbo Wang
742741b175 Support Super Block Alignment (#13909)
Summary:
Pad block based table based on super block alignment

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13909

Test Plan:
Unit Test

No impact on perf observed due to change in the inner loop of flush.

upstream/main branch 202.15 MB/s
```
for i in `seq 1 10`; do ./db_bench --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 >> /tmp/x1 2>&1; grep fillseq /tmp/x1 | grep -Po "\d+\.\d+ MB/s" | grep -Po "\d+\.\d+" | awk '{sum+=$1} END {print sum/NR}'
```

After the change without super block alignment 203.44 MB/s
```
for i in `seq 1 10`; do ./db_bench --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 >> /tmp/x1 2>&1
```

After the change with super block alignment 204.47 MB/s
```
for i in `seq 1 10`; do ./db_bench --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 --super_block_alignment_size=131072 --super_block_alignment_max_padding_size=4096 >> /tmp/x1 2>&1;
```

Reviewed By: pdillinger

Differential Revision: D83068913

Pulled By: xingbowang

fbshipit-source-id: eecd65088ab3e9dbc7902aab8c2580f1bc8575df
2025-10-01 18:20:35 -07:00
Peter Dillinger
f5fb597bac Resolve missing/inconsistent tickers in Java (#14012)
Summary:
Pretty self-explanatory from the changes, including re-arranging the "COOL" entries for easier tracking of which values are used.

I'm not touching the TICKER_ENUM_MAX issue because IIRC we've gotten in trouble in the past for changing any Java ticker values.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14012

Test Plan: CI, sufficient prompts to get AI to discover the known issues relayed by hx235, to help ensure we found any other outstanding issues.

Reviewed By: hx235

Differential Revision: D83497503

Pulled By: pdillinger

fbshipit-source-id: ec0bd7e28188e0430fb03fc5bd79c2ed7b28f3ad
2025-09-29 14:21:00 -07:00
Peter Dillinger
1c8a012727 Add kCool Temperature (#14000)
Summary:
also requested by internal user, like kIce in https://github.com/facebook/rocksdb/issues/13927

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14000

Test Plan: unit tests updated

Reviewed By: archang19

Differential Revision: D83200479

Pulled By: pdillinger

fbshipit-source-id: 31f2842d87bcad40227aeee9687ff5772393689c
2025-09-25 11:27:00 -07:00
Peter Dillinger
6127a42f98 Use/endorse (Auto)HyperClockCache by default over LRUCache (#13964)
Summary:
After seeing more people hit issues with thrashing small LRUCache shards and AutoHCC running fully in production for a while on a very large service, here I make these updates:

* In the public API, mark the case of `estimated_entry_charge = 0` (which is how you select AutoHCC) as production-ready and generally preferred. That means devoting a lot less space to how to tune FixedHCC (`estimated_entry_charge > 0`) because it is not generally recommended anymore even though in theory it is the fastest (conditional on a fragile configuration).
* In the public API, add more detail about potential problems with LRUCache and explicitly endorse HCC.
* When a default block cache is created, use AutoHCC instead of LRUCache. It's still a 32MB cache but that's just one cache shard for AutoHCC so the risk of issues with small cache shards is dramatically reduced. And a single AutoHCC shard is still essentially wait-free.
* Improve the handling of the hypothetical scenario of a failed anonymous mmap. This is hardly a concern for 64-bit Linux and likely most other OSes. It would in theory be possible to fall back on LRUCache in that case but the code structure makes that annoying/challenging. Instead we crash with an appropriate message.
* Cleaned up some includes
* Fixed some previously unreported leaks (better assertions on HCC perhaps, some subtle behavior changes)
* Added a new mode to cache_bench (detailed below)
* Avoid a particularly costly sanity check in `~AutoHyperClockTable()` even in debug builds so that unit testing, etc., isn't bogged down, except keep it in ASAN build.

Planned follow-up:
* Update HCC implementation to use my new "bit field atomics" API introduced in https://github.com/facebook/rocksdb/issues/13910 to make it easier to read and maintain

Possible follow-up:
* Re-engineer table cache to use AutoHCC also, instead of LRUCache and a single mutex to ensure no duplication across threads. (a) Pad table cache key to 128 bits for AutoHCC. (b) Stripe/shard the no-duplication mutex. (HCC's consistency model is too weak for concurrent threads to use its API to agree on a winner, even if entries could be inserted in an "open in progress" state.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13964

Test Plan:
existing tests. ClockCacheTest.ClockEvictionEffortCapTest caught a regression during my development, and the crash test has a history of finding subtle HCC bugs.

## Performance

Although we've validated AutoHCC performance under high load, etc., before we haven't really considered whether there will be unacceptable overheads for small DBs and CFs, e.g. in unit tests. For this, I have added a new mode to cache_bench: with the -stress_cache_instances=n parameter, it will create and destroy n empty cache instances several times. In the debug build, this found that a particular check in `~AutoHyperClockTable()` was extremely costly for short-lived caches (fixed). Beyond that, we can answer the question of whether it is feasible for a single process to host 1000 DBs each with 1000 CFs with default block cache instances, after moving LRUCache -> AutoHCC, for example:

```
/usr/bin/time ./cache_bench -stress_
cache_instances=1000000 -cache_type=auto_hyper_clock_cache -cache_size=33554432
```

Release build:
Average 9.8 us per 32MB LRUCache creation, 2.9 us per destruction, 24.6GB max RSS (~25KB each)
->
Average 4.3 us per  32MB AutoHCC creation, 4.9 us per destruction, 4.8GB max RSS (~5KB each)

Debug build:
Average 10.9 us per 32MB LRUCache creation, 3.5 us per destruction, 28.7GB max RSS (~29KB each)
->
Average 4.5 us per 32MB AutoHCC creation, 4.9 us per destruction, 4.7GB max RSS (~5KB each)

Despite the anonymous mmaps, it's apparently more efficient for default/small/empty structures. This is likely due to the dramatically low number of cache shards at this size. If we switch to `-stress_cache_instances=10000  -cache_size=1073741824`:

Release build:
Average 10.6 us per 1GB LRUCache, 2.8 us per destruction, 2.3 GB max RSS (~230KB each)
->
Average 130 us per 1GB AutoHCC creation, 153 us per destruction, 1.5 GB max RSS (~150KB each)

Debug build:
Average 11.2 us per 1GB LRUCache, 3.6 us per destruction, 2.4 GB max RSS (~240KB each)
->
Average 130 us per 1GB AutoHCC creation, 150 us per destruction, 1.6 GB max RSS (~160KB each)

Here it's clear that we are paying a price in time for setting up all those mmaps for the good number of cache shards and potential table growth, even though the RSS is well under control. However, I am not concerned about this at all, as it's unlikely to slow down anything notably such as unit tests. Before and after full testsuite runs confirm:

3327.73user 5188.71system 3:38.88elapsed -> 3312.07user 5704.77system 3:41.61elapsed

There is increased kernel time but acceptable. With ASAN+UBSAN:

11618.70user 15671.30system 5:54.68elapsed -> 12595.81user 16159.67system 6:32.77elapsed

Acceptable given that our ASAN+UBSAN builds are not the slowest in CI

Reviewed By: hx235

Differential Revision: D82661067

Pulled By: pdillinger

fbshipit-source-id: ab25c766ca70f2b8664849c2a838b9e1b4e72d3b
2025-09-18 13:27:51 -07:00
Peter Dillinger
67af5bdc38 Add Temperature::kIce (#13927)
Summary:
... and associated statistics, etc. Someone needs it, so here it is.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13927

Test Plan: Updated / extended / added some unit tests

Reviewed By: cbi42

Differential Revision: D81981469

Pulled By: pdillinger

fbshipit-source-id: 52558c08741890b781310906acbc18d9eb479363
2025-09-10 10:29:49 -07:00
Alan Paxton
805ac7c887 Update compression libraries to latest releases (#13609)
Summary:
See `Makefile` for actual changes:

* ZLIB remains the same
* BZIP2 remains the same
* SNAPPY is a minor update
* LZ4 is a significant update with multithreaded/multicore compression https://github.com/lz4/lz4/releases/tag/v1.10.0
* ZSTD is a significant update RocksDB is called out as benefiting in particular from the performance improvements herein https://github.com/facebook/zstd/releases/tag/v1.5.7

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13609

Reviewed By: archang19

Differential Revision: D77877295

Pulled By: mszeszko-meta

fbshipit-source-id: bf9a257e8f68dec3d02743b339aa2df65df4ab2c
2025-07-07 13:28:49 -07:00
Miroslav Kovar
fc2cf7ead2 Expose optimized TransactionBaseImpl::MultiGet through JNI (#13589)
Summary:
Addresses https://github.com/facebook/rocksdb/issues/13587.

This PR exposes the optimized implementation of batched reads through a `Transaction` object to Java clients.

The latency improvement of transactional multiget on production workload achieved by switching the implementation is roughly:
```
quantile=0.2: 21%
quantile=0.5: 28%
quantile=0.8: 46%
quantile=1.0: 239%
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13589

Reviewed By: jaykorean

Differential Revision: D74660169

Pulled By: cbi42

fbshipit-source-id: d01780173e0500c96e5e431ff6645008cbf6e8b5
2025-05-14 13:19:06 -07:00
Hui Xiao
29c6610617 Add compaction explicit prefetch stats (#13520)
Summary:
**Context/Summary:**
This PR adds new stats to measure compaction readahead size for rocksdb managed prefetching (not FS prefetching). It can be used to verify compaction read-ahead is doing what's configured. This PR also excludes compaction readahead stats from user scan readahead stats measured in existing stats so there is a cleaner separating between these two.

Bonus: this PR also included some typo fixing about "io activities"

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13520

Test Plan: Modified existing test to verify stats

Reviewed By: archang19

Differential Revision: D72892850

Pulled By: hx235

fbshipit-source-id: 1a73182061baa044c9c9193a2b0fd967ffe75c4a
2025-04-14 12:08:38 -07:00
anand76
f7764cb6b2 Remove fail_if_options_file_error DB option (#13504)
Summary:
The fail_if_options_file_error has been deprecated for more than a year. This PR removes it from the code base. https://github.com/facebook/rocksdb/issues/12056 fixed a bug that was blocking the option from removal. https://github.com/facebook/rocksdb/issues/12249 marked it as deprecated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13504

Reviewed By: hx235

Differential Revision: D72194063

Pulled By: anand1976

fbshipit-source-id: 0aa7cf56e60c48c7e7654743d3e64922ce65225d
2025-04-09 14:18:33 -07:00
Yu Zhang
5e10baa412 Delete max_write_buffer_number_to_maintain (#13491)
Summary:
As titled. This option has been marked deprecated since introduction of a better option `max_write_buffer_size_to_maintain` and acts as its fallback since RocksDB 6.5.0 The internal user we know these options were created for migrated to `max_write_buffer_size_to_maintain` for a long time too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13491

Test Plan: existing tests

Reviewed By: cbi42

Differential Revision: D71984601

Pulled By: jowlyzhang

fbshipit-source-id: c264d4809e311f60fdbad817ebfade256db549b6
2025-04-07 21:44:36 -07:00
Hui Xiao
48eb646787 Mark MaxMemCompactionLevel() deprecated (#13503)
Summary:
**Context/Summary:**

MaxMemCompactionLevel() developed 10 years ago simply returns the level a memtable flushed to, which has historically been L0 and have no plan to change to something different for future. It is also not used in test or internally.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13503

Test Plan: CI + fake release

Reviewed By: cbi42

Differential Revision: D72066092

Pulled By: hx235

fbshipit-source-id: 5ff5b16a6664ef3efabd3a6fbd8a2d0529b62460
2025-03-31 19:29:40 -07:00
Changyu Bi
325dcdf2e5 Deprecate ReadOptions::ignore_range_deletions and experimental::PromoteL0() (#13500)
Summary:
based on the option comment, `ignore_range_deletions` was added due to the overhead of range deletions in read path when a DB does not use DeleteRange(). The current implementation should not have a noticeable performance difference in this case.

`experimental::PromoteL0()` can be replaced by doing a manual compaction with proper CompactRangeOptions.

There are some internal use of these option and API so we will remove them later after the usages are updated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13500

Test Plan:
comment change only.
Performance: benchmark the performance difference with `ignore_range_deletions` and without (borrowed flag `universal_incremental` for this purpose), ran at the same time on the same machine.

- random point get:
    - ignore_range_deletions=false: 343078 ops/sec
    - ignore_range_deletions=true: 340219 ops/sec (0.8% slower)
```
(for I in $(seq 1 1); do TEST_TMPDIR=/dev/shm/t1 /data/users/changyubi/vscode-root/rocksdb/db_bench --benchmarks=fillseq,waitforcompaction,readrandom --write_buffer_size=67108864 --writes=1000000 --num=2000000 --reads=1000000  --seed=1723056275 --universal_incremental=false 2>&1 | grep "readrandom"; done;) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }';
```

- sequential scan:
  - ignore_range_deletions=false: 5378104 ops/sec
  - ignore_range_deletions=true: 5393809 ops/sec (0.3% faster)
```
(for I in $(seq 1 10); do TEST_TMPDIR=/dev/shm/t1 /data/users/changyubi/vscode-root/rocksdb/db_bench --benchmarks=fillseq,waitforcompaction,readseq[-X10] --write_buffer_size=67108864 --writes=1000000 --num=2000000  --universal_incremental=true --seed=1723056275 2>1 | grep "\[AVG 10 runs\]"; done;) | awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }';
```

The difference in ops/sec for the two benchmarks is likely noise.

Reviewed By: hx235

Differential Revision: D72069223

Pulled By: cbi42

fbshipit-source-id: ad82a051aa4682790d2178cd4fb2d1467397fbb5
2025-03-28 14:49:28 -07:00
Maciej Szeszko
591f5b1266 Remove deprecated DB::DeleteFile API references (#13322)
Summary:
Cleanup post https://github.com/facebook/rocksdb/pull/13284.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13322

Test Plan:
1. We did not find any evidence of breakage in internal pre-release integration pipeline runs after renaming the deprecated API in `9.10`.
2. _To the extent possible_, we manually validated partner use cases of file deletion and confirmed deprecated API is no longer in use.

Reviewed By: jaykorean

Differential Revision: D68476852

Pulled By: mszeszko-meta

fbshipit-source-id: fbe1f873e16ae7c60d7706a3c44ecc695ab86a4b
2025-01-24 22:28:41 -08:00
Roman Puchkovskiy
b98c21b281 Support flush reasons above 12 in Java integration (#13246)
Summary:
FlushReason enum in C++ has members up to 15, but in Java, the mirroring FlushReason only supports reason codes up to 12. This causes exceptions when adding a flush listener.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13246

Reviewed By: pdillinger

Differential Revision: D68241620

Pulled By: jaykorean

fbshipit-source-id: 1e2856dad28dff0cbb1772f5a8ea03cc1e224088
2025-01-17 10:04:10 -08:00
Maciej Szeszko
6e97a813dc Deprecate db delete file public API (#13284)
Summary:
We added a removal warning for public `DB::DeleteFile` API ~4 years ago in https://github.com/facebook/rocksdb/pull/7337. This API seems to sit at wrong layer of abstraction, where instead of exposing a clear interface to delete specific range of keys, callers rely on their own discovery / interpretation of where their data / log possibly resides 'as-of-now'. For example, in case of data, the physical location of the keys might very well change after user obtained their mapping from key(s) to specific SST file. This will lead to `InvalidArgument` response, which if repeated, would put a user in a race condition spinning wheel - the behavior that's inefficient, fairly indeterministic and therefore one that should be strongly discouraged. We're employing a graceful approach to prefixing the public API with `DEPRECATED_` first for better discoverability and ease of self service for product teams should they still use that legacy API. If everything goes smoothly, we intend to remove all the deprecated API references in the next release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13284

Reviewed By: pdillinger

Differential Revision: D67981502

Pulled By: mszeszko-meta

fbshipit-source-id: adc7fe5cf4e2180bcfd21878b8f78f3fb6ead355
2025-01-10 19:07:33 -08:00
Maciej Szeszko
541761eaaa Deprecate random access max buffer size references - take #2 (#13288)
Summary:
This time properly marking db option as `kDeprecated` in `db_options.cc`. Original PR: https://github.com/facebook/rocksdb/pull/13278.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13288

Reviewed By: pdillinger

Differential Revision: D68024379

Pulled By: mszeszko-meta

fbshipit-source-id: 8e1f08b048ccf5971d899811edaf0b0ef16581ef
2025-01-10 15:32:38 -08:00
Maciej Szeszko
6c6defe3b8 Revert "Deprecate random access max buffer size references (#13278)" (#13285)
Summary:
This reverts commit d4bd67fb09. There are total of 4 call sites as referenced [here](https://www.internalfb.com/code/search?q=repo%3Aall%20-filepath%3Afbcode%2Frocksdb%2F%7Cthird-party%2Frocksdb%2F%7Clibrocksdb%2F%7Cfb_mysql%2F.*%2Frocksdb%7Cinternal_repo_rocksdb%20regex%3Aon%20random_access_max_buffer_size&lang_filter=cpp). None of them have a strict reliance of this setting, which should make followup cleanup fairly easy.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13285

Reviewed By: pdillinger

Differential Revision: D67984325

Pulled By: mszeszko-meta

fbshipit-source-id: 12c0b6281c2af6c32261fdf6092856b0566d389e
2025-01-09 13:20:30 -08:00
Maciej Szeszko
d4bd67fb09 Deprecate random access max buffer size references (#13278)
Summary:
This option has been officially deprecated in 5.4.0. We're removing all the references to `random_access_max_buffer_size`, related rules and all the clients wrappers. As a part of this refactoring, we're also getting rid of the `options-1-false` (and consequently its' `multiple-conds-all-false` corresponding rule), as condition would not make much sense anymore without the bounding RA max buffer size limit. Motivated by ongoing tech debt reduction effort.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13278

Test Plan: Validated that internal users do not rely on this long-gone option in their workflows.

Reviewed By: jaykorean

Differential Revision: D67909674

Pulled By: mszeszko-meta

fbshipit-source-id: 8f4b59a4a92b0b32b8b91b71ac318aafc17f1da2
2025-01-08 09:59:18 -08:00
Alan Paxton
9b1d0c02e9 Add [set]DailyOffpeakTimeUTC option to Java API (#13148)
Summary:
Reflect RocksDB DailyOffpeakTimeUTC  option in Java API. As is standard for options, there are a number of different places where this option needs to be added: it is an option, a DB option, and it is mutable (can be changed while running).

The new option is a string value. This requires an extension to the internal MutableDBOptions parse code, which received the entire options string from C++ and parses it on the Java side.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13148

Reviewed By: cbi42

Differential Revision: D67870402

Pulled By: jaykorean

fbshipit-source-id: 975af69773206da936d230cbadb5f69a002d92a3
2025-01-07 09:39:01 -08:00
Levi Tamasi
54ace7f340 Change the semantics of blob_garbage_collection_force_threshold to provide better control over space amp (#13022)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13022

Currently, `blob_garbage_collection_force_threshold` applies to the oldest batch of blob files, which is typically only a small subset of the blob files currently eligible for garbage collection. This can result in a form of head-of-line blocking: no GC-triggered compactions will be scheduled if the oldest batch does not currently exceed the threshold, even if a lot of higher-numbered blob files do. This can in turn lead to high space amplification that exceeds the soft bound implicit in the force threshold (e.g. 50% would suggest a space amp of <2 and 75% would imply a space amp of <4). The patch changes the semantics of this configuration threshold to apply to the entire set of blob files that are eligible for garbage collection based on `blob_garbage_collection_age_cutoff`. This provides more intuitive semantics for the option and can provide a better write amp/space amp trade-off. (Note that GC-triggered compactions still pick the same SST files as before, so triggered GC still targets the oldest the blob files.)

Reviewed By: jowlyzhang

Differential Revision: D62977860

fbshipit-source-id: a999f31fe9cdda313de513f0e7a6fc707424d4a3
2024-09-19 15:47:13 -07:00
Peter Dillinger
98c33cb8e3 Steps toward making IDENTITY file obsolete (#13019)
Summary:
* Set write_dbid_to_manifest=true by default
* Add new option write_identity_file (default true) that allows us to opt-in to future behavior without identity file
* Refactor related DB open code to minimize code duplication

_Recommend hiding whitespace changes for review_

Intended follow-up: add support to ldb for reading and even replacing the DB identity in the manifest. Could be a variant of `update_manifest` command or based on it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13019

Test Plan: unit tests and stress test updated for new functionality

Reviewed By: anand1976

Differential Revision: D62898229

Pulled By: pdillinger

fbshipit-source-id: c08b25cf790610b034e51a9de0dc78b921abbcf0
2024-09-19 14:05:21 -07:00
anand76
c21fe1a47f Add ticker stats for read corruption retries (#12923)
Summary:
Add a couple of ticker stats for corruption retry count and successful retries. This PR also eliminates an extra read attempt when there's a checksum mismatch in a block read from the prefetch buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12923

Test Plan: Update existing tests

Reviewed By: jowlyzhang

Differential Revision: D61024687

Pulled By: anand1976

fbshipit-source-id: 3a08403580ab244000e0d480b7ee0f5a03d76b06
2024-08-12 15:32:07 -07:00
Zixuan Tan
a97a1f3247 Fix incorrect refillPeriodMicros unit in the document (#12832)
Summary:
The default value for `refillPeriodMicros` is `100 * 1000`, which means 100ms (or 100,000us).

The document comments say 100,000ms (equivalent to 100 seconds), which is incorrect and misleading. This PR fixes this typo.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12832

Reviewed By: cbi42

Differential Revision: D59492336

Pulled By: ajkr

fbshipit-source-id: c2f55a8b996fe078a1510fcbebaea92ec0075929
2024-07-08 18:08:53 -07:00
Peter Dillinger
45c105104b Set optimize_filters_for_memory by default (#12377)
Summary:
This feature has been around for a couple of years and users haven't reported any problems with it.

Not quite related: fixed a technical ODR violation in public header for info_log_level in case DEBUG build status changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12377

Test Plan: unit tests updated, already in crash test. Some unit tests are expecting specific behaviors of optimize_filters_for_memory=false and we now need to bake that in.

Reviewed By: jowlyzhang

Differential Revision: D54129517

Pulled By: pdillinger

fbshipit-source-id: a64b614840eadd18b892624187b3e122bab6719c
2024-04-30 08:33:31 -07:00
Andrew Kryczka
b4c520cadc change default CompactionOptions::compression while deprecating it (#12587)
Summary:
I had a TODO to complete `CompactionOptions`'s compression API but never did it: d610e14f93/db/compaction/compaction_picker.cc (L371-L373)

Without solving that TODO, the API remains incomplete and unsafe. Now, however, I don't think it's worthwhile to complete it. I think we should instead delete the API entirely. This PR deprecates it in preparation for deletion in a future major release. The `ColumnFamilyOptions` settings for compression should be good enough for `CompactFiles()` since they are apparently good enough for every other compaction, including `CompactRange()`.

In the meantime, I also changed the default `CompressionType`. Having callers of `CompactFiles()` use Snappy compression by default does not make sense when the default could be to simply use the same compression type that is used for every other compaction. As a bonus, this change makes the default `CompressionType` consistent with the `CompressionOptions` that will be used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12587

Reviewed By: hx235

Differential Revision: D56619273

Pulled By: ajkr

fbshipit-source-id: 1477de49f14b06c72d6f0045616a8ce91d97e66e
2024-04-26 13:03:21 -07:00
Radek Hubner
a8035ebc0b Fix exception on RocksDB.getColumnFamilyMetaData() (#12474)
Summary:
https://github.com/facebook/rocksdb/issues/12466 reported a bug when `RocksDB.getColumnFamilyMetaData()` is called on an existing database(With files stored on disk). As neilramaswamy mentioned, this was caused by https://github.com/facebook/rocksdb/issues/11770 where the signature of `SstFileMetaData` constructor was changed, but JNI code wasn't updated.

This PR fix JNI code, and also properly populate `fileChecksum` on `SstFileMetaData`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12474

Reviewed By: jowlyzhang

Differential Revision: D55811808

Pulled By: ajkr

fbshipit-source-id: 2ab156f41eaf4a4f30c49e6df421b61e8451230e
2024-04-05 13:55:18 -07:00
Alexey Vinogradov
c4df598b8e Implement PerfContex#toString for the Java API (#12473)
Summary:
I've implemented `PerfContext#toString` for the Java API.
See: https://groups.google.com/g/rocksdb/c/qbY_gNhbyAg

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12473

Reviewed By: jowlyzhang

Differential Revision: D55660871

Pulled By: cbi42

fbshipit-source-id: f0528fba31ac06e16495e4f49b0bafe0dbc1bc61
2024-04-03 14:33:31 -07:00
Radek Hubner
db9eb10b5b Enable all Java test via CMake (#12446)
Summary:
This PR follows the work done in  https://github.com/facebook/rocksdb/issues/11756 and enable all Java test to be run via CMake/Ctest.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12446

Reviewed By: jowlyzhang

Differential Revision: D55661635

Pulled By: cbi42

fbshipit-source-id: 3ea49a121a3ba72089632ff43ee7fe4419b08a96
2024-04-03 11:03:11 -07:00
anand76
4868c10b44 Retry block reads on checksum mismatch (#12427)
Summary:
On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`.

Tests:
Add new unit tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427

Reviewed By: ajkr

Differential Revision: D55025941

Pulled By: anand1976

fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e
2024-03-18 16:16:05 -07:00
Alan Paxton
d9a441113e JNI get_helper code sharing / multiGet() use efficient batch C++ support (#12344)
Summary:
Implement RAII-based helpers for JNIGet() and multiGet()

Replace JNI C++ helpers `rocksdb_get_helper, rocksdb_get_helper_direct`, `multi_get_helper`, `multi_get_helper_direct`, `multi_get_helper_release_keys`, `txn_get_helper`, and `txn_multi_get_helper`.

The model is to entirely do away with a single helper, instead a number of utility methods allow each separate
JNI `Get()` and `MultiGet()` method to organise their parameters efficiently, then call the underlying C++ `db->Get()`,
`db->MultiGet()`, `txn->Get()`, or `txn->MultiGet()` method itself, and use further utilities to retrieve results.

Roughly speaking:

* get keys into C++ form
* Call C++ Get()
* get results and status into Java form

We achieve a useful performance gain as part of this work; by using the updated C++ multiGet we immediately pick up its performance gains (batch improvements to multiGet C++ were previously implemented, but not until now used by Java/JNI). multiGetBB already uses the batched C++ multiGet(), and all other benchmarks show consistent improvement after the changes:

## Before:
```
Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 256 thrpt 25 5315.459 ± 20.465 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 1024 thrpt 25 5673.115 ± 78.299 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 4096 thrpt 25 2616.860 ± 46.994 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 16384 thrpt 25 1700.058 ± 24.034 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 65536 thrpt 25 791.171 ± 13.955 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 256 thrpt 25 6129.929 ± 94.200 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 1024 thrpt 25 7012.405 ± 97.886 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 4096 thrpt 25 2799.014 ± 39.352 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 16384 thrpt 25 1417.205 ± 22.272 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 65536 thrpt 25 655.594 ± 13.050 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 256 thrpt 25 6147.247 ± 82.711 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 1024 thrpt 25 7004.213 ± 79.251 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 4096 thrpt 25 2715.154 ± 110.017 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 16384 thrpt 25 1408.070 ± 31.714 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 65536 thrpt 25 623.829 ± 57.374 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 256 thrpt 25 6119.243 ± 116.313 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 1024 thrpt 25 6931.873 ± 128.094 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 4096 thrpt 25 2678.253 ± 39.113 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 16384 thrpt 25 1337.384 ± 19.500 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 65536 thrpt 25 625.596 ± 14.525 ops/s
```

## After:
```
Benchmark                                    (columnFamilyTestType)  (keyCount)  (keySize)  (multiGetSize)  (valueSize)   Mode  Cnt     Score     Error  Units
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100          256  thrpt   25  5191.074 ±  78.250  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100         1024  thrpt   25  5378.692 ± 260.682  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100         4096  thrpt   25  2590.183 ±  34.844  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100        16384  thrpt   25  1634.793 ±  34.022  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100        65536  thrpt   25   786.455 ±   8.462  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100          256  thrpt   25  5285.055 ±  11.676  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100         1024  thrpt   25  5586.758 ± 213.008  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100         4096  thrpt   25  2527.172 ±  17.106  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100        16384  thrpt   25  1819.547 ±  12.958  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100        65536  thrpt   25   803.861 ±   9.963  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100          256  thrpt   25  5253.793 ±  28.020  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100         1024  thrpt   25  5705.591 ±  20.556  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100         4096  thrpt   25  2523.377 ±  15.415  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100        16384  thrpt   25  1815.344 ±  11.309  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100        65536  thrpt   25   820.792 ±   3.192  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100          256  thrpt   25  5262.184 ±  20.477  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100         1024  thrpt   25  5706.959 ±  23.123  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100         4096  thrpt   25  2520.362 ±   9.170  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100        16384  thrpt   25  1789.185 ±  14.239  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100        65536  thrpt   25   818.401 ±  12.132  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100          256  thrpt   25  6978.310 ±  14.084  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100         1024  thrpt   25  7664.242 ±  22.304  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100         4096  thrpt   25  2881.778 ±  81.054  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100        16384  thrpt   25  1599.826 ±   7.190  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100        65536  thrpt   25   737.520 ±   6.809  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100          256  thrpt   25  6974.376 ±  10.716  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100         1024  thrpt   25  7637.440 ±  45.877  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100         4096  thrpt   25  2820.472 ±  42.231  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100        16384  thrpt   25  1716.663 ±   8.527  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100        65536  thrpt   25   755.848 ±   7.514  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100          256  thrpt   25  6943.651 ±  20.040  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100         1024  thrpt   25  7679.415 ±   9.114  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100         4096  thrpt   25  2844.564 ±  13.388  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100        16384  thrpt   25  1729.545 ±   5.983  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100        65536  thrpt   25   783.218 ±   1.530  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100          256  thrpt   25  6944.276 ±  29.995  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100         1024  thrpt   25  7670.301 ±   8.986  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100         4096  thrpt   25  2839.828 ±  12.421  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100        16384  thrpt   25  1730.005 ±   9.209  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100        65536  thrpt   25   787.096 ±   1.977  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100          256  thrpt   25  6896.944 ±  21.530  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100         1024  thrpt   25  7622.407 ±  12.824  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100         4096  thrpt   25  2927.538 ±  19.792  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100        16384  thrpt   25  1598.041 ±   4.312  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100        65536  thrpt   25   744.564 ±   9.236  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100          256  thrpt   25  6853.760 ±  78.041  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100         1024  thrpt   25  7360.917 ± 355.365  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100         4096  thrpt   25  2848.774 ±  13.409  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100        16384  thrpt   25  1727.688 ±   3.329  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100        65536  thrpt   25   776.088 ±   7.517  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100          256  thrpt   25  6910.339 ±  14.366  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100         1024  thrpt   25  7633.660 ±  10.830  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100         4096  thrpt   25  2787.799 ±  81.775  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100        16384  thrpt   25  1726.517 ±   6.830  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100        65536  thrpt   25   787.597 ±   3.362  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100          256  thrpt   25  6922.445 ±  10.493  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100         1024  thrpt   25  7604.710 ±  48.043  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100         4096  thrpt   25  2848.788 ±  15.783  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100        16384  thrpt   25  1730.837 ±   6.497  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100        65536  thrpt   25   794.557 ±   1.869  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100          256  thrpt   25  6918.716 ±  15.766  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100         1024  thrpt   25  7626.692 ±   9.394  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100         4096  thrpt   25  2871.382 ±  72.155  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100        16384  thrpt   25  1598.786 ±   4.819  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100        65536  thrpt   25   748.469 ±   7.234  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100          256  thrpt   25  6922.666 ±  17.131  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100         1024  thrpt   25  7623.890 ±   8.805  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100         4096  thrpt   25  2850.698 ±  18.004  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100        16384  thrpt   25  1727.623 ±   4.868  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100        65536  thrpt   25   774.534 ±  10.025  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100          256  thrpt   25  5486.251 ±  13.582  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100         1024  thrpt   25  4920.656 ±  44.557  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100         4096  thrpt   25  3922.913 ±  25.686  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100        16384  thrpt   25  2873.106 ±   4.336  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100        65536  thrpt   25   802.404 ±   8.967  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100          256  thrpt   25  4817.996 ±  18.042  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100         1024  thrpt   25  4243.922 ±  13.929  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100         4096  thrpt   25  3175.998 ±   7.773  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100        16384  thrpt   25  2321.990 ±  12.501  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100        65536  thrpt   25  1753.028 ±   7.130  ops/s
```

Closes https://github.com/facebook/rocksdb/issues/11518

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12344

Reviewed By: cbi42

Differential Revision: D54809714

Pulled By: pdillinger

fbshipit-source-id: bee3b949720abac073bce043b59ce976a11e99eb
2024-03-12 12:42:08 -07:00
Alan Paxton
c4d37da826 Java API - Fix handling of CF handles in DB subclasses (#12417)
Summary:
The most general `open()` method for each of RocksDB, TtlDB, OptimisticTransactionDB and TransactionDB should
- ensure the default CF is supplied in the list of descriptors
- cache the default CF handle
- store open CF handles for automatic close on DB close
The `close()` method in each of these DB subclasses should `close()` all the owned CF handles.

I can’t find a cleaner way to build some generalised open/close that does this for all DB subclasses, so it exists as cut and paste with variations in the 4 different DB subclasses.

Added some slightly paranoid testing that CF handles explicitly referred to as default in a list of CF handles in the general open methods, and the simple open that doesn’t supply a CF, end up reading and writing to the same CF. Prompted by the fact that this code is a bit opaque; the first returned handle is the DB.

As part of this, fix the bug where the Java side of `OptimisticsTransactionDB` was not setting up default column family; this was visible when setting up an iterator; add a test to validate that the iterator is OK. A single Java reference to the default column family was not being created in the OptimisticsTransactionDB RocksDB subclass; it should be created in all subclasses. The same problem had previously been fixed for TtlDB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12417

Reviewed By: ajkr

Differential Revision: D54807643

Pulled By: pdillinger

fbshipit-source-id: 66f34e56a822a009a8f2018d401cf8940d91aa35
2024-03-12 10:33:27 -07:00
Yu Zhang
2940acac00 Persist table options use_delta_encoding in options file (#11987)
Summary:
This option is used for encoding keys in block based table files. It has been having a default true value since its introduction.

Users may not notice this option is not persisted in options file unless they are explicitly setting it to false. If the users expect `Iterator::GetProperty("rocksdb.iterator.is-key-pinned")` to return 1 when setting `ReadOptions.pin_data = true`, they should have noticed loading options file won't work and have work around for this by always explicitly set this option to false for opening DB. This change won't impact those users except that now they can remove their work around. If the users are not relying on key pinning behavior at all and as a result didn't notice the option is not persisted, this change shouldn't have any visible behavior impact either.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11987

Reviewed By: hx235

Differential Revision: D54093238

Pulled By: jowlyzhang

fbshipit-source-id: 256a3348c44cf91349034d1f6e242c437b32b9a5
2024-02-23 14:13:28 -08:00