rocksdb

mirror of https://github.com/facebook/rocksdb synced 2026-05-24 09:29:21 +00:00

Author	SHA1	Message	Date
Xingbo Wang	21723bbbef	Add async WAL precreation Some checks are pending facebook/rocksdb/pr-jobs / build-linux-cmake-with-benchmark-no-thread-status (2) (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-cmake-with-benchmark-no-thread-status (3) (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-encrypted_env-no_compression (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-macos-cmake (1) (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-release (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-clang-13-no_test_run (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-clang-21-no_test_run (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-gcc-14-no_test_run (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-unity-and-headers (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-macos-cmake (2) (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-mini-crashtest (240, blackbox_crash_test) (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-mini-crashtest (480, blackbox_crash_test_with_atomic_flush) (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-clang21-asan-ubsan (0) (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-clang21-asan-ubsan (1) (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-clang21-asan-ubsan (2) (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-clang21-mini-tsan (0) (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-clang21-mini-tsan (1) (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-clang21-mini-tsan (2) (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-static_lib-alt_namespace-status_checked (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-macos (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-windows-vs2022 (false, db_test, db_test) (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-windows-vs2022 (true, , java) (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-java (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-java-static (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-macos-java (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-java-pmd (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-windows-vs2022 (false, arena_test,db_basic_test,db_test2,db_merge_operand_test,bloom_test,c_test,coding_test,crc32c_test,dynamic_bloom_test,env_basic_test,env_test,hash_test,random_test, other) (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-macos-java-static (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-macos-java-static-universal (push) Blocked by required conditions Details facebook/rocksdb/pr-jobs / build-linux-arm (push) Blocked by required conditions Details Summary: - Add experimental immutable `DBOptions::async_wal_precreate` to reserve and open one future WAL on a background HIGH-priority task, with sanitization that disables the optimization when WAL recycling is configured. - Split WAL creation into open/preallocate and start phases so `SwitchMemtable()` can consume a prepared WAL after writing normal WAL metadata, wait for in-flight precreation, fall back to synchronous creation, and delete an unstarted prepared WAL on start failure. - Keep WAL numbering, close, recovery, and read-only open safe for empty future WAL files left by async precreation; `error_if_wal_file_exists=true` now rejects non-empty WALs while tolerating empty WALs. - Add public option plumbing for the C API, options parsing/stringification, random option testing, `db_bench`, `db_stress`, and crash-test configuration. - Add WAL precreate statistics counters plus Java `TickerType`/JNI mappings, and update C++, C, and Java read-only-open documentation for the empty-WAL behavior. - Add focused WAL/option/C/Java tests for async precreate ready/wait/failure/recovery paths, read-only WAL detection, option sanitization, and API plumbing, plus write-flow docs and unreleased history entries for the new feature and behavior change. PR https://github.com/facebook/rocksdb/pull/14738 Reviewed By: pdillinger Differential Revision: D105020559 fbshipit-source-id: 5059b424702e021abb8de65ceeb6d3b975280ffc	2026-05-15 10:59:47 -07:00
Peter Dillinger	87c554b492	Persist compacted manifest size for auto-tuning across DB::Open (#14725 ) Some checks failed facebook/rocksdb/pr-jobs / build-linux-cmake-with-folly-coroutines (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-cmake-with-benchmark-no-thread-status (0) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-cmake-with-benchmark-no-thread-status (1) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-cmake-with-benchmark-no-thread-status (2) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-cmake-with-benchmark-no-thread-status (3) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-encrypted_env-no_compression (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-release (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-clang-13-no_test_run (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-clang-21-no_test_run (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-clang21-asan-ubsan (0) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-clang21-asan-ubsan (1) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-clang21-asan-ubsan (2) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-clang21-mini-tsan (0) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-clang21-mini-tsan (1) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-macos-cmake (0) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-clang21-mini-tsan (2) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-static_lib-alt_namespace-status_checked (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-macos (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-macos-cmake (1) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-macos-cmake (2) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-windows-vs2022 (false, arena_test,db_basic_test,db_test2,db_merge_operand_test,bloom_test,c_test,coding_test,crc32c_test,dynamic_bloom_test,env_basic_test,env_test,hash_test,random_test, other) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-windows-vs2022 (false, db_test, db_test) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-windows-vs2022 (true, , java) (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-java (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-java-static (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-macos-java (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-macos-java-static (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-macos-java-static-universal (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-java-pmd (push) Has been cancelled Details facebook/rocksdb/pr-jobs / build-linux-arm (push) Has been cancelled Details Summary: last_compacted_manifest_file_size_ drives TuneMaxManifestFileSize() to compute the manifest rotation threshold, but it started at 0 on every DB::Open and was only populated after the first manifest rotation. This is really only a problem with reuse_manifest_on_open, because no fresh manifest is created on open. Add a new forward-compatible (safe-to-ignore) MANIFEST tag kLastCompactedManifestFileSize that records the approximate compacted manifest size at the end of WriteCurrentStateToManifest. During recovery, the value is loaded and used to immediately tune the rotation threshold. The record includes a rough estimate of its own overhead (~15 bytes) and must be the last record written by WriteCurrentStateToManifest for accurate estimation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14725 Test Plan: Extended AutoTuneManifestSize in db_etc3_test to close and reopen with reuse_manifest_on_open after establishing a known auto-tuning state. Verifies that the manifest file number is preserved (no spurious rotation) and that subsequent CF additions don't trigger rotation -- proving the persisted compacted size keeps the tuned threshold correct. Verified the test fails when the recovery loading is disabled. Relax a fragile Java test that was dependent on the exact size of the manifest file. SHORT_TEST=1 ./tools/check_format_compatible.sh Reviewed By: anand1976 Differential Revision: D104464522 Pulled By: pdillinger fbshipit-source-id: 4f5d22d2e149bd40a523ee11780e5e3344803c19	2026-05-13 18:31:49 -07:00
Danny Chen	2e7cf42cda	Add MANIFEST_VALIDATION_FAILURE_COUNT statistic (#14657 ) Summary: CONTEXT: The manifest validation on close feature (verify_manifest_content_on_close) detects corruption but does not increment any statistics counter, making it harder to monitor in production. WHAT: Add a new ticker MANIFEST_VALIDATION_FAILURE_COUNT that is incremented each time content validation detects manifest corruption during DB::Close(). The counter fires per corruption detection, so it can increment up to 2 times per close (once on initial check, once after rewrite attempt). Updated all existing manifest validation tests to verify the counter value. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14657 Test Plan: - All 7 manifest validation tests pass with new stat assertions - 5x repeat with COERCE_CONTEXT_SWITCH=1 shows no flakiness - Full version_set_test suite (212 tests) passes Reviewed By: anand1976 Differential Revision: D102404260 Pulled By: dannyhchen fbshipit-source-id: 21a0aa1ad8de12a935caf5642e41ccf2a47b46d9	2026-04-24 15:22:15 -07:00
Josh Kang	3f51c0a185	Convert sequential single deletes into range tombstones (#14448 ) Summary: Add a read-path optimization that converts contiguous point tombstones into range tombstones during forward/reverse iteration. When a configurable threshold of consecutive point deletions (kTypeDeletion, kTypeDeletionWithTimestamp, kTypeSingleDeletion — with no live keys between them) is detected, a range tombstone covering `[first_tombstone_key, next_live_key)` is inserted into the active mutable memtable. This benefits future iterators by enabling efficient skipping via range tombstone fragmentation. If there is a memtable switch during the read iteration, then the range deletion entry is discarded. The inserted range tombstones are logically redundant (they don't delete anything that isn't already deleted by point tombstones), skip WAL (they're a derived optimization regenerated by future reads on crash), and use the max tombstone sequence number so they don't interfere with newer writes. ## Key changes - New option `min_tombstones_for_range_conversion` (`AdvancedColumnFamilyOptions`): Threshold of contiguous point tombstones before converting to a range tombstone. Default 0 (disabled). Dynamically changeable via `SetOptions()`. - `DBIter` tracking logic (`db/db_iter.cc`): Tracks contiguous tombstones during `FindNextUserEntryInternal()` (forward) and `PrevInternal()` (reverse). When a live key terminates a run that meets the threshold, `MaybeInsertRangeTombstone()` inserts `[first_tombstone, live_key)` into the active memtable. - `FindValueForCurrentKey` `found_visible` output (`db/db_iter.cc`): Distinguishes "key deleted at this snapshot" from "no visible entries" so reverse tracking doesn't treat post-snapshot keys as tombstones. - `IterKey::Swap()` (`db/dbformat.h`): Efficiently tracks reverse tombstone run end keys without extra allocations. - `MemTable::AddLogicallyRedundantRangeTombstone()` (`db/memtable.cc`): Concurrent-safe range tombstone insertion into the active memtable. Range tombstone skiplist always uses concurrent inserts. - `ConstructFragmentedRangeTombstones` race fix (`db/db_impl/db_impl_write.cc`): Moved after `MarkImmutable()` to prevent lost entries. - `MarkImmutable` ordering fix (`db/memtable_list.cc`): Called before `current_->Add()` to close a race window. - Prefix filter awareness (`db/db_iter.cc`): Tombstone tracking scoped to the seek prefix when prefix filtering is active. See dedicated section below. - Transaction awareness (`db/db_iter.cc`): Tombstones with `seq > snapshot` excluded from tracking. `min_uncommitted` guard uses `insert_seq` (which may be bumped to `earliest_seq`) instead of `range_tomb_max_seq_`. See dedicated section below. - Duplicate range check: Skips insertion if the memtable already covers `[start, end)`. - New statistics: `READ_PATH_RANGE_TOMBSTONES_INSERTED` and `READ_PATH_RANGE_TOMBSTONES_DISCARDED`. - Memtable MultiGet batch lookup (`memtable/inlineskiplist.h`, `db/memtable.cc`): `InlineSkipList::MultiGet()` with cached search path ("finger") for sorted key lookups. - New option `memtable_batch_lookup_optimization` (`AdvancedColumnFamilyOptions`): Enables batch lookup for memtable MultiGet. Default false. Immutable. ## Deciding Range Tombstone Seqno - The range tombstone is inserted with `insert_seq = max(range_tomb_max_seq_, earliest_seq)` where `range_tomb_max_seq_` is the maximum sequence number across all point tombstones in the contiguous run, and `earliest_seq` is the memtable's earliest sequence number. This preserves the memtable's `earliest_seqno_` invariant. - If the iterator's snapshot sequence (`sequence_`) predates the memtable's `earliest_seq`, insertion is skipped entirely to avoid unintentionally covering entries between `sequence_` and `earliest_seq`. ## ConstructFragmentedRangeTombstones Race Fix - `MarkImmutable()` and `ConstructFragmentedRangeTombstones()` are now called before `mutex_.Lock()` in `SwitchMemtable`, keeping this work outside the DB mutex. `MarkImmutable()` blocks concurrent `AddLogicallyRedundantRangeTombstone()` calls via `immutable_mutex_`, ensuring no range tombstones are inserted after the fragmented list is built. `MarkImmutable()` is idempotent, so `MemTableList::Add()` calling it again inside the mutex is harmless. ## Prefix Filter Safety - When prefix filtering is active, the BBTI bloom filter may reject SST files outside the seek prefix, but the memtable (no bloom filter) returns keys across prefix boundaries. Tombstone tracking is scoped to the seek prefix so that converted range tombstones cannot cover live keys hidden in filtered files. - `total_order_seek=true` disables prefix filtering — all files are visible, so tombstones safely span prefix boundaries. - Behavior change: Seeking to an out-of-domain key with `total_order_seek=false` now treats it as total-order (prefix_ not set). When `prefix_same_as_start=true`, iterating past an out-of-domain key cleanly invalidates the iterator instead of calling `Transform()` on it (which was UB in release builds with `FixedPrefixTransform`). This is a requirement because an incorrect iterator scan could lead to a range tombstone covering a live key. ## Transaction Support - Tombstones written by the transaction's own uncommitted writes (sequence > snapshot) are now excluded from contiguous tombstone tracking entirely at the tracking site in `FindNextUserEntryInternal()` and `PrevInternal()`. Previously, tracking relied on a `min_uncommitted` check at insertion time, but this was insufficient — a transaction's own Delete with `seq > snapshot` could extend a run of committed tombstones, and the resulting range tombstone would cover data visible to other snapshots. - The fix skips any tombstone with `ikey_.sequence > sequence_` during tracking. If a transaction-owned tombstone appears mid-run, it flushes the accumulated committed run first, then resets tracking. This ensures only tombstones visible to the current snapshot are ever converted. - Both WritePrepared and WriteUnprepared transactions are supported with dedicated test coverage: - WritePrepared: When tombstones are committed before `Prepare()`, their seqnos are below `min_uncommitted` and insertion proceeds safely. When `Prepare()` happens first, tombstone seqnos exceed `min_uncommitted` and insertion is blocked. - WriteUnprepared: Multiple unprepared batches with different seqno ranges are handled correctly. Own transaction Deletes that extend a committed tombstone run block insertion of the entire run. After rollback, data correctness is verified. ## UDT support - When user-defined timestamps (UDT) are enabled, keys include an 8-byte timestamp suffix. The comparator, Put/Delete APIs, and ReadOptions all require timestamps. - Forward exhaustion with UDT: `iterate_upper_bound_` is a plain user key without a timestamp suffix. It is padded with min timestamp via `AppendKeyWithMinTimestamp()` so it sorts after all entries with this user key, preserving the exclusive bound semantics. - Reverse exhaustion with UDT: The end key comes from either the previous live key (which already has a proper timestamp suffix) or the seek target set by `SetSavedKeyToSeekForPrevTarget()` (which appends a timestamp via `SetInternalKey(..., timestamp_ub_)` and `UpdateInternalKey(..., ts)`). In both cases, the end key is already properly timestamped, so no additional padding is needed unlike forward exhaustion. - Contiguous tombstone detection works correctly with UDT because the underlying `kTypeDeletionWithTimestamp` entries are tracked the same way as `kTypeDeletion`. ## Concurrent Iterators - Should concurrent iterators happen to read the range at the same time, both will produce the same range and seqno entry. Only one will be accepted by the skip list and the others will be rejected. Future iterators will read the range and not even attempt to insert the range. - There is nothing preventing similar ranges from being inserted however. Two iterators can produce overlapping ranges, but this protection would be complicated to implement and there is no evidence that it is a likely scenario yet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14448 Test Plan: - Unit tests (`db/db_iterator_test.cc`): `ReadPathRangeTombstoneTest` parameterized by forward/reverse with cases for basic insertion, non-contiguous (below threshold), memtable switch, exhausted iterator with/without bounds, direction change, mixed Delete/SingleDelete, single-delete-only runs, snapshot predating memtable, block cache tier incomplete, skip when covered by existing range, UDT basic scan, UDT exhaustion, prefix filter cross-prefix scan (`PrefixFilterCrossPrefixScanCoversLiveKey` with default/total_order_seek/prefix_same_as_start variants), stale ikey from forward-then-reverse scan (`StaleIkeyFromForwardThenReverse`), and reseek stale ikey (`ReseekStaleIkey`). - Concurrency test (`db/db_test2.cc`): `DBTestConcurrentRangeTombstoneConversions` parameterized by `(allow_concurrent_memtable_write, min_tombstones_for_range_conversion)` with mixed writers, deleters, range deleters, and concurrent forward/reverse readers. - Transaction tests (`utilities/transactions/write_prepared_transaction_test.cc`, `write_unprepared_transaction_test.cc`): Tests for WritePrepared (insertion allowed when tombstones committed before prepare, blocked when after; seqno bump shadowing prepared writes `RangeTombstoneSeqnoBumpShadowsPreparedWrite`) and WriteUnprepared (multiple batches, extended visibility with CalcMaxVisibleSeq, own deletions with rollback). - IterKey::Swap tests (`db/dbformat_test.cc`): `IterKeySwapTest` parameterized over `(key_len, copy, use_secondary)` × 2 covering all inline/heap/pinned/secondary combinations. - InlineSkipList MultiGet tests (`memtable/inlineskiplist_test.cc`): Basic, exact matches, empty, single key, randomized validation against `std::set::lower_bound`, duplicate keys with callback walk, and concurrent MultiGet with read-after-write consistency. - Memtable MultiGet tests (`db/db_basic_test.cc`): Batch lookup, overwrite, flush, merge, disabled by default, paranoid checks, and snapshot tests. - Stress test coverage: `min_tombstones_for_range_conversion` and `memtable_batch_lookup_optimization` options added to `db_crashtest.py` and `db_stress` flags. - `make check` passes all tests. ### Benchmark results Tombstones scattered randomly in clusters via `seek_nexts_to_delete` for realistic workloads. DB Setup A (scattered deletes, 100 per seek): ``` # Step 1: Create and compact ./db_bench --benchmarks=fillseq,compact --seed=1 --compression_type=none --num=1000000 --db=<DB> # Step 2: Scatter tombstones (5000 seeks × 100 deletes ≈ 500k tombstones) ./db_bench --benchmarks=seekrandom,flush --seed=1 --compression_type=none --num=2000 \ --seek_nexts=0 --seek_nexts_to_delete=100 --use_existing_db=1 --threads=1 --db=<DB> ``` DB Setup A2 (scattered deletes, 8 per seek): ``` # Step 1: Create and compact ./db_bench --benchmarks=fillseq,compact --seed=1 --compression_type=none --num=1000000 --db=<DB> # Step 2: Scatter tombstones (5000 seeks × 8 deletes ≈ 40k tombstones) ./db_bench --benchmarks=seekrandom,flush --seed=1 --compression_type=none --num=2000 \ --seek_nexts=0 --seek_nexts_to_delete=8 --use_existing_db=1 --threads=1 --db=<DB> ``` DB Setup B (no deletes): ``` ./db_bench --benchmarks=fillseq,compact --seed=1 --compression_type=none --num=1000000 \ [--key_size=100] --db=<DB> ``` Read workload (same for all): ``` ./db_bench --benchmarks=seekrandom --seek_nexts=100 --threads=8 \ --reverse_iterator={true,false} --seed=1 --use_existing_db=1 \ --compression_type=none --num=1000000 --duration=10 \ --disable_auto_compactions \ [--key_size=100] [--min_tombstones_for_range_conversion=X] --db=<DB_COPY> ``` Each workload averaged over 3 runs. Table 1: seekrandom forward, scattered deletes (2000 seeks × 100 deletes/seek) \| Variant \| avg ops/s \| % vs main \| \|---------\|-----------\|-----------\| \| main \| 2,895 \| - \| \| threshold=0 \| 2,869 \| -0.9% \| \| threshold=8 \| 287,334 \| +9,824% \| Table 2: seekrandom reverse, scattered deletes (2000 seeks × 100 deletes/seek) \| Variant \| avg ops/s \| % vs main \| \|---------\|-----------\|-----------\| \| main \| 544 \| - \| \| threshold=0 \| 548 \| +0.7% \| \| threshold=8 \| 206,491 \| +37,860% \| Table 3: seekrandom forward, scattered deletes (2000 seeks × 8 deletes/seek) \| Variant \| avg ops/s \| % vs main \| \|---------\|-----------\|-----------\| \| main \| 194,049 \| - \| \| threshold=0 \| 195,703 \| +0.9% \| \| threshold=8 \| 310,740 \| +60.1% \| Table 4: seekrandom reverse, scattered deletes (2000 seeks × 8 deletes/seek) \| Variant \| avg ops/s \| % vs main \| \|---------\|-----------\|-----------\| \| main \| 63,854 \| - \| \| threshold=0 \| 69,266 \| +8.5% \| \| threshold=8 \| 218,101 \| +241.6% \| Table 5: seekrandom forward, no deletes (regression check) \| Variant \| key=16B avg ops/s \| % vs main \| key=100B avg ops/s \| % vs main \| \|---------\|-------------------\|-----------\|---------------------\|-----------\| \| main \| 330,901 \| - \| 236,048 \| - \| \| threshold=0 \| 328,398 \| -0.8% \| 238,055 \| +0.9% \| \| threshold=8 \| 332,539 \| +0.5% \| 233,776 \| -1.0% \| Table 6: seekrandom reverse, no deletes (regression check) \| Variant \| key=16B avg ops/s \| % vs main \| key=100B avg ops/s \| % vs main \| \|---------\|-------------------\|-----------\|---------------------\|-----------\| \| main \| 261,445 \| - \| 192,177 \| - \| \| threshold=0 \| 265,020 \| +1.4% \| 191,616 \| -0.3% \| \| threshold=8 \| 250,881 \| -4.0% \| 189,239 \| -1.5% \| Reviewed By: xingbowang Differential Revision: D96203950 Pulled By: joshkang97 fbshipit-source-id: 06ba66ebde3c355f04671d1e681f1b1586e8751d	2026-04-03 22:47:51 -07:00
Josh Kang	5db0603613	Read-triggered compactions (#14426 ) Summary: Add read-triggered compaction, a new feature that reduces read amplification by compacting SST files that receive high read traffic. When an SST file's read frequency (`num_reads_sampled / file_size`) exceeds a configurable threshold, it is marked for compaction to a lower level. The feature introduces two new options: a CF option `read_triggered_compaction_threshold` (default 0, disabled) and a DB option `max_periodic_compaction_trigger_seconds` (default 43200s) that controls how often the background thread re-evaluates compaction scores on quiet databases. Both options are dynamically changeable. Lowering `max_periodic_compaction_trigger_seconds` does add some overhead, but generally is minimal, so running this every couple of minutes in a production environment seems fairly reasonable. ## Key changes - New CF option `read_triggered_compaction_threshold` (`advanced_options.h`): When positive, files with `reads_per_byte > threshold` are marked for compaction. Files at the last non-empty level are skipped (bottommost compaction handles those separately). Marked files are sorted by hotness (reads_per_byte descending). - New DB option `max_periodic_compaction_trigger_seconds` (`options.h`): Replaces the hardcoded 12-hour ceiling in `ComputeTriggerCompactionPeriod()`. Essential for read-triggered compaction on quiet DBs since there are no writes to trigger score re-evaluation. - Leveled compaction picker (`compaction_picker_level.cc`): Adds read-triggered as the lowest-priority compaction reason in `SetupInitialFiles()`, using the existing `PickFileToCompact` helper. - Universal compaction picker (`compaction_picker_universal.cc`): Adds `PickReadTriggeredCompaction` as lowest priority. Refactors shared "find output level + compute overlapping inputs + create Compaction" logic from both `PickDeleteTriggeredCompaction` and `PickReadTriggeredCompaction` into `BuildCompactionToNextLevel`, handling both single-level and multi-level universal cases. - Periodic trigger integration (`db_impl.cc`): `TriggerPeriodicCompaction` now also fires for CFs with `read_triggered_compaction_threshold > 0`, even without time-based compaction configured. - Stress test & db_bench support: Both `db_stress` and `db_bench` support the new options. `db_crashtest.py` randomly enables read-triggered compaction and sets a short periodic trigger interval when enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14426 Test Plan: Unit tests: - `compaction_picker_test` — 7 new tests: `ReadTriggeredCompactionDisabled`, `ReadTriggeredCompactionBelowThreshold`, `ReadTriggeredCompactionAboveThreshold`, `NeedsCompactionReadTriggered`, `ReadTriggeredPicksFile`, `UniversalReadTriggeredCompaction`, `ReadTriggeredSkipsLastLevel`, `UniversalReadTriggeredNoPickWhenNotMarked` - `db_compaction_test` — `ReadTriggeredCompaction` integration test verifying end-to-end behavior with sync points - Stress test coverage Stress test: ``` make V=1 -j "CRASH_TEST_EXT_ARGS=--duration=600 --max_key=2500000 --max_compaction_trigger_wakeup_seconds=10 --read_triggered_compaction_threshold=0.0001 --interval=600" blackbox_crash_test ``` - confirmed read triggered compactions from LOGS Benchmark (`db_bench`): Setup: 5M keys (100B values, 16B keys), leveled compaction, 5 levels, 4MB target file size. DB fully compacted, then 2M overlapping keys written without compaction to create L0/L1 overlap (82 files, ~294MB). LSM shape change during readrandom with read-triggered compaction: ``` BEFORE: L0=9 files (15MB), L1=4 (16MB), L2=20 (69MB), L3=49 (194MB) — 82 files, 294MB AFTER: L3=66 files (223MB) ``` \| Benchmark \| Config \| avg ops/s \| % change \| \|-----------\|--------\|-----------\|----------\| \| readrandom (8 threads, 5M reads) \| baseline (threshold=0) \| 1,086,965 \| — \| \| readrandom (8 threads, 5M reads) \| threshold=0.000001, trigger=5s \| 1,453,697 \| +33.7% \| Reviewed By: xingbowang Differential Revision: D97838716 Pulled By: joshkang97 fbshipit-source-id: a21fcb270c7fadd4f78d98b9c821982f220dd3f0	2026-03-27 14:43:52 -07:00
Xingbo Wang	1a4b1e42cc	Add include_blob_files option to GetApproximateSizes (#14501 ) Summary: Summary: Add a new boolean flag `include_blob_files` (default: `false`) to `SizeApproximationOptions` and a corresponding `INCLUDE_BLOB_FILES` enum value to `SizeApproximationFlags`. When set to `true`, the returned size includes an approximation of blob file data in the queried key range. Algorithm: The blob file size contribution is prorated using the SST size ratio: ``` blob_size_in_range ≈ total_blob_size * (sst_size_in_range / total_sst_size) ``` The blob-to-SST ratio (`total_blob_size / total_sst_size`) is computed once before the per-range loop, so iterating levels and blob files only happens once per `GetApproximateSizes` call regardless of how many ranges are queried. The per-range SST size (`ApproximateSize`) is computed once and shared between `include_files` and `include_blob_files`. Limitations: - Assumes blob data is distributed proportionally to SST data across the key space. May be inaccurate if blob value sizes vary significantly across different key ranges (e.g., one range has large blobs while another has small ones). - If there are no SST files (all data in memtables), the blob size contribution will be 0 even if blob files exist on disk. Changes: - `include/rocksdb/options.h`: New `include_blob_files` field in `SizeApproximationOptions`; updated doc comments for `include_memtables`/`include_files` - `include/rocksdb/db.h`: New `INCLUDE_BLOB_FILES` in `SizeApproximationFlags` enum, updated flags-to-options mapping - `include/rocksdb/c.h`: New `rocksdb_size_approximation_flags_include_blob_files` C API enum value - `java/`: Added `INCLUDE_BLOB_FILES` to `SizeApproximationFlag.java` and JNI flag mapping in `rocksjni.cc` - `db/db_impl/db_impl.cc`: Blob-to-SST ratio computed once before loop, SST range size computed once per range and shared - `db_stress_tool/db_stress_test_base.cc`: Randomized `include_blob_files` in stress test Pull Request resolved: https://github.com/facebook/rocksdb/pull/14501 Test Plan: - New `DBBlobBasicTest.GetApproximateSizesIncludingBlobFiles` — verifies: - Size with blobs > without (full range) - Non-overlapping range returns 0 - Partial range returns proportionally less than full range - `SizeApproximationFlags` API works - Multi-range query: two sub-ranges sum approximately to the full-range result - Stress test now exercises the new option randomly Reviewed By: hx235 Differential Revision: D97984211 Pulled By: xingbowang fbshipit-source-id: e9127eac3308687fd4f0b17a771fd61fba6a8380	2026-03-27 13:53:21 -07:00
Josh Kang	3ad23b2d94	Support automated interpolation search (#14383 ) Summary: Add automatic per-block interpolation search selection (`kAuto` mode) for index blocks. During SST construction, each index block's key distribution is analyzed using the coefficient of variation (CV) of gaps between restart-point keys. Blocks with uniformly distributed keys are flagged via a new bit in the data block footer, and at read time, `kAuto` resolves to interpolation search for uniform blocks and binary search otherwise. ## Key changes - New `BlockSearchType::kAuto` enum value: Resolves per-block at read time to either `kInterpolation` or `kBinary` based on the block's uniformity flag. Falls back to `kBinary` on older versions that don't recognize it. - Write-path uniformity analysis: `BlockBuilder::ScanForUniformity()` uses Welford's online algorithm to incrementally compute the CV of key gaps at restart points. The result is stored in a new bit (bit 30) of the data block footer's packed restart count. - New table option `uniform_cv_threshold` (default: -1 `disabled`): Controls how strict the uniformity check is. Set to negative to disable. Exposed in C++, Java (JNI), and `db_bench`. - Code reorganization: Block entry decode helpers (`DecodeEntry`, `DecodeKey`, `DecodeKeyV4`, `ReadBe64FromKey`) moved from `block.cc` to a new shared header `block_util.h` so they can be reused by `BlockBuilder` on the write path. - New histogram `BLOCK_KEY_DISTRIBUTION_CV`: Records the CV (scaled by 10000) of each index block's key distribution for observability. - Java bindings: `IndexSearchType.kAuto`, `uniformCvThreshold` getter/setter, JNI portal constructor signature updated, and `HistogramType.BLOCK_KEY_DISTRIBUTION_CV` added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14383 Test Plan: - `IndexBlockTest.IndexValueEncodingTest` parameterized to include `kAuto` search type alongside `kBinary` and `kInterpolation`, verifying correct seek/iteration behavior across all combinations of key distributions, restart intervals, and key lengths. - Uniformity detection validated: blocks with uniform key distribution correctly set `is_uniform = true`, blocks with clustered/non-uniform keys set `is_uniform = false`. - Stress test coverage - Updated check_format_compatible to also include a "uniform" dataset. By default using uniform_cv_threshold=-1 does not result in an incompatibility issues. When manually changing the threshold (e.g. `uniform_cv_threshold=1000`), I see `bad block contents`, which is expected ## Benchmark readrandom with `fillrandom,compact -seed=1 --statistics`: \| Benchmark \| Branch \| Params \| avg ops/s \| % change vs main \| CV P50 \| \|-----------\|--------\|--------\|-----------\|------------------\|--------\| \| readrandom \| main \| `binary_search, shortening=1` \| 335,791 \| baseline \| N/A \| \| readrandom \| feature \| `binary_search, shortening=1` (default) \| 335,749 \| -0.0% \| 1,500 \| \| readrandom \| feature \| `auto_search, shortening=1` (kAuto) \| 366,832 \| +9.2% \| 1,500 \| \| readrandom \| feature \| `interpolation_search, shortening=1` \| 366,598 \| +9.2% \| 1,500 \| \| readrandom \| feature \| `auto_search, shortening=2` (kAuto) \| 344,631 \| +2.6% \| 1,030,000 \| \| readrandom \| feature \| `interpolation_search, shortening=2` \| 201,178 \| -40.1% \| 1,030,000 \| As seen with shortening=2, a non-uniform distribution produces a high CV, which does not use interpolation search. ## Write benchmark There is a write overhead which scans each restart entry for a block upon Finish. In practice this is very low because currently it is only applied to index blocks. See cpu profile (https://fburl.com/strobelight/io5hwj9h) here of `-benchmarks=fillseq,compact -compression_type=none -disable_wal=1`. Only 0.08% attributed to `ScanForUniformity`. Reviewed By: pdillinger Differential Revision: D94738890 Pulled By: joshkang97 fbshipit-source-id: 9661ac593c5fef89d49f3a8a027f1338a0c96766	2026-03-06 10:13:51 -08:00
Josh Kang	f25fb41da6	Add option to validate sst files in the background on DB open (#14322 ) Summary: Add `open_files_async` option for faster DB startup. When enabled, SST file opening and validation is deferred to a background thread after `DB::Open` returns, reducing startup latency for databases with many SST files. WAL recovery remains synchronous. To support this, `FindTable` is extended with a pinning mechanism that stores the cache handle directly on `FileMetaData` via a new `PinnedTableReader` class, and sets the table reader atomically so subsequent reads skip cache lookups. `FileDescriptor::table_reader` is replaced with `PinnedTableReader pinned_reader` which wraps a `std::atomic<TableReader>` with acquire/release ordering to safely handle concurrent access between the background opener and read threads. Should validations fail, the background opener sets a `kAsyncFileOpen` background error. Future read requests will look up the table reader again via the cache, and if any validations fail there it will get propagated to the user (existing behavior when `max_open_files > 0`). This feature is most useful when `max_open_files=-1`, because otherwise file opening is already capped at 16 files and DB open should be fast. ## Restrictions - This feature also is incompatible with fifo compaction because fifo compaction requires reading table properties under DB mutex. When table reader is unpinned, this may cause a DB hang. - This feature is also incompatible with `skip_stats_update_on_db_open=false` because it will result in even longer DB open ## Key changes - New `open_files_async` DB option with C, Java, and `db_bench` bindings - `BGWorkAsyncFileOpen` background worker that opens all SST files post-`DB::Open`, with shutdown awareness via `shutting_down_` flag - New `PinnedTableReader` class in `version_edit.h` — thread-safe wrapper holding `std::atomic<TableReader>` and `Cache::Handle` with proper acquire/release ordering. Replaces the old `FileDescriptor::table_reader` raw pointer and `FileMetaData::table_reader_handle` - Extract `LoadTableHandlersHelper` into `db/version_util.cc` — shared between `VersionBuilder::LoadTableHandlers` (for version edits during recovery) and `BGWorkAsyncFileOpen` (for base storage post-open) - `FindTable` extended with `pin_table_handle` and `out_table_reader` params — when pinning is enabled, the table reader is stored on `FileMetaData` so Get/MultiGet/Iterator skip redundant cache lookups. `FindTable` now performs the pinned-reader fast-path check internally instead of requiring callers to check `fd.table_reader` beforehand - Note: pinning is explicit (not default) because some callers create temporary `FileMetaData`s that would need to properly clean up table handles - `CompactedDBImpl` updated to use `FindTable` + pinning instead of raw `fd.table_reader` access for Get/MultiGet - New `kAsyncFileOpen` background error reason in `listener.h` and `error_handler.cc` - Add a check in ~DBImpl to ensure async file open task has not been forgotten to be scheduled in (future) subclasses of DBImpl. Certain subclasses that never use it will need to explicitly mark it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14322 Test Plan: - `OpenFilesAsyncTest` parameterized over `num_flushes` (1, 20), `ReadType` (Get, MultiGet, Iterator), `max_open_files` (-1, 10), and `read_only` (true, false) - ConcurrentFileAccess: concurrent reads and compactions race with async opener - AfterRead: reads happen before async opener, verifying lazy open and that the opener sees already-pinned readers - BeforeRead: async opener completes first, verifying reads use pre-loaded table readers - Shutdown: DB closes before async opener starts, verifying clean cancellation with 0 file opens - Error: corrupted SST files, verifying `kAsyncFileOpen` background error is set and reads return corruption - DropColumnFamily*: CF dropped before async opener runs, verifying the opener gracefully skips dropped CFs - Added to crash test ### Benchmark To simulate a high-latency remote filesystem, I set up a virtual filesystem with dm-delay using 10ms reads, 0 ms writes. ``` # Generate a DB with many L0 files TEST_TMPDIR=/data/users/jkangs/dm-delay-test/mnt ./db_bench -benchmarks=fillseq -disable_auto_compactions=true -write_buffer_size=1000 -num=1000000 ``` ``` ./db_bench -use_existing_db=true -db=/data/users/jkangs/dm-delay-test/mnt/dbbench -benchmarks=readrandom -reads=1 -report_open_timing=true -open_files_async=true -use_direct_reads -file_opening_threads=1 -skip_stats_update_on_db_open OpenDb: 25.1419 milliseconds ``` ``` ./db_bench -use_existing_db=true -db=/data/users/jkangs/dm-delay-test/mnt/dbbench -benchmarks=readrandom -reads=1 -report_open_timing=true -open_files_async=false -use_direct_reads -file_opening_threads=1 -skip_stats_update_on_db_open OpenDb: 23109.4 milliseconds ``` ### No read regressions On main branch ``` ./db_bench -use_existing_db=true -db=/dev/shm/dbbench -benchmarks=readrandom -seed=1 -threads=8 -duration=30 readrandom : 4.827 micros/op 1657100 ops/sec 30.005 seconds 49720992 operations; 183.3 MB/s (6198999 of 6198999 found) ``` On this branch ``` ./db_bench -use_existing_db=true -db=/dev/shm/dbbench -benchmarks=readrandom -seed=1 -threads=8 -duration=30 readrandom : 4.863 micros/op 1644808 ops/sec 30.007 seconds 49354992 operations; 182.0 MB/s (6099999 of 6099999 found) ./db_bench -use_existing_db=true -db=/dev/shm/dbbench -benchmarks=readrandom -seed=1 -threads=8 -duration=30 -open_files_async=true readrandom : 4.803 micros/op 1665392 ops/sec 30.004 seconds 49968992 operations; 184.2 MB/s (6222999 of 6222999 found) ``` Reviewed By: pdillinger, xingbowang Differential Revision: D93538033 Pulled By: joshkang97 fbshipit-source-id: 32ac70c112cd733b7c1e1c1e2e7ce6422318a5ae	2026-03-02 16:18:14 -08:00
Josh Kang	901c88e37b	Separate keys and values in data blocks (#14287 ) Summary: Introduce new table option with separated key-value storage in data blocks. This PR implements a new SST block format where keys and values are stored in separate sections within data blocks, rather than interleaved. Keys are stored first, followed by all values in a contiguous section. The motivation is better cpu cache hit rate during seeks and potentially better compression. The additional storage cost is a varint per restart point, and 4 bytes additional block footer. For a data block with a restart interval of 16, it is approximately 1 bit of overhead per entry. But compression actually performs better, resulting in ~3% storage savings from benchmark. For now I've opted to not separate kvs in non-data blocks since restart interval for those blocks is typically 1, and values are typically small and probably better inlined. ### New block layout ``` +------------------+ \| Keys Section \| <- Key entries with delta encoding +------------------+ \| Values Section \| <- (new) Values stored contiguously +------------------+ \| Restart Array \| <- Fixed32 offsets to restart points +------------------+ \| Values Offset \| <- (new) 4 bytes: offset to values section \| Footer \| <- 4 bytes: packed index_type + num_restarts +------------------+ ``` ### Entry Format At restart points ``` +--------------+------------------+----------------+-----------------+-----------+ \| shared (v32) \| non_shared (v32) \| value_sz (v32) \| value_off (v32) \| key_delta \| +--------------+------------------+----------------+-----------------+-----------+ ``` At non-restart points ``` +--------------+------------------+----------------+-----------+ \| shared (v32) \| non_shared (v32) \| value_sz (v32) \| key_delta \| +--------------+------------------+----------------+-----------+ ``` - `value_offset` is only stored at restart points to save space - For non-restart entries, value offset is computed as: `prev_value_offset + prev_value_size` ### Forward Compatibility - We make use of reserved block footer bits to mark if a block has separated kv format. Should an older version read this, it will assume a very large block restart interval and result in a corruption error. ### Key Changes - BlockBuilder: Accumulates values in a separate buffer; value offsets are stored only at restart points (other entries derive offset from previous value's position). There is an additional memcpy cost to place the value data after the key data. - Block iteration: Iteration now needs to know if we are at a restart point. This will rely on `cur_entry_idx_`, which was previously only used for per-kv checksum purposes. In this new format, we also need to know the block_restart_interval, which was previously also only calculated for per-kv checksums. - Table properties: Store `data_block_restart_interval`, `index_block_restart_interval`, and `separate_key_value_in_data_block` in table properties Pull Request resolved: https://github.com/facebook/rocksdb/pull/14287 Test Plan: - Extended block_test, table_test, compaction_test to contain new separated_kv param - Added new parameter to crash test --- ## Benchmark ### Varying Value Size Write: `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --value_size=<X> --benchmarks=fillrandom,compact` Read: `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq` \| Value Size \| FillRandom (ops/s) \| \| Compact (s) \| \| ReadRandom (ops/s) \| \| ReadSeq (ops/s) \| \| SST Size (MB) \| \| \|----------:\|-------------------:\|-----:\|----------:\|-----:\|-------------------:\|-----:\|----------------:\|-----:\|-------------:\|-----:\| \| \| baseline \| sep_kv \| baseline \| sep_kv \| baseline \| sep_kv \| baseline \| sep_kv \| baseline \| sep_kv \| \| 1 \| 253,280 \| 264,977 \| 0.516 \| 0.497 (-3.7%) \| 596,328 \| 586,625 (-1.6%) \| 5,904,681 \| 6,069,581 (+2.8%) \| 5.1 \| 4.2 (-18.6%) \| \| 16 \| 235,757 \| 242,367 \| 0.572 \| 0.533 (-6.8%) \| 570,751 \| 572,815 (+0.4%) \| 5,371,138 \| 5,604,908 (+4.4%) \| 14.7 \| 13.4 (-8.5%) \| \| 100 \| 264,299 \| 265,427 \| 0.461 \| 0.454 (-1.5%) \| 323,696 \| 332,790 (+2.8%) \| 4,239,725 \| 4,232,416 (-0.2%) \| 38.8 \| 37.6 (-3.2%) \| \| 1,000 \| 238,992 \| 242,764 \| 2.349 \| 2.329 (-0.9%) \| 244,608 \| 261,403 (+6.9%) \| 1,285,394 \| 1,265,868 (-1.5%) \| 342.1 \| 342.0 (-0.0%) \| ### Varying Block Restart Interval Write: `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --block_restart_interval=<X> --benchmarks=fillrandom,compact` Read: `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq` \| BRI \| FillRandom (ops/s) \| \| Compact (s) \| \| ReadRandom (ops/s) \| \| ReadSeq (ops/s) \| \| SST Size (MB) \| \| \|----:\|-------------------:\|-----:\|----------:\|-----:\|-------------------:\|-----:\|----------------:\|-----:\|-------------:\|-----:\| \| \| baseline \| sep_kv \| baseline \| sep_kv \| baseline \| sep_kv \| baseline \| sep_kv \| baseline \| sep_kv \| \| 1 \| 251,654 \| 263,707 \| 0.453 \| 0.485 (+7.1%) \| 334,653 \| 328,708 (-1.8%) \| 4,194,342 \| 3,954,291 (-5.7%) \| 40.6 \| 42.4 (+4.5%) \| \| 4 \| 253,797 \| 252,394 \| 0.476 \| 0.476 (+0.0%) \| 332,719 \| 341,676 (+2.7%) \| 4,135,691 \| 4,051,151 (-2.0%) \| 39.2 \| 39.3 (+0.3%) \| \| 8 \| 260,143 \| 262,273 \| 0.496 \| 0.460 (-7.3%) \| 330,859 \| 337,567 (+2.0%) \| 4,144,081 \| 4,187,389 (+1.0%) \| 38.9 \| 38.1 (-2.1%) \| \| 16 \| 252,875 \| 263,176 \| 0.464 \| 0.455 (-1.9%) \| 323,783 \| 335,418 (+3.6%) \| 4,127,310 \| 4,217,028 (+2.2%) \| 38.8 \| 37.6 (-3.2%) \| \| 32 \| 260,224 \| 269,422 \| 0.464 \| 0.451 (-2.8%) \| 304,001 \| 314,989 (+3.6%) \| 4,310,162 \| 4,247,248 (-1.5%) \| 38.8 \| 37.3 (-3.8%) \| ### Varying Compression Write: `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --compression_type=<X> [--compression_level=<N>] --benchmarks=fillrandom,compact` Read: `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq` \| Compression \| FillRandom (ops/s) \| \| Compact (s) \| \| ReadRandom (ops/s) \| \| ReadSeq (ops/s) \| \| SST Size (MB) \| \| \|------------\|-------------------:\|-----:\|----------:\|-----:\|-------------------:\|-----:\|----------------:\|-----:\|-------------:\|-----:\| \| \| baseline \| sep_kv \| baseline \| sep_kv \| baseline \| sep_kv \| baseline \| sep_kv \| baseline \| sep_kv \| \| None \| 252,494 \| 260,552 \| 0.413 \| 0.419 (+1.5%) \| 356,290 \| 371,535 (+4.3%) \| 4,479,261 \| 4,507,133 (+0.6%) \| 73.5 \| 73.6 (+0.2%) \| \| LZ4 \| 246,010 \| 256,360 \| 0.477 \| 0.455 (-4.6%) \| 342,497 \| 345,882 (+1.0%) \| 4,400,570 \| 4,268,102 (-3.0%) \| 38.3 \| 37.6 (-2.0%) \| \| ZSTD (L3) \| 254,748 \| 258,556 \| 1.067 \| 1.055 (-1.1%) \| 176,724 \| 177,566 (+0.5%) \| 2,736,841 \| 2,717,739 (-0.7%) \| 32.9 \| 31.3 (-4.7%) \| \| ZSTD (L6) \| 256,459 \| 259,388 \| 1.556 \| 1.462 (-6.0%) \| 177,390 \| 176,691 (-0.4%) \| 2,754,336 \| 2,688,682 (-2.4%) \| 32.8 \| 31.1 (-5.1%) \| ### Varying Block Size Write: `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --block_size=<X> --benchmarks=fillrandom,compact` Read: `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq` \| Block Size \| FillRandom (ops/s) \| \| Compact (s) \| \| ReadRandom (ops/s) \| \| ReadSeq (ops/s) \| \| SST Size (MB) \| \| \|-----------:\|-------------------:\|-----:\|----------:\|-----:\|-------------------:\|-----:\|----------------:\|-----:\|-------------:\|-----:\| \| \| baseline \| sep_kv \| baseline \| sep_kv \| baseline \| sep_kv \| baseline \| sep_kv \| baseline \| sep_kv \| \| 4 KB \| 263,203 \| 260,362 \| 0.469 \| 0.461 (-1.7%) \| 324,178 \| 332,249 (+2.5%) \| 4,231,537 \| 4,217,763 (-0.3%) \| 38.8 \| 37.6 (-3.1%) \| \| 16 KB \| 252,742 \| 263,161 \| 0.426 \| 0.428 (+0.5%) \| 227,805 \| 222,873 (-2.2%) \| 5,146,997 \| 5,081,080 (-1.3%) \| 38.1 \| 36.7 (-3.6%) \| \| 64 KB \| 257,490 \| 260,225 \| 0.423 \| 0.414 (-2.1%) \| 86,807 \| 91,586 (+5.5%) \| 5,380,403 \| 5,372,372 (-0.1%) \| 36.3 \| 35.0 (-3.5%) \| ### Varying Key Size Write: `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --min_key_size=10 --max_key_size=100 --benchmarks=fillrandom,compact` Read: `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq` \| Key Size \| FillRandom (ops/s) \| \| Compact (s) \| \| ReadRandom (ops/s) \| \| ReadSeq (ops/s) \| \| SST Size (MB) \| \| \|---------:\|-------------------:\|-----:\|----------:\|-----:\|-------------------:\|-----:\|----------------:\|-----:\|-------------:\|-----:\| \| \| baseline \| sep_kv \| baseline \| sep_kv \| baseline \| sep_kv \| baseline \| sep_kv \| baseline \| sep_kv \| \| 10–100 \| 243,740 \| 255,183 \| 0.618 \| 0.622 (+0.6%) \| 284,304 \| 307,569 (+8.2%) \| 3,738,921 \| 3,686,676 (-1.4%) \| 41.5 \| 41.2 (-0.8%) \| ### CPU Profile Notes - No compression: DataBlock::SeekForGet uses less cpu (13.2% vs 13.9%) - https://fburl.com/strobelight/6mwwebft with separated KV - https://fburl.com/strobelight/m9m798ka without - ZSTD compression: rocksdb::DecompressSerializedBlock uses more CPU (45.8% vs 44.9%), while DataBlock::SeekForGet uses less cpu (5.09% vs 6.52%) - https://fburl.com/strobelight/3x5nw1k4 with separated KV - https://fburl.com/strobelight/e7809046 without --- Reviewed By: xingbowang, pdillinger Differential Revision: D92103024 Pulled By: joshkang97 fbshipit-source-id: 47cfeb656ff3c20d34975f0b6c4c0462935a83dc	2026-02-23 12:42:05 -08:00
Hui Xiao	29819f37e1	Remove deprecated `ReadOptions::managed`, `ColumnFamilyOptions::snap_refresh_nanos (#14350 ) Summary: Context/Summary: Remove deprecated, unused APIs and options: - ReadOptions::managed: This option was not used anymore. The functionality it controlled has been removed long ago. - ColumnFamilyOptions::snap_refresh_nanos: Deprecated and unused option. Corresponding C API (rocksdb_readoptions_set_managed) and Java API (ReadOptions.managed/setManaged) are also removed. All related checks an db_impl and db_impl_secondary iterators are cleaned up. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14350 Test Plan: make check Reviewed By: pdillinger Differential Revision: D93812438 Pulled By: hx235 fbshipit-source-id: e4a9d21c65f83294b6d0878286ba14024f049bac	2026-02-20 14:00:41 -08:00
xingbowang	3556c22059	Remove deprecated option skip_checking_sst_file_sizes_on_db_open (#14346 ) Summary: Remove deprecated option skip_checking_sst_file_sizes_on_db_open Pull Request resolved: https://github.com/facebook/rocksdb/pull/14346 Test Plan: Unit test Reviewed By: hx235 Differential Revision: D93602683 Pulled By: xingbowang fbshipit-source-id: f576825cb107bb0aeb14f4ff29fef0df269b8728	2026-02-19 14:12:38 -08:00
Xingbo Wang	b040ab83e1	Add a new picking algorithm in fifo compaction (#14326 ) Summary: Add a new kv ratio based compaction picking algorithm in fifo compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/14326 Test Plan: Unit test Reviewed By: pdillinger Differential Revision: D93257941 Pulled By: xingbowang fbshipit-source-id: fd2d0e1356c7b54682a1197475a1bd26cb45c9d4	2026-02-15 10:04:58 -08:00
Josh Kang	9f47518676	Add interpolation search as an alternative to binary search (#14247 ) Summary: Interpolation search is an alternative algorithm to binary search, which performs better on uniformly distributed keys. Instead of binary search always computing the mid point of the left and right boundaries, interpolation search "interpolates" the mid point based on the distance to the target. Fortunately, we can re-use existing block format to support interpolation search. For a given block, we compute the shared_prefix length of the first and last key. Interpolation search is usually done with numerical target values, so for a variable binary length key, we calculate the "value" as the first 8 non-shared bytes. This also means interpolation search would only really be effective for bytewise comparator (guarded via options validations). #### Fallback to binary search - if the the val(left_key) == val(right_key) then we fallback to classic binary search (to avoid divide by 0) - interpolation search is significantly more computationally expensive than binary search, so when the search distance is small, we also fallback to binary search. - if interpolation search does not make significant progress (i.e. reduces search space by more than half each iteration), we can assume data is non-uniform and fallback. Interpolation search also performs best when there is minimal shortening, especially shortening of the last block, as it can heavily skew the distribution of the actual keys. Note that each search algorithm is guaranteed to make progress because at each iteration the search space is guaranteed to be reduce by at least 1. For now this change only applies to index block seeks, as data block seeks and other blocks do not have as many entries and would not require significant number of search rounds, but it could be easily extended to include that support. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14247 Test Plan: Updated unit tests and crash test with new search option ### Benchmark The default benchmark sets up keys in generally uniform distribution, so it was a good way to test performance improvements. Setup: `./db_bench -benchmarks=fillseq,compact -index_shortening_mode=1` #### Before this change ``` ./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 readrandom : 2.899 micros/op 344973 ops/sec 2.899 seconds 1000000 operations; 38.2 MB/s (1000000 of 1000000 found) ``` #### After this change Notice how key comparison counts are the same between the two. ``` ./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=binary_search readrandom : 2.881 micros/op 347128 ops/sec 2.881 seconds 1000000 operations; 38.4 MB/s (1000000 of 1000000 found) ``` ``` ./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=interpolation_search readrandom : 2.609 micros/op 383209 ops/sec 2.610 seconds 1000000 operations; 42.4 MB/s (1000000 of 1000000 found) ``` With a non-uniform distribution, `i.e. index_shortening_mode=2` ``` ./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=binary_search readrandom : 2.958 micros/op 338075 ops/sec 2.958 seconds 1000000 operations; 37.4 MB/s (1000000 of 1000000 found) ``` ``` ./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=interpolation_search readrandom : 5.502 micros/op 181750 ops/sec 5.502 seconds 1000000 operations; 20.1 MB/s (1000000 of 1000000 found) ``` Reviewed By: pdillinger Differential Revision: D91063163 Pulled By: joshkang97 fbshipit-source-id: 151d6aa76f8713740b714de6e406aff40d28ccbc	2026-02-13 17:15:10 -08:00
Xingbo Wang	656b734a5f	Support abort background compaction jobs. (#14227 ) Summary: This adds a new public API to allow applications to abort all running compactions and prevent new ones from starting. Unlike DisableManualCompaction() which only pauses manual compactions and waits for them to finish naturally, AbortAllCompactions() actively signals running compactions (both automatic and manual) to terminate early and waits for them to complete before returning. The abort signal is checked periodically during compaction (every 100 keys), so ongoing compactions abort quickly. Any output files from aborted compactions are automatically cleaned up to prevent partial results from being installed. This is useful for scenarios where applications need to quickly stop all compaction activity, such as during graceful shutdown or when performing maintenance operations. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14227 Test Plan: - Unit tests in db_compaction_abort_test.cc cover various abort scenarios including: abort before/during compaction, abort with multiple subcompactions, nested abort/resume calls, abort with CompactFiles API, abort across multiple column families, and timing guarantees - Updated compaction_job_test.cc to include the new parameter Reviewed By: anand1976 Differential Revision: D91480994 Pulled By: xingbowang fbshipit-source-id: 36837971d8a540cd34d3ec28a78bc94b582625b0	2026-01-30 05:53:04 -08:00
Maciej Szeszko	6b5ccbbec6	Remove inline values support (#14270 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/14270 Legacy BlobDB's inline values feature (storing small values directly in the LSM tree via `min_blob_size` threshold) is unused in production - all deployments use `min_blob_size = 0`. This removes the functionality entirely. Changes: - Remove `min_blob_size` from `BlobDBOptions` - Remove `IsInlined()` check from compaction filter (dead code path) - Remove inline-related statistics (`BLOB_DB_WRITE_INLINED*`) - Remove `InlineSmallValues` test - Update stale comments referencing inlined data Reviewed By: xingbowang Differential Revision: D91088985 fbshipit-source-id: ec67848ece1a7dc071ca8e8a17faebb435394733	2026-01-28 16:08:10 -08:00
Anand Ananthabhotla	b89d290c20	Add MultiScan statistics (#14248 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/14248 ### Overview This diff introduces the addition of multi-scan statistics to RocksDB, enhancing the database's ability to monitor and analyze performance during multi-scan operations. ### Key Changes #### Implemented Multi-Scan Statistics The following statistics were implemented to provide deeper insights into multi-scan operations: - MULTISCAN_PREPARE_MICROS: Measures the time (in microseconds) spent preparing for multi-scan operations. - MULTISCAN_BLOCKS_PER_PREPARE: Tracks the number of blocks processed per multi-scan prepare operation. - Wasted Prefetch Blocks Count: Counts the number of prefetched blocks that were not used (i.e., wasted) if the iterator is abandoned before accessing them. - MULTISCAN_TOTAL_BLOCKS_SCANNED: Tracks the total number of blocks scanned during all multi-scan operations. - MULTISCAN_TOTAL_KEYS_SCANNED: Measures the total number of keys scanned across all multi-scan operations. - MULTISCAN_TOTAL_MICROS: Captures the total time (in microseconds) spent in multi-scan operations. - MULTISCAN_PREFETCHED_BLOCKS: Counts the number of blocks that were prefetched during multi-scan operations. - MULTISCAN_USED_PREFETCH_BLOCKS: Tracks the number of prefetched blocks that were actually used during multi-scan operations. ### Impact This diff provides more fine-grained statistics for multi-scan operations, allowing developers and users to better understand and optimize the performance of their RocksDB instances. Reviewed By: krhancoc Differential Revision: D91053297 fbshipit-source-id: 7158741b9f026c0b5ce8ba1264dbd137e7fe985d	2026-01-21 23:23:38 -08:00
Peter Dillinger	a6af317476	Use format_version=7 by default, fix perf bug (#14239 ) Summary: Since it's been > 6 months and we have production uses, migrate to fv=7 by default. One unit test needed an update for the change to table properties with fv=7. On making this change, PresetCompressionDictTest tests detected extra memory usage by decompressing LZ4 with dictionary compression. This turned out to be a bug in `std::find` usage that led to using the ZSTD-optimized decompressor (with digested dictionary usage) in cases where it is not needed. I've fixed the bug and improved the unit tests that found the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14239 Test Plan: existing tests, including format compatible CI job (updated, and run locally with SHORT_TEST=1) Reviewed By: hx235 Differential Revision: D90728697 Pulled By: pdillinger fbshipit-source-id: 8f1a0e9ca59a88c18eaa4cdfdea00309175ce30a	2026-01-21 09:28:06 -08:00
Hui Xiao	8edb99f904	Statistics for successfully resumed compaction output bytes (#14054 ) Summary: Context/Summary: as titled Pull Request resolved: https://github.com/facebook/rocksdb/pull/14054 Test Plan: new UT, manually checking Reviewed By: jaykorean Differential Revision: D84828431 Pulled By: hx235 fbshipit-source-id: 56e1a9159f7597a10d6c549657d8b22788aa0599	2025-10-17 11:38:20 -07:00
Xingbo Wang	742741b175	Support Super Block Alignment (#13909 ) Summary: Pad block based table based on super block alignment Pull Request resolved: https://github.com/facebook/rocksdb/pull/13909 Test Plan: Unit Test No impact on perf observed due to change in the inner loop of flush. upstream/main branch 202.15 MB/s ``` for i in `seq 1 10`; do ./db_bench --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 >> /tmp/x1 2>&1; grep fillseq /tmp/x1 \| grep -Po "\d+\.\d+ MB/s" \| grep -Po "\d+\.\d+" \| awk '{sum+=$1} END {print sum/NR}' ``` After the change without super block alignment 203.44 MB/s ``` for i in `seq 1 10`; do ./db_bench --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 >> /tmp/x1 2>&1 ``` After the change with super block alignment 204.47 MB/s ``` for i in `seq 1 10`; do ./db_bench --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 --super_block_alignment_size=131072 --super_block_alignment_max_padding_size=4096 >> /tmp/x1 2>&1; ``` Reviewed By: pdillinger Differential Revision: D83068913 Pulled By: xingbowang fbshipit-source-id: eecd65088ab3e9dbc7902aab8c2580f1bc8575df	2025-10-01 18:20:35 -07:00
Peter Dillinger	f5fb597bac	Resolve missing/inconsistent tickers in Java (#14012 ) Summary: Pretty self-explanatory from the changes, including re-arranging the "COOL" entries for easier tracking of which values are used. I'm not touching the TICKER_ENUM_MAX issue because IIRC we've gotten in trouble in the past for changing any Java ticker values. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14012 Test Plan: CI, sufficient prompts to get AI to discover the known issues relayed by hx235, to help ensure we found any other outstanding issues. Reviewed By: hx235 Differential Revision: D83497503 Pulled By: pdillinger fbshipit-source-id: ec0bd7e28188e0430fb03fc5bd79c2ed7b28f3ad	2025-09-29 14:21:00 -07:00
Peter Dillinger	1c8a012727	Add kCool Temperature (#14000 ) Summary: also requested by internal user, like kIce in https://github.com/facebook/rocksdb/issues/13927 Pull Request resolved: https://github.com/facebook/rocksdb/pull/14000 Test Plan: unit tests updated Reviewed By: archang19 Differential Revision: D83200479 Pulled By: pdillinger fbshipit-source-id: 31f2842d87bcad40227aeee9687ff5772393689c	2025-09-25 11:27:00 -07:00
Peter Dillinger	6127a42f98	Use/endorse (Auto)HyperClockCache by default over LRUCache (#13964 ) Summary: After seeing more people hit issues with thrashing small LRUCache shards and AutoHCC running fully in production for a while on a very large service, here I make these updates: * In the public API, mark the case of `estimated_entry_charge = 0` (which is how you select AutoHCC) as production-ready and generally preferred. That means devoting a lot less space to how to tune FixedHCC (`estimated_entry_charge > 0`) because it is not generally recommended anymore even though in theory it is the fastest (conditional on a fragile configuration). * In the public API, add more detail about potential problems with LRUCache and explicitly endorse HCC. * When a default block cache is created, use AutoHCC instead of LRUCache. It's still a 32MB cache but that's just one cache shard for AutoHCC so the risk of issues with small cache shards is dramatically reduced. And a single AutoHCC shard is still essentially wait-free. * Improve the handling of the hypothetical scenario of a failed anonymous mmap. This is hardly a concern for 64-bit Linux and likely most other OSes. It would in theory be possible to fall back on LRUCache in that case but the code structure makes that annoying/challenging. Instead we crash with an appropriate message. * Cleaned up some includes * Fixed some previously unreported leaks (better assertions on HCC perhaps, some subtle behavior changes) * Added a new mode to cache_bench (detailed below) * Avoid a particularly costly sanity check in `~AutoHyperClockTable()` even in debug builds so that unit testing, etc., isn't bogged down, except keep it in ASAN build. Planned follow-up: * Update HCC implementation to use my new "bit field atomics" API introduced in https://github.com/facebook/rocksdb/issues/13910 to make it easier to read and maintain Possible follow-up: * Re-engineer table cache to use AutoHCC also, instead of LRUCache and a single mutex to ensure no duplication across threads. (a) Pad table cache key to 128 bits for AutoHCC. (b) Stripe/shard the no-duplication mutex. (HCC's consistency model is too weak for concurrent threads to use its API to agree on a winner, even if entries could be inserted in an "open in progress" state.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/13964 Test Plan: existing tests. ClockCacheTest.ClockEvictionEffortCapTest caught a regression during my development, and the crash test has a history of finding subtle HCC bugs. ## Performance Although we've validated AutoHCC performance under high load, etc., before we haven't really considered whether there will be unacceptable overheads for small DBs and CFs, e.g. in unit tests. For this, I have added a new mode to cache_bench: with the -stress_cache_instances=n parameter, it will create and destroy n empty cache instances several times. In the debug build, this found that a particular check in `~AutoHyperClockTable()` was extremely costly for short-lived caches (fixed). Beyond that, we can answer the question of whether it is feasible for a single process to host 1000 DBs each with 1000 CFs with default block cache instances, after moving LRUCache -> AutoHCC, for example: ``` /usr/bin/time ./cache_bench -stress_ cache_instances=1000000 -cache_type=auto_hyper_clock_cache -cache_size=33554432 ``` Release build: Average 9.8 us per 32MB LRUCache creation, 2.9 us per destruction, 24.6GB max RSS (~25KB each) -> Average 4.3 us per 32MB AutoHCC creation, 4.9 us per destruction, 4.8GB max RSS (~5KB each) Debug build: Average 10.9 us per 32MB LRUCache creation, 3.5 us per destruction, 28.7GB max RSS (~29KB each) -> Average 4.5 us per 32MB AutoHCC creation, 4.9 us per destruction, 4.7GB max RSS (~5KB each) Despite the anonymous mmaps, it's apparently more efficient for default/small/empty structures. This is likely due to the dramatically low number of cache shards at this size. If we switch to `-stress_cache_instances=10000 -cache_size=1073741824`: Release build: Average 10.6 us per 1GB LRUCache, 2.8 us per destruction, 2.3 GB max RSS (~230KB each) -> Average 130 us per 1GB AutoHCC creation, 153 us per destruction, 1.5 GB max RSS (~150KB each) Debug build: Average 11.2 us per 1GB LRUCache, 3.6 us per destruction, 2.4 GB max RSS (~240KB each) -> Average 130 us per 1GB AutoHCC creation, 150 us per destruction, 1.6 GB max RSS (~160KB each) Here it's clear that we are paying a price in time for setting up all those mmaps for the good number of cache shards and potential table growth, even though the RSS is well under control. However, I am not concerned about this at all, as it's unlikely to slow down anything notably such as unit tests. Before and after full testsuite runs confirm: 3327.73user 5188.71system 3:38.88elapsed -> 3312.07user 5704.77system 3:41.61elapsed There is increased kernel time but acceptable. With ASAN+UBSAN: 11618.70user 15671.30system 5:54.68elapsed -> 12595.81user 16159.67system 6:32.77elapsed Acceptable given that our ASAN+UBSAN builds are not the slowest in CI Reviewed By: hx235 Differential Revision: D82661067 Pulled By: pdillinger fbshipit-source-id: ab25c766ca70f2b8664849c2a838b9e1b4e72d3b	2025-09-18 13:27:51 -07:00
Peter Dillinger	67af5bdc38	Add Temperature::kIce (#13927 ) Summary: ... and associated statistics, etc. Someone needs it, so here it is. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13927 Test Plan: Updated / extended / added some unit tests Reviewed By: cbi42 Differential Revision: D81981469 Pulled By: pdillinger fbshipit-source-id: 52558c08741890b781310906acbc18d9eb479363	2025-09-10 10:29:49 -07:00
Alan Paxton	805ac7c887	Update compression libraries to latest releases (#13609 ) Summary: See `Makefile` for actual changes: * ZLIB remains the same * BZIP2 remains the same * SNAPPY is a minor update * LZ4 is a significant update with multithreaded/multicore compression https://github.com/lz4/lz4/releases/tag/v1.10.0 * ZSTD is a significant update RocksDB is called out as benefiting in particular from the performance improvements herein https://github.com/facebook/zstd/releases/tag/v1.5.7 Pull Request resolved: https://github.com/facebook/rocksdb/pull/13609 Reviewed By: archang19 Differential Revision: D77877295 Pulled By: mszeszko-meta fbshipit-source-id: bf9a257e8f68dec3d02743b339aa2df65df4ab2c	2025-07-07 13:28:49 -07:00
Miroslav Kovar	fc2cf7ead2	Expose optimized `TransactionBaseImpl::MultiGet` through JNI (#13589 ) Summary: Addresses https://github.com/facebook/rocksdb/issues/13587. This PR exposes the optimized implementation of batched reads through a `Transaction` object to Java clients. The latency improvement of transactional multiget on production workload achieved by switching the implementation is roughly: ``` quantile=0.2: 21% quantile=0.5: 28% quantile=0.8: 46% quantile=1.0: 239% ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/13589 Reviewed By: jaykorean Differential Revision: D74660169 Pulled By: cbi42 fbshipit-source-id: d01780173e0500c96e5e431ff6645008cbf6e8b5	2025-05-14 13:19:06 -07:00
Hui Xiao	29c6610617	Add compaction explicit prefetch stats (#13520 ) Summary: Context/Summary: This PR adds new stats to measure compaction readahead size for rocksdb managed prefetching (not FS prefetching). It can be used to verify compaction read-ahead is doing what's configured. This PR also excludes compaction readahead stats from user scan readahead stats measured in existing stats so there is a cleaner separating between these two. Bonus: this PR also included some typo fixing about "io activities" Pull Request resolved: https://github.com/facebook/rocksdb/pull/13520 Test Plan: Modified existing test to verify stats Reviewed By: archang19 Differential Revision: D72892850 Pulled By: hx235 fbshipit-source-id: 1a73182061baa044c9c9193a2b0fd967ffe75c4a	2025-04-14 12:08:38 -07:00
anand76	f7764cb6b2	Remove fail_if_options_file_error DB option (#13504 ) Summary: The fail_if_options_file_error has been deprecated for more than a year. This PR removes it from the code base. https://github.com/facebook/rocksdb/issues/12056 fixed a bug that was blocking the option from removal. https://github.com/facebook/rocksdb/issues/12249 marked it as deprecated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13504 Reviewed By: hx235 Differential Revision: D72194063 Pulled By: anand1976 fbshipit-source-id: 0aa7cf56e60c48c7e7654743d3e64922ce65225d	2025-04-09 14:18:33 -07:00
Yu Zhang	5e10baa412	Delete max_write_buffer_number_to_maintain (#13491 ) Summary: As titled. This option has been marked deprecated since introduction of a better option `max_write_buffer_size_to_maintain` and acts as its fallback since RocksDB 6.5.0 The internal user we know these options were created for migrated to `max_write_buffer_size_to_maintain` for a long time too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13491 Test Plan: existing tests Reviewed By: cbi42 Differential Revision: D71984601 Pulled By: jowlyzhang fbshipit-source-id: c264d4809e311f60fdbad817ebfade256db549b6	2025-04-07 21:44:36 -07:00
Hui Xiao	48eb646787	Mark MaxMemCompactionLevel() deprecated (#13503 ) Summary: Context/Summary: MaxMemCompactionLevel() developed 10 years ago simply returns the level a memtable flushed to, which has historically been L0 and have no plan to change to something different for future. It is also not used in test or internally. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13503 Test Plan: CI + fake release Reviewed By: cbi42 Differential Revision: D72066092 Pulled By: hx235 fbshipit-source-id: 5ff5b16a6664ef3efabd3a6fbd8a2d0529b62460	2025-03-31 19:29:40 -07:00
Changyu Bi	325dcdf2e5	Deprecate `ReadOptions::ignore_range_deletions` and `experimental::PromoteL0()` (#13500 ) Summary: based on the option comment, `ignore_range_deletions` was added due to the overhead of range deletions in read path when a DB does not use DeleteRange(). The current implementation should not have a noticeable performance difference in this case. `experimental::PromoteL0()` can be replaced by doing a manual compaction with proper CompactRangeOptions. There are some internal use of these option and API so we will remove them later after the usages are updated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13500 Test Plan: comment change only. Performance: benchmark the performance difference with `ignore_range_deletions` and without (borrowed flag `universal_incremental` for this purpose), ran at the same time on the same machine. - random point get: - ignore_range_deletions=false: 343078 ops/sec - ignore_range_deletions=true: 340219 ops/sec (0.8% slower) ``` (for I in $(seq 1 1); do TEST_TMPDIR=/dev/shm/t1 /data/users/changyubi/vscode-root/rocksdb/db_bench --benchmarks=fillseq,waitforcompaction,readrandom --write_buffer_size=67108864 --writes=1000000 --num=2000000 --reads=1000000 --seed=1723056275 --universal_incremental=false 2>&1 \| grep "readrandom"; done;) \| awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; ``` - sequential scan: - ignore_range_deletions=false: 5378104 ops/sec - ignore_range_deletions=true: 5393809 ops/sec (0.3% faster) ``` (for I in $(seq 1 10); do TEST_TMPDIR=/dev/shm/t1 /data/users/changyubi/vscode-root/rocksdb/db_bench --benchmarks=fillseq,waitforcompaction,readseq[-X10] --write_buffer_size=67108864 --writes=1000000 --num=2000000 --universal_incremental=true --seed=1723056275 2>1 \| grep "\[AVG 10 runs\]"; done;) \| awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }'; ``` The difference in ops/sec for the two benchmarks is likely noise. Reviewed By: hx235 Differential Revision: D72069223 Pulled By: cbi42 fbshipit-source-id: ad82a051aa4682790d2178cd4fb2d1467397fbb5	2025-03-28 14:49:28 -07:00
Maciej Szeszko	591f5b1266	Remove deprecated DB::DeleteFile API references (#13322 ) Summary: Cleanup post https://github.com/facebook/rocksdb/pull/13284. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13322 Test Plan: 1. We did not find any evidence of breakage in internal pre-release integration pipeline runs after renaming the deprecated API in `9.10`. 2. _To the extent possible_, we manually validated partner use cases of file deletion and confirmed deprecated API is no longer in use. Reviewed By: jaykorean Differential Revision: D68476852 Pulled By: mszeszko-meta fbshipit-source-id: fbe1f873e16ae7c60d7706a3c44ecc695ab86a4b	2025-01-24 22:28:41 -08:00
Roman Puchkovskiy	b98c21b281	Support flush reasons above 12 in Java integration (#13246 ) Summary: FlushReason enum in C++ has members up to 15, but in Java, the mirroring FlushReason only supports reason codes up to 12. This causes exceptions when adding a flush listener. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13246 Reviewed By: pdillinger Differential Revision: D68241620 Pulled By: jaykorean fbshipit-source-id: 1e2856dad28dff0cbb1772f5a8ea03cc1e224088	2025-01-17 10:04:10 -08:00
Maciej Szeszko	6e97a813dc	Deprecate db delete file public API (#13284 ) Summary: We added a removal warning for public `DB::DeleteFile` API ~4 years ago in https://github.com/facebook/rocksdb/pull/7337. This API seems to sit at wrong layer of abstraction, where instead of exposing a clear interface to delete specific range of keys, callers rely on their own discovery / interpretation of where their data / log possibly resides 'as-of-now'. For example, in case of data, the physical location of the keys might very well change after user obtained their mapping from key(s) to specific SST file. This will lead to `InvalidArgument` response, which if repeated, would put a user in a race condition spinning wheel - the behavior that's inefficient, fairly indeterministic and therefore one that should be strongly discouraged. We're employing a graceful approach to prefixing the public API with `DEPRECATED_` first for better discoverability and ease of self service for product teams should they still use that legacy API. If everything goes smoothly, we intend to remove all the deprecated API references in the next release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13284 Reviewed By: pdillinger Differential Revision: D67981502 Pulled By: mszeszko-meta fbshipit-source-id: adc7fe5cf4e2180bcfd21878b8f78f3fb6ead355	2025-01-10 19:07:33 -08:00
Maciej Szeszko	541761eaaa	Deprecate random access max buffer size references - take #2 (#13288 ) Summary: This time properly marking db option as `kDeprecated` in `db_options.cc`. Original PR: https://github.com/facebook/rocksdb/pull/13278. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13288 Reviewed By: pdillinger Differential Revision: D68024379 Pulled By: mszeszko-meta fbshipit-source-id: 8e1f08b048ccf5971d899811edaf0b0ef16581ef	2025-01-10 15:32:38 -08:00
Maciej Szeszko	6c6defe3b8	Revert "Deprecate random access max buffer size references (#13278 )" (#13285 ) Summary: This reverts commit `d4bd67fb09`. There are total of 4 call sites as referenced [here](https://www.internalfb.com/code/search?q=repo%3Aall%20-filepath%3Afbcode%2Frocksdb%2F%7Cthird-party%2Frocksdb%2F%7Clibrocksdb%2F%7Cfb_mysql%2F.*%2Frocksdb%7Cinternal_repo_rocksdb%20regex%3Aon%20random_access_max_buffer_size&lang_filter=cpp). None of them have a strict reliance of this setting, which should make followup cleanup fairly easy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13285 Reviewed By: pdillinger Differential Revision: D67984325 Pulled By: mszeszko-meta fbshipit-source-id: 12c0b6281c2af6c32261fdf6092856b0566d389e	2025-01-09 13:20:30 -08:00
Maciej Szeszko	d4bd67fb09	Deprecate random access max buffer size references (#13278 ) Summary: This option has been officially deprecated in 5.4.0. We're removing all the references to `random_access_max_buffer_size`, related rules and all the clients wrappers. As a part of this refactoring, we're also getting rid of the `options-1-false` (and consequently its' `multiple-conds-all-false` corresponding rule), as condition would not make much sense anymore without the bounding RA max buffer size limit. Motivated by ongoing tech debt reduction effort. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13278 Test Plan: Validated that internal users do not rely on this long-gone option in their workflows. Reviewed By: jaykorean Differential Revision: D67909674 Pulled By: mszeszko-meta fbshipit-source-id: 8f4b59a4a92b0b32b8b91b71ac318aafc17f1da2	2025-01-08 09:59:18 -08:00
Alan Paxton	9b1d0c02e9	Add [set]DailyOffpeakTimeUTC option to Java API (#13148 ) Summary: Reflect RocksDB DailyOffpeakTimeUTC option in Java API. As is standard for options, there are a number of different places where this option needs to be added: it is an option, a DB option, and it is mutable (can be changed while running). The new option is a string value. This requires an extension to the internal MutableDBOptions parse code, which received the entire options string from C++ and parses it on the Java side. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13148 Reviewed By: cbi42 Differential Revision: D67870402 Pulled By: jaykorean fbshipit-source-id: 975af69773206da936d230cbadb5f69a002d92a3	2025-01-07 09:39:01 -08:00
Levi Tamasi	54ace7f340	Change the semantics of blob_garbage_collection_force_threshold to provide better control over space amp (#13022 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13022 Currently, `blob_garbage_collection_force_threshold` applies to the oldest batch of blob files, which is typically only a small subset of the blob files currently eligible for garbage collection. This can result in a form of head-of-line blocking: no GC-triggered compactions will be scheduled if the oldest batch does not currently exceed the threshold, even if a lot of higher-numbered blob files do. This can in turn lead to high space amplification that exceeds the soft bound implicit in the force threshold (e.g. 50% would suggest a space amp of <2 and 75% would imply a space amp of <4). The patch changes the semantics of this configuration threshold to apply to the entire set of blob files that are eligible for garbage collection based on `blob_garbage_collection_age_cutoff`. This provides more intuitive semantics for the option and can provide a better write amp/space amp trade-off. (Note that GC-triggered compactions still pick the same SST files as before, so triggered GC still targets the oldest the blob files.) Reviewed By: jowlyzhang Differential Revision: D62977860 fbshipit-source-id: a999f31fe9cdda313de513f0e7a6fc707424d4a3	2024-09-19 15:47:13 -07:00
Peter Dillinger	98c33cb8e3	Steps toward making IDENTITY file obsolete (#13019 ) Summary: * Set write_dbid_to_manifest=true by default * Add new option write_identity_file (default true) that allows us to opt-in to future behavior without identity file * Refactor related DB open code to minimize code duplication _Recommend hiding whitespace changes for review_ Intended follow-up: add support to ldb for reading and even replacing the DB identity in the manifest. Could be a variant of `update_manifest` command or based on it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13019 Test Plan: unit tests and stress test updated for new functionality Reviewed By: anand1976 Differential Revision: D62898229 Pulled By: pdillinger fbshipit-source-id: c08b25cf790610b034e51a9de0dc78b921abbcf0	2024-09-19 14:05:21 -07:00
anand76	c21fe1a47f	Add ticker stats for read corruption retries (#12923 ) Summary: Add a couple of ticker stats for corruption retry count and successful retries. This PR also eliminates an extra read attempt when there's a checksum mismatch in a block read from the prefetch buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12923 Test Plan: Update existing tests Reviewed By: jowlyzhang Differential Revision: D61024687 Pulled By: anand1976 fbshipit-source-id: 3a08403580ab244000e0d480b7ee0f5a03d76b06	2024-08-12 15:32:07 -07:00
Zixuan Tan	a97a1f3247	Fix incorrect refillPeriodMicros unit in the document (#12832 ) Summary: The default value for `refillPeriodMicros` is `100 * 1000`, which means 100ms (or 100,000us). The document comments say 100,000ms (equivalent to 100 seconds), which is incorrect and misleading. This PR fixes this typo. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12832 Reviewed By: cbi42 Differential Revision: D59492336 Pulled By: ajkr fbshipit-source-id: c2f55a8b996fe078a1510fcbebaea92ec0075929	2024-07-08 18:08:53 -07:00
Peter Dillinger	45c105104b	Set optimize_filters_for_memory by default (#12377 ) Summary: This feature has been around for a couple of years and users haven't reported any problems with it. Not quite related: fixed a technical ODR violation in public header for info_log_level in case DEBUG build status changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12377 Test Plan: unit tests updated, already in crash test. Some unit tests are expecting specific behaviors of optimize_filters_for_memory=false and we now need to bake that in. Reviewed By: jowlyzhang Differential Revision: D54129517 Pulled By: pdillinger fbshipit-source-id: a64b614840eadd18b892624187b3e122bab6719c	2024-04-30 08:33:31 -07:00
Andrew Kryczka	b4c520cadc	change default `CompactionOptions::compression` while deprecating it (#12587 ) Summary: I had a TODO to complete `CompactionOptions`'s compression API but never did it: `d610e14f93/db/compaction/compaction_picker.cc (L371-L373)` Without solving that TODO, the API remains incomplete and unsafe. Now, however, I don't think it's worthwhile to complete it. I think we should instead delete the API entirely. This PR deprecates it in preparation for deletion in a future major release. The `ColumnFamilyOptions` settings for compression should be good enough for `CompactFiles()` since they are apparently good enough for every other compaction, including `CompactRange()`. In the meantime, I also changed the default `CompressionType`. Having callers of `CompactFiles()` use Snappy compression by default does not make sense when the default could be to simply use the same compression type that is used for every other compaction. As a bonus, this change makes the default `CompressionType` consistent with the `CompressionOptions` that will be used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12587 Reviewed By: hx235 Differential Revision: D56619273 Pulled By: ajkr fbshipit-source-id: 1477de49f14b06c72d6f0045616a8ce91d97e66e	2024-04-26 13:03:21 -07:00
Radek Hubner	a8035ebc0b	Fix exception on RocksDB.getColumnFamilyMetaData() (#12474 ) Summary: https://github.com/facebook/rocksdb/issues/12466 reported a bug when `RocksDB.getColumnFamilyMetaData()` is called on an existing database(With files stored on disk). As neilramaswamy mentioned, this was caused by https://github.com/facebook/rocksdb/issues/11770 where the signature of `SstFileMetaData` constructor was changed, but JNI code wasn't updated. This PR fix JNI code, and also properly populate `fileChecksum` on `SstFileMetaData`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12474 Reviewed By: jowlyzhang Differential Revision: D55811808 Pulled By: ajkr fbshipit-source-id: 2ab156f41eaf4a4f30c49e6df421b61e8451230e	2024-04-05 13:55:18 -07:00
Alexey Vinogradov	c4df598b8e	Implement `PerfContex#toString` for the Java API (#12473 ) Summary: I've implemented `PerfContext#toString` for the Java API. See: https://groups.google.com/g/rocksdb/c/qbY_gNhbyAg Pull Request resolved: https://github.com/facebook/rocksdb/pull/12473 Reviewed By: jowlyzhang Differential Revision: D55660871 Pulled By: cbi42 fbshipit-source-id: f0528fba31ac06e16495e4f49b0bafe0dbc1bc61	2024-04-03 14:33:31 -07:00
Radek Hubner	db9eb10b5b	Enable all Java test via CMake (#12446 ) Summary: This PR follows the work done in https://github.com/facebook/rocksdb/issues/11756 and enable all Java test to be run via CMake/Ctest. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12446 Reviewed By: jowlyzhang Differential Revision: D55661635 Pulled By: cbi42 fbshipit-source-id: 3ea49a121a3ba72089632ff43ee7fe4419b08a96	2024-04-03 11:03:11 -07:00
anand76	4868c10b44	Retry block reads on checksum mismatch (#12427 ) Summary: On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`. Tests: Add new unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427 Reviewed By: ajkr Differential Revision: D55025941 Pulled By: anand1976 fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e	2024-03-18 16:16:05 -07:00
Alan Paxton	d9a441113e	JNI get_helper code sharing / multiGet() use efficient batch C++ support (#12344 ) Summary: Implement RAII-based helpers for JNIGet() and multiGet() Replace JNI C++ helpers `rocksdb_get_helper, rocksdb_get_helper_direct`, `multi_get_helper`, `multi_get_helper_direct`, `multi_get_helper_release_keys`, `txn_get_helper`, and `txn_multi_get_helper`. The model is to entirely do away with a single helper, instead a number of utility methods allow each separate JNI `Get()` and `MultiGet()` method to organise their parameters efficiently, then call the underlying C++ `db->Get()`, `db->MultiGet()`, `txn->Get()`, or `txn->MultiGet()` method itself, and use further utilities to retrieve results. Roughly speaking: * get keys into C++ form * Call C++ Get() * get results and status into Java form We achieve a useful performance gain as part of this work; by using the updated C++ multiGet we immediately pick up its performance gains (batch improvements to multiGet C++ were previously implemented, but not until now used by Java/JNI). multiGetBB already uses the batched C++ multiGet(), and all other benchmarks show consistent improvement after the changes: ## Before: ``` Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 256 thrpt 25 5315.459 ± 20.465 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 1024 thrpt 25 5673.115 ± 78.299 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 4096 thrpt 25 2616.860 ± 46.994 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 16384 thrpt 25 1700.058 ± 24.034 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 65536 thrpt 25 791.171 ± 13.955 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 256 thrpt 25 6129.929 ± 94.200 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 1024 thrpt 25 7012.405 ± 97.886 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 4096 thrpt 25 2799.014 ± 39.352 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 16384 thrpt 25 1417.205 ± 22.272 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 65536 thrpt 25 655.594 ± 13.050 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 256 thrpt 25 6147.247 ± 82.711 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 1024 thrpt 25 7004.213 ± 79.251 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 4096 thrpt 25 2715.154 ± 110.017 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 16384 thrpt 25 1408.070 ± 31.714 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 65536 thrpt 25 623.829 ± 57.374 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 256 thrpt 25 6119.243 ± 116.313 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 1024 thrpt 25 6931.873 ± 128.094 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 4096 thrpt 25 2678.253 ± 39.113 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 16384 thrpt 25 1337.384 ± 19.500 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 65536 thrpt 25 625.596 ± 14.525 ops/s ``` ## After: ``` Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 256 thrpt 25 5191.074 ± 78.250 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 1024 thrpt 25 5378.692 ± 260.682 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 4096 thrpt 25 2590.183 ± 34.844 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 16384 thrpt 25 1634.793 ± 34.022 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 65536 thrpt 25 786.455 ± 8.462 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 256 thrpt 25 5285.055 ± 11.676 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 1024 thrpt 25 5586.758 ± 213.008 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 4096 thrpt 25 2527.172 ± 17.106 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 16384 thrpt 25 1819.547 ± 12.958 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 65536 thrpt 25 803.861 ± 9.963 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 256 thrpt 25 5253.793 ± 28.020 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 1024 thrpt 25 5705.591 ± 20.556 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 4096 thrpt 25 2523.377 ± 15.415 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 16384 thrpt 25 1815.344 ± 11.309 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 65536 thrpt 25 820.792 ± 3.192 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 256 thrpt 25 5262.184 ± 20.477 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 1024 thrpt 25 5706.959 ± 23.123 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 4096 thrpt 25 2520.362 ± 9.170 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 16384 thrpt 25 1789.185 ± 14.239 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 65536 thrpt 25 818.401 ± 12.132 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 256 thrpt 25 6978.310 ± 14.084 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 1024 thrpt 25 7664.242 ± 22.304 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 4096 thrpt 25 2881.778 ± 81.054 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 16384 thrpt 25 1599.826 ± 7.190 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 65536 thrpt 25 737.520 ± 6.809 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 256 thrpt 25 6974.376 ± 10.716 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 1024 thrpt 25 7637.440 ± 45.877 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 4096 thrpt 25 2820.472 ± 42.231 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 16384 thrpt 25 1716.663 ± 8.527 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 65536 thrpt 25 755.848 ± 7.514 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 256 thrpt 25 6943.651 ± 20.040 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 1024 thrpt 25 7679.415 ± 9.114 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 4096 thrpt 25 2844.564 ± 13.388 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 16384 thrpt 25 1729.545 ± 5.983 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 65536 thrpt 25 783.218 ± 1.530 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 256 thrpt 25 6944.276 ± 29.995 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 1024 thrpt 25 7670.301 ± 8.986 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 4096 thrpt 25 2839.828 ± 12.421 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 16384 thrpt 25 1730.005 ± 9.209 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 65536 thrpt 25 787.096 ± 1.977 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 256 thrpt 25 6896.944 ± 21.530 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 1024 thrpt 25 7622.407 ± 12.824 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 4096 thrpt 25 2927.538 ± 19.792 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 16384 thrpt 25 1598.041 ± 4.312 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 65536 thrpt 25 744.564 ± 9.236 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 256 thrpt 25 6853.760 ± 78.041 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 1024 thrpt 25 7360.917 ± 355.365 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 4096 thrpt 25 2848.774 ± 13.409 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 16384 thrpt 25 1727.688 ± 3.329 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 65536 thrpt 25 776.088 ± 7.517 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 256 thrpt 25 6910.339 ± 14.366 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 1024 thrpt 25 7633.660 ± 10.830 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 4096 thrpt 25 2787.799 ± 81.775 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 16384 thrpt 25 1726.517 ± 6.830 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 65536 thrpt 25 787.597 ± 3.362 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 256 thrpt 25 6922.445 ± 10.493 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 1024 thrpt 25 7604.710 ± 48.043 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 4096 thrpt 25 2848.788 ± 15.783 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 16384 thrpt 25 1730.837 ± 6.497 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 65536 thrpt 25 794.557 ± 1.869 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 256 thrpt 25 6918.716 ± 15.766 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 1024 thrpt 25 7626.692 ± 9.394 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 4096 thrpt 25 2871.382 ± 72.155 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 16384 thrpt 25 1598.786 ± 4.819 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 65536 thrpt 25 748.469 ± 7.234 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 256 thrpt 25 6922.666 ± 17.131 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 1024 thrpt 25 7623.890 ± 8.805 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 4096 thrpt 25 2850.698 ± 18.004 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 16384 thrpt 25 1727.623 ± 4.868 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 65536 thrpt 25 774.534 ± 10.025 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 256 thrpt 25 5486.251 ± 13.582 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 1024 thrpt 25 4920.656 ± 44.557 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 4096 thrpt 25 3922.913 ± 25.686 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 16384 thrpt 25 2873.106 ± 4.336 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 65536 thrpt 25 802.404 ± 8.967 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 256 thrpt 25 4817.996 ± 18.042 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 1024 thrpt 25 4243.922 ± 13.929 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 4096 thrpt 25 3175.998 ± 7.773 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 16384 thrpt 25 2321.990 ± 12.501 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 65536 thrpt 25 1753.028 ± 7.130 ops/s ``` Closes https://github.com/facebook/rocksdb/issues/11518 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12344 Reviewed By: cbi42 Differential Revision: D54809714 Pulled By: pdillinger fbshipit-source-id: bee3b949720abac073bce043b59ce976a11e99eb	2024-03-12 12:42:08 -07:00
Alan Paxton	c4d37da826	Java API - Fix handling of CF handles in DB subclasses (#12417 ) Summary: The most general `open()` method for each of RocksDB, TtlDB, OptimisticTransactionDB and TransactionDB should - ensure the default CF is supplied in the list of descriptors - cache the default CF handle - store open CF handles for automatic close on DB close The `close()` method in each of these DB subclasses should `close()` all the owned CF handles. I can’t find a cleaner way to build some generalised open/close that does this for all DB subclasses, so it exists as cut and paste with variations in the 4 different DB subclasses. Added some slightly paranoid testing that CF handles explicitly referred to as default in a list of CF handles in the general open methods, and the simple open that doesn’t supply a CF, end up reading and writing to the same CF. Prompted by the fact that this code is a bit opaque; the first returned handle is the DB. As part of this, fix the bug where the Java side of `OptimisticsTransactionDB` was not setting up default column family; this was visible when setting up an iterator; add a test to validate that the iterator is OK. A single Java reference to the default column family was not being created in the OptimisticsTransactionDB RocksDB subclass; it should be created in all subclasses. The same problem had previously been fixed for TtlDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12417 Reviewed By: ajkr Differential Revision: D54807643 Pulled By: pdillinger fbshipit-source-id: 66f34e56a822a009a8f2018d401cf8940d91aa35	2024-03-12 10:33:27 -07:00
Yu Zhang	2940acac00	Persist table options use_delta_encoding in options file (#11987 ) Summary: This option is used for encoding keys in block based table files. It has been having a default true value since its introduction. Users may not notice this option is not persisted in options file unless they are explicitly setting it to false. If the users expect `Iterator::GetProperty("rocksdb.iterator.is-key-pinned")` to return 1 when setting `ReadOptions.pin_data = true`, they should have noticed loading options file won't work and have work around for this by always explicitly set this option to false for opening DB. This change won't impact those users except that now they can remove their work around. If the users are not relying on key pinning behavior at all and as a result didn't notice the option is not persisted, this change shouldn't have any visible behavior impact either. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11987 Reviewed By: hx235 Differential Revision: D54093238 Pulled By: jowlyzhang fbshipit-source-id: 256a3348c44cf91349034d1f6e242c437b32b9a5	2024-02-23 14:13:28 -08:00

1 2 3 4 5 ...

467 commits