Commit graph

1005 commits

Author SHA1 Message Date
Peter Dillinger
1c8a012727 Add kCool Temperature (#14000)
Summary:
also requested by internal user, like kIce in https://github.com/facebook/rocksdb/issues/13927

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14000

Test Plan: unit tests updated

Reviewed By: archang19

Differential Revision: D83200479

Pulled By: pdillinger

fbshipit-source-id: 31f2842d87bcad40227aeee9687ff5772393689c
2025-09-25 11:27:00 -07:00
Peter Dillinger
6127a42f98 Use/endorse (Auto)HyperClockCache by default over LRUCache (#13964)
Summary:
After seeing more people hit issues with thrashing small LRUCache shards and AutoHCC running fully in production for a while on a very large service, here I make these updates:

* In the public API, mark the case of `estimated_entry_charge = 0` (which is how you select AutoHCC) as production-ready and generally preferred. That means devoting a lot less space to how to tune FixedHCC (`estimated_entry_charge > 0`) because it is not generally recommended anymore even though in theory it is the fastest (conditional on a fragile configuration).
* In the public API, add more detail about potential problems with LRUCache and explicitly endorse HCC.
* When a default block cache is created, use AutoHCC instead of LRUCache. It's still a 32MB cache but that's just one cache shard for AutoHCC so the risk of issues with small cache shards is dramatically reduced. And a single AutoHCC shard is still essentially wait-free.
* Improve the handling of the hypothetical scenario of a failed anonymous mmap. This is hardly a concern for 64-bit Linux and likely most other OSes. It would in theory be possible to fall back on LRUCache in that case but the code structure makes that annoying/challenging. Instead we crash with an appropriate message.
* Cleaned up some includes
* Fixed some previously unreported leaks (better assertions on HCC perhaps, some subtle behavior changes)
* Added a new mode to cache_bench (detailed below)
* Avoid a particularly costly sanity check in `~AutoHyperClockTable()` even in debug builds so that unit testing, etc., isn't bogged down, except keep it in ASAN build.

Planned follow-up:
* Update HCC implementation to use my new "bit field atomics" API introduced in https://github.com/facebook/rocksdb/issues/13910 to make it easier to read and maintain

Possible follow-up:
* Re-engineer table cache to use AutoHCC also, instead of LRUCache and a single mutex to ensure no duplication across threads. (a) Pad table cache key to 128 bits for AutoHCC. (b) Stripe/shard the no-duplication mutex. (HCC's consistency model is too weak for concurrent threads to use its API to agree on a winner, even if entries could be inserted in an "open in progress" state.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13964

Test Plan:
existing tests. ClockCacheTest.ClockEvictionEffortCapTest caught a regression during my development, and the crash test has a history of finding subtle HCC bugs.

## Performance

Although we've validated AutoHCC performance under high load, etc., before we haven't really considered whether there will be unacceptable overheads for small DBs and CFs, e.g. in unit tests. For this, I have added a new mode to cache_bench: with the -stress_cache_instances=n parameter, it will create and destroy n empty cache instances several times. In the debug build, this found that a particular check in `~AutoHyperClockTable()` was extremely costly for short-lived caches (fixed). Beyond that, we can answer the question of whether it is feasible for a single process to host 1000 DBs each with 1000 CFs with default block cache instances, after moving LRUCache -> AutoHCC, for example:

```
/usr/bin/time ./cache_bench -stress_
cache_instances=1000000 -cache_type=auto_hyper_clock_cache -cache_size=33554432
```

Release build:
Average 9.8 us per 32MB LRUCache creation, 2.9 us per destruction, 24.6GB max RSS (~25KB each)
->
Average 4.3 us per  32MB AutoHCC creation, 4.9 us per destruction, 4.8GB max RSS (~5KB each)

Debug build:
Average 10.9 us per 32MB LRUCache creation, 3.5 us per destruction, 28.7GB max RSS (~29KB each)
->
Average 4.5 us per 32MB AutoHCC creation, 4.9 us per destruction, 4.7GB max RSS (~5KB each)

Despite the anonymous mmaps, it's apparently more efficient for default/small/empty structures. This is likely due to the dramatically low number of cache shards at this size. If we switch to `-stress_cache_instances=10000  -cache_size=1073741824`:

Release build:
Average 10.6 us per 1GB LRUCache, 2.8 us per destruction, 2.3 GB max RSS (~230KB each)
->
Average 130 us per 1GB AutoHCC creation, 153 us per destruction, 1.5 GB max RSS (~150KB each)

Debug build:
Average 11.2 us per 1GB LRUCache, 3.6 us per destruction, 2.4 GB max RSS (~240KB each)
->
Average 130 us per 1GB AutoHCC creation, 150 us per destruction, 1.6 GB max RSS (~160KB each)

Here it's clear that we are paying a price in time for setting up all those mmaps for the good number of cache shards and potential table growth, even though the RSS is well under control. However, I am not concerned about this at all, as it's unlikely to slow down anything notably such as unit tests. Before and after full testsuite runs confirm:

3327.73user 5188.71system 3:38.88elapsed -> 3312.07user 5704.77system 3:41.61elapsed

There is increased kernel time but acceptable. With ASAN+UBSAN:

11618.70user 15671.30system 5:54.68elapsed -> 12595.81user 16159.67system 6:32.77elapsed

Acceptable given that our ASAN+UBSAN builds are not the slowest in CI

Reviewed By: hx235

Differential Revision: D82661067

Pulled By: pdillinger

fbshipit-source-id: ab25c766ca70f2b8664849c2a838b9e1b4e72d3b
2025-09-18 13:27:51 -07:00
Peter Dillinger
67af5bdc38 Add Temperature::kIce (#13927)
Summary:
... and associated statistics, etc. Someone needs it, so here it is.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13927

Test Plan: Updated / extended / added some unit tests

Reviewed By: cbi42

Differential Revision: D81981469

Pulled By: pdillinger

fbshipit-source-id: 52558c08741890b781310906acbc18d9eb479363
2025-09-10 10:29:49 -07:00
Alan Paxton
805ac7c887 Update compression libraries to latest releases (#13609)
Summary:
See `Makefile` for actual changes:

* ZLIB remains the same
* BZIP2 remains the same
* SNAPPY is a minor update
* LZ4 is a significant update with multithreaded/multicore compression https://github.com/lz4/lz4/releases/tag/v1.10.0
* ZSTD is a significant update RocksDB is called out as benefiting in particular from the performance improvements herein https://github.com/facebook/zstd/releases/tag/v1.5.7

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13609

Reviewed By: archang19

Differential Revision: D77877295

Pulled By: mszeszko-meta

fbshipit-source-id: bf9a257e8f68dec3d02743b339aa2df65df4ab2c
2025-07-07 13:28:49 -07:00
Miroslav Kovar
fc2cf7ead2 Expose optimized TransactionBaseImpl::MultiGet through JNI (#13589)
Summary:
Addresses https://github.com/facebook/rocksdb/issues/13587.

This PR exposes the optimized implementation of batched reads through a `Transaction` object to Java clients.

The latency improvement of transactional multiget on production workload achieved by switching the implementation is roughly:
```
quantile=0.2: 21%
quantile=0.5: 28%
quantile=0.8: 46%
quantile=1.0: 239%
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13589

Reviewed By: jaykorean

Differential Revision: D74660169

Pulled By: cbi42

fbshipit-source-id: d01780173e0500c96e5e431ff6645008cbf6e8b5
2025-05-14 13:19:06 -07:00
Michael C Huang
13d865f6f1 Add trivial copy support when FIFO compaction reason is kChangeTemperature (#13562)
Summary:
Prior to this PR, for FIFO kChangeTemperature compaction was done by iterating and reading thru the input sst and generate the output sst. This was wasteful since for FIFO we could apply the "trivial" move by copying the input sst to the out sst without need decompress/compress and reading thru the input sst content at all. This PR added "allow_trivial_copy_when_change_temperature" to the CompactionOptionsFIFO.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13562

Reviewed By: cbi42

Differential Revision: D73295404

Pulled By: mikechuangmeta

fbshipit-source-id: 02241c7389797730ecd4a3b636837cb5f912b424
2025-05-08 15:51:37 -07:00
Hui Xiao
29c6610617 Add compaction explicit prefetch stats (#13520)
Summary:
**Context/Summary:**
This PR adds new stats to measure compaction readahead size for rocksdb managed prefetching (not FS prefetching). It can be used to verify compaction read-ahead is doing what's configured. This PR also excludes compaction readahead stats from user scan readahead stats measured in existing stats so there is a cleaner separating between these two.

Bonus: this PR also included some typo fixing about "io activities"

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13520

Test Plan: Modified existing test to verify stats

Reviewed By: archang19

Differential Revision: D72892850

Pulled By: hx235

fbshipit-source-id: 1a73182061baa044c9c9193a2b0fd967ffe75c4a
2025-04-14 12:08:38 -07:00
anand76
f7764cb6b2 Remove fail_if_options_file_error DB option (#13504)
Summary:
The fail_if_options_file_error has been deprecated for more than a year. This PR removes it from the code base. https://github.com/facebook/rocksdb/issues/12056 fixed a bug that was blocking the option from removal. https://github.com/facebook/rocksdb/issues/12249 marked it as deprecated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13504

Reviewed By: hx235

Differential Revision: D72194063

Pulled By: anand1976

fbshipit-source-id: 0aa7cf56e60c48c7e7654743d3e64922ce65225d
2025-04-09 14:18:33 -07:00
Yu Zhang
5e10baa412 Delete max_write_buffer_number_to_maintain (#13491)
Summary:
As titled. This option has been marked deprecated since introduction of a better option `max_write_buffer_size_to_maintain` and acts as its fallback since RocksDB 6.5.0 The internal user we know these options were created for migrated to `max_write_buffer_size_to_maintain` for a long time too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13491

Test Plan: existing tests

Reviewed By: cbi42

Differential Revision: D71984601

Pulled By: jowlyzhang

fbshipit-source-id: c264d4809e311f60fdbad817ebfade256db549b6
2025-04-07 21:44:36 -07:00
Hui Xiao
5735ff4e03 Update window build cmake to download newer snappy version and Java cmake_minimum_required (#13514)
Summary:
**Context/Summary:**

- This is an attempt to fix our [build-window-vs2022 failure](https://github.com/facebook/rocksdb/actions/runs/14215681026/job/39831770554?fbclid=IwZXh0bgNhZW0CMTAAAR2BQLjp8kC1u1yyvN1_S5qwmrHEZOfzxJdcbj2vq7mvwwq83n1cbkmiBCA_aem_ygYxQA5EUmxh2y4EjMlTfg) below. snappy-1.1.8's cmake_minimum_required  being less than 3.5 seems to trigger the complaint. Hopefully downloading the 1.2.2 which is the [first version starting to use higher cmake_minimum_required version](https://github.com/google/snappy/releases/tag/1.2.2) solves the failure.

```
    Directory: D:\a\rocksdb\rocksdb\thirdparty\snappy-1.1.8

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d----            4/2/2025  9:02 AM                build
CMake Error at CMakeLists.txt:29 (cmake_minimum_required):
  Compatibility with CMake < 3.5 has been removed from CMake.

  Update the VERSION argument <min> value.  Or, use the <min>...<max> syntax
  to tell CMake that the project requires at least <min> but has been updated
  to work with policies introduced by <max> or earlier.

  Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway.
 ```
- The downloaded snappy do not include the content under nested repos Google Test and Google Benchmark. But snappy cmake by default will attempt to build them. Since we don't change snappy, we don't need building such development suit. This PR also disabled snappy cmake's attempt to build them.

- By running above changes, the same build [complained](https://github.com/facebook/rocksdb/actions/runs/14228883966/job/39874927730?pr=13514) about java cmakelists requiring too low cmake_minimum_required as well.  So this PR also upgraded its cmake_minimum_required to be 3.11 aligning with its warning message
```
if(${CMAKE_VERSION} VERSION_LESS "3.11.4")
    message("Please consider switching to CMake 3.11.4 or newer")
endif()
```

**Test plan**
Monitor build-window-vs2022 for this PR

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13514

Reviewed By: pdillinger

Differential Revision: D72333581

Pulled By: hx235

fbshipit-source-id: 1a9096738d39c8b1d270fe17fbd78c1ea4c4c45e
2025-04-02 15:46:02 -07:00
Hui Xiao
48eb646787 Mark MaxMemCompactionLevel() deprecated (#13503)
Summary:
**Context/Summary:**

MaxMemCompactionLevel() developed 10 years ago simply returns the level a memtable flushed to, which has historically been L0 and have no plan to change to something different for future. It is also not used in test or internally.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13503

Test Plan: CI + fake release

Reviewed By: cbi42

Differential Revision: D72066092

Pulled By: hx235

fbshipit-source-id: 5ff5b16a6664ef3efabd3a6fbd8a2d0529b62460
2025-03-31 19:29:40 -07:00
Changyu Bi
325dcdf2e5 Deprecate ReadOptions::ignore_range_deletions and experimental::PromoteL0() (#13500)
Summary:
based on the option comment, `ignore_range_deletions` was added due to the overhead of range deletions in read path when a DB does not use DeleteRange(). The current implementation should not have a noticeable performance difference in this case.

`experimental::PromoteL0()` can be replaced by doing a manual compaction with proper CompactRangeOptions.

There are some internal use of these option and API so we will remove them later after the usages are updated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13500

Test Plan:
comment change only.
Performance: benchmark the performance difference with `ignore_range_deletions` and without (borrowed flag `universal_incremental` for this purpose), ran at the same time on the same machine.

- random point get:
    - ignore_range_deletions=false: 343078 ops/sec
    - ignore_range_deletions=true: 340219 ops/sec (0.8% slower)
```
(for I in $(seq 1 1); do TEST_TMPDIR=/dev/shm/t1 /data/users/changyubi/vscode-root/rocksdb/db_bench --benchmarks=fillseq,waitforcompaction,readrandom --write_buffer_size=67108864 --writes=1000000 --num=2000000 --reads=1000000  --seed=1723056275 --universal_incremental=false 2>&1 | grep "readrandom"; done;) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }';
```

- sequential scan:
  - ignore_range_deletions=false: 5378104 ops/sec
  - ignore_range_deletions=true: 5393809 ops/sec (0.3% faster)
```
(for I in $(seq 1 10); do TEST_TMPDIR=/dev/shm/t1 /data/users/changyubi/vscode-root/rocksdb/db_bench --benchmarks=fillseq,waitforcompaction,readseq[-X10] --write_buffer_size=67108864 --writes=1000000 --num=2000000  --universal_incremental=true --seed=1723056275 2>1 | grep "\[AVG 10 runs\]"; done;) | awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }';
```

The difference in ops/sec for the two benchmarks is likely noise.

Reviewed By: hx235

Differential Revision: D72069223

Pulled By: cbi42

fbshipit-source-id: ad82a051aa4682790d2178cd4fb2d1467397fbb5
2025-03-28 14:49:28 -07:00
Peter Dillinger
82794e0a4f Deprecate RangePtr, favor new RangeOpt and OptSlice (#13481)
Summary:
The new API in https://github.com/facebook/rocksdb/issues/13453 is awkward and precarious because of using RangePtr, which encodes optional keys using raw pointers to Slice. We could use `std::optional<Slice>` instead but that is unsatisfyingly a larger object with an inefficient size (typically 17 bytes).

Here I introduce a custom optional Slice type, `OptSlice`, that is the same size as a Slice, and use it in a number of places to clean up code and make some public APIs easier to work with. This includes

* `atomic_replace_range` (not yet released, OK to change)
* `GetAllKeyVersions()` which gets a behavior change because of its unusual handling of empty keys.
* `DeleteFilesInRanges()`
* TODO in follow-up: `CompactRange()`

Most of the diff is associated updates and refactorings. Also

* Move some relevant things out of db.h to keep it as tidy as possible.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13481

Test Plan: tests updated

Reviewed By: hx235

Differential Revision: D71747774

Pulled By: pdillinger

fbshipit-source-id: b4c8519608d119b8bceca9bb0fd778608f62a141
2025-03-24 17:08:17 -07:00
Maciej Szeszko
591f5b1266 Remove deprecated DB::DeleteFile API references (#13322)
Summary:
Cleanup post https://github.com/facebook/rocksdb/pull/13284.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13322

Test Plan:
1. We did not find any evidence of breakage in internal pre-release integration pipeline runs after renaming the deprecated API in `9.10`.
2. _To the extent possible_, we manually validated partner use cases of file deletion and confirmed deprecated API is no longer in use.

Reviewed By: jaykorean

Differential Revision: D68476852

Pulled By: mszeszko-meta

fbshipit-source-id: fbe1f873e16ae7c60d7706a3c44ecc695ab86a4b
2025-01-24 22:28:41 -08:00
Roman Puchkovskiy
b98c21b281 Support flush reasons above 12 in Java integration (#13246)
Summary:
FlushReason enum in C++ has members up to 15, but in Java, the mirroring FlushReason only supports reason codes up to 12. This causes exceptions when adding a flush listener.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13246

Reviewed By: pdillinger

Differential Revision: D68241620

Pulled By: jaykorean

fbshipit-source-id: 1e2856dad28dff0cbb1772f5a8ea03cc1e224088
2025-01-17 10:04:10 -08:00
Maciej Szeszko
6e97a813dc Deprecate db delete file public API (#13284)
Summary:
We added a removal warning for public `DB::DeleteFile` API ~4 years ago in https://github.com/facebook/rocksdb/pull/7337. This API seems to sit at wrong layer of abstraction, where instead of exposing a clear interface to delete specific range of keys, callers rely on their own discovery / interpretation of where their data / log possibly resides 'as-of-now'. For example, in case of data, the physical location of the keys might very well change after user obtained their mapping from key(s) to specific SST file. This will lead to `InvalidArgument` response, which if repeated, would put a user in a race condition spinning wheel - the behavior that's inefficient, fairly indeterministic and therefore one that should be strongly discouraged. We're employing a graceful approach to prefixing the public API with `DEPRECATED_` first for better discoverability and ease of self service for product teams should they still use that legacy API. If everything goes smoothly, we intend to remove all the deprecated API references in the next release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13284

Reviewed By: pdillinger

Differential Revision: D67981502

Pulled By: mszeszko-meta

fbshipit-source-id: adc7fe5cf4e2180bcfd21878b8f78f3fb6ead355
2025-01-10 19:07:33 -08:00
Maciej Szeszko
541761eaaa Deprecate random access max buffer size references - take #2 (#13288)
Summary:
This time properly marking db option as `kDeprecated` in `db_options.cc`. Original PR: https://github.com/facebook/rocksdb/pull/13278.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13288

Reviewed By: pdillinger

Differential Revision: D68024379

Pulled By: mszeszko-meta

fbshipit-source-id: 8e1f08b048ccf5971d899811edaf0b0ef16581ef
2025-01-10 15:32:38 -08:00
Maciej Szeszko
6c6defe3b8 Revert "Deprecate random access max buffer size references (#13278)" (#13285)
Summary:
This reverts commit d4bd67fb09. There are total of 4 call sites as referenced [here](https://www.internalfb.com/code/search?q=repo%3Aall%20-filepath%3Afbcode%2Frocksdb%2F%7Cthird-party%2Frocksdb%2F%7Clibrocksdb%2F%7Cfb_mysql%2F.*%2Frocksdb%7Cinternal_repo_rocksdb%20regex%3Aon%20random_access_max_buffer_size&lang_filter=cpp). None of them have a strict reliance of this setting, which should make followup cleanup fairly easy.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13285

Reviewed By: pdillinger

Differential Revision: D67984325

Pulled By: mszeszko-meta

fbshipit-source-id: 12c0b6281c2af6c32261fdf6092856b0566d389e
2025-01-09 13:20:30 -08:00
Maciej Szeszko
d4bd67fb09 Deprecate random access max buffer size references (#13278)
Summary:
This option has been officially deprecated in 5.4.0. We're removing all the references to `random_access_max_buffer_size`, related rules and all the clients wrappers. As a part of this refactoring, we're also getting rid of the `options-1-false` (and consequently its' `multiple-conds-all-false` corresponding rule), as condition would not make much sense anymore without the bounding RA max buffer size limit. Motivated by ongoing tech debt reduction effort.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13278

Test Plan: Validated that internal users do not rely on this long-gone option in their workflows.

Reviewed By: jaykorean

Differential Revision: D67909674

Pulled By: mszeszko-meta

fbshipit-source-id: 8f4b59a4a92b0b32b8b91b71ac318aafc17f1da2
2025-01-08 09:59:18 -08:00
Alan Paxton
9b1d0c02e9 Add [set]DailyOffpeakTimeUTC option to Java API (#13148)
Summary:
Reflect RocksDB DailyOffpeakTimeUTC  option in Java API. As is standard for options, there are a number of different places where this option needs to be added: it is an option, a DB option, and it is mutable (can be changed while running).

The new option is a string value. This requires an extension to the internal MutableDBOptions parse code, which received the entire options string from C++ and parses it on the Java side.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13148

Reviewed By: cbi42

Differential Revision: D67870402

Pulled By: jaykorean

fbshipit-source-id: 975af69773206da936d230cbadb5f69a002d92a3
2025-01-07 09:39:01 -08:00
Changyu Bi
d5345a8ff7 Introduce a transaction option to skip memtable write during commit (#13144)
Summary:
add a new transaction option `TransactionOptions::commit_bypass_memtable` that will ingest the transaction into a DB as an immutable memtables, skipping memtable writes during transaction commit. This helps to reduce the blocking time of committing a large transaction, which is mostly spent on memtable writes. The ingestion is done by creating WBWIMemTable using transaction's underlying WBWI, and ingest it as the latest immutable memtable. The feature will be experimental.

Major changes are:
1. write path change to ingest the transaction, mostly in WriteImpl() and IngestWBWI() in db_impl_write.cc.
2. WBWI changes to track some per CF stats like entry count and overwritten single deletion count, and track which keys have overwritten single deletions (see 3.). Per CF stat is used to precompute the number of entries in each WBWIMemTable.
3. WBWIMemTable Iterator changes to emit overwritten single deletions. The motivation is explained in the comment above class WBWIMemTable definition. The rest of the changes in WBWIMemTable are moving the iterator definition around.

Some intended follow ups:
1. support for merge operations
2. stats/logging around this option
3. tests improvement, including stress test support for the more comprehensive no_batched_op_stress.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13144

Test Plan:
* added new unit tests
* enabled in multi_ops_txns_stress test
* Benchmark: applying the change in 8222c0cafc4c6eb3a0d05807f7014b44998acb7a, I tested txn size of 10k and check perf context for write_memtable_time, write_wal_time and key_lock_wait_time(repurposed for transaction unlock time). Though the benchmark result number can be flaky, this shows memtable write time improved a lot (more than 100 times). The benchmark also shows that the remaining commit latency is from transaction unlock.
```
./db_bench --benchmarks=fillrandom --seed=1727376962 --threads=1 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --batch_size=10000 --transaction_db=1 --perf_level=4 --enable_pipelined_write=false --commit_bypass_memtable=1

commit_bypass_memtable = false
fillrandom   :       3.982 micros/op 251119 ops/sec 0.398 seconds 100000 operations;   27.8 MB/s PERF_CONTEXT:
write_memtable_time = 116950422
write_wal_time =      8535565
txn unlock time =     32979883

commit_bypass_memtable = true
fillrandom   :       2.627 micros/op 380559 ops/sec 0.263 seconds 100000 operations;   42.1 MB/s PERF_CONTEXT:
write_memtable_time = 740784
write_wal_time =      11993119
txn unlock time =     21735685
```

Reviewed By: jowlyzhang

Differential Revision: D66307632

Pulled By: cbi42

fbshipit-source-id: 6619af58c4c537aed1f76c4a7e869fb3f5098999
2024-12-05 15:00:17 -08:00
Peter Dillinger
a1a102ffce Steps toward deprecating implicit prefix seek, related fixes (#13026)
Summary:
With some new use cases onboarding to prefix extractors/seek/filters, one of the risks is existing iterator code, e.g. for maintenance tasks, being unintentionally subject to prefix seek semantics. This is a longstanding known design flaw with prefix seek, and `prefix_same_as_start` and `auto_prefix_mode` were steps in the direction of making that obsolete. However, we can't just immediately set `total_order_seek` to true by default, because that would impact so much code instantly.

Here we add a new DB option, `prefix_seek_opt_in_only` that basically allows users to transition to the future behavior when they are ready. When set to true, all iterators will be treated as if `total_order_seek=true` and then the only ways to get prefix seek semantics are with `prefix_same_as_start` or `auto_prefix_mode`.

Related fixes / changes:
* Make sure that `prefix_same_as_start` and `auto_prefix_mode` are compatible with (or override) `total_order_seek` (depending on your interpretation).
* Fix a bug in which a new iterator after dynamically changing the prefix extractor might mix different prefix semantics between memtable and SSTs. Both should use the latest extractor semantics, which means iterators ignoring memtable prefix filters with an old extractor. And that means passing the latest prefix extractor to new memtable iterators that might use prefix seek. (Without the fix, the test added for this fails in many ways.)

Suggested follow-up:
* Investigate a FIXME where a MergeIteratorBuilder is created in db_impl.cc. No unit test detects a change in value that should impact correctness.
* Make memtable prefix bloom compatible with `auto_prefix_mode`, which might require involving the memtablereps because we don't know at iterator creation time (only seek time) whether an auto_prefix_mode seek will be a prefix seek.
* Add `prefix_same_as_start` testing to db_stress

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13026

Test Plan:
tests updated, added. Add combination of `total_order_seek=true` and `auto_prefix_mode=true` to stress test. Ran `make blackbox_crash_test` for a long while.

Manually ran tests with `prefix_seek_opt_in_only=true` as default, looking for unexpected issues. I inspected most of the results and migrated many tests to be ready for such a change (but not all).

Reviewed By: ltamasi

Differential Revision: D63147378

Pulled By: pdillinger

fbshipit-source-id: 1f4477b730683d43b4be7e933338583702d3c25e
2024-09-20 15:54:19 -07:00
Levi Tamasi
54ace7f340 Change the semantics of blob_garbage_collection_force_threshold to provide better control over space amp (#13022)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13022

Currently, `blob_garbage_collection_force_threshold` applies to the oldest batch of blob files, which is typically only a small subset of the blob files currently eligible for garbage collection. This can result in a form of head-of-line blocking: no GC-triggered compactions will be scheduled if the oldest batch does not currently exceed the threshold, even if a lot of higher-numbered blob files do. This can in turn lead to high space amplification that exceeds the soft bound implicit in the force threshold (e.g. 50% would suggest a space amp of <2 and 75% would imply a space amp of <4). The patch changes the semantics of this configuration threshold to apply to the entire set of blob files that are eligible for garbage collection based on `blob_garbage_collection_age_cutoff`. This provides more intuitive semantics for the option and can provide a better write amp/space amp trade-off. (Note that GC-triggered compactions still pick the same SST files as before, so triggered GC still targets the oldest the blob files.)

Reviewed By: jowlyzhang

Differential Revision: D62977860

fbshipit-source-id: a999f31fe9cdda313de513f0e7a6fc707424d4a3
2024-09-19 15:47:13 -07:00
Peter Dillinger
98c33cb8e3 Steps toward making IDENTITY file obsolete (#13019)
Summary:
* Set write_dbid_to_manifest=true by default
* Add new option write_identity_file (default true) that allows us to opt-in to future behavior without identity file
* Refactor related DB open code to minimize code duplication

_Recommend hiding whitespace changes for review_

Intended follow-up: add support to ldb for reading and even replacing the DB identity in the manifest. Could be a variant of `update_manifest` command or based on it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13019

Test Plan: unit tests and stress test updated for new functionality

Reviewed By: anand1976

Differential Revision: D62898229

Pulled By: pdillinger

fbshipit-source-id: c08b25cf790610b034e51a9de0dc78b921abbcf0
2024-09-19 14:05:21 -07:00
anand76
c21fe1a47f Add ticker stats for read corruption retries (#12923)
Summary:
Add a couple of ticker stats for corruption retry count and successful retries. This PR also eliminates an extra read attempt when there's a checksum mismatch in a block read from the prefetch buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12923

Test Plan: Update existing tests

Reviewed By: jowlyzhang

Differential Revision: D61024687

Pulled By: anand1976

fbshipit-source-id: 3a08403580ab244000e0d480b7ee0f5a03d76b06
2024-08-12 15:32:07 -07:00
Zixuan Tan
a97a1f3247 Fix incorrect refillPeriodMicros unit in the document (#12832)
Summary:
The default value for `refillPeriodMicros` is `100 * 1000`, which means 100ms (or 100,000us).

The document comments say 100,000ms (equivalent to 100 seconds), which is incorrect and misleading. This PR fixes this typo.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12832

Reviewed By: cbi42

Differential Revision: D59492336

Pulled By: ajkr

fbshipit-source-id: c2f55a8b996fe078a1510fcbebaea92ec0075929
2024-07-08 18:08:53 -07:00
Richard Barnes
a8dd15ad41 Fix deprecated dynamic exception in internal_repo_rocksdb/repo/java/rocksjni/kv_helper.h +1
Summary:
LLVM has detected a violation of `-Wdeprecated-dynamic-exception-spec`. Dynamic exceptions were removed in C++17. This diff fixes the deprecated instance(s).

See [Dynamic exception specification](https://en.cppreference.com/w/cpp/language/except_spec) and [noexcept specifier](https://en.cppreference.com/w/cpp/language/noexcept_spec).

Reviewed By: palmje

Differential Revision: D58528375

fbshipit-source-id: 130fecd3aa556e4cdb955feea53c442bd9fbc864
2024-06-13 12:41:13 -07:00
Davide Angelocola
cee32c5cce use nullptr instead of NULL / 0 in rocksdbjni (#12575)
Summary:
While I was trying to understand issue https://github.com/facebook/rocksdb/issues/12503, I found this minor problem. Please have a look adamretter rhubner

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12575

Reviewed By: ajkr

Differential Revision: D57596055

Pulled By: cbi42

fbshipit-source-id: ee0860bdfbee9364cd30c23957b72a04da6acd45
2024-05-21 12:56:07 -07:00
Peter Dillinger
45c105104b Set optimize_filters_for_memory by default (#12377)
Summary:
This feature has been around for a couple of years and users haven't reported any problems with it.

Not quite related: fixed a technical ODR violation in public header for info_log_level in case DEBUG build status changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12377

Test Plan: unit tests updated, already in crash test. Some unit tests are expecting specific behaviors of optimize_filters_for_memory=false and we now need to bake that in.

Reviewed By: jowlyzhang

Differential Revision: D54129517

Pulled By: pdillinger

fbshipit-source-id: a64b614840eadd18b892624187b3e122bab6719c
2024-04-30 08:33:31 -07:00
Andrew Kryczka
b4c520cadc change default CompactionOptions::compression while deprecating it (#12587)
Summary:
I had a TODO to complete `CompactionOptions`'s compression API but never did it: d610e14f93/db/compaction/compaction_picker.cc (L371-L373)

Without solving that TODO, the API remains incomplete and unsafe. Now, however, I don't think it's worthwhile to complete it. I think we should instead delete the API entirely. This PR deprecates it in preparation for deletion in a future major release. The `ColumnFamilyOptions` settings for compression should be good enough for `CompactFiles()` since they are apparently good enough for every other compaction, including `CompactRange()`.

In the meantime, I also changed the default `CompressionType`. Having callers of `CompactFiles()` use Snappy compression by default does not make sense when the default could be to simply use the same compression type that is used for every other compaction. As a bonus, this change makes the default `CompressionType` consistent with the `CompressionOptions` that will be used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12587

Reviewed By: hx235

Differential Revision: D56619273

Pulled By: ajkr

fbshipit-source-id: 1477de49f14b06c72d6f0045616a8ce91d97e66e
2024-04-26 13:03:21 -07:00
Radek Hubner
a8035ebc0b Fix exception on RocksDB.getColumnFamilyMetaData() (#12474)
Summary:
https://github.com/facebook/rocksdb/issues/12466 reported a bug when `RocksDB.getColumnFamilyMetaData()` is called on an existing database(With files stored on disk). As neilramaswamy mentioned, this was caused by https://github.com/facebook/rocksdb/issues/11770 where the signature of `SstFileMetaData` constructor was changed, but JNI code wasn't updated.

This PR fix JNI code, and also properly populate `fileChecksum` on `SstFileMetaData`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12474

Reviewed By: jowlyzhang

Differential Revision: D55811808

Pulled By: ajkr

fbshipit-source-id: 2ab156f41eaf4a4f30c49e6df421b61e8451230e
2024-04-05 13:55:18 -07:00
Alexey Vinogradov
c4df598b8e Implement PerfContex#toString for the Java API (#12473)
Summary:
I've implemented `PerfContext#toString` for the Java API.
See: https://groups.google.com/g/rocksdb/c/qbY_gNhbyAg

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12473

Reviewed By: jowlyzhang

Differential Revision: D55660871

Pulled By: cbi42

fbshipit-source-id: f0528fba31ac06e16495e4f49b0bafe0dbc1bc61
2024-04-03 14:33:31 -07:00
Radek Hubner
db9eb10b5b Enable all Java test via CMake (#12446)
Summary:
This PR follows the work done in  https://github.com/facebook/rocksdb/issues/11756 and enable all Java test to be run via CMake/Ctest.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12446

Reviewed By: jowlyzhang

Differential Revision: D55661635

Pulled By: cbi42

fbshipit-source-id: 3ea49a121a3ba72089632ff43ee7fe4419b08a96
2024-04-03 11:03:11 -07:00
Peter Dillinger
b515a5db3f Replace ScopedArenaIterator with ScopedArenaPtr<InternalIterator> (#12470)
Summary:
ScopedArenaIterator is not an iterator. It is a pointer wrapper. And we don't need a custom implemented pointer wrapper when std::unique_ptr can be instantiated with what we want.

So this adds ScopedArenaPtr<T> to replace those uses.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12470

Test Plan: CI (including ASAN/UBSAN)

Reviewed By: jowlyzhang

Differential Revision: D55254362

Pulled By: pdillinger

fbshipit-source-id: cc96a0b9840df99aa807f417725e120802c0ae18
2024-03-22 13:40:42 -07:00
anand76
98d8a85624 New PerfContext counters for block cache bytes read (#12459)
Summary:
Add PerfContext counters for measuring the cumulative size of blocks found in the block cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12459

Reviewed By: ajkr

Differential Revision: D55170694

Pulled By: anand1976

fbshipit-source-id: 8cbba76eece116cefce7f00e2fc9d74757661d25
2024-03-21 10:46:46 -07:00
anand76
4868c10b44 Retry block reads on checksum mismatch (#12427)
Summary:
On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`.

Tests:
Add new unit tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427

Reviewed By: ajkr

Differential Revision: D55025941

Pulled By: anand1976

fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e
2024-03-18 16:16:05 -07:00
Yu Zhang
f2546b6623 Support returning write unix time in iterator property (#12428)
Summary:
This PR adds support to return data's approximate unix write time in the iterator property API. The general implementation is:
1) If the entry comes from a SST file, the sequence number to time mapping recorded in that file's table properties will be used to deduce the entry's write time from its sequence number. If no such recording is available, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown except if the entry's sequence number is zero, in which case, 0 is returned. This also means that even if `preclude_last_level_data_seconds` and `preserve_internal_time_seconds` can be toggled off between DB reopens, as long as the SST file's table property has the mapping available, the entry's write time can be deduced and returned.

2) If the entry comes from memtable, we will use the DB's sequence number to write time mapping to do similar things. A copy of the DB's seqno to write time mapping is kept in SuperVersion to allow iterators to have lock free access. This also means a new `SuperVersion` is installed each time DB's seqno to time mapping updates, which is originally proposed by Peter in  https://github.com/facebook/rocksdb/issues/11928 . Similarly, if the feature is not enabled, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown.

Needed follow up:
1) The write time for `kTypeValuePreferredSeqno` should be special cased, where it's already specified by the user, so we can directly return it.

2) Flush job can be updated to use DB's seqno to time mapping copy in the SuperVersion.

3) Handle the case when `TimedPut` is called with a write time that is `std::numeric_limits<uint64_t>::max()`. We can make it a regular `Put`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12428

Test Plan: Added unit test

Reviewed By: pdillinger

Differential Revision: D54967067

Pulled By: jowlyzhang

fbshipit-source-id: c795b1b7ec142e09e53f2ed3461cf719833cb37a
2024-03-15 15:37:37 -07:00
Alan Paxton
d9a441113e JNI get_helper code sharing / multiGet() use efficient batch C++ support (#12344)
Summary:
Implement RAII-based helpers for JNIGet() and multiGet()

Replace JNI C++ helpers `rocksdb_get_helper, rocksdb_get_helper_direct`, `multi_get_helper`, `multi_get_helper_direct`, `multi_get_helper_release_keys`, `txn_get_helper`, and `txn_multi_get_helper`.

The model is to entirely do away with a single helper, instead a number of utility methods allow each separate
JNI `Get()` and `MultiGet()` method to organise their parameters efficiently, then call the underlying C++ `db->Get()`,
`db->MultiGet()`, `txn->Get()`, or `txn->MultiGet()` method itself, and use further utilities to retrieve results.

Roughly speaking:

* get keys into C++ form
* Call C++ Get()
* get results and status into Java form

We achieve a useful performance gain as part of this work; by using the updated C++ multiGet we immediately pick up its performance gains (batch improvements to multiGet C++ were previously implemented, but not until now used by Java/JNI). multiGetBB already uses the batched C++ multiGet(), and all other benchmarks show consistent improvement after the changes:

## Before:
```
Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 256 thrpt 25 5315.459 ± 20.465 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 1024 thrpt 25 5673.115 ± 78.299 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 4096 thrpt 25 2616.860 ± 46.994 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 16384 thrpt 25 1700.058 ± 24.034 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 65536 thrpt 25 791.171 ± 13.955 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 256 thrpt 25 6129.929 ± 94.200 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 1024 thrpt 25 7012.405 ± 97.886 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 4096 thrpt 25 2799.014 ± 39.352 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 16384 thrpt 25 1417.205 ± 22.272 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 65536 thrpt 25 655.594 ± 13.050 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 256 thrpt 25 6147.247 ± 82.711 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 1024 thrpt 25 7004.213 ± 79.251 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 4096 thrpt 25 2715.154 ± 110.017 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 16384 thrpt 25 1408.070 ± 31.714 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 65536 thrpt 25 623.829 ± 57.374 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 256 thrpt 25 6119.243 ± 116.313 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 1024 thrpt 25 6931.873 ± 128.094 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 4096 thrpt 25 2678.253 ± 39.113 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 16384 thrpt 25 1337.384 ± 19.500 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 65536 thrpt 25 625.596 ± 14.525 ops/s
```

## After:
```
Benchmark                                    (columnFamilyTestType)  (keyCount)  (keySize)  (multiGetSize)  (valueSize)   Mode  Cnt     Score     Error  Units
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100          256  thrpt   25  5191.074 ±  78.250  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100         1024  thrpt   25  5378.692 ± 260.682  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100         4096  thrpt   25  2590.183 ±  34.844  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100        16384  thrpt   25  1634.793 ±  34.022  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100        65536  thrpt   25   786.455 ±   8.462  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100          256  thrpt   25  5285.055 ±  11.676  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100         1024  thrpt   25  5586.758 ± 213.008  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100         4096  thrpt   25  2527.172 ±  17.106  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100        16384  thrpt   25  1819.547 ±  12.958  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100        65536  thrpt   25   803.861 ±   9.963  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100          256  thrpt   25  5253.793 ±  28.020  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100         1024  thrpt   25  5705.591 ±  20.556  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100         4096  thrpt   25  2523.377 ±  15.415  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100        16384  thrpt   25  1815.344 ±  11.309  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100        65536  thrpt   25   820.792 ±   3.192  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100          256  thrpt   25  5262.184 ±  20.477  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100         1024  thrpt   25  5706.959 ±  23.123  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100         4096  thrpt   25  2520.362 ±   9.170  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100        16384  thrpt   25  1789.185 ±  14.239  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100        65536  thrpt   25   818.401 ±  12.132  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100          256  thrpt   25  6978.310 ±  14.084  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100         1024  thrpt   25  7664.242 ±  22.304  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100         4096  thrpt   25  2881.778 ±  81.054  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100        16384  thrpt   25  1599.826 ±   7.190  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100        65536  thrpt   25   737.520 ±   6.809  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100          256  thrpt   25  6974.376 ±  10.716  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100         1024  thrpt   25  7637.440 ±  45.877  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100         4096  thrpt   25  2820.472 ±  42.231  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100        16384  thrpt   25  1716.663 ±   8.527  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100        65536  thrpt   25   755.848 ±   7.514  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100          256  thrpt   25  6943.651 ±  20.040  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100         1024  thrpt   25  7679.415 ±   9.114  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100         4096  thrpt   25  2844.564 ±  13.388  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100        16384  thrpt   25  1729.545 ±   5.983  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100        65536  thrpt   25   783.218 ±   1.530  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100          256  thrpt   25  6944.276 ±  29.995  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100         1024  thrpt   25  7670.301 ±   8.986  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100         4096  thrpt   25  2839.828 ±  12.421  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100        16384  thrpt   25  1730.005 ±   9.209  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100        65536  thrpt   25   787.096 ±   1.977  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100          256  thrpt   25  6896.944 ±  21.530  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100         1024  thrpt   25  7622.407 ±  12.824  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100         4096  thrpt   25  2927.538 ±  19.792  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100        16384  thrpt   25  1598.041 ±   4.312  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100        65536  thrpt   25   744.564 ±   9.236  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100          256  thrpt   25  6853.760 ±  78.041  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100         1024  thrpt   25  7360.917 ± 355.365  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100         4096  thrpt   25  2848.774 ±  13.409  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100        16384  thrpt   25  1727.688 ±   3.329  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100        65536  thrpt   25   776.088 ±   7.517  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100          256  thrpt   25  6910.339 ±  14.366  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100         1024  thrpt   25  7633.660 ±  10.830  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100         4096  thrpt   25  2787.799 ±  81.775  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100        16384  thrpt   25  1726.517 ±   6.830  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100        65536  thrpt   25   787.597 ±   3.362  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100          256  thrpt   25  6922.445 ±  10.493  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100         1024  thrpt   25  7604.710 ±  48.043  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100         4096  thrpt   25  2848.788 ±  15.783  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100        16384  thrpt   25  1730.837 ±   6.497  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100        65536  thrpt   25   794.557 ±   1.869  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100          256  thrpt   25  6918.716 ±  15.766  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100         1024  thrpt   25  7626.692 ±   9.394  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100         4096  thrpt   25  2871.382 ±  72.155  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100        16384  thrpt   25  1598.786 ±   4.819  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100        65536  thrpt   25   748.469 ±   7.234  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100          256  thrpt   25  6922.666 ±  17.131  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100         1024  thrpt   25  7623.890 ±   8.805  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100         4096  thrpt   25  2850.698 ±  18.004  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100        16384  thrpt   25  1727.623 ±   4.868  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100        65536  thrpt   25   774.534 ±  10.025  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100          256  thrpt   25  5486.251 ±  13.582  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100         1024  thrpt   25  4920.656 ±  44.557  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100         4096  thrpt   25  3922.913 ±  25.686  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100        16384  thrpt   25  2873.106 ±   4.336  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100        65536  thrpt   25   802.404 ±   8.967  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100          256  thrpt   25  4817.996 ±  18.042  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100         1024  thrpt   25  4243.922 ±  13.929  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100         4096  thrpt   25  3175.998 ±   7.773  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100        16384  thrpt   25  2321.990 ±  12.501  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100        65536  thrpt   25  1753.028 ±   7.130  ops/s
```

Closes https://github.com/facebook/rocksdb/issues/11518

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12344

Reviewed By: cbi42

Differential Revision: D54809714

Pulled By: pdillinger

fbshipit-source-id: bee3b949720abac073bce043b59ce976a11e99eb
2024-03-12 12:42:08 -07:00
Alan Paxton
c4d37da826 Java API - Fix handling of CF handles in DB subclasses (#12417)
Summary:
The most general `open()` method for each of RocksDB, TtlDB, OptimisticTransactionDB and TransactionDB should
- ensure the default CF is supplied in the list of descriptors
- cache the default CF handle
- store open CF handles for automatic close on DB close
The `close()` method in each of these DB subclasses should `close()` all the owned CF handles.

I can’t find a cleaner way to build some generalised open/close that does this for all DB subclasses, so it exists as cut and paste with variations in the 4 different DB subclasses.

Added some slightly paranoid testing that CF handles explicitly referred to as default in a list of CF handles in the general open methods, and the simple open that doesn’t supply a CF, end up reading and writing to the same CF. Prompted by the fact that this code is a bit opaque; the first returned handle is the DB.

As part of this, fix the bug where the Java side of `OptimisticsTransactionDB` was not setting up default column family; this was visible when setting up an iterator; add a test to validate that the iterator is OK. A single Java reference to the default column family was not being created in the OptimisticsTransactionDB RocksDB subclass; it should be created in all subclasses. The same problem had previously been fixed for TtlDB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12417

Reviewed By: ajkr

Differential Revision: D54807643

Pulled By: pdillinger

fbshipit-source-id: 66f34e56a822a009a8f2018d401cf8940d91aa35
2024-03-12 10:33:27 -07:00
Radek Hubner
583fded565 Fix regression for Javadoc jar build (#12404)
Summary:
https://github.com/facebook/rocksdb/issues/12371 Introduced regression not defining dependency between `create_javadoc`  and `rocksdb_javadocs_jar` build targets.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12404

Reviewed By: pdillinger

Differential Revision: D54516862

Pulled By: ajkr

fbshipit-source-id: 785a99b2caf979395ae0de60e40e7d1b93059adb
2024-03-06 10:33:17 -08:00
Adam Retter
5458eda5f0 Pass build parallelism flag to Docker builds (#12392)
Summary:
Passed the `-j` flag through to builds happening inside Docker containers.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12392

Reviewed By: cbi42

Differential Revision: D54311937

Pulled By: ajkr

fbshipit-source-id: 5cf1bfe4b9059cc2d078fb5331812f32cf9e89ab
2024-02-28 12:51:00 -08:00
Adam Retter
99cc36be9b Correct CMake Javadoc and source jar builds (#12371)
Summary:
Fix some issues introduced in https://github.com/facebook/rocksdb/pull/12199 (CC rhubner)
1. Previous `jar -v -c -f` was not valid command syntax.
2. Javadoc and source Jar files were prefixed `rocksdb-`, now corrected to `rocksdbjni-`

pdillinger This needs to be merged to `main` and also `8.11.fb` (to fix the Windows build for the RocksJava release of 8.11.2) please.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12371

Reviewed By: pdillinger, jowlyzhang

Differential Revision: D54136834

Pulled By: hx235

fbshipit-source-id: f356f2401042af359ada607e5f0be627418ccd6c
2024-02-27 15:46:12 -08:00
Yu Zhang
2940acac00 Persist table options use_delta_encoding in options file (#11987)
Summary:
This option is used for encoding keys in block based table files. It has been having a default true value since its introduction.

Users may not notice this option is not persisted in options file unless they are explicitly setting it to false. If the users expect `Iterator::GetProperty("rocksdb.iterator.is-key-pinned")` to return 1 when setting `ReadOptions.pin_data = true`, they should have noticed loading options file won't work and have work around for this by always explicitly set this option to false for opening DB. This change won't impact those users except that now they can remove their work around. If the users are not relying on key pinning behavior at all and as a result didn't notice the option is not persisted, this change shouldn't have any visible behavior impact either.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11987

Reviewed By: hx235

Differential Revision: D54093238

Pulled By: jowlyzhang

fbshipit-source-id: 256a3348c44cf91349034d1f6e242c437b32b9a5
2024-02-23 14:13:28 -08:00
leedonggyu
ca99a8f153 Add function to check if the RocksDB instance is closed or not (#11337)
Summary:
In RocksDb jni threre is no method to know if the instance is closed or not.
so when using a closed instance it makes jvm crash.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11337

Reviewed By: jaykorean

Differential Revision: D53941387

Pulled By: ajkr

fbshipit-source-id: e3e4e6fe48409fa70a312810e467ec0c4ce356ef
2024-02-20 11:36:28 -08:00
anand76
d227276147 Deprecate some variants of Get and MultiGet (#12327)
Summary:
A lot of variants of Get and MultiGet have been added to `include/rocksdb/db.h` over the years. Try to consolidate them by marking variants that don't return timestamps as deprecated. The underlying DB implementation will check and return Status::NotSupported() if it doesn't support returning timestamps and the caller asks for it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12327

Reviewed By: pdillinger

Differential Revision: D53828151

Pulled By: anand1976

fbshipit-source-id: e0b5ca42d32daa2739d5f439a729815a2d4ff050
2024-02-16 09:21:06 -08:00
anand76
28c1c15c29 Sync tickers and histograms across C++ and Java (#12355)
Summary:
The RocksDB ticker and histogram statistics were out of sync between the C++ and Java code, with a number of newer stats missing in TickerType.java and HistogramType.java. Also, there were gaps in numbering in portal.h, which could soon become an issue due to the number of tickers and the fact that we're limited to 1 byte in Java. This PR adds the missing stats, and re-numbers all of them. It also moves some stats around to try to group related stats together. Since this will go into a major release, compatibility shouldn't be an issue.

This should be automated at some point, since the current process is somewhat error prone.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12355

Reviewed By: jaykorean

Differential Revision: D53825324

Pulled By: anand1976

fbshipit-source-id: 298c180872f4b9f1ee54b8bb22f4e280458e7e09
2024-02-15 17:22:03 -08:00
Peter Dillinger
bfd00bba9c Use format_version=6 by default (#12352)
Summary:
It's in production for a large storage service, and it was initially released 6 months ago (8.6.0). IMHO that's enough room for "easy downgrade" to most any user's previously integrated version, even if they only update a few times a year.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12352

Test Plan:
tests updated, including format capatibility test

table_test: ApproximateOffsetOfCompressed is affected because adding index block to metaindex adds about 13 bytes
to SST files in format_version 6. This test has historically been problematic and one reason is that, apparently, not only
could it pass/fail depending on snappy compression version, but also how long your host name is, because of db_host_id.
I've cleared that out for the test, which takes care of format_version=6 and hopefully improves long-term reliability.

Suggested follow-up: FinishImpl in table_test.cc takes a table_options that is ignored in some cases and might not match
the ioptions.table_factory configuration unless the caller is very careful. This should be cleaned up somehow.

Reviewed By: anand1976

Differential Revision: D53786884

Pulled By: pdillinger

fbshipit-source-id: 1964cbd40d3ab0a821fdc01c458031df716fcf51
2024-02-15 11:23:48 -08:00
Yu Zhang
4bea83aa44 Remove the force mode for EnableFileDeletions API (#12337)
Summary:
There is no strong reason for user to need this mode while on the other hand, its behavior is destructive.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12337

Reviewed By: hx235

Differential Revision: D53630393

Pulled By: jowlyzhang

fbshipit-source-id: ce94b537258102cd98f89aa4090025663664dd78
2024-02-13 18:36:25 -08:00
马越
45668a05f5 add unit test for compactRangeWithNullBoundaries java api (#12333)
Summary:
The purpose of this PR is to supplement a set of unit tests for https://github.com/facebook/rocksdb/pull/12328

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12333

Reviewed By: ltamasi

Differential Revision: D53553830

Pulled By: cbi42

fbshipit-source-id: d21490f7ce7b30f42807ee37eda455ca6abdd072
2024-02-13 10:48:31 -08:00
Hui Xiao
1a885fe730 Remove deprecated Options::access_hint_on_compaction_start (#11654)
Summary:
**Context:**
`Options::access_hint_on_compaction_start ` is marked deprecated and now ready to be removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11654

Test Plan:
Multiple db_stress runs with pre-PR and post-PR binary randomly to ensure forward/backward compatibility on options 36a5686ec0
`python3 tools/db_crashtest.py --simple blackbox --interval=30`

Reviewed By: cbi42

Differential Revision: D47892459

Pulled By: hx235

fbshipit-source-id: a62f46a0377fe143be7638e218978d5431c15c56
2024-02-05 13:35:19 -08:00