rocksdb

mirror of https://github.com/facebook/rocksdb synced 2026-05-24 09:29:21 +00:00

Author	SHA1	Message	Date
Peter Dillinger	1c8a012727	Add kCool Temperature (#14000 ) Summary: also requested by internal user, like kIce in https://github.com/facebook/rocksdb/issues/13927 Pull Request resolved: https://github.com/facebook/rocksdb/pull/14000 Test Plan: unit tests updated Reviewed By: archang19 Differential Revision: D83200479 Pulled By: pdillinger fbshipit-source-id: 31f2842d87bcad40227aeee9687ff5772393689c	2025-09-25 11:27:00 -07:00
Peter Dillinger	6127a42f98	Use/endorse (Auto)HyperClockCache by default over LRUCache (#13964 ) Summary: After seeing more people hit issues with thrashing small LRUCache shards and AutoHCC running fully in production for a while on a very large service, here I make these updates: * In the public API, mark the case of `estimated_entry_charge = 0` (which is how you select AutoHCC) as production-ready and generally preferred. That means devoting a lot less space to how to tune FixedHCC (`estimated_entry_charge > 0`) because it is not generally recommended anymore even though in theory it is the fastest (conditional on a fragile configuration). * In the public API, add more detail about potential problems with LRUCache and explicitly endorse HCC. * When a default block cache is created, use AutoHCC instead of LRUCache. It's still a 32MB cache but that's just one cache shard for AutoHCC so the risk of issues with small cache shards is dramatically reduced. And a single AutoHCC shard is still essentially wait-free. * Improve the handling of the hypothetical scenario of a failed anonymous mmap. This is hardly a concern for 64-bit Linux and likely most other OSes. It would in theory be possible to fall back on LRUCache in that case but the code structure makes that annoying/challenging. Instead we crash with an appropriate message. * Cleaned up some includes * Fixed some previously unreported leaks (better assertions on HCC perhaps, some subtle behavior changes) * Added a new mode to cache_bench (detailed below) * Avoid a particularly costly sanity check in `~AutoHyperClockTable()` even in debug builds so that unit testing, etc., isn't bogged down, except keep it in ASAN build. Planned follow-up: * Update HCC implementation to use my new "bit field atomics" API introduced in https://github.com/facebook/rocksdb/issues/13910 to make it easier to read and maintain Possible follow-up: * Re-engineer table cache to use AutoHCC also, instead of LRUCache and a single mutex to ensure no duplication across threads. (a) Pad table cache key to 128 bits for AutoHCC. (b) Stripe/shard the no-duplication mutex. (HCC's consistency model is too weak for concurrent threads to use its API to agree on a winner, even if entries could be inserted in an "open in progress" state.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/13964 Test Plan: existing tests. ClockCacheTest.ClockEvictionEffortCapTest caught a regression during my development, and the crash test has a history of finding subtle HCC bugs. ## Performance Although we've validated AutoHCC performance under high load, etc., before we haven't really considered whether there will be unacceptable overheads for small DBs and CFs, e.g. in unit tests. For this, I have added a new mode to cache_bench: with the -stress_cache_instances=n parameter, it will create and destroy n empty cache instances several times. In the debug build, this found that a particular check in `~AutoHyperClockTable()` was extremely costly for short-lived caches (fixed). Beyond that, we can answer the question of whether it is feasible for a single process to host 1000 DBs each with 1000 CFs with default block cache instances, after moving LRUCache -> AutoHCC, for example: ``` /usr/bin/time ./cache_bench -stress_ cache_instances=1000000 -cache_type=auto_hyper_clock_cache -cache_size=33554432 ``` Release build: Average 9.8 us per 32MB LRUCache creation, 2.9 us per destruction, 24.6GB max RSS (~25KB each) -> Average 4.3 us per 32MB AutoHCC creation, 4.9 us per destruction, 4.8GB max RSS (~5KB each) Debug build: Average 10.9 us per 32MB LRUCache creation, 3.5 us per destruction, 28.7GB max RSS (~29KB each) -> Average 4.5 us per 32MB AutoHCC creation, 4.9 us per destruction, 4.7GB max RSS (~5KB each) Despite the anonymous mmaps, it's apparently more efficient for default/small/empty structures. This is likely due to the dramatically low number of cache shards at this size. If we switch to `-stress_cache_instances=10000 -cache_size=1073741824`: Release build: Average 10.6 us per 1GB LRUCache, 2.8 us per destruction, 2.3 GB max RSS (~230KB each) -> Average 130 us per 1GB AutoHCC creation, 153 us per destruction, 1.5 GB max RSS (~150KB each) Debug build: Average 11.2 us per 1GB LRUCache, 3.6 us per destruction, 2.4 GB max RSS (~240KB each) -> Average 130 us per 1GB AutoHCC creation, 150 us per destruction, 1.6 GB max RSS (~160KB each) Here it's clear that we are paying a price in time for setting up all those mmaps for the good number of cache shards and potential table growth, even though the RSS is well under control. However, I am not concerned about this at all, as it's unlikely to slow down anything notably such as unit tests. Before and after full testsuite runs confirm: 3327.73user 5188.71system 3:38.88elapsed -> 3312.07user 5704.77system 3:41.61elapsed There is increased kernel time but acceptable. With ASAN+UBSAN: 11618.70user 15671.30system 5:54.68elapsed -> 12595.81user 16159.67system 6:32.77elapsed Acceptable given that our ASAN+UBSAN builds are not the slowest in CI Reviewed By: hx235 Differential Revision: D82661067 Pulled By: pdillinger fbshipit-source-id: ab25c766ca70f2b8664849c2a838b9e1b4e72d3b	2025-09-18 13:27:51 -07:00
Peter Dillinger	67af5bdc38	Add Temperature::kIce (#13927 ) Summary: ... and associated statistics, etc. Someone needs it, so here it is. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13927 Test Plan: Updated / extended / added some unit tests Reviewed By: cbi42 Differential Revision: D81981469 Pulled By: pdillinger fbshipit-source-id: 52558c08741890b781310906acbc18d9eb479363	2025-09-10 10:29:49 -07:00
Alan Paxton	805ac7c887	Update compression libraries to latest releases (#13609 ) Summary: See `Makefile` for actual changes: * ZLIB remains the same * BZIP2 remains the same * SNAPPY is a minor update * LZ4 is a significant update with multithreaded/multicore compression https://github.com/lz4/lz4/releases/tag/v1.10.0 * ZSTD is a significant update RocksDB is called out as benefiting in particular from the performance improvements herein https://github.com/facebook/zstd/releases/tag/v1.5.7 Pull Request resolved: https://github.com/facebook/rocksdb/pull/13609 Reviewed By: archang19 Differential Revision: D77877295 Pulled By: mszeszko-meta fbshipit-source-id: bf9a257e8f68dec3d02743b339aa2df65df4ab2c	2025-07-07 13:28:49 -07:00
Miroslav Kovar	fc2cf7ead2	Expose optimized `TransactionBaseImpl::MultiGet` through JNI (#13589 ) Summary: Addresses https://github.com/facebook/rocksdb/issues/13587. This PR exposes the optimized implementation of batched reads through a `Transaction` object to Java clients. The latency improvement of transactional multiget on production workload achieved by switching the implementation is roughly: ``` quantile=0.2: 21% quantile=0.5: 28% quantile=0.8: 46% quantile=1.0: 239% ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/13589 Reviewed By: jaykorean Differential Revision: D74660169 Pulled By: cbi42 fbshipit-source-id: d01780173e0500c96e5e431ff6645008cbf6e8b5	2025-05-14 13:19:06 -07:00
Michael C Huang	13d865f6f1	Add trivial copy support when FIFO compaction reason is kChangeTemperature (#13562 ) Summary: Prior to this PR, for FIFO kChangeTemperature compaction was done by iterating and reading thru the input sst and generate the output sst. This was wasteful since for FIFO we could apply the "trivial" move by copying the input sst to the out sst without need decompress/compress and reading thru the input sst content at all. This PR added "allow_trivial_copy_when_change_temperature" to the CompactionOptionsFIFO. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13562 Reviewed By: cbi42 Differential Revision: D73295404 Pulled By: mikechuangmeta fbshipit-source-id: 02241c7389797730ecd4a3b636837cb5f912b424	2025-05-08 15:51:37 -07:00
Hui Xiao	29c6610617	Add compaction explicit prefetch stats (#13520 ) Summary: Context/Summary: This PR adds new stats to measure compaction readahead size for rocksdb managed prefetching (not FS prefetching). It can be used to verify compaction read-ahead is doing what's configured. This PR also excludes compaction readahead stats from user scan readahead stats measured in existing stats so there is a cleaner separating between these two. Bonus: this PR also included some typo fixing about "io activities" Pull Request resolved: https://github.com/facebook/rocksdb/pull/13520 Test Plan: Modified existing test to verify stats Reviewed By: archang19 Differential Revision: D72892850 Pulled By: hx235 fbshipit-source-id: 1a73182061baa044c9c9193a2b0fd967ffe75c4a	2025-04-14 12:08:38 -07:00
anand76	f7764cb6b2	Remove fail_if_options_file_error DB option (#13504 ) Summary: The fail_if_options_file_error has been deprecated for more than a year. This PR removes it from the code base. https://github.com/facebook/rocksdb/issues/12056 fixed a bug that was blocking the option from removal. https://github.com/facebook/rocksdb/issues/12249 marked it as deprecated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13504 Reviewed By: hx235 Differential Revision: D72194063 Pulled By: anand1976 fbshipit-source-id: 0aa7cf56e60c48c7e7654743d3e64922ce65225d	2025-04-09 14:18:33 -07:00
Yu Zhang	5e10baa412	Delete max_write_buffer_number_to_maintain (#13491 ) Summary: As titled. This option has been marked deprecated since introduction of a better option `max_write_buffer_size_to_maintain` and acts as its fallback since RocksDB 6.5.0 The internal user we know these options were created for migrated to `max_write_buffer_size_to_maintain` for a long time too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13491 Test Plan: existing tests Reviewed By: cbi42 Differential Revision: D71984601 Pulled By: jowlyzhang fbshipit-source-id: c264d4809e311f60fdbad817ebfade256db549b6	2025-04-07 21:44:36 -07:00
Hui Xiao	5735ff4e03	Update window build cmake to download newer snappy version and Java cmake_minimum_required (#13514 ) Summary: Context/Summary: - This is an attempt to fix our [build-window-vs2022 failure](https://github.com/facebook/rocksdb/actions/runs/14215681026/job/39831770554?fbclid=IwZXh0bgNhZW0CMTAAAR2BQLjp8kC1u1yyvN1_S5qwmrHEZOfzxJdcbj2vq7mvwwq83n1cbkmiBCA_aem_ygYxQA5EUmxh2y4EjMlTfg) below. snappy-1.1.8's cmake_minimum_required being less than 3.5 seems to trigger the complaint. Hopefully downloading the 1.2.2 which is the [first version starting to use higher cmake_minimum_required version](https://github.com/google/snappy/releases/tag/1.2.2) solves the failure. ``` Directory: D:\a\rocksdb\rocksdb\thirdparty\snappy-1.1.8 Mode LastWriteTime Length Name ---- ------------- ------ ---- d---- 4/2/2025 9:02 AM build CMake Error at CMakeLists.txt:29 (cmake_minimum_required): Compatibility with CMake < 3.5 has been removed from CMake. Update the VERSION argument <min> value. Or, use the <min>...<max> syntax to tell CMake that the project requires at least <min> but has been updated to work with policies introduced by <max> or earlier. Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway. ``` - The downloaded snappy do not include the content under nested repos Google Test and Google Benchmark. But snappy cmake by default will attempt to build them. Since we don't change snappy, we don't need building such development suit. This PR also disabled snappy cmake's attempt to build them. - By running above changes, the same build [complained](https://github.com/facebook/rocksdb/actions/runs/14228883966/job/39874927730?pr=13514) about java cmakelists requiring too low cmake_minimum_required as well. So this PR also upgraded its cmake_minimum_required to be 3.11 aligning with its warning message ``` if(${CMAKE_VERSION} VERSION_LESS "3.11.4") message("Please consider switching to CMake 3.11.4 or newer") endif() ``` Test plan Monitor build-window-vs2022 for this PR Pull Request resolved: https://github.com/facebook/rocksdb/pull/13514 Reviewed By: pdillinger Differential Revision: D72333581 Pulled By: hx235 fbshipit-source-id: 1a9096738d39c8b1d270fe17fbd78c1ea4c4c45e	2025-04-02 15:46:02 -07:00
Hui Xiao	48eb646787	Mark MaxMemCompactionLevel() deprecated (#13503 ) Summary: Context/Summary: MaxMemCompactionLevel() developed 10 years ago simply returns the level a memtable flushed to, which has historically been L0 and have no plan to change to something different for future. It is also not used in test or internally. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13503 Test Plan: CI + fake release Reviewed By: cbi42 Differential Revision: D72066092 Pulled By: hx235 fbshipit-source-id: 5ff5b16a6664ef3efabd3a6fbd8a2d0529b62460	2025-03-31 19:29:40 -07:00
Changyu Bi	325dcdf2e5	Deprecate `ReadOptions::ignore_range_deletions` and `experimental::PromoteL0()` (#13500 ) Summary: based on the option comment, `ignore_range_deletions` was added due to the overhead of range deletions in read path when a DB does not use DeleteRange(). The current implementation should not have a noticeable performance difference in this case. `experimental::PromoteL0()` can be replaced by doing a manual compaction with proper CompactRangeOptions. There are some internal use of these option and API so we will remove them later after the usages are updated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13500 Test Plan: comment change only. Performance: benchmark the performance difference with `ignore_range_deletions` and without (borrowed flag `universal_incremental` for this purpose), ran at the same time on the same machine. - random point get: - ignore_range_deletions=false: 343078 ops/sec - ignore_range_deletions=true: 340219 ops/sec (0.8% slower) ``` (for I in $(seq 1 1); do TEST_TMPDIR=/dev/shm/t1 /data/users/changyubi/vscode-root/rocksdb/db_bench --benchmarks=fillseq,waitforcompaction,readrandom --write_buffer_size=67108864 --writes=1000000 --num=2000000 --reads=1000000 --seed=1723056275 --universal_incremental=false 2>&1 \| grep "readrandom"; done;) \| awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; ``` - sequential scan: - ignore_range_deletions=false: 5378104 ops/sec - ignore_range_deletions=true: 5393809 ops/sec (0.3% faster) ``` (for I in $(seq 1 10); do TEST_TMPDIR=/dev/shm/t1 /data/users/changyubi/vscode-root/rocksdb/db_bench --benchmarks=fillseq,waitforcompaction,readseq[-X10] --write_buffer_size=67108864 --writes=1000000 --num=2000000 --universal_incremental=true --seed=1723056275 2>1 \| grep "\[AVG 10 runs\]"; done;) \| awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }'; ``` The difference in ops/sec for the two benchmarks is likely noise. Reviewed By: hx235 Differential Revision: D72069223 Pulled By: cbi42 fbshipit-source-id: ad82a051aa4682790d2178cd4fb2d1467397fbb5	2025-03-28 14:49:28 -07:00
Peter Dillinger	82794e0a4f	Deprecate RangePtr, favor new RangeOpt and OptSlice (#13481 ) Summary: The new API in https://github.com/facebook/rocksdb/issues/13453 is awkward and precarious because of using RangePtr, which encodes optional keys using raw pointers to Slice. We could use `std::optional<Slice>` instead but that is unsatisfyingly a larger object with an inefficient size (typically 17 bytes). Here I introduce a custom optional Slice type, `OptSlice`, that is the same size as a Slice, and use it in a number of places to clean up code and make some public APIs easier to work with. This includes * `atomic_replace_range` (not yet released, OK to change) * `GetAllKeyVersions()` which gets a behavior change because of its unusual handling of empty keys. * `DeleteFilesInRanges()` * TODO in follow-up: `CompactRange()` Most of the diff is associated updates and refactorings. Also * Move some relevant things out of db.h to keep it as tidy as possible. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13481 Test Plan: tests updated Reviewed By: hx235 Differential Revision: D71747774 Pulled By: pdillinger fbshipit-source-id: b4c8519608d119b8bceca9bb0fd778608f62a141	2025-03-24 17:08:17 -07:00
Maciej Szeszko	591f5b1266	Remove deprecated DB::DeleteFile API references (#13322 ) Summary: Cleanup post https://github.com/facebook/rocksdb/pull/13284. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13322 Test Plan: 1. We did not find any evidence of breakage in internal pre-release integration pipeline runs after renaming the deprecated API in `9.10`. 2. _To the extent possible_, we manually validated partner use cases of file deletion and confirmed deprecated API is no longer in use. Reviewed By: jaykorean Differential Revision: D68476852 Pulled By: mszeszko-meta fbshipit-source-id: fbe1f873e16ae7c60d7706a3c44ecc695ab86a4b	2025-01-24 22:28:41 -08:00
Roman Puchkovskiy	b98c21b281	Support flush reasons above 12 in Java integration (#13246 ) Summary: FlushReason enum in C++ has members up to 15, but in Java, the mirroring FlushReason only supports reason codes up to 12. This causes exceptions when adding a flush listener. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13246 Reviewed By: pdillinger Differential Revision: D68241620 Pulled By: jaykorean fbshipit-source-id: 1e2856dad28dff0cbb1772f5a8ea03cc1e224088	2025-01-17 10:04:10 -08:00
Maciej Szeszko	6e97a813dc	Deprecate db delete file public API (#13284 ) Summary: We added a removal warning for public `DB::DeleteFile` API ~4 years ago in https://github.com/facebook/rocksdb/pull/7337. This API seems to sit at wrong layer of abstraction, where instead of exposing a clear interface to delete specific range of keys, callers rely on their own discovery / interpretation of where their data / log possibly resides 'as-of-now'. For example, in case of data, the physical location of the keys might very well change after user obtained their mapping from key(s) to specific SST file. This will lead to `InvalidArgument` response, which if repeated, would put a user in a race condition spinning wheel - the behavior that's inefficient, fairly indeterministic and therefore one that should be strongly discouraged. We're employing a graceful approach to prefixing the public API with `DEPRECATED_` first for better discoverability and ease of self service for product teams should they still use that legacy API. If everything goes smoothly, we intend to remove all the deprecated API references in the next release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13284 Reviewed By: pdillinger Differential Revision: D67981502 Pulled By: mszeszko-meta fbshipit-source-id: adc7fe5cf4e2180bcfd21878b8f78f3fb6ead355	2025-01-10 19:07:33 -08:00
Maciej Szeszko	541761eaaa	Deprecate random access max buffer size references - take #2 (#13288 ) Summary: This time properly marking db option as `kDeprecated` in `db_options.cc`. Original PR: https://github.com/facebook/rocksdb/pull/13278. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13288 Reviewed By: pdillinger Differential Revision: D68024379 Pulled By: mszeszko-meta fbshipit-source-id: 8e1f08b048ccf5971d899811edaf0b0ef16581ef	2025-01-10 15:32:38 -08:00
Maciej Szeszko	6c6defe3b8	Revert "Deprecate random access max buffer size references (#13278 )" (#13285 ) Summary: This reverts commit `d4bd67fb09`. There are total of 4 call sites as referenced [here](https://www.internalfb.com/code/search?q=repo%3Aall%20-filepath%3Afbcode%2Frocksdb%2F%7Cthird-party%2Frocksdb%2F%7Clibrocksdb%2F%7Cfb_mysql%2F.*%2Frocksdb%7Cinternal_repo_rocksdb%20regex%3Aon%20random_access_max_buffer_size&lang_filter=cpp). None of them have a strict reliance of this setting, which should make followup cleanup fairly easy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13285 Reviewed By: pdillinger Differential Revision: D67984325 Pulled By: mszeszko-meta fbshipit-source-id: 12c0b6281c2af6c32261fdf6092856b0566d389e	2025-01-09 13:20:30 -08:00
Maciej Szeszko	d4bd67fb09	Deprecate random access max buffer size references (#13278 ) Summary: This option has been officially deprecated in 5.4.0. We're removing all the references to `random_access_max_buffer_size`, related rules and all the clients wrappers. As a part of this refactoring, we're also getting rid of the `options-1-false` (and consequently its' `multiple-conds-all-false` corresponding rule), as condition would not make much sense anymore without the bounding RA max buffer size limit. Motivated by ongoing tech debt reduction effort. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13278 Test Plan: Validated that internal users do not rely on this long-gone option in their workflows. Reviewed By: jaykorean Differential Revision: D67909674 Pulled By: mszeszko-meta fbshipit-source-id: 8f4b59a4a92b0b32b8b91b71ac318aafc17f1da2	2025-01-08 09:59:18 -08:00
Alan Paxton	9b1d0c02e9	Add [set]DailyOffpeakTimeUTC option to Java API (#13148 ) Summary: Reflect RocksDB DailyOffpeakTimeUTC option in Java API. As is standard for options, there are a number of different places where this option needs to be added: it is an option, a DB option, and it is mutable (can be changed while running). The new option is a string value. This requires an extension to the internal MutableDBOptions parse code, which received the entire options string from C++ and parses it on the Java side. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13148 Reviewed By: cbi42 Differential Revision: D67870402 Pulled By: jaykorean fbshipit-source-id: 975af69773206da936d230cbadb5f69a002d92a3	2025-01-07 09:39:01 -08:00
Changyu Bi	d5345a8ff7	Introduce a transaction option to skip memtable write during commit (#13144 ) Summary: add a new transaction option `TransactionOptions::commit_bypass_memtable` that will ingest the transaction into a DB as an immutable memtables, skipping memtable writes during transaction commit. This helps to reduce the blocking time of committing a large transaction, which is mostly spent on memtable writes. The ingestion is done by creating WBWIMemTable using transaction's underlying WBWI, and ingest it as the latest immutable memtable. The feature will be experimental. Major changes are: 1. write path change to ingest the transaction, mostly in WriteImpl() and IngestWBWI() in db_impl_write.cc. 2. WBWI changes to track some per CF stats like entry count and overwritten single deletion count, and track which keys have overwritten single deletions (see 3.). Per CF stat is used to precompute the number of entries in each WBWIMemTable. 3. WBWIMemTable Iterator changes to emit overwritten single deletions. The motivation is explained in the comment above class WBWIMemTable definition. The rest of the changes in WBWIMemTable are moving the iterator definition around. Some intended follow ups: 1. support for merge operations 2. stats/logging around this option 3. tests improvement, including stress test support for the more comprehensive no_batched_op_stress. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13144 Test Plan: * added new unit tests * enabled in multi_ops_txns_stress test * Benchmark: applying the change in 8222c0cafc4c6eb3a0d05807f7014b44998acb7a, I tested txn size of 10k and check perf context for write_memtable_time, write_wal_time and key_lock_wait_time(repurposed for transaction unlock time). Though the benchmark result number can be flaky, this shows memtable write time improved a lot (more than 100 times). The benchmark also shows that the remaining commit latency is from transaction unlock. ``` ./db_bench --benchmarks=fillrandom --seed=1727376962 --threads=1 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --batch_size=10000 --transaction_db=1 --perf_level=4 --enable_pipelined_write=false --commit_bypass_memtable=1 commit_bypass_memtable = false fillrandom : 3.982 micros/op 251119 ops/sec 0.398 seconds 100000 operations; 27.8 MB/s PERF_CONTEXT: write_memtable_time = 116950422 write_wal_time = 8535565 txn unlock time = 32979883 commit_bypass_memtable = true fillrandom : 2.627 micros/op 380559 ops/sec 0.263 seconds 100000 operations; 42.1 MB/s PERF_CONTEXT: write_memtable_time = 740784 write_wal_time = 11993119 txn unlock time = 21735685 ``` Reviewed By: jowlyzhang Differential Revision: D66307632 Pulled By: cbi42 fbshipit-source-id: 6619af58c4c537aed1f76c4a7e869fb3f5098999	2024-12-05 15:00:17 -08:00
Peter Dillinger	a1a102ffce	Steps toward deprecating implicit prefix seek, related fixes (#13026 ) Summary: With some new use cases onboarding to prefix extractors/seek/filters, one of the risks is existing iterator code, e.g. for maintenance tasks, being unintentionally subject to prefix seek semantics. This is a longstanding known design flaw with prefix seek, and `prefix_same_as_start` and `auto_prefix_mode` were steps in the direction of making that obsolete. However, we can't just immediately set `total_order_seek` to true by default, because that would impact so much code instantly. Here we add a new DB option, `prefix_seek_opt_in_only` that basically allows users to transition to the future behavior when they are ready. When set to true, all iterators will be treated as if `total_order_seek=true` and then the only ways to get prefix seek semantics are with `prefix_same_as_start` or `auto_prefix_mode`. Related fixes / changes: * Make sure that `prefix_same_as_start` and `auto_prefix_mode` are compatible with (or override) `total_order_seek` (depending on your interpretation). * Fix a bug in which a new iterator after dynamically changing the prefix extractor might mix different prefix semantics between memtable and SSTs. Both should use the latest extractor semantics, which means iterators ignoring memtable prefix filters with an old extractor. And that means passing the latest prefix extractor to new memtable iterators that might use prefix seek. (Without the fix, the test added for this fails in many ways.) Suggested follow-up: * Investigate a FIXME where a MergeIteratorBuilder is created in db_impl.cc. No unit test detects a change in value that should impact correctness. * Make memtable prefix bloom compatible with `auto_prefix_mode`, which might require involving the memtablereps because we don't know at iterator creation time (only seek time) whether an auto_prefix_mode seek will be a prefix seek. * Add `prefix_same_as_start` testing to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/13026 Test Plan: tests updated, added. Add combination of `total_order_seek=true` and `auto_prefix_mode=true` to stress test. Ran `make blackbox_crash_test` for a long while. Manually ran tests with `prefix_seek_opt_in_only=true` as default, looking for unexpected issues. I inspected most of the results and migrated many tests to be ready for such a change (but not all). Reviewed By: ltamasi Differential Revision: D63147378 Pulled By: pdillinger fbshipit-source-id: 1f4477b730683d43b4be7e933338583702d3c25e	2024-09-20 15:54:19 -07:00
Levi Tamasi	54ace7f340	Change the semantics of blob_garbage_collection_force_threshold to provide better control over space amp (#13022 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13022 Currently, `blob_garbage_collection_force_threshold` applies to the oldest batch of blob files, which is typically only a small subset of the blob files currently eligible for garbage collection. This can result in a form of head-of-line blocking: no GC-triggered compactions will be scheduled if the oldest batch does not currently exceed the threshold, even if a lot of higher-numbered blob files do. This can in turn lead to high space amplification that exceeds the soft bound implicit in the force threshold (e.g. 50% would suggest a space amp of <2 and 75% would imply a space amp of <4). The patch changes the semantics of this configuration threshold to apply to the entire set of blob files that are eligible for garbage collection based on `blob_garbage_collection_age_cutoff`. This provides more intuitive semantics for the option and can provide a better write amp/space amp trade-off. (Note that GC-triggered compactions still pick the same SST files as before, so triggered GC still targets the oldest the blob files.) Reviewed By: jowlyzhang Differential Revision: D62977860 fbshipit-source-id: a999f31fe9cdda313de513f0e7a6fc707424d4a3	2024-09-19 15:47:13 -07:00
Peter Dillinger	98c33cb8e3	Steps toward making IDENTITY file obsolete (#13019 ) Summary: * Set write_dbid_to_manifest=true by default * Add new option write_identity_file (default true) that allows us to opt-in to future behavior without identity file * Refactor related DB open code to minimize code duplication _Recommend hiding whitespace changes for review_ Intended follow-up: add support to ldb for reading and even replacing the DB identity in the manifest. Could be a variant of `update_manifest` command or based on it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13019 Test Plan: unit tests and stress test updated for new functionality Reviewed By: anand1976 Differential Revision: D62898229 Pulled By: pdillinger fbshipit-source-id: c08b25cf790610b034e51a9de0dc78b921abbcf0	2024-09-19 14:05:21 -07:00
anand76	c21fe1a47f	Add ticker stats for read corruption retries (#12923 ) Summary: Add a couple of ticker stats for corruption retry count and successful retries. This PR also eliminates an extra read attempt when there's a checksum mismatch in a block read from the prefetch buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12923 Test Plan: Update existing tests Reviewed By: jowlyzhang Differential Revision: D61024687 Pulled By: anand1976 fbshipit-source-id: 3a08403580ab244000e0d480b7ee0f5a03d76b06	2024-08-12 15:32:07 -07:00
Zixuan Tan	a97a1f3247	Fix incorrect refillPeriodMicros unit in the document (#12832 ) Summary: The default value for `refillPeriodMicros` is `100 * 1000`, which means 100ms (or 100,000us). The document comments say 100,000ms (equivalent to 100 seconds), which is incorrect and misleading. This PR fixes this typo. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12832 Reviewed By: cbi42 Differential Revision: D59492336 Pulled By: ajkr fbshipit-source-id: c2f55a8b996fe078a1510fcbebaea92ec0075929	2024-07-08 18:08:53 -07:00
Richard Barnes	a8dd15ad41	Fix deprecated dynamic exception in internal_repo_rocksdb/repo/java/rocksjni/kv_helper.h +1 Summary: LLVM has detected a violation of `-Wdeprecated-dynamic-exception-spec`. Dynamic exceptions were removed in C++17. This diff fixes the deprecated instance(s). See [Dynamic exception specification](https://en.cppreference.com/w/cpp/language/except_spec) and [noexcept specifier](https://en.cppreference.com/w/cpp/language/noexcept_spec). Reviewed By: palmje Differential Revision: D58528375 fbshipit-source-id: 130fecd3aa556e4cdb955feea53c442bd9fbc864	2024-06-13 12:41:13 -07:00
Davide Angelocola	cee32c5cce	use nullptr instead of NULL / 0 in rocksdbjni (#12575 ) Summary: While I was trying to understand issue https://github.com/facebook/rocksdb/issues/12503, I found this minor problem. Please have a look adamretter rhubner Pull Request resolved: https://github.com/facebook/rocksdb/pull/12575 Reviewed By: ajkr Differential Revision: D57596055 Pulled By: cbi42 fbshipit-source-id: ee0860bdfbee9364cd30c23957b72a04da6acd45	2024-05-21 12:56:07 -07:00
Peter Dillinger	45c105104b	Set optimize_filters_for_memory by default (#12377 ) Summary: This feature has been around for a couple of years and users haven't reported any problems with it. Not quite related: fixed a technical ODR violation in public header for info_log_level in case DEBUG build status changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12377 Test Plan: unit tests updated, already in crash test. Some unit tests are expecting specific behaviors of optimize_filters_for_memory=false and we now need to bake that in. Reviewed By: jowlyzhang Differential Revision: D54129517 Pulled By: pdillinger fbshipit-source-id: a64b614840eadd18b892624187b3e122bab6719c	2024-04-30 08:33:31 -07:00
Andrew Kryczka	b4c520cadc	change default `CompactionOptions::compression` while deprecating it (#12587 ) Summary: I had a TODO to complete `CompactionOptions`'s compression API but never did it: `d610e14f93/db/compaction/compaction_picker.cc (L371-L373)` Without solving that TODO, the API remains incomplete and unsafe. Now, however, I don't think it's worthwhile to complete it. I think we should instead delete the API entirely. This PR deprecates it in preparation for deletion in a future major release. The `ColumnFamilyOptions` settings for compression should be good enough for `CompactFiles()` since they are apparently good enough for every other compaction, including `CompactRange()`. In the meantime, I also changed the default `CompressionType`. Having callers of `CompactFiles()` use Snappy compression by default does not make sense when the default could be to simply use the same compression type that is used for every other compaction. As a bonus, this change makes the default `CompressionType` consistent with the `CompressionOptions` that will be used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12587 Reviewed By: hx235 Differential Revision: D56619273 Pulled By: ajkr fbshipit-source-id: 1477de49f14b06c72d6f0045616a8ce91d97e66e	2024-04-26 13:03:21 -07:00
Radek Hubner	a8035ebc0b	Fix exception on RocksDB.getColumnFamilyMetaData() (#12474 ) Summary: https://github.com/facebook/rocksdb/issues/12466 reported a bug when `RocksDB.getColumnFamilyMetaData()` is called on an existing database(With files stored on disk). As neilramaswamy mentioned, this was caused by https://github.com/facebook/rocksdb/issues/11770 where the signature of `SstFileMetaData` constructor was changed, but JNI code wasn't updated. This PR fix JNI code, and also properly populate `fileChecksum` on `SstFileMetaData`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12474 Reviewed By: jowlyzhang Differential Revision: D55811808 Pulled By: ajkr fbshipit-source-id: 2ab156f41eaf4a4f30c49e6df421b61e8451230e	2024-04-05 13:55:18 -07:00
Alexey Vinogradov	c4df598b8e	Implement `PerfContex#toString` for the Java API (#12473 ) Summary: I've implemented `PerfContext#toString` for the Java API. See: https://groups.google.com/g/rocksdb/c/qbY_gNhbyAg Pull Request resolved: https://github.com/facebook/rocksdb/pull/12473 Reviewed By: jowlyzhang Differential Revision: D55660871 Pulled By: cbi42 fbshipit-source-id: f0528fba31ac06e16495e4f49b0bafe0dbc1bc61	2024-04-03 14:33:31 -07:00
Radek Hubner	db9eb10b5b	Enable all Java test via CMake (#12446 ) Summary: This PR follows the work done in https://github.com/facebook/rocksdb/issues/11756 and enable all Java test to be run via CMake/Ctest. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12446 Reviewed By: jowlyzhang Differential Revision: D55661635 Pulled By: cbi42 fbshipit-source-id: 3ea49a121a3ba72089632ff43ee7fe4419b08a96	2024-04-03 11:03:11 -07:00
Peter Dillinger	b515a5db3f	Replace ScopedArenaIterator with ScopedArenaPtr<InternalIterator> (#12470 ) Summary: ScopedArenaIterator is not an iterator. It is a pointer wrapper. And we don't need a custom implemented pointer wrapper when std::unique_ptr can be instantiated with what we want. So this adds ScopedArenaPtr<T> to replace those uses. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12470 Test Plan: CI (including ASAN/UBSAN) Reviewed By: jowlyzhang Differential Revision: D55254362 Pulled By: pdillinger fbshipit-source-id: cc96a0b9840df99aa807f417725e120802c0ae18	2024-03-22 13:40:42 -07:00
anand76	98d8a85624	New PerfContext counters for block cache bytes read (#12459 ) Summary: Add PerfContext counters for measuring the cumulative size of blocks found in the block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12459 Reviewed By: ajkr Differential Revision: D55170694 Pulled By: anand1976 fbshipit-source-id: 8cbba76eece116cefce7f00e2fc9d74757661d25	2024-03-21 10:46:46 -07:00
anand76	4868c10b44	Retry block reads on checksum mismatch (#12427 ) Summary: On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`. Tests: Add new unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427 Reviewed By: ajkr Differential Revision: D55025941 Pulled By: anand1976 fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e	2024-03-18 16:16:05 -07:00
Yu Zhang	f2546b6623	Support returning write unix time in iterator property (#12428 ) Summary: This PR adds support to return data's approximate unix write time in the iterator property API. The general implementation is: 1) If the entry comes from a SST file, the sequence number to time mapping recorded in that file's table properties will be used to deduce the entry's write time from its sequence number. If no such recording is available, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown except if the entry's sequence number is zero, in which case, 0 is returned. This also means that even if `preclude_last_level_data_seconds` and `preserve_internal_time_seconds` can be toggled off between DB reopens, as long as the SST file's table property has the mapping available, the entry's write time can be deduced and returned. 2) If the entry comes from memtable, we will use the DB's sequence number to write time mapping to do similar things. A copy of the DB's seqno to write time mapping is kept in SuperVersion to allow iterators to have lock free access. This also means a new `SuperVersion` is installed each time DB's seqno to time mapping updates, which is originally proposed by Peter in https://github.com/facebook/rocksdb/issues/11928 . Similarly, if the feature is not enabled, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown. Needed follow up: 1) The write time for `kTypeValuePreferredSeqno` should be special cased, where it's already specified by the user, so we can directly return it. 2) Flush job can be updated to use DB's seqno to time mapping copy in the SuperVersion. 3) Handle the case when `TimedPut` is called with a write time that is `std::numeric_limits<uint64_t>::max()`. We can make it a regular `Put`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12428 Test Plan: Added unit test Reviewed By: pdillinger Differential Revision: D54967067 Pulled By: jowlyzhang fbshipit-source-id: c795b1b7ec142e09e53f2ed3461cf719833cb37a	2024-03-15 15:37:37 -07:00
Alan Paxton	d9a441113e	JNI get_helper code sharing / multiGet() use efficient batch C++ support (#12344 ) Summary: Implement RAII-based helpers for JNIGet() and multiGet() Replace JNI C++ helpers `rocksdb_get_helper, rocksdb_get_helper_direct`, `multi_get_helper`, `multi_get_helper_direct`, `multi_get_helper_release_keys`, `txn_get_helper`, and `txn_multi_get_helper`. The model is to entirely do away with a single helper, instead a number of utility methods allow each separate JNI `Get()` and `MultiGet()` method to organise their parameters efficiently, then call the underlying C++ `db->Get()`, `db->MultiGet()`, `txn->Get()`, or `txn->MultiGet()` method itself, and use further utilities to retrieve results. Roughly speaking: * get keys into C++ form * Call C++ Get() * get results and status into Java form We achieve a useful performance gain as part of this work; by using the updated C++ multiGet we immediately pick up its performance gains (batch improvements to multiGet C++ were previously implemented, but not until now used by Java/JNI). multiGetBB already uses the batched C++ multiGet(), and all other benchmarks show consistent improvement after the changes: ## Before: ``` Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 256 thrpt 25 5315.459 ± 20.465 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 1024 thrpt 25 5673.115 ± 78.299 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 4096 thrpt 25 2616.860 ± 46.994 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 16384 thrpt 25 1700.058 ± 24.034 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 65536 thrpt 25 791.171 ± 13.955 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 256 thrpt 25 6129.929 ± 94.200 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 1024 thrpt 25 7012.405 ± 97.886 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 4096 thrpt 25 2799.014 ± 39.352 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 16384 thrpt 25 1417.205 ± 22.272 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 65536 thrpt 25 655.594 ± 13.050 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 256 thrpt 25 6147.247 ± 82.711 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 1024 thrpt 25 7004.213 ± 79.251 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 4096 thrpt 25 2715.154 ± 110.017 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 16384 thrpt 25 1408.070 ± 31.714 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 65536 thrpt 25 623.829 ± 57.374 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 256 thrpt 25 6119.243 ± 116.313 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 1024 thrpt 25 6931.873 ± 128.094 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 4096 thrpt 25 2678.253 ± 39.113 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 16384 thrpt 25 1337.384 ± 19.500 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 65536 thrpt 25 625.596 ± 14.525 ops/s ``` ## After: ``` Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 256 thrpt 25 5191.074 ± 78.250 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 1024 thrpt 25 5378.692 ± 260.682 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 4096 thrpt 25 2590.183 ± 34.844 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 16384 thrpt 25 1634.793 ± 34.022 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 65536 thrpt 25 786.455 ± 8.462 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 256 thrpt 25 5285.055 ± 11.676 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 1024 thrpt 25 5586.758 ± 213.008 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 4096 thrpt 25 2527.172 ± 17.106 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 16384 thrpt 25 1819.547 ± 12.958 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 65536 thrpt 25 803.861 ± 9.963 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 256 thrpt 25 5253.793 ± 28.020 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 1024 thrpt 25 5705.591 ± 20.556 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 4096 thrpt 25 2523.377 ± 15.415 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 16384 thrpt 25 1815.344 ± 11.309 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 65536 thrpt 25 820.792 ± 3.192 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 256 thrpt 25 5262.184 ± 20.477 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 1024 thrpt 25 5706.959 ± 23.123 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 4096 thrpt 25 2520.362 ± 9.170 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 16384 thrpt 25 1789.185 ± 14.239 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 65536 thrpt 25 818.401 ± 12.132 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 256 thrpt 25 6978.310 ± 14.084 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 1024 thrpt 25 7664.242 ± 22.304 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 4096 thrpt 25 2881.778 ± 81.054 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 16384 thrpt 25 1599.826 ± 7.190 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 65536 thrpt 25 737.520 ± 6.809 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 256 thrpt 25 6974.376 ± 10.716 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 1024 thrpt 25 7637.440 ± 45.877 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 4096 thrpt 25 2820.472 ± 42.231 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 16384 thrpt 25 1716.663 ± 8.527 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 65536 thrpt 25 755.848 ± 7.514 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 256 thrpt 25 6943.651 ± 20.040 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 1024 thrpt 25 7679.415 ± 9.114 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 4096 thrpt 25 2844.564 ± 13.388 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 16384 thrpt 25 1729.545 ± 5.983 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 65536 thrpt 25 783.218 ± 1.530 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 256 thrpt 25 6944.276 ± 29.995 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 1024 thrpt 25 7670.301 ± 8.986 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 4096 thrpt 25 2839.828 ± 12.421 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 16384 thrpt 25 1730.005 ± 9.209 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 65536 thrpt 25 787.096 ± 1.977 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 256 thrpt 25 6896.944 ± 21.530 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 1024 thrpt 25 7622.407 ± 12.824 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 4096 thrpt 25 2927.538 ± 19.792 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 16384 thrpt 25 1598.041 ± 4.312 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 65536 thrpt 25 744.564 ± 9.236 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 256 thrpt 25 6853.760 ± 78.041 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 1024 thrpt 25 7360.917 ± 355.365 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 4096 thrpt 25 2848.774 ± 13.409 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 16384 thrpt 25 1727.688 ± 3.329 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 65536 thrpt 25 776.088 ± 7.517 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 256 thrpt 25 6910.339 ± 14.366 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 1024 thrpt 25 7633.660 ± 10.830 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 4096 thrpt 25 2787.799 ± 81.775 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 16384 thrpt 25 1726.517 ± 6.830 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 65536 thrpt 25 787.597 ± 3.362 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 256 thrpt 25 6922.445 ± 10.493 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 1024 thrpt 25 7604.710 ± 48.043 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 4096 thrpt 25 2848.788 ± 15.783 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 16384 thrpt 25 1730.837 ± 6.497 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 65536 thrpt 25 794.557 ± 1.869 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 256 thrpt 25 6918.716 ± 15.766 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 1024 thrpt 25 7626.692 ± 9.394 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 4096 thrpt 25 2871.382 ± 72.155 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 16384 thrpt 25 1598.786 ± 4.819 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 65536 thrpt 25 748.469 ± 7.234 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 256 thrpt 25 6922.666 ± 17.131 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 1024 thrpt 25 7623.890 ± 8.805 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 4096 thrpt 25 2850.698 ± 18.004 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 16384 thrpt 25 1727.623 ± 4.868 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 65536 thrpt 25 774.534 ± 10.025 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 256 thrpt 25 5486.251 ± 13.582 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 1024 thrpt 25 4920.656 ± 44.557 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 4096 thrpt 25 3922.913 ± 25.686 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 16384 thrpt 25 2873.106 ± 4.336 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 65536 thrpt 25 802.404 ± 8.967 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 256 thrpt 25 4817.996 ± 18.042 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 1024 thrpt 25 4243.922 ± 13.929 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 4096 thrpt 25 3175.998 ± 7.773 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 16384 thrpt 25 2321.990 ± 12.501 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 65536 thrpt 25 1753.028 ± 7.130 ops/s ``` Closes https://github.com/facebook/rocksdb/issues/11518 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12344 Reviewed By: cbi42 Differential Revision: D54809714 Pulled By: pdillinger fbshipit-source-id: bee3b949720abac073bce043b59ce976a11e99eb	2024-03-12 12:42:08 -07:00
Alan Paxton	c4d37da826	Java API - Fix handling of CF handles in DB subclasses (#12417 ) Summary: The most general `open()` method for each of RocksDB, TtlDB, OptimisticTransactionDB and TransactionDB should - ensure the default CF is supplied in the list of descriptors - cache the default CF handle - store open CF handles for automatic close on DB close The `close()` method in each of these DB subclasses should `close()` all the owned CF handles. I can’t find a cleaner way to build some generalised open/close that does this for all DB subclasses, so it exists as cut and paste with variations in the 4 different DB subclasses. Added some slightly paranoid testing that CF handles explicitly referred to as default in a list of CF handles in the general open methods, and the simple open that doesn’t supply a CF, end up reading and writing to the same CF. Prompted by the fact that this code is a bit opaque; the first returned handle is the DB. As part of this, fix the bug where the Java side of `OptimisticsTransactionDB` was not setting up default column family; this was visible when setting up an iterator; add a test to validate that the iterator is OK. A single Java reference to the default column family was not being created in the OptimisticsTransactionDB RocksDB subclass; it should be created in all subclasses. The same problem had previously been fixed for TtlDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12417 Reviewed By: ajkr Differential Revision: D54807643 Pulled By: pdillinger fbshipit-source-id: 66f34e56a822a009a8f2018d401cf8940d91aa35	2024-03-12 10:33:27 -07:00
Radek Hubner	583fded565	Fix regression for Javadoc jar build (#12404 ) Summary: https://github.com/facebook/rocksdb/issues/12371 Introduced regression not defining dependency between `create_javadoc` and `rocksdb_javadocs_jar` build targets. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12404 Reviewed By: pdillinger Differential Revision: D54516862 Pulled By: ajkr fbshipit-source-id: 785a99b2caf979395ae0de60e40e7d1b93059adb	2024-03-06 10:33:17 -08:00
Adam Retter	5458eda5f0	Pass build parallelism flag to Docker builds (#12392 ) Summary: Passed the `-j` flag through to builds happening inside Docker containers. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12392 Reviewed By: cbi42 Differential Revision: D54311937 Pulled By: ajkr fbshipit-source-id: 5cf1bfe4b9059cc2d078fb5331812f32cf9e89ab	2024-02-28 12:51:00 -08:00
Adam Retter	99cc36be9b	Correct CMake Javadoc and source jar builds (#12371 ) Summary: Fix some issues introduced in https://github.com/facebook/rocksdb/pull/12199 (CC rhubner) 1. Previous `jar -v -c -f` was not valid command syntax. 2. Javadoc and source Jar files were prefixed `rocksdb-`, now corrected to `rocksdbjni-` pdillinger This needs to be merged to `main` and also `8.11.fb` (to fix the Windows build for the RocksJava release of 8.11.2) please. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12371 Reviewed By: pdillinger, jowlyzhang Differential Revision: D54136834 Pulled By: hx235 fbshipit-source-id: f356f2401042af359ada607e5f0be627418ccd6c	2024-02-27 15:46:12 -08:00
Yu Zhang	2940acac00	Persist table options use_delta_encoding in options file (#11987 ) Summary: This option is used for encoding keys in block based table files. It has been having a default true value since its introduction. Users may not notice this option is not persisted in options file unless they are explicitly setting it to false. If the users expect `Iterator::GetProperty("rocksdb.iterator.is-key-pinned")` to return 1 when setting `ReadOptions.pin_data = true`, they should have noticed loading options file won't work and have work around for this by always explicitly set this option to false for opening DB. This change won't impact those users except that now they can remove their work around. If the users are not relying on key pinning behavior at all and as a result didn't notice the option is not persisted, this change shouldn't have any visible behavior impact either. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11987 Reviewed By: hx235 Differential Revision: D54093238 Pulled By: jowlyzhang fbshipit-source-id: 256a3348c44cf91349034d1f6e242c437b32b9a5	2024-02-23 14:13:28 -08:00
leedonggyu	ca99a8f153	Add function to check if the RocksDB instance is closed or not (#11337 ) Summary: In RocksDb jni threre is no method to know if the instance is closed or not. so when using a closed instance it makes jvm crash. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11337 Reviewed By: jaykorean Differential Revision: D53941387 Pulled By: ajkr fbshipit-source-id: e3e4e6fe48409fa70a312810e467ec0c4ce356ef	2024-02-20 11:36:28 -08:00
anand76	d227276147	Deprecate some variants of Get and MultiGet (#12327 ) Summary: A lot of variants of Get and MultiGet have been added to `include/rocksdb/db.h` over the years. Try to consolidate them by marking variants that don't return timestamps as deprecated. The underlying DB implementation will check and return Status::NotSupported() if it doesn't support returning timestamps and the caller asks for it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12327 Reviewed By: pdillinger Differential Revision: D53828151 Pulled By: anand1976 fbshipit-source-id: e0b5ca42d32daa2739d5f439a729815a2d4ff050	2024-02-16 09:21:06 -08:00
anand76	28c1c15c29	Sync tickers and histograms across C++ and Java (#12355 ) Summary: The RocksDB ticker and histogram statistics were out of sync between the C++ and Java code, with a number of newer stats missing in TickerType.java and HistogramType.java. Also, there were gaps in numbering in portal.h, which could soon become an issue due to the number of tickers and the fact that we're limited to 1 byte in Java. This PR adds the missing stats, and re-numbers all of them. It also moves some stats around to try to group related stats together. Since this will go into a major release, compatibility shouldn't be an issue. This should be automated at some point, since the current process is somewhat error prone. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12355 Reviewed By: jaykorean Differential Revision: D53825324 Pulled By: anand1976 fbshipit-source-id: 298c180872f4b9f1ee54b8bb22f4e280458e7e09	2024-02-15 17:22:03 -08:00
Peter Dillinger	bfd00bba9c	Use format_version=6 by default (#12352 ) Summary: It's in production for a large storage service, and it was initially released 6 months ago (8.6.0). IMHO that's enough room for "easy downgrade" to most any user's previously integrated version, even if they only update a few times a year. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12352 Test Plan: tests updated, including format capatibility test table_test: ApproximateOffsetOfCompressed is affected because adding index block to metaindex adds about 13 bytes to SST files in format_version 6. This test has historically been problematic and one reason is that, apparently, not only could it pass/fail depending on snappy compression version, but also how long your host name is, because of db_host_id. I've cleared that out for the test, which takes care of format_version=6 and hopefully improves long-term reliability. Suggested follow-up: FinishImpl in table_test.cc takes a table_options that is ignored in some cases and might not match the ioptions.table_factory configuration unless the caller is very careful. This should be cleaned up somehow. Reviewed By: anand1976 Differential Revision: D53786884 Pulled By: pdillinger fbshipit-source-id: 1964cbd40d3ab0a821fdc01c458031df716fcf51	2024-02-15 11:23:48 -08:00
Yu Zhang	4bea83aa44	Remove the force mode for EnableFileDeletions API (#12337 ) Summary: There is no strong reason for user to need this mode while on the other hand, its behavior is destructive. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12337 Reviewed By: hx235 Differential Revision: D53630393 Pulled By: jowlyzhang fbshipit-source-id: ce94b537258102cd98f89aa4090025663664dd78	2024-02-13 18:36:25 -08:00
马越	45668a05f5	add unit test for compactRangeWithNullBoundaries java api (#12333 ) Summary: The purpose of this PR is to supplement a set of unit tests for https://github.com/facebook/rocksdb/pull/12328 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12333 Reviewed By: ltamasi Differential Revision: D53553830 Pulled By: cbi42 fbshipit-source-id: d21490f7ce7b30f42807ee37eda455ca6abdd072	2024-02-13 10:48:31 -08:00
Hui Xiao	1a885fe730	Remove deprecated Options::access_hint_on_compaction_start (#11654 ) Summary: Context: `Options::access_hint_on_compaction_start ` is marked deprecated and now ready to be removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11654 Test Plan: Multiple db_stress runs with pre-PR and post-PR binary randomly to ensure forward/backward compatibility on options `36a5686ec0` `python3 tools/db_crashtest.py --simple blackbox --interval=30` Reviewed By: cbi42 Differential Revision: D47892459 Pulled By: hx235 fbshipit-source-id: a62f46a0377fe143be7638e218978d5431c15c56	2024-02-05 13:35:19 -08:00

1 2 3 4 5 ...

1005 commits