Summary:
After seeing more people hit issues with thrashing small LRUCache shards and AutoHCC running fully in production for a while on a very large service, here I make these updates:
* In the public API, mark the case of `estimated_entry_charge = 0` (which is how you select AutoHCC) as production-ready and generally preferred. That means devoting a lot less space to how to tune FixedHCC (`estimated_entry_charge > 0`) because it is not generally recommended anymore even though in theory it is the fastest (conditional on a fragile configuration).
* In the public API, add more detail about potential problems with LRUCache and explicitly endorse HCC.
* When a default block cache is created, use AutoHCC instead of LRUCache. It's still a 32MB cache but that's just one cache shard for AutoHCC so the risk of issues with small cache shards is dramatically reduced. And a single AutoHCC shard is still essentially wait-free.
* Improve the handling of the hypothetical scenario of a failed anonymous mmap. This is hardly a concern for 64-bit Linux and likely most other OSes. It would in theory be possible to fall back on LRUCache in that case but the code structure makes that annoying/challenging. Instead we crash with an appropriate message.
* Cleaned up some includes
* Fixed some previously unreported leaks (better assertions on HCC perhaps, some subtle behavior changes)
* Added a new mode to cache_bench (detailed below)
* Avoid a particularly costly sanity check in `~AutoHyperClockTable()` even in debug builds so that unit testing, etc., isn't bogged down, except keep it in ASAN build.
Planned follow-up:
* Update HCC implementation to use my new "bit field atomics" API introduced in https://github.com/facebook/rocksdb/issues/13910 to make it easier to read and maintain
Possible follow-up:
* Re-engineer table cache to use AutoHCC also, instead of LRUCache and a single mutex to ensure no duplication across threads. (a) Pad table cache key to 128 bits for AutoHCC. (b) Stripe/shard the no-duplication mutex. (HCC's consistency model is too weak for concurrent threads to use its API to agree on a winner, even if entries could be inserted in an "open in progress" state.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13964
Test Plan:
existing tests. ClockCacheTest.ClockEvictionEffortCapTest caught a regression during my development, and the crash test has a history of finding subtle HCC bugs.
## Performance
Although we've validated AutoHCC performance under high load, etc., before we haven't really considered whether there will be unacceptable overheads for small DBs and CFs, e.g. in unit tests. For this, I have added a new mode to cache_bench: with the -stress_cache_instances=n parameter, it will create and destroy n empty cache instances several times. In the debug build, this found that a particular check in `~AutoHyperClockTable()` was extremely costly for short-lived caches (fixed). Beyond that, we can answer the question of whether it is feasible for a single process to host 1000 DBs each with 1000 CFs with default block cache instances, after moving LRUCache -> AutoHCC, for example:
```
/usr/bin/time ./cache_bench -stress_
cache_instances=1000000 -cache_type=auto_hyper_clock_cache -cache_size=33554432
```
Release build:
Average 9.8 us per 32MB LRUCache creation, 2.9 us per destruction, 24.6GB max RSS (~25KB each)
->
Average 4.3 us per 32MB AutoHCC creation, 4.9 us per destruction, 4.8GB max RSS (~5KB each)
Debug build:
Average 10.9 us per 32MB LRUCache creation, 3.5 us per destruction, 28.7GB max RSS (~29KB each)
->
Average 4.5 us per 32MB AutoHCC creation, 4.9 us per destruction, 4.7GB max RSS (~5KB each)
Despite the anonymous mmaps, it's apparently more efficient for default/small/empty structures. This is likely due to the dramatically low number of cache shards at this size. If we switch to `-stress_cache_instances=10000 -cache_size=1073741824`:
Release build:
Average 10.6 us per 1GB LRUCache, 2.8 us per destruction, 2.3 GB max RSS (~230KB each)
->
Average 130 us per 1GB AutoHCC creation, 153 us per destruction, 1.5 GB max RSS (~150KB each)
Debug build:
Average 11.2 us per 1GB LRUCache, 3.6 us per destruction, 2.4 GB max RSS (~240KB each)
->
Average 130 us per 1GB AutoHCC creation, 150 us per destruction, 1.6 GB max RSS (~160KB each)
Here it's clear that we are paying a price in time for setting up all those mmaps for the good number of cache shards and potential table growth, even though the RSS is well under control. However, I am not concerned about this at all, as it's unlikely to slow down anything notably such as unit tests. Before and after full testsuite runs confirm:
3327.73user 5188.71system 3:38.88elapsed -> 3312.07user 5704.77system 3:41.61elapsed
There is increased kernel time but acceptable. With ASAN+UBSAN:
11618.70user 15671.30system 5:54.68elapsed -> 12595.81user 16159.67system 6:32.77elapsed
Acceptable given that our ASAN+UBSAN builds are not the slowest in CI
Reviewed By: hx235
Differential Revision: D82661067
Pulled By: pdillinger
fbshipit-source-id: ab25c766ca70f2b8664849c2a838b9e1b4e72d3b
Summary:
... and associated statistics, etc. Someone needs it, so here it is.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13927
Test Plan: Updated / extended / added some unit tests
Reviewed By: cbi42
Differential Revision: D81981469
Pulled By: pdillinger
fbshipit-source-id: 52558c08741890b781310906acbc18d9eb479363
Summary:
See `Makefile` for actual changes:
* ZLIB remains the same
* BZIP2 remains the same
* SNAPPY is a minor update
* LZ4 is a significant update with multithreaded/multicore compression https://github.com/lz4/lz4/releases/tag/v1.10.0
* ZSTD is a significant update RocksDB is called out as benefiting in particular from the performance improvements herein https://github.com/facebook/zstd/releases/tag/v1.5.7
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13609
Reviewed By: archang19
Differential Revision: D77877295
Pulled By: mszeszko-meta
fbshipit-source-id: bf9a257e8f68dec3d02743b339aa2df65df4ab2c
Summary:
Addresses https://github.com/facebook/rocksdb/issues/13587.
This PR exposes the optimized implementation of batched reads through a `Transaction` object to Java clients.
The latency improvement of transactional multiget on production workload achieved by switching the implementation is roughly:
```
quantile=0.2: 21%
quantile=0.5: 28%
quantile=0.8: 46%
quantile=1.0: 239%
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13589
Reviewed By: jaykorean
Differential Revision: D74660169
Pulled By: cbi42
fbshipit-source-id: d01780173e0500c96e5e431ff6645008cbf6e8b5
Summary:
Prior to this PR, for FIFO kChangeTemperature compaction was done by iterating and reading thru the input sst and generate the output sst. This was wasteful since for FIFO we could apply the "trivial" move by copying the input sst to the out sst without need decompress/compress and reading thru the input sst content at all. This PR added "allow_trivial_copy_when_change_temperature" to the CompactionOptionsFIFO.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13562
Reviewed By: cbi42
Differential Revision: D73295404
Pulled By: mikechuangmeta
fbshipit-source-id: 02241c7389797730ecd4a3b636837cb5f912b424
Summary:
**Context/Summary:**
This PR adds new stats to measure compaction readahead size for rocksdb managed prefetching (not FS prefetching). It can be used to verify compaction read-ahead is doing what's configured. This PR also excludes compaction readahead stats from user scan readahead stats measured in existing stats so there is a cleaner separating between these two.
Bonus: this PR also included some typo fixing about "io activities"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13520
Test Plan: Modified existing test to verify stats
Reviewed By: archang19
Differential Revision: D72892850
Pulled By: hx235
fbshipit-source-id: 1a73182061baa044c9c9193a2b0fd967ffe75c4a
Summary:
As titled. This option has been marked deprecated since introduction of a better option `max_write_buffer_size_to_maintain` and acts as its fallback since RocksDB 6.5.0 The internal user we know these options were created for migrated to `max_write_buffer_size_to_maintain` for a long time too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13491
Test Plan: existing tests
Reviewed By: cbi42
Differential Revision: D71984601
Pulled By: jowlyzhang
fbshipit-source-id: c264d4809e311f60fdbad817ebfade256db549b6
Summary:
**Context/Summary:**
- This is an attempt to fix our [build-window-vs2022 failure](https://github.com/facebook/rocksdb/actions/runs/14215681026/job/39831770554?fbclid=IwZXh0bgNhZW0CMTAAAR2BQLjp8kC1u1yyvN1_S5qwmrHEZOfzxJdcbj2vq7mvwwq83n1cbkmiBCA_aem_ygYxQA5EUmxh2y4EjMlTfg) below. snappy-1.1.8's cmake_minimum_required being less than 3.5 seems to trigger the complaint. Hopefully downloading the 1.2.2 which is the [first version starting to use higher cmake_minimum_required version](https://github.com/google/snappy/releases/tag/1.2.2) solves the failure.
```
Directory: D:\a\rocksdb\rocksdb\thirdparty\snappy-1.1.8
Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 4/2/2025 9:02 AM build
CMake Error at CMakeLists.txt:29 (cmake_minimum_required):
Compatibility with CMake < 3.5 has been removed from CMake.
Update the VERSION argument <min> value. Or, use the <min>...<max> syntax
to tell CMake that the project requires at least <min> but has been updated
to work with policies introduced by <max> or earlier.
Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway.
```
- The downloaded snappy do not include the content under nested repos Google Test and Google Benchmark. But snappy cmake by default will attempt to build them. Since we don't change snappy, we don't need building such development suit. This PR also disabled snappy cmake's attempt to build them.
- By running above changes, the same build [complained](https://github.com/facebook/rocksdb/actions/runs/14228883966/job/39874927730?pr=13514) about java cmakelists requiring too low cmake_minimum_required as well. So this PR also upgraded its cmake_minimum_required to be 3.11 aligning with its warning message
```
if(${CMAKE_VERSION} VERSION_LESS "3.11.4")
message("Please consider switching to CMake 3.11.4 or newer")
endif()
```
**Test plan**
Monitor build-window-vs2022 for this PR
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13514
Reviewed By: pdillinger
Differential Revision: D72333581
Pulled By: hx235
fbshipit-source-id: 1a9096738d39c8b1d270fe17fbd78c1ea4c4c45e
Summary:
**Context/Summary:**
MaxMemCompactionLevel() developed 10 years ago simply returns the level a memtable flushed to, which has historically been L0 and have no plan to change to something different for future. It is also not used in test or internally.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13503
Test Plan: CI + fake release
Reviewed By: cbi42
Differential Revision: D72066092
Pulled By: hx235
fbshipit-source-id: 5ff5b16a6664ef3efabd3a6fbd8a2d0529b62460
Summary:
based on the option comment, `ignore_range_deletions` was added due to the overhead of range deletions in read path when a DB does not use DeleteRange(). The current implementation should not have a noticeable performance difference in this case.
`experimental::PromoteL0()` can be replaced by doing a manual compaction with proper CompactRangeOptions.
There are some internal use of these option and API so we will remove them later after the usages are updated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13500
Test Plan:
comment change only.
Performance: benchmark the performance difference with `ignore_range_deletions` and without (borrowed flag `universal_incremental` for this purpose), ran at the same time on the same machine.
- random point get:
- ignore_range_deletions=false: 343078 ops/sec
- ignore_range_deletions=true: 340219 ops/sec (0.8% slower)
```
(for I in $(seq 1 1); do TEST_TMPDIR=/dev/shm/t1 /data/users/changyubi/vscode-root/rocksdb/db_bench --benchmarks=fillseq,waitforcompaction,readrandom --write_buffer_size=67108864 --writes=1000000 --num=2000000 --reads=1000000 --seed=1723056275 --universal_incremental=false 2>&1 | grep "readrandom"; done;) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }';
```
- sequential scan:
- ignore_range_deletions=false: 5378104 ops/sec
- ignore_range_deletions=true: 5393809 ops/sec (0.3% faster)
```
(for I in $(seq 1 10); do TEST_TMPDIR=/dev/shm/t1 /data/users/changyubi/vscode-root/rocksdb/db_bench --benchmarks=fillseq,waitforcompaction,readseq[-X10] --write_buffer_size=67108864 --writes=1000000 --num=2000000 --universal_incremental=true --seed=1723056275 2>1 | grep "\[AVG 10 runs\]"; done;) | awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }';
```
The difference in ops/sec for the two benchmarks is likely noise.
Reviewed By: hx235
Differential Revision: D72069223
Pulled By: cbi42
fbshipit-source-id: ad82a051aa4682790d2178cd4fb2d1467397fbb5
Summary:
The new API in https://github.com/facebook/rocksdb/issues/13453 is awkward and precarious because of using RangePtr, which encodes optional keys using raw pointers to Slice. We could use `std::optional<Slice>` instead but that is unsatisfyingly a larger object with an inefficient size (typically 17 bytes).
Here I introduce a custom optional Slice type, `OptSlice`, that is the same size as a Slice, and use it in a number of places to clean up code and make some public APIs easier to work with. This includes
* `atomic_replace_range` (not yet released, OK to change)
* `GetAllKeyVersions()` which gets a behavior change because of its unusual handling of empty keys.
* `DeleteFilesInRanges()`
* TODO in follow-up: `CompactRange()`
Most of the diff is associated updates and refactorings. Also
* Move some relevant things out of db.h to keep it as tidy as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13481
Test Plan: tests updated
Reviewed By: hx235
Differential Revision: D71747774
Pulled By: pdillinger
fbshipit-source-id: b4c8519608d119b8bceca9bb0fd778608f62a141
Summary:
Cleanup post https://github.com/facebook/rocksdb/pull/13284.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13322
Test Plan:
1. We did not find any evidence of breakage in internal pre-release integration pipeline runs after renaming the deprecated API in `9.10`.
2. _To the extent possible_, we manually validated partner use cases of file deletion and confirmed deprecated API is no longer in use.
Reviewed By: jaykorean
Differential Revision: D68476852
Pulled By: mszeszko-meta
fbshipit-source-id: fbe1f873e16ae7c60d7706a3c44ecc695ab86a4b
Summary:
FlushReason enum in C++ has members up to 15, but in Java, the mirroring FlushReason only supports reason codes up to 12. This causes exceptions when adding a flush listener.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13246
Reviewed By: pdillinger
Differential Revision: D68241620
Pulled By: jaykorean
fbshipit-source-id: 1e2856dad28dff0cbb1772f5a8ea03cc1e224088
Summary:
We added a removal warning for public `DB::DeleteFile` API ~4 years ago in https://github.com/facebook/rocksdb/pull/7337. This API seems to sit at wrong layer of abstraction, where instead of exposing a clear interface to delete specific range of keys, callers rely on their own discovery / interpretation of where their data / log possibly resides 'as-of-now'. For example, in case of data, the physical location of the keys might very well change after user obtained their mapping from key(s) to specific SST file. This will lead to `InvalidArgument` response, which if repeated, would put a user in a race condition spinning wheel - the behavior that's inefficient, fairly indeterministic and therefore one that should be strongly discouraged. We're employing a graceful approach to prefixing the public API with `DEPRECATED_` first for better discoverability and ease of self service for product teams should they still use that legacy API. If everything goes smoothly, we intend to remove all the deprecated API references in the next release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13284
Reviewed By: pdillinger
Differential Revision: D67981502
Pulled By: mszeszko-meta
fbshipit-source-id: adc7fe5cf4e2180bcfd21878b8f78f3fb6ead355
Summary:
This option has been officially deprecated in 5.4.0. We're removing all the references to `random_access_max_buffer_size`, related rules and all the clients wrappers. As a part of this refactoring, we're also getting rid of the `options-1-false` (and consequently its' `multiple-conds-all-false` corresponding rule), as condition would not make much sense anymore without the bounding RA max buffer size limit. Motivated by ongoing tech debt reduction effort.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13278
Test Plan: Validated that internal users do not rely on this long-gone option in their workflows.
Reviewed By: jaykorean
Differential Revision: D67909674
Pulled By: mszeszko-meta
fbshipit-source-id: 8f4b59a4a92b0b32b8b91b71ac318aafc17f1da2
Summary:
Reflect RocksDB DailyOffpeakTimeUTC option in Java API. As is standard for options, there are a number of different places where this option needs to be added: it is an option, a DB option, and it is mutable (can be changed while running).
The new option is a string value. This requires an extension to the internal MutableDBOptions parse code, which received the entire options string from C++ and parses it on the Java side.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13148
Reviewed By: cbi42
Differential Revision: D67870402
Pulled By: jaykorean
fbshipit-source-id: 975af69773206da936d230cbadb5f69a002d92a3
Summary:
add a new transaction option `TransactionOptions::commit_bypass_memtable` that will ingest the transaction into a DB as an immutable memtables, skipping memtable writes during transaction commit. This helps to reduce the blocking time of committing a large transaction, which is mostly spent on memtable writes. The ingestion is done by creating WBWIMemTable using transaction's underlying WBWI, and ingest it as the latest immutable memtable. The feature will be experimental.
Major changes are:
1. write path change to ingest the transaction, mostly in WriteImpl() and IngestWBWI() in db_impl_write.cc.
2. WBWI changes to track some per CF stats like entry count and overwritten single deletion count, and track which keys have overwritten single deletions (see 3.). Per CF stat is used to precompute the number of entries in each WBWIMemTable.
3. WBWIMemTable Iterator changes to emit overwritten single deletions. The motivation is explained in the comment above class WBWIMemTable definition. The rest of the changes in WBWIMemTable are moving the iterator definition around.
Some intended follow ups:
1. support for merge operations
2. stats/logging around this option
3. tests improvement, including stress test support for the more comprehensive no_batched_op_stress.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13144
Test Plan:
* added new unit tests
* enabled in multi_ops_txns_stress test
* Benchmark: applying the change in 8222c0cafc4c6eb3a0d05807f7014b44998acb7a, I tested txn size of 10k and check perf context for write_memtable_time, write_wal_time and key_lock_wait_time(repurposed for transaction unlock time). Though the benchmark result number can be flaky, this shows memtable write time improved a lot (more than 100 times). The benchmark also shows that the remaining commit latency is from transaction unlock.
```
./db_bench --benchmarks=fillrandom --seed=1727376962 --threads=1 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --batch_size=10000 --transaction_db=1 --perf_level=4 --enable_pipelined_write=false --commit_bypass_memtable=1
commit_bypass_memtable = false
fillrandom : 3.982 micros/op 251119 ops/sec 0.398 seconds 100000 operations; 27.8 MB/s PERF_CONTEXT:
write_memtable_time = 116950422
write_wal_time = 8535565
txn unlock time = 32979883
commit_bypass_memtable = true
fillrandom : 2.627 micros/op 380559 ops/sec 0.263 seconds 100000 operations; 42.1 MB/s PERF_CONTEXT:
write_memtable_time = 740784
write_wal_time = 11993119
txn unlock time = 21735685
```
Reviewed By: jowlyzhang
Differential Revision: D66307632
Pulled By: cbi42
fbshipit-source-id: 6619af58c4c537aed1f76c4a7e869fb3f5098999
Summary:
With some new use cases onboarding to prefix extractors/seek/filters, one of the risks is existing iterator code, e.g. for maintenance tasks, being unintentionally subject to prefix seek semantics. This is a longstanding known design flaw with prefix seek, and `prefix_same_as_start` and `auto_prefix_mode` were steps in the direction of making that obsolete. However, we can't just immediately set `total_order_seek` to true by default, because that would impact so much code instantly.
Here we add a new DB option, `prefix_seek_opt_in_only` that basically allows users to transition to the future behavior when they are ready. When set to true, all iterators will be treated as if `total_order_seek=true` and then the only ways to get prefix seek semantics are with `prefix_same_as_start` or `auto_prefix_mode`.
Related fixes / changes:
* Make sure that `prefix_same_as_start` and `auto_prefix_mode` are compatible with (or override) `total_order_seek` (depending on your interpretation).
* Fix a bug in which a new iterator after dynamically changing the prefix extractor might mix different prefix semantics between memtable and SSTs. Both should use the latest extractor semantics, which means iterators ignoring memtable prefix filters with an old extractor. And that means passing the latest prefix extractor to new memtable iterators that might use prefix seek. (Without the fix, the test added for this fails in many ways.)
Suggested follow-up:
* Investigate a FIXME where a MergeIteratorBuilder is created in db_impl.cc. No unit test detects a change in value that should impact correctness.
* Make memtable prefix bloom compatible with `auto_prefix_mode`, which might require involving the memtablereps because we don't know at iterator creation time (only seek time) whether an auto_prefix_mode seek will be a prefix seek.
* Add `prefix_same_as_start` testing to db_stress
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13026
Test Plan:
tests updated, added. Add combination of `total_order_seek=true` and `auto_prefix_mode=true` to stress test. Ran `make blackbox_crash_test` for a long while.
Manually ran tests with `prefix_seek_opt_in_only=true` as default, looking for unexpected issues. I inspected most of the results and migrated many tests to be ready for such a change (but not all).
Reviewed By: ltamasi
Differential Revision: D63147378
Pulled By: pdillinger
fbshipit-source-id: 1f4477b730683d43b4be7e933338583702d3c25e
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13022
Currently, `blob_garbage_collection_force_threshold` applies to the oldest batch of blob files, which is typically only a small subset of the blob files currently eligible for garbage collection. This can result in a form of head-of-line blocking: no GC-triggered compactions will be scheduled if the oldest batch does not currently exceed the threshold, even if a lot of higher-numbered blob files do. This can in turn lead to high space amplification that exceeds the soft bound implicit in the force threshold (e.g. 50% would suggest a space amp of <2 and 75% would imply a space amp of <4). The patch changes the semantics of this configuration threshold to apply to the entire set of blob files that are eligible for garbage collection based on `blob_garbage_collection_age_cutoff`. This provides more intuitive semantics for the option and can provide a better write amp/space amp trade-off. (Note that GC-triggered compactions still pick the same SST files as before, so triggered GC still targets the oldest the blob files.)
Reviewed By: jowlyzhang
Differential Revision: D62977860
fbshipit-source-id: a999f31fe9cdda313de513f0e7a6fc707424d4a3
Summary:
* Set write_dbid_to_manifest=true by default
* Add new option write_identity_file (default true) that allows us to opt-in to future behavior without identity file
* Refactor related DB open code to minimize code duplication
_Recommend hiding whitespace changes for review_
Intended follow-up: add support to ldb for reading and even replacing the DB identity in the manifest. Could be a variant of `update_manifest` command or based on it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13019
Test Plan: unit tests and stress test updated for new functionality
Reviewed By: anand1976
Differential Revision: D62898229
Pulled By: pdillinger
fbshipit-source-id: c08b25cf790610b034e51a9de0dc78b921abbcf0
Summary:
Add a couple of ticker stats for corruption retry count and successful retries. This PR also eliminates an extra read attempt when there's a checksum mismatch in a block read from the prefetch buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12923
Test Plan: Update existing tests
Reviewed By: jowlyzhang
Differential Revision: D61024687
Pulled By: anand1976
fbshipit-source-id: 3a08403580ab244000e0d480b7ee0f5a03d76b06
Summary:
The default value for `refillPeriodMicros` is `100 * 1000`, which means 100ms (or 100,000us).
The document comments say 100,000ms (equivalent to 100 seconds), which is incorrect and misleading. This PR fixes this typo.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12832
Reviewed By: cbi42
Differential Revision: D59492336
Pulled By: ajkr
fbshipit-source-id: c2f55a8b996fe078a1510fcbebaea92ec0075929
Summary:
LLVM has detected a violation of `-Wdeprecated-dynamic-exception-spec`. Dynamic exceptions were removed in C++17. This diff fixes the deprecated instance(s).
See [Dynamic exception specification](https://en.cppreference.com/w/cpp/language/except_spec) and [noexcept specifier](https://en.cppreference.com/w/cpp/language/noexcept_spec).
Reviewed By: palmje
Differential Revision: D58528375
fbshipit-source-id: 130fecd3aa556e4cdb955feea53c442bd9fbc864
Summary:
This feature has been around for a couple of years and users haven't reported any problems with it.
Not quite related: fixed a technical ODR violation in public header for info_log_level in case DEBUG build status changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12377
Test Plan: unit tests updated, already in crash test. Some unit tests are expecting specific behaviors of optimize_filters_for_memory=false and we now need to bake that in.
Reviewed By: jowlyzhang
Differential Revision: D54129517
Pulled By: pdillinger
fbshipit-source-id: a64b614840eadd18b892624187b3e122bab6719c
Summary:
I had a TODO to complete `CompactionOptions`'s compression API but never did it: d610e14f93/db/compaction/compaction_picker.cc (L371-L373)
Without solving that TODO, the API remains incomplete and unsafe. Now, however, I don't think it's worthwhile to complete it. I think we should instead delete the API entirely. This PR deprecates it in preparation for deletion in a future major release. The `ColumnFamilyOptions` settings for compression should be good enough for `CompactFiles()` since they are apparently good enough for every other compaction, including `CompactRange()`.
In the meantime, I also changed the default `CompressionType`. Having callers of `CompactFiles()` use Snappy compression by default does not make sense when the default could be to simply use the same compression type that is used for every other compaction. As a bonus, this change makes the default `CompressionType` consistent with the `CompressionOptions` that will be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12587
Reviewed By: hx235
Differential Revision: D56619273
Pulled By: ajkr
fbshipit-source-id: 1477de49f14b06c72d6f0045616a8ce91d97e66e
Summary:
https://github.com/facebook/rocksdb/issues/12466 reported a bug when `RocksDB.getColumnFamilyMetaData()` is called on an existing database(With files stored on disk). As neilramaswamy mentioned, this was caused by https://github.com/facebook/rocksdb/issues/11770 where the signature of `SstFileMetaData` constructor was changed, but JNI code wasn't updated.
This PR fix JNI code, and also properly populate `fileChecksum` on `SstFileMetaData`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12474
Reviewed By: jowlyzhang
Differential Revision: D55811808
Pulled By: ajkr
fbshipit-source-id: 2ab156f41eaf4a4f30c49e6df421b61e8451230e
Summary:
ScopedArenaIterator is not an iterator. It is a pointer wrapper. And we don't need a custom implemented pointer wrapper when std::unique_ptr can be instantiated with what we want.
So this adds ScopedArenaPtr<T> to replace those uses.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12470
Test Plan: CI (including ASAN/UBSAN)
Reviewed By: jowlyzhang
Differential Revision: D55254362
Pulled By: pdillinger
fbshipit-source-id: cc96a0b9840df99aa807f417725e120802c0ae18
Summary:
On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`.
Tests:
Add new unit tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427
Reviewed By: ajkr
Differential Revision: D55025941
Pulled By: anand1976
fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e
Summary:
This PR adds support to return data's approximate unix write time in the iterator property API. The general implementation is:
1) If the entry comes from a SST file, the sequence number to time mapping recorded in that file's table properties will be used to deduce the entry's write time from its sequence number. If no such recording is available, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown except if the entry's sequence number is zero, in which case, 0 is returned. This also means that even if `preclude_last_level_data_seconds` and `preserve_internal_time_seconds` can be toggled off between DB reopens, as long as the SST file's table property has the mapping available, the entry's write time can be deduced and returned.
2) If the entry comes from memtable, we will use the DB's sequence number to write time mapping to do similar things. A copy of the DB's seqno to write time mapping is kept in SuperVersion to allow iterators to have lock free access. This also means a new `SuperVersion` is installed each time DB's seqno to time mapping updates, which is originally proposed by Peter in https://github.com/facebook/rocksdb/issues/11928 . Similarly, if the feature is not enabled, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown.
Needed follow up:
1) The write time for `kTypeValuePreferredSeqno` should be special cased, where it's already specified by the user, so we can directly return it.
2) Flush job can be updated to use DB's seqno to time mapping copy in the SuperVersion.
3) Handle the case when `TimedPut` is called with a write time that is `std::numeric_limits<uint64_t>::max()`. We can make it a regular `Put`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12428
Test Plan: Added unit test
Reviewed By: pdillinger
Differential Revision: D54967067
Pulled By: jowlyzhang
fbshipit-source-id: c795b1b7ec142e09e53f2ed3461cf719833cb37a
Summary:
The most general `open()` method for each of RocksDB, TtlDB, OptimisticTransactionDB and TransactionDB should
- ensure the default CF is supplied in the list of descriptors
- cache the default CF handle
- store open CF handles for automatic close on DB close
The `close()` method in each of these DB subclasses should `close()` all the owned CF handles.
I can’t find a cleaner way to build some generalised open/close that does this for all DB subclasses, so it exists as cut and paste with variations in the 4 different DB subclasses.
Added some slightly paranoid testing that CF handles explicitly referred to as default in a list of CF handles in the general open methods, and the simple open that doesn’t supply a CF, end up reading and writing to the same CF. Prompted by the fact that this code is a bit opaque; the first returned handle is the DB.
As part of this, fix the bug where the Java side of `OptimisticsTransactionDB` was not setting up default column family; this was visible when setting up an iterator; add a test to validate that the iterator is OK. A single Java reference to the default column family was not being created in the OptimisticsTransactionDB RocksDB subclass; it should be created in all subclasses. The same problem had previously been fixed for TtlDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12417
Reviewed By: ajkr
Differential Revision: D54807643
Pulled By: pdillinger
fbshipit-source-id: 66f34e56a822a009a8f2018d401cf8940d91aa35
Summary:
Fix some issues introduced in https://github.com/facebook/rocksdb/pull/12199 (CC rhubner)
1. Previous `jar -v -c -f` was not valid command syntax.
2. Javadoc and source Jar files were prefixed `rocksdb-`, now corrected to `rocksdbjni-`
pdillinger This needs to be merged to `main` and also `8.11.fb` (to fix the Windows build for the RocksJava release of 8.11.2) please.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12371
Reviewed By: pdillinger, jowlyzhang
Differential Revision: D54136834
Pulled By: hx235
fbshipit-source-id: f356f2401042af359ada607e5f0be627418ccd6c
Summary:
This option is used for encoding keys in block based table files. It has been having a default true value since its introduction.
Users may not notice this option is not persisted in options file unless they are explicitly setting it to false. If the users expect `Iterator::GetProperty("rocksdb.iterator.is-key-pinned")` to return 1 when setting `ReadOptions.pin_data = true`, they should have noticed loading options file won't work and have work around for this by always explicitly set this option to false for opening DB. This change won't impact those users except that now they can remove their work around. If the users are not relying on key pinning behavior at all and as a result didn't notice the option is not persisted, this change shouldn't have any visible behavior impact either.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11987
Reviewed By: hx235
Differential Revision: D54093238
Pulled By: jowlyzhang
fbshipit-source-id: 256a3348c44cf91349034d1f6e242c437b32b9a5
Summary:
In RocksDb jni threre is no method to know if the instance is closed or not.
so when using a closed instance it makes jvm crash.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11337
Reviewed By: jaykorean
Differential Revision: D53941387
Pulled By: ajkr
fbshipit-source-id: e3e4e6fe48409fa70a312810e467ec0c4ce356ef
Summary:
A lot of variants of Get and MultiGet have been added to `include/rocksdb/db.h` over the years. Try to consolidate them by marking variants that don't return timestamps as deprecated. The underlying DB implementation will check and return Status::NotSupported() if it doesn't support returning timestamps and the caller asks for it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12327
Reviewed By: pdillinger
Differential Revision: D53828151
Pulled By: anand1976
fbshipit-source-id: e0b5ca42d32daa2739d5f439a729815a2d4ff050
Summary:
The RocksDB ticker and histogram statistics were out of sync between the C++ and Java code, with a number of newer stats missing in TickerType.java and HistogramType.java. Also, there were gaps in numbering in portal.h, which could soon become an issue due to the number of tickers and the fact that we're limited to 1 byte in Java. This PR adds the missing stats, and re-numbers all of them. It also moves some stats around to try to group related stats together. Since this will go into a major release, compatibility shouldn't be an issue.
This should be automated at some point, since the current process is somewhat error prone.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12355
Reviewed By: jaykorean
Differential Revision: D53825324
Pulled By: anand1976
fbshipit-source-id: 298c180872f4b9f1ee54b8bb22f4e280458e7e09
Summary:
It's in production for a large storage service, and it was initially released 6 months ago (8.6.0). IMHO that's enough room for "easy downgrade" to most any user's previously integrated version, even if they only update a few times a year.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12352
Test Plan:
tests updated, including format capatibility test
table_test: ApproximateOffsetOfCompressed is affected because adding index block to metaindex adds about 13 bytes
to SST files in format_version 6. This test has historically been problematic and one reason is that, apparently, not only
could it pass/fail depending on snappy compression version, but also how long your host name is, because of db_host_id.
I've cleared that out for the test, which takes care of format_version=6 and hopefully improves long-term reliability.
Suggested follow-up: FinishImpl in table_test.cc takes a table_options that is ignored in some cases and might not match
the ioptions.table_factory configuration unless the caller is very careful. This should be cleaned up somehow.
Reviewed By: anand1976
Differential Revision: D53786884
Pulled By: pdillinger
fbshipit-source-id: 1964cbd40d3ab0a821fdc01c458031df716fcf51
Summary:
There is no strong reason for user to need this mode while on the other hand, its behavior is destructive.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12337
Reviewed By: hx235
Differential Revision: D53630393
Pulled By: jowlyzhang
fbshipit-source-id: ce94b537258102cd98f89aa4090025663664dd78