onflow / atree Goto Github PK
View Code? Open in Web Editor NEWAtree provides scalable arrays and scalable ordered maps.
Home Page: https://onflow.org
License: Apache License 2.0
Atree provides scalable arrays and scalable ordered maps.
Home Page: https://onflow.org
License: Apache License 2.0
Implement benchmarks compatible with benchstat and make it similar to Ramtin's benchmarks used for the spreadsheets. E.g. measure 100 ops rather than 1 op, and group by array size.
Additionally:
Digester
is being created and destroyed in every OMT operation.
Some of the names are from an old design doc, current design doc, or neither. Make naming more consistent and match the latest design doc.
values used for slab flags should be changed to
slab flag (8bits)
- one bit - set if this is a root slab
- one bit - set if it has pointers to other slabs (for fastcopy)
- one bit - set if no limit on the size (external collision group current implementation)
- rest for types (32 types)
- [00000] ArrayData
- [00001] ArrayMeta
- [00010] LargeImutableArray (includes large strings) (not used right now)
- [01000] MapData
- [01001] MapMeta
- [01010] LargeMapEntry (key/value pair)
- [01011] LargeCollisionGroup (External)
- [11111] Storable (general purpose for external objects for now) (edited)
We also need to add method on Slab to check the first few bits with bit masking and exposes
The compiler should optimize this but new code is simpler and easier to read.
Example change from:
midPoint := uint32(math.Ceil(float64(dataSize) / 2))
to:
midPoint := (dataSize + 1) >> 1
Both for security reasons and future migrations we should use dedicated slab flags for
A similar change was made for array, so this change should be done to map.
Merged PR #165 to fix issue #164 disables deltas map reset during Commit
. PR #165 fix was rushed to fix testnet issue related to storageUsed
.
It causes all delta slabs to be encoded and saved again in the next Commit
.
We should only keep slabs with empty address in deltas map after Commit
(only discard saved slabs).
When firstKey
isn't updated, inserts can go to the wrong data slab.
We need to:
firstKey
when empty slab is merged or rebalancedfirstKey
if data slab is the first child@turbolent identified this nice optimization on Friday. It was discussed extensively on Slack today.
Basically, computing type info can be expensive. And type info for some maps and arrays aren't necessary because they aren't save to storage.
By deferring the retrieval of type info until encoding, we can also take advantage of Storage.FastCommit
which uses multiple goroutines for encoding slabs.
Some benchmarks take over 11 minutes. Make them skip running if -short is specified on the command line.
Test array serialization by comparing results of encoding, decoding, and re-encoding again.
There is an intermittent test failure on TestMapHashCollision/deterministic
. Test fails on the number of external slabs containing hash collision.
This bug affects the pointer flag during encoding.
Maybe we can use a significantly faster hash such as XXH128 without sacrificing much security (as long as we handle collisions) because SipHash relies on its key being secret for security.
XXHash v2 is used by many software. XXHash v3 (XXH3) is substantially faster and was finalized in v0.8 of its reference implementation, so its digest output won't change in the future.
The preferred Go implementation of XXH3 and XXH128 is https://github.com/zeebo/xxh3. Coincidentally, our BLAKE3 library in Go is by the same author (zeebo). zeebo's BLAKE3 library is used by Jean-Philippe Aumasson (the co-creator of BLAKE/BLAKE2/BLAKE3 and SipHash) in multi-party-sig (a Taurus Group SA project).
The API of XXH3 and XXH128 by zeebo doesn't support seeds/keys but we can workaround that by hashing the seed as a prefix in front of the data. The overhead of doing this is offset by XXH3 and XXH128 being much faster than SipHash.
I'm not proposing we implement a bunch of alternatives, but I thought it would be useful to document some MAC/Hash combinations that are possible using OMT.
This is required for Cadence integration.
This bug was discovered by @turbolent during integration with Cadence.
When go 1.17 is released, add it to ci.yml.
CI currently uses go 1.16 (other versions were removed to speed up CI).
Test map serialization by comparing results of encoding, decoding, and re-encoding again.
The rebalancing algorithm can be optimized for speed and to reduce number of slabs accessed/loaded.
This can happen when new element triggers the slab to split or merge, causing the root slab size to overflow/underflow without adjustment.
We need to check for such a scenario in Array.Set() to adjust the root slab properly.
Add nested array and nested map in stress test for array and map so that stress test supports arbitrary nested structures.
Supporting this may require implementing deep remove for array and map (currently done in Cadence).
This is meant to run for hours/days using randomized data and operations in random sequences.
It can be useful to serialize version information in case the layout changes in the future.
Consider adding 1 or 2 bytes to hold layout version info.
Lookup speed can be faster.
<picard> โ๏ธ Make it so. </picard>
During Commit
, delta maps are reset so all slabs with empty Address
are wiped out.
EDIT: I was too hasty labeling this as "Bug" instead of "Enhancement". This issue arose because Commit
was called in the middle of a transaction by another project (logic error). My apologies for the confusion. I relabeled this as "Enhancement".
Cadence's SomeStorable
has nested Storable
and we need a way to expose it.
Add Storable.ChildStorables()
to return []Storable
. This will also require changes to Cadence integration.
We can potentially skip SipHash for integer key types to gain speed.
Maybe this optimization can be evaluated after OMT is feature complete and we have more test coverage (and fuzzing) in place.
Thanks @ramtinms for mentioning this possible optimization!
A new validation check found that in some edge cases, the prev
sibling ids can get out of place.
The bug happens when slab A splits and we need to update slab A's old "right" sibling slab B, so slab B's prev
can point to the newly split slab.
Fixing this may require loading an extra data slab. However, prev
doesn't appear to be used in SAT and OMT. It only appears to be encoded/decoded.
So we can just remove prev
since it's unused, which would give us back 16 bytes in each data slab for both SAT and OMT.
Add -seed flag to override time.Now().UnixNano().
Tests can be improved by adding more checks to confirm tree integrity.
For example, element size, count, first key, etc. in slab header should be confirmed to be correct after various operations.
More validation checks, such as encoding size, can be added to validArray()
and validMap()
Slabs implement Storable
interface, which has func StoredValue(SlabStorage) (Value, error)
.
Only root slabs have enough information to recreate Array
/OrderedMap
. Non-root slabs should return error from StoredValue
function.
This affects SAT and OMT.
Copying performance in Cadence can be improved by using batch ops.
Copying performance in Cadence can be improved by using batch append.
SipHash is too slow to use as a level 1 hash.
XXH3 is a very fast alternative but there are some problems with XXH3_64 and XXH3_128.
Current version of github.com/zeebo/xxh3 doesn't support seeds. Adding support for seeds in a way that is compatible with XXH3 and XXH128 requires many lines of changes.
XXH3_64 failed smhasher tests. We'll focus on XXH128 and XXH128low because they both passed rurban/smhasher. UPDATE: XXH128low fails demerphq/smhasher (Strict Avalanche Criterion).
UPDATE: some of these issues may have been fixed upstream and not yet updated in rurban/smhasher.
XXH128 is bijective for 128-bit input -> 128-bit digest but is not bijective for 64-bit input -> 64-bit digest, which can make collisions faster to find.
Example collision on 64-bit input for XXH128low (low 64 bits of XXH128):
data=3f1d4bb70c38aa9f, digest128={54ca4b07dcfa8704 ffddad0ba4c9ace9}
data=ec524de4468a8921, digest128={ce414f5214c1fa36 ffddad0ba4c9ace9}
One workaround for inputs <= 8 bytes is to prefix the data with a 64-bit key. Unfortunately, this approach using unmodified zeebo/xxh3 is slower than SipHash128 and also introduces an extra allocation.
Another solution is to use fxamacker/xxh128p when it becomes available shortly. It's optimized for using a 64-bit prefix key with XXH3_128. This solution doesn't change the algorithm of XXH3_128, so it's easy to verify results with existing XXH3_128 implementations. Although this solution would be faster than SipHash128, it would be slower than unmodified XXH3_128.
The simplest solution appears to be to use fxamacker/circlehash. It passes every test in smhasher (both rurban/smhasher and demerphq/smhasher), support a uint64 seed, produce a 64-bit digest, is faster than cespare/XXH64, and (for some data sizes) is faster than unseeded XXH3.
The biggest advantage of CircleHash64 would be simplicity and ease-of-audit given our release schedule.
Binary search can be faster. Use the same optimization done for SAT, which is to use linear scan if there are insufficient elements to justify the overhead of binary search.
Optimize GetHashInput()
:
HashCode
to GetHashInput
Storable is used incorrectly if:
Value.Equal(Value) bool
can be optimized to provide roughly 2-3% faster speed for OMT Get
operation when measured using 100 ops on 1000 elements. This also improves Set
and Remove
.
Benchmark names are too long, making the results harder to read or compare.
Shorten benchmark names and use standard abbreviations. Correct existing abbreviations if needed.
The current MetaDataSlab layout uses 4 bytes (uint32) to hold child header count.
Remove the child header count (if feasible) or at least use a smaller sized uint such as uint8 or uint16. The current max for child header count is 95 for max slab size 1536 bytes.
For arrays and maps that use data slab as root, next
field is encoded as all 0
s (StorageIDUndefined
). So we can remove this all-zero next
field in encoded data and save 16 bytes for small arrays and maps.
This change only affects encoding and decoding of data slab that is root.
Hashable
interface to HashInputProvider
function root
/ \
meta1a meta1b
/ \ / \
meta2a meta2b meta2c meta2d
/ \ / \ / \ / \
d1 d2 d3 d4 ... dn-1 dn
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.