Comments (1)
It depends a lot on your data. It is general purpose file storage, log files (streams).
The general approach should be dedup -> compress -> erasure. That will ensure the smallest amount of data.
However, a lot of the gains from deduplication comes from running it across multiple TB of data, so if you treat files as separate entities you will of course not get the main benefit.
ZPAQ includes deduplication and does it as follows:
- Step 1, deduplicate into fragments of 4-64KB size.
- Step 2, collect blocks until you have 16MB data.
- Step 3, compress each block.
A file entry then contains information about which block/fragments are used to reconstruct each file. The 16MB size is the maximum penalty for getting a single fragment.
For a datacenter type job, I would look into the possibility of doing deduplication globally. The 'dedup' currently doesn't offer "DYI" splitting, but that could quite easily be added.
If you would like to discuss things in more detail, you are very welcome to write a mail with your business case. I would be happy to help out!
from reedsolomon.
Related Issues (20)
- replace [][]byte with []byte to reduce gogc scan time HOT 4
- Suggestion: Export Galois field arithmetic and move it to a subpackage like reedsolomon/galois HOT 2
- panic: runtime error: index out of range at Split() HOT 1
- Can we use WithLeopardGF16(true) when shards are less than 256?
- runtime error: slice bounds out of range HOT 3
- Slice bounds out of range [6400:3200] in codeSomeShardsAVXP HOT 4
- Consider introducing a new error when shard size is not a multiple of 64
- (*reedSolomon).codeSomeShardsGFNI and others should make use of defer for sync.WaitGroup operations like .Done() and .Wait() HOT 1
- How to speed up progressive encoding? HOT 6
- Encoder return the same content for many shards HOT 4
- First parameter of galDivide is always 1 HOT 9
- JS implementation HOT 1
- A
- Implement proper ReconstructSome for leopard
- `ReconstructSome` for Leopard HOT 3
- Consider allowing reconstructSome to compute parity shards
- concurrency: (*inversionTree).GetInvertedMatrix holds a read-lock before returning a slice to naked usage yet slices are mutable and without the lock hence susceptible to using the wrong data/data races HOT 2
- How does the performance compare with golang's https://github.com/klauspost/reedsolomon?
- Loading argument twice off the stack?
- Memory bloat
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from reedsolomon.