lostatc / acid-store Goto Github PK

View Code? Open in Web Editor NEW

95.0 5.0 11.0 3.13 MB

[UNMAINTAINED] A transactional and deduplicating virtual file system

License: Apache License 2.0

Rust 99.82% Dockerfile 0.10% Shell 0.08%

deduplication encryption acid redis s3 sqlite filesystem sftp rclone fuse

acid-store's People

Contributors

Stargazers

Watchers

Forkers

maikelwever falhumai96 mwatts douglasdwyer whymidnight jasoncolburne crrow simonsan securityworks steven-r corban-dallas

acid-store's Issues

wasm wasi?

OpenMode::Create mode is ignored when opening a local directory

Hi, this is a bug report.

Summary

We want to open a repository if it exists, create it if not, which is the purpose of OpenMode::Create.
However, when doing so, at least when using FileRepo and DirectoryConfig, the repo opening fails at the second opening because the implementation attempted to re-create the directory, which is not was was requested.

How To Reproduce

Repro case (main.rs):

use std::path::{ PathBuf };
use acid_store::repo::{ file::FileRepo, OpenOptions, OpenMode, };
use acid_store::store::{ DirectoryConfig };

fn main() {
    use std::env;
    let args: Vec<String> = env::args().collect();

    let config = DirectoryConfig{
        path: PathBuf::from(&args[1])
    };

    let _repo : FileRepo = OpenOptions::new()
        .mode(OpenMode::Create) // This seems to be ignored!
        .open(&config)
        .unwrap();

    let stdin = std::io::stdin();
    loop {
        let mut input = String::new();
        stdin.read_line(&mut input).expect("failed to read input");
        if input == "exit\n" {
            break;
        }
    }
}

Compile and run it twice:

cargo run /tmp/myrepo
cargo run /tmp/myrepo

Observed

Failure on any run with the same directory argument after the first call with that directory:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Store(File exists (os error 17))', src/main.rs:17:10

Expected

No error, no attempt at creating the directory when we use OpenMode::Create and successful opening of the repository.

Notes

In open-options.rs we can locate this call which shows that the mode is handled after attempting to access the store, instead of in the store.

 let mut store = config.open()?; // HERE

        match self.mode {
            OpenMode::Open => self.open_repo(store),
            OpenMode::Create => { // HANDLED HERE

Going into the call to config.open() then leads to this code:

fn open(&self) -> crate::Result<Self::Store> {
        // Create the blocks directory in the data store.
        create_dir_all(&self.path)
            .map_err(|error| crate::Error::Store(anyhow::Error::from(error)))?;
        create_dir(self.path.join(STORE_DIRECTORY))
            .map_err(|error| crate::Error::Store(anyhow::Error::from(error)))?; // FAILS HERE
        create_dir(self.path.join(STAGING_DIRECTORY))
            .map_err(|error| crate::Error::Store(anyhow::Error::from(error)))?;
        create_dir(self.path.join(type_path(BlockType::Data)))
            .map_err(|error| crate::Error::Store(anyhow::Error::from(error)))?;
        create_dir(self.path.join(type_path(BlockType::Lock)))
            .map_err(|error| crate::Error::Store(anyhow::Error::from(error)))?;
        create_dir(self.path.join(type_path(BlockType::Header)))
            .map_err(|error| crate::Error::Store(anyhow::Error::from(error)))?;

We can see that the option is not handled at the right level and directory creation is "forced" systematically.

We did not attempt to see if the same issue appear with other kinds of stores. We intend to check with S3.

Real-world usage of the underlying principle

This project experiences frequent breaking API changes and hasn't seen significant real-world usage.

From what I can see the underlying principle has a real-world usage in https://github.com/rustic-rs/rustic

WRT the Backend trait.

Would be nice if you would have a look and give some feedback (:

Cheers! 🥂

How does this project relate to e.g. LSM implementations?

First of all, this is an amazing project.
I am very impressed with how many different back-ends are supported right now, and how readable most of the source code is! 👍

I have one question about how acid-store works, with relation to other database/datastore-like systems that for instance use a 'log structured merge tree' implementation where incoming transactions are added to a write-ahead log (for consistency/durability) as well as an in-memory dictionary (for speed of lookups using recent data), and periodically merge (some or all of) this data into the blocks stored on the permanent back-end.

Is this similar to what acid-store does as well? Or does acid-store always immediately (when calling .commit()) perform an update to the particular blocks in the back-end that require a change?

Put differently: Does acid-state perform any kind of amortization to make the average insert/update faster, or not?

[WIP] My experiences with acid-store

First of all, thanks a lot for making this crate! I am very interested in trying it out, and am currently experimenting with it. This issue is a platform for communicating my findings and experiences in the hopes to provide you with yet another usecase which might one day be supported.

Motivation

I am building a mining platform for crates.io which consists of 3 stages:

fetch changes
- The first fetch will always be all crate versions thus far, which is ~215000. Then it fetches changes again in intervals, and won't see more than say 100 new crate versions per day
- Receiving the first batch of 215000 is incredibly fast, as it basically serializes all of the them from memory into the database.
- it also stores crate objects which are merely the available versions per crate. There are ~36000 of them.
processing
- This stage runs in intervals and will iterate all crate versions to see if there are unfinished tasks. Tasks are currently downloading the crate and extracting it to gather meta-data. Tasks are stored as small objects that serve as markers. For two tasks currently available, it would store numCrates times two small task types.
- Each task produces a task result to store the outcome of the computation.
- more tasks are planned
reporting
runs in intervals, and traverses all crate versions and generates one or more HTML files for displaying intelligence obtained from the data we gathered thus far.
to avoid generating a page needlessly, it marks the presence of a page in the database （even though right now it marks them with symlinks)

For the current implementation, we are talking about 6 * 215000 + 36000 objects, that are read often, and usually written in small batches, excluding the first run when we see all ~215k crate versions at once.

Eventually I would like to run 'criner' on a Raspberry Pi with 512MB of memory for all tasks that don't require running cargo/rustc. Thus I would prefer a DB which trades of speed for lower and predictable memory consumption. Sled clearly is optimized for speed, which is great, but it's something I can't even pay for on my current hardware as it simply consumes too much memory when during migrations and when there are too many objects.

The current database

The database of choice is Sled, but I ran into the following issues that make me seek out an alternative.

Despite sled generally being lightning fast, it costs a lot of memory to support it. Now I run into problems where the memory consumption is disproportional and so high that I feel uncomfortable proceeding with it. Database sizes tend to be large, even though that isn't my primary concern. It's the concern of not being able to interact with the data anymore that one spent days producing. For instance, migrating from one version to another once took 50GB in memory (and it's a wonder my MBPro did not die trying, but completed the herculean task).

Observations

Migration

When transferring all data in Sled to an ObjectRepository with SQLite backend, while doing a commit only at the end of all object write, the processing got slower and slower the more objects were uncommitted. I will try committing every 1000 objects to see if the performance slowdown is caused by the commit mechanism, or the underlying database. Edit: Committing every 1000 or 100 objects avoids the slowdown.
- When looking at IO statistics at the end of writing ~210k objects into SQLite, it tooks about 6GB of writes to produce a 115MB file.

SQLite files are small! This seems great, especially after seeing sled easily take 10GB
When using an ObjectRepository with a DirectoryBackend, the removal of obsolete objects was prohibitively slow. The runtime was dominated by IO calls.
It's surprisingly hard to 'open or create' a database.
I would love to disable encryption and compression with cargo features, they pull in around 100 additional crates to compile.