Coder Social home page Coder Social logo

bossphorus's Introduction

Bossphorus implementation in Rust

GitHub GitHub Workflow Status

This is a partial reimplementation of the BossDB REST API in Rust.

Why?

bossphorus simplifies data-access patterns for data that do not fit into RAM. When you write a 100-gigabyte file, bossphorus automatically slices your dataset up to fit in bite-sized pieces.

When you request small pieces of your data for analysis, bossphorus intelligently serves only the parts you need, leaving the rest on disk.

Feature Parity

See Feature Parity for more information.

Disk Usage

Bossphorus caches cuboids in the uploads folder that's created in the current working directory. Currently, it will cache up to 1000 cuboids in this folder. The least recently used cuboids are removed when the cuboid limit is reached.

Configuration

Environment variables have precedence over the Rocket.toml config file.

Environment Variables

BOSSHOST: Sets the Boss DB host
BOSSTOKEN: Token used for Boss auth

Rocket.toml File

bosshost: Sets the Boss DB host
bosstoken: Token used for Boss auth

Defaults

In absence of an environment variable and value in the Rocket.toml file:

bosshost = "api.bossdb.io"
bosstoken = "public"

Development

Blosc must be installed manually via a package manager to build. SQLite is required, but it included with MacOS by default.

For MacOS:

brew install c-blosc

For Debian based Linux distros:

sudo apt-get install libblosc-dev sqlite3

For RPM based Linux distros:

sudo yum install blosc sqlite

Due to use of the Rocket web server crate, the nightly Rust toolchain must be used. You can set this as your project default with:

rustup override set nightly

Releases

You can build an optimized release with:

cargo build --release

The binary will be at target/release/bossphorus.

License

FOSSA Status

bossphorus's People

Contributors

dependabot[bot] avatar fossabot avatar j6k4m8 avatar movestill avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

fossabot

bossphorus's Issues

[ChunkedBloscFileDataManager] File IO parallelism

Right now, each cuboid on disk is read in series, but we know both the shape as well as the filenames a priori. This could theoretically be parallelized, though it'll require some cleverness with the ndarray API, which I don't believe supports parallel reads/writes by default.

Allow user to configure the DataManager stack they want to run

I'm thinking we should make the binary configurable in a file, but with sane defaults. Something like:

config.yml

port: 8090
cache: "LRU"

usage_manager: "console"

data_managers:

    - ChunkedFileDataManager:
        upload_path: "uploads/"
    
    - BossDBRelayDataManager:
        host:        "bossdb.io"
        protocol:    "https"
        token:       "public"

And callable with bossphorus --config config.yml

In particular, I think being able to specify which data managers are used and in which order is something users may want to be able to do at runtime, which is perhaps a bit more complicated than the current env-variable technique allows.

@movestill thoughts?

Implement a cache cleanup and cache-maintenance strategy

Right now, if you create the following DataManager stack;

ChunkedBloscFileDataManager โ†’ BossDBRelayDataManager

...then as cache-misses in the ChunkedBloscFileDataManager are fulfilled by the BossDBRelayDataManager, they're saved to disk and returned to the client.

There currently exists no mechanism by which to clear the files from ChunkedBloscFileDataManager, which means that the cache will grow to infinity (or until your drive is full, whichever is sooner).

There should be a cache cleanup strategy, but I believe it makes sense for there to be several strategies which a user can choose between. Perhaps options like:

  • LRU
  • FIFO
  • Most distant (Euclidean) from LRU (??)

Even if we just implement one of these to start with, might be smart to leave space and an abstraction layer to allow for multiple in the future.

Support for annotation u64, and image u16 channels

Right now we support all imagery (u8) channels. I think all that's really required here is Generic-izing the u8 code to take an arbitrary dtype, though there may be a few libraries that don't support other datatypes.

Going to use this Issue to document these conflicts as I encounter them so that we can address them with all the info we need.

[ChunkedBloscFileDataManager] Save data to disk in a SEPARATE thread as returning data to the user

Right now we perform the following data flow upon a ChunkedBloscFileDataManager cache miss:

  1. Ask the next layer for data
  2. Save the retrieved data to disk
  3. Return a copy of that retrieved data

In order to improve performance and save roundtrip time on that initial request, we should perform step 2 (saving the data to disk) in parallel with returning it to the user. Or, rather, spawn a routine to save data to disk in parallel (and allow it to finish even once the HTTP request is closed).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.