Coder Social home page Coder Social logo

modelardata / modelardb-rs Goto Github PK

View Code? Open in Web Editor NEW
10.0 10.0 5.0 1.71 MB

ModelarDB: Model-Based Time Series Management from Edge to Cloud

License: Apache License 2.0

Rust 99.93% Dockerfile 0.07%
apache-arrow datafusion industrial-iot rust time-series time-series-database

modelardb-rs's People

Contributors

aabduvakhobov avatar agneborn98 avatar cgodiksen avatar chrthomsen avatar skejserjensen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

modelardb-rs's Issues

Example data

I want to quickly get started with ModelarDB but don't know how to prepare the data. The user document only talks about putting data in a local folder or a cloud but doesn't mention what type of data files to put. Can you give me an example of such a data file? If there are multiple data files in the folder, are they supposed to be data from multiple data source? Thank you!

Evaluate removing uid by storing tags in segments

A way to remedy #187 may be to store tags in segments if it does not significantly increase the amount of space and bandwidth used and then remove univariate_id. However, this change needs to be be evaluated in-depth first, as the schema for compressed segments would no longer be the same for model tables with tags and and it may increase the amount of storage and bandwidth required. Removing univariate_ids has multiple benefits in addition to fixing #187, e.g., it makes each segment self-contained which reduces the complexity of data transfer and query processing and it removes the limit on columns in model table. The problem with each model table having different schema for compressed segments could be solved by adding the model tables compressed segment schema to ModelTableMetadata.

Transfer data for all columns atomically

To make it seem like data is transferred atomically from a users perspective the data transfer component should operate at the table level and transfer data points for the same time range for each column when transferring data, otherwise data points may disappear on the edge without being available on the cloud because data is only transferred for a subset of column for a time range.

Support relative and absolute error bounds

Add a function named maximum_allowed_deviation() as a complement to is_within_error_bound() so all error calculation is in one place instead of being implemented by each model type.

Support INSERT without values for generated column

For model tables INSERT INTO is currently processed by TableProvider::insert_into(), but before Apache DataFusion passes the data to that method, it parses and verifies that its schema matches the schema returned by schema() which takes no arguments. However, the schema returned by schema() also defined which columns can be queried. Thus, to allow generated columns to be queried, data must also inserted for the generated columns which is then immediately dropped by insert_into(). This can maybe be fixed by implementing some of the other methods in TableProvider, .eg., TableProvider::get_column_default(), if not, an issue and/or a PR should probably be opened in the Apache DataFusion repository to get some form of support for optional columns added.

Support reading/writing data using Delta Lake

Storing data in Delta Lake instead of manually managed Apache Parquet files removes the need for implementing a lot of complex functionality in the storage engine, such as atomic writes and compaction.

Add support for automatic optimizing and vacuuming

With data and metadata stored in Delta Lake, many small Apache Parquet files may be be created. Thus, the storage engine and metadata manager should automatically merge these files by automatically optimizing each Delta Lake table and and then delete the old files using vacuuming without interfering with query execution.

Support reading and writing metadata using delta-rs

  • Create a new table metadata manager.
  • Create a new manager metadata manager.
  • Update the start up process to no longer require a SQLite database in server.
  • Update the start up process to no longer require a PostgreSQL database in manager.
  • Update the documentation to make it clear that an SQL database is no longer needed (Maybe also mention delta).
  • Update Docker deployment to no longer set up a PostgreSQL database in cluster deployment.
  • Look into optimizing the initial version.

Some univariate_id cannot be written to Delta Lake

Some u64 univariate_ids cause errors when written to Delta Lake as it looks like they are stored as Int64 instead of UInt64 despite the field being defined as Field::new("univariate_id", DataType::UInt64, false). For example tag_one and/or tag_two previously used in the integration tests causes the following error to occur when executing the tests: Failed to flush data in compressed data manager due to: Failed to convert into Arrow schema: Cast error: Can't cast value 11069825858223412227 to type Int64.

Batch normal table inserts in CompressedDataBuffer

When data is inserted into a normal tables it is currently written directly to an Apache Parquet file, thus many small files are created. Instead data for normal tables should be merged into larger batches using CompressedDataBuffers like is done for model tables. This will reduce the number of small files created by the storage engine and simplify the storage engine by managing the data for normal and model tables the same.

Remove compaction from query processing

Write segments directly to non-overlapping Apache Parquet files that are perfectly ordered by univariate_id and start_time without compaction during query processing.

Update missing_puncation.yml for latest ast-grep

The version of ast-grep to use in GitHub Actions was set to version 0.21.4 in #175 as version 0.22.0 breaks missing_puncation.yml. Thus, the rule should be updated if ast-grep was purposely changed, if missing_puncation.yml broke due to a bug inast-grep` this should probably be reported as an issue.

Support planing multiple queries in parallel

Currently an exclusive write lock must be taken on the storage engine while planing queries. This should be changed to a read lock so multiple queries can be planed in parallel.

Run tests on FreeBSD using GitHub Actions

To ensure ModelarDB can be installed on and keep working as expected on FreeBSD, the tests should also be run on FreeBSD. While GitHub does not natively support any of the BSDs, it should be possible to run ModelarDB's test on them using the GitHub Actions developed by VM Actions. Depending on the complexity, the tests should maybe also be run on other relevant targets that Rust has Tier 1 and hosted Tier 2 support for such as NetBSD. If this is this done, installation instructions should also be added to the user README.

Delete transferred data without blocking queries

The transferred Apache Parquet files cannot be deleted right after they are transferred if they are current used by a query, however, new queries should also not use them since then the transferred data would exist on both the edge and cloud from a user's perspective.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.