modelardata / modelardb-rs Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 5.0 1.71 MB

ModelarDB: Model-Based Time Series Management from Edge to Cloud

License: Apache License 2.0

Rust 99.93% Dockerfile 0.07%

apache-arrow datafusion industrial-iot rust time-series time-series-database

modelardb-rs's People

Contributors

Stargazers

Watchers

Forkers

tranway1 ohkinozomu more-eu khalefa-ow aabduvakhobov

modelardb-rs's Issues

Example data

I want to quickly get started with ModelarDB but don't know how to prepare the data. The user document only talks about putting data in a local folder or a cloud but doesn't mention what type of data files to put. Can you give me an example of such a data file? If there are multiple data files in the folder, are they supposed to be data from multiple data source? Thank you!

Evaluate removing uid by storing tags in segments

A way to remedy #187 may be to store tags in segments if it does not significantly increase the amount of space and bandwidth used and then remove univariate_id. However, this change needs to be be evaluated in-depth first, as the schema for compressed segments would no longer be the same for model tables with tags and and it may increase the amount of storage and bandwidth required. Removing univariate_ids has multiple benefits in addition to fixing #187, e.g., it makes each segment self-contained which reduces the complexity of data transfer and query processing and it removes the limit on columns in model table. The problem with each model table having different schema for compressed segments could be solved by adding the model tables compressed segment schema to ModelTableMetadata.

Transfer data for all columns atomically

To make it seem like data is transferred atomically from a users perspective the data transfer component should operate at the table level and transfer data points for the same time range for each column when transferring data, otherwise data points may disappear on the edge without being available on the cloud because data is only transferred for a subset of column for a time range.

Support relative and absolute error bounds

Add a function named maximum_allowed_deviation() as a complement to is_within_error_bound() so all error calculation is in one place instead of being implemented by each model type.

Evaluate whether it is necessary to refactor DataFolder to DeltaLake

Throughout the code we use DataFolder, local data folder, and remote data folder in places where it might be better to use DeltaLake, local Delta Lake, and remote Delta Lake. If it makes, refactor this code so we use the updated terminology.

Support INSERT without values for generated column

For model tables INSERT INTO is currently processed by TableProvider::insert_into(), but before Apache DataFusion passes the data to that method, it parses and verifies that its schema matches the schema returned by schema() which takes no arguments. However, the schema returned by schema() also defined which columns can be queried. Thus, to allow generated columns to be queried, data must also inserted for the generated columns which is then immediately dropped by insert_into(). This can maybe be fixed by implementing some of the other methods in TableProvider, .eg., TableProvider::get_column_default(), if not, an issue and/or a PR should probably be opened in the Apache DataFusion repository to get some form of support for optional columns added.

Support reading/writing data using Delta Lake

Storing data in Delta Lake instead of manually managed Apache Parquet files removes the need for implementing a lot of complex functionality in the storage engine, such as atomic writes and compaction.

Add support for automatic optimizing and vacuuming

With data and metadata stored in Delta Lake, many small Apache Parquet files may be be created. Thus, the storage engine and metadata manager should automatically merge these files by automatically optimizing each Delta Lake table and and then delete the old files using vacuuming without interfering with query execution.

Fix documentation not mentioning maturin is needed

@aabduvakhobov reported that the Python API requires maturin to compile and that this is not mentioned in the documentation.

Support data transfer for normal tables

Currently only data inserted into model tables are transferred, data inserted into normal tables should also be transferred.

Support reading and writing metadata using delta-rs

Create a new table metadata manager.
Create a new manager metadata manager.
Update the start up process to no longer require a SQLite database in server.
Update the start up process to no longer require a PostgreSQL database in manager.
Update the documentation to make it clear that an SQL database is no longer needed (Maybe also mention delta).
Update Docker deployment to no longer set up a PostgreSQL database in cluster deployment.
Look into optimizing the initial version.

Some univariate_id cannot be written to Delta Lake

Some u64 univariate_ids cause errors when written to Delta Lake as it looks like they are stored as Int64 instead of UInt64 despite the field being defined as Field::new("univariate_id", DataType::UInt64, false). For example tag_one and/or tag_two previously used in the integration tests causes the following error to occur when executing the tests: Failed to flush data in compressed data manager due to: Failed to convert into Arrow schema: Cast error: Can't cast value 11069825858223412227 to type Int64.

Batch normal table inserts in CompressedDataBuffer

When data is inserted into a normal tables it is currently written directly to an Apache Parquet file, thus many small files are created. Instead data for normal tables should be merged into larger batches using CompressedDataBuffers like is done for model tables. This will reduce the number of small files created by the storage engine and simplify the storage engine by managing the data for normal and model tables the same.

Invalid compact unwind encoding with macos-latest

The tests fail on macos-latest due to libunwind: stepWithCompactEncoding - invalid compact unwind encoding. Thus, macos-latest was changed to macos-13 until this issue is fixed upstream or a workaround can be found.

Remove compaction from query processing

Write segments directly to non-overlapping Apache Parquet files that are perfectly ordered by univariate_id and start_time without compaction during query processing.

Add Telegraf to ingestion process

Optimize Gorilla for Microseconds or replace it

#185 changed all timestamps from Milliseconds to Microseconds as it is the only time unit supported by delta-rs due to the current definition of the Delta Transaction Protocol. Thus, the implementation of Gorilla used for compressing timestamps should be optimized for Microseconds or replaced with a compression method that works better for Microseconds timestamps.

Update missing_puncation.yml for latest ast-grep

The version of ast-grep to use in GitHub Actions was set to version 0.21.4 in #175 as version 0.22.0 breaks missing_puncation.yml. Thus, the rule should be updated if ast-grep was purposely changed, if missing_puncation.yml broke due to a bug inast-grep` this should probably be reported as an issue.

Support planing multiple queries in parallel

Currently an exclusive write lock must be taken on the storage engine while planing queries. This should be changed to a read lock so multiple queries can be planed in parallel.

Run tests on FreeBSD using GitHub Actions

To ensure ModelarDB can be installed on and keep working as expected on FreeBSD, the tests should also be run on FreeBSD. While GitHub does not natively support any of the BSDs, it should be possible to run ModelarDB's test on them using the GitHub Actions developed by VM Actions. Depending on the complexity, the tests should maybe also be run on other relevant targets that Rust has Tier 1 and hosted Tier 2 support for such as NetBSD. If this is this done, installation instructions should also be added to the user README.

Delete transferred data without blocking queries

The transferred Apache Parquet files cannot be deleted right after they are transferred if they are current used by a query, however, new queries should also not use them since then the transferred data would exist on both the edge and cloud from a user's perspective.