scylladb / scylla-rust-driver Goto Github PK

Async CQL driver for Rust, optimized for ScyllaDB!

License: Apache License 2.0

Rust 99.81% Dockerfile 0.02% Shell 0.08% Makefile 0.09%

rust driver cql scylladb

scylla-rust-driver's Introduction

ScyllaDB Rust Driver

This is a client-side driver for ScyllaDB written in pure Rust with a fully async API using Tokio. Although optimized for ScyllaDB, the driver is also compatible with Apache Cassandra®.

Note: this driver is officially supported but currently available in beta. Bug reports and pull requests are welcome!

Getting Started

The documentation book is a good place to get started. Another useful resource is the Rust and Scylla lesson on Scylla University.

Examples

let uri = "127.0.0.1:9042";

let session: Session = SessionBuilder::new().known_node(uri).build().await?;

let result = session.query("SELECT a, b, c FROM ks.t", &[]).await?;
let mut iter = result.rows_typed::<(i32, i32, String)>()?;
while let Some((a, b, c)) = iter.next().transpose()? {
    println!("a, b, c: {}, {}, {}", a, b, c);
}

Please see the full example program for more information. You can also run the example as follows if you have a Scylla server running:

SCYLLA_URI="127.0.0.1:9042" cargo run --example basic

All examples are available in the examples directory

Features and Roadmap

The driver supports the following:

Asynchronous API
Token-aware routing
Shard-aware routing (specific to ScyllaDB)
Prepared statements
Query paging
Compression (LZ4 and Snappy algorithms)
CQL binary protocol version 4
Batch statements
Configurable load balancing policies
Driver-side metrics
TLS support - install openssl if you want to use it https://docs.rs/openssl/0.10.32/openssl/#automatic
Configurable retry policies
Authentication support
CQL tracing

Ongoing efforts:

CQL Events
More tests
More benchmarks

Getting Help

Please join the #rust-driver channel on ScyllaDB Slack to discuss any issues or questions you might have.

Supported Rust Versions

Our driver's minimum supported Rust version (MSRV) is 1.66.0. Any changes will be explicitly published and will only happen during major releases.

Reference Documentation

CQL binary protocol specification version 4

Other Drivers

cdrs-tokio: Apache Cassandra driver written in pure Rust.
cassandra-rs: Rust wrappers for the DataStax C++ driver for Apache Cassandra.

License

This project is licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

scylla-rust-driver's People

Contributors

Stargazers

Watchers

Forkers

havaker cvybhu tomzhang kejmer ansrivas lauranovich sgg piodul altanozlu dgarcia360 guangminglion chenlongxi666 u2000 tzach jrwats jtcarnes quentinperez hackathon-rust-cpp harisraharjo mkawalec jasperav numberly macher259 ponewor pkolaczk strogo procrastinationfighter rivet-gg guy9 nemosupremo crestonbunch ionosnetworks raoufchebri gor027 merlleu dayofthepenguin zhaopinglu isabella232 wyfo stadiamaps wprzytula kulezi chillfish8 ten0 rukai sankar-boro terry1504 insanitybit colin-grapl wdshin avelanarius siningma wmitros ianmichaelash khorolets jaw-sh 16chan michoecho ben1009 danielhe4rt akoshchiy tempbottle kbr-scylla samgj18 iq-scm vlasfama sydea-rnd iguberman anfid yiwen-ai rishabharyal hippalus annastuchlik shotover karlpvoss muzarski sylwiaszunejko lorak-mmk s3rius cstyles samuelorji michaelhly nsipplswezey oeb25 appcypher mmuzaf mykaul rodmitry cholcombe973 rohankumardubey av1nag lvboudre aldanor

scylla-rust-driver's Issues

CQL query support

Add support for sending QUERY requests to the server (4.1.4.).

Please note that there are separate issues for query result set parsing (#11) and paging (#10).

Cluster topology discovery

Probe the cluster for its topology to enable things like token-aware routing (you need to know the address of the node that owns a specific token).

std::bad_alloc (sic) when trying to use LZ4 compression

HEAD: a42aef8

I get... an std::bad_alloc error when trying to run the example.rs:

[piodul@localhost scylla-rust-driver]$ cargo run --example example

# ... skipped warnings ...

warning: 5 warnings emitted

    Finished dev [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/examples/example`
Connecting to localhost:9042 ...
Error: Error (code 0): std::bad_alloc

The exception occurs when trying to send the first query with LZ4 compression turned on.

Add reading responses for requests

Implement utilities to read responses for requests (4.2. Responses from CQL spec).

Publish on crates.io

Let's publish the crate on crates.io:

https://doc.rust-lang.org/cargo/reference/publishing.html

Add retrieving replication strategy information from the cluster

In order to implement token-aware policy, we need to know what's the replication strategy for given keyspace in order to know which nodes are responsible the data we're trying to read/write.

First part is to get result metadata from the response of a PREPARED request. That's already mostly done - look for TableSpec and deser_table_spec for details.

Once we have the keyspace name (which is just a string), we can retrieve its replication strategy by selecting it from a system table:

SELECT replication FROM system_schema.keyspaces WHERE keyspace_name = 'our_name';

After that we need to parse the replication strategy information and it can later be used to determine which nodes to use for load balancing.

I believe that that's the related code from gocql:
https://github.com/gocql/gocql/blob/964d7011f63d85c0c135ca47e2d06032c6be391b/topology.go#L71-L91
https://github.com/gocql/gocql/blob/5913df4d474e0b2492a129d17bbb3c04537a15cd/metadata.go#L544-L598

LZ4 compression is incompatible with Scylla

LZ4 compression works for small messages with Scylla:

[penberg@nero rust-driver]$ SCYLLA_URI="localhost:9042" cargo run --example cqlsh-rs
    Finished dev [unoptimized + debuginfo] target(s) in 0.04s
     Running `target/debug/examples/cqlsh-rs`
Connecting to localhost:9042 ...
>> USE ks
>>

but not for larger ones:

Connecting to localhost:9042 ...
>> CREATE KEYSPACE IF NOT EXISTS ks WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor' : 1}
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error (code 0): CQL frame LZ4 uncompression failure', examples/cqlsh-rs.rs:22:48
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

We need to find a crate that is compatible with Scylla's LZ4.

Allow passing multiple node addresses to session::connect

We currently only allow passing a single address to Session::connect. It's fine for token-aware routing, since we then pull the topology info and establish connections to each node anyway, but we should also allow non-token-aware users to have sensible load balancing (e.g. round robin over multiple addresses). Token awareness is kind of hardcoded for now, but it's not going to always be the case, so users should have other options.

In order to do that, we should allow passing multiple addresses to connect, just like other drivers tend to do, e.g. Session::connect(&[addr1, addr2, addr3]). These addresses should then be inserted into a pool, which can be used for load balancing.

Run PREPARE on all nodes

Currently, prepare() only sends the request via a single connection. What's better to do instead is to use the topology information (if available) to send the preparation request to each node.

Statement preparation

Add support for the PREPARE (4.1.5.) statement, which registers a CQL statement on the cluster for later execution.

Add Scylla Sphinx Theme to Rust Driver Docs

Just like the Python and Java Driver, the Scylla Sphinx Theme should be applied to the Rust Driver.

The Process includes:

Following the steps - https://github.com/scylladb/sphinx-scylladb-theme/blob/master/README.rst
Testing the output
Making sure everything we display is branded Scylla not Datastax
Making sure the links are all directed to Scylla
Adding multi-version when the next version goes out.

Implement RoundRobin load balancing

Here's a reference from another driver docs: https://hexdocs.pm/cassandra/Cassandra.LoadBalancing.RoundRobin.html

Code formatting check

Our CI should check if our code conforms to a style enforced by cargo fmt.

Some of us (at least I) use cargo fmt to format the code that we write locally. We should agree on one version and enforce its formatting so that we won't have to deal with unwanted formatting change when working on files formatted by another person which used a different version of cargo fmt (or did not use cargo fmt at all).

Optimize token to shard mapping

I copied gocql's approach to for shard_of, but as it turns out, we can implement it more efficiently with 128-bit arithmetic:

uint64_t biased_token = token + ((uint64_t)1 << 63);
biased_token <<= ignore_msb;
int shard = ((unsigned __int128)biased_token * nr_shards) >> 64;

as per:

https://github.com/scylladb/scylla/blob/master/docs/protocol-extensions.md#intranode-sharding

Support CQL collection types: lists, sets, maps, tuples, frozen<> and not frozen

Part of #97

Implement DCAwareRoundRobin load balancing

Again, here's a random reference from another driver: https://www.rubydoc.info/gems/cassandra-driver/2.1.5/Cassandra/LoadBalancing/Policies/DCAwareRoundRobin The idea is similar to regular round robin, but only within a single datacenter (datacenter info can be fetched from the cluster metadata).

Calculate replica sets when performing token-aware routing

Currently, when executing a prepared statement, we calculate the token of the partition key and pick the connection to the node which owns this token (i.e. which owns the token range/vnode that this token lies in).

The owner of a token should be one of the replicas for the given partition, but I'm not sure if it will always be -- maybe in NTS it could happen that this owner lies in a data-center that the keyspace does not replicate to, in which case it won't be a replica.

In any case, there can and usually will be other replicas based on the replication strategy of the keyspace that we're performing the statement on. The connection picking code should calculate this set of replicas and in case executing the statement fails on one of them, try another one. It could also try balancing the load between the replicas somehow by e.g. picking the connections in a round-robin fashion.

Token-aware routing

Use cluster topology (#14) to send queries to a node that owns the token of a query.

ORM

I made an orm for https://github.com/AlexPikalov/cdrs a while ago: https://github.com/Jasperav/cdrs_orm. In short:

A lot of pre-generated queries (https://github.com/Jasperav/cdrs_orm/blob/master/cdrs_db_mirror/test_derived_equals/src/gen/generated_some_struct.rs)
Compile time type checked queries: https://github.com/Jasperav/cdrs_orm/blob/master/cdrs_query/cdrs_query_example/src/lib.rs
Automatic table mapping to Rust structs: https://github.com/Jasperav/cdrs_orm/blob/master/cdrs_to_rust/src/test_result/test_table.rs
JSON mapping: https://github.com/Jasperav/cdrs_orm/blob/master/cdrs_db_mirror/example_db_mirror/src/lib.rs

I see that Scylla has an ORM for Go: https://github.com/scylladb/gocqlx.

I think it's a good idea to make an ORM for this crate when it is production ready, maybe we can use cdrs_orm (my crate) as a starting point. I can help with the migration.

LWT support

We need to support and test LWT (lightweight transaction) queries - the ones that have conditions within them, e.g. IF EXISTS.
We also have a Scylla-specific optimization, which returns a flag when preparing a statement that this statement is a lightweight transaction. We can then reduce the number of paxos conflicts by trying to always send the requests of the same key to the same node, which is assumed to be the paxos coordinator.

We should also have tests for LWT queries.

An API for querying rows with known type

Problem

Currently, our query API returns rows in untyped form. Because of its untypedness, it can be quite inconvenient to use - there is a lot of unwrapping involved in order to get a single value:

if let Some(rs) = session.query("SELECT a, b, c FROM ks.t", &[]).await? {
    for r in rs {
        let a = r.columns[0].as_ref().unwrap().as_int().unwrap();
        let b = r.columns[1].as_ref().unwrap().as_int().unwrap();
        let c = r.columns[2].as_ref().unwrap().as_text().unwrap();
        println!("a, b, c: {}, {}, {}", a, b, c);
    }
}

While I think this interface may have some use (e.g. when doing SELECT * FROM ... we may not know the column names and types and we want to discover it along with the response), it's unnecesarily inconvenient when the user knows the schema and knows which types to expect.

We should add another interface which returns rows as tuples of user-specfied types. This interface would first check that the types declared by the user match with the metadata in response, and then proceed with deserialization.

Examples

This is how I imagine the example above would be rewritten using such an API:

let result: Option<QueryResult<(i32, i32, String)>> = session.query("SELECT a, b, c FROM ks.t", &[]).await?;
if let Some(rs) = result {
    for (a, b, c) in rs {
        println!("a, b, c: {}, {}, {}", a, b, c);
    }
}

I think there should be some scenarios in which the type inference eliminates the need to specify the types at all:

fn print_row(a: u32, b: u32, c: String) {
    println!("a, b, c: {}, {}, {}", a, b, c);
}

if let Some(rs) = session.query("SELECT a, b, c FROM ks.t", &[]).await? {
    for (a, b, c) in rs {
        print_row(a, b, c);
    }
}

Maybe the Query and PreparedStatement types should encode the returned row type? The old API would be put to UntypedQuery and UntypedPreparedStatement.

Implement token aware load balancing

The code is already there, but we want this load balancer to be configurable. It should take an underlying load balancer (e.g. RoundRobin) and use it internally to distribute requests to different nodes.

Random reference: https://www.rubydoc.info/gems/cassandra-driver/2.1.5/Cassandra/LoadBalancing/Policies/DCAwareRoundRobin

Support user-defined types (UDT)

Part of #97

Use server events to trigger metadata refresh

We currently refresh metadata (e.g. token ring) every 10 seconds. Switch to using server events to avoid the periodic polling.

Establish connections to all available nodes on startup

Currently, connections to specific nodes/shards are established lazily - only when a token aware statement is about to be sent to a specific node and it doesn't have a working connection yet. That creates a subtle problem with statements which are propagated from the driver side (e.g. #115), because the aforementioned statements will only get propagated to existing connections, which may be not enough. In particular, right after calling Session::connect, the number of connections is currently always 1.

The solution is to simply establish a connection the moment we discover (via topology) that there exists a node which wasn't contacted yet.

Provide an option to bind values to each separate statement in a batch

While using current batch API, it is easy to encounter the problem of statements/values count mismatch. To avoid it, we need to provide an option to bind values to each separate statement in a batch.

New batch API should support:

Binding values to each separate statement in a batch
Modifying statement's values between batch executions

Prepared statement execution

After statement preparation support (#8), add support for the EXECUTE request (4.1.6.) to execute prepared statements. Please note that a prepared statement might be evicted from the prepared statement cache on the server, which requires the driver to re-prepare the statement.

Allow setting the consistency level

As of now, it's not possible to set the consistency level - it's hardcoded everywhere as ONE. We should allow the user to set the consistency level.

Receive sharding info via sending OPTIONS

Refs #18

More informative errors

Currently, we are using the anyhow crate for error handling because it is very easy to use - it exposes one error type anyhow::Error, which is a catch-all for all types implementing std::error::Error.

This approach has a big drawback - while anyhow::Error can be associated with a descriptive error message so that it is easy to understand by a human, it's not really possible to differentiate different kinds of failure in the code - e.g. distinguish between query timeout and connection close. This kind of information will not only be useful for client programs, but for driver internals, too - for example, see #59 (comment)

We should gradually move away from the anyhow::Error type and write meaningful error types ourselves. I'd like to suggest the thiserror crate - it greatly simplifies writing custom error types.

TLS support

We should support encrypted connections over TLS.

Query paging support

Add support for paging with the QUERY and EXECUTE requests.

Retry policy support

Add support for retrying of some operations under right circumstances (query is idempotent, retry policy is configured).

Support USE statement

The USE statement isn't particularly complicated, but it should be treated in a special way - it should be propagated to all underlying connections for a given CQL session, since it changes their state.

Refactor Value API

Currently all values passed to queries are represented like this:

pub enum Value {
    Val(Bytes),
    Null,
    NotSet,
}

This works well for simple types but I ran into some problems when implementing User Defined Types here
First problem is that converting a UDT into Val(Bytes) can fail if a field is serizalized to [bytes] bigger than 2GiB. This makes query api ugly because instead of using simple values! we have to use try_values! and handle conversion error.
Ex. with values!:

session
        .query(
            "INSERT INTO ks.t (a, b, c) VALUES (?, ?, ?)",
            &scylla::values!(3, 4, "def"),
        )
        .await?;

And with try_values!:

session
        .query(
            "INSERT INTO ks.t (a, b, c) VALUES (?, ?, ?)",
            &(scylla::try_values!(3, 4, "def") ?),
        )
        .await?;

Additionally the same error can happen when awaiting the query because we serialize Val(Bytes) for the final time when sending the query.
Maybe we could avoid this ugliness if Value was changed to something like:

pub enum Value {
    Serialized(Bytes),
    TooBigTooSend,
}

Then values! would convert rust types into either succesfully serialized Serialized(Bytes) or if it turns out bigger than 2GiB TooBigTooSend. Later query api would look if the values are properly serialized and throw an error in case of TooBigTooSend.
This would make user's api nicer at the cost of unwrapping Values in query code.
Maybe splitting Value into variants would also work well with named values in the future (?)

Another issue is that when serializing a UDT with current api each field is recursively converted into a Value and then written onto final BytesMut which means an allocation for each conversion into Value which is not ideal. A better solution would be to introduce a trait which would allow to serialize a type as value by writing into &mut impl BufMut instance - similarly to how requests are serialized. Then we could convert all rust types into Value::Serialized using this trait - currently bytes inside Value::Val(Bytes) aren't a fully serialized value so it wouldn't work

There would be some problems because for example token routing uses hashing value bytes only if it's not null but this could be solved by checking if number of serialized bytes is > 4 and then taking bytes[4..] as data to hash

I'm not sure what api for Value would be the best but current one is non ideal
I'm gonna try some options and see what could work

Allow configuring load balancing algorithms

Session objects should accept some kind of config information (e.g. a configuration struct), which includes the picked load balancing strategy:

RoundRobin
DCAwareRoundRobin
TokenAware
ShardAware

... where TokenAware and ShardAware also take an underlying policy for internal load balancing (so that a user can configure the load balancing to be TokenAware(RoundRobin) or TokenAware(DCAwareRoundRobin).

Authentication support

If authentication is enabled on the server, it will send a AUTHENTICATE (4.2.3.) message, which the client responds with AUTH_RESPONSE. There's also AUTH_CHALLENGE and AUTH_SUCCESS messages sent by the server.

Add multiplexing to the connection

Our connection class needs to be able to handle multiple communication streams, indentified by stream: u16 field in CQL specification. Also, CQL allows the server to push messages to the driver via EVENT. In order to correctly handle that, we shoul prepare the connection class to multiplex multiple communication channels and have separate routines for reading and writing.

Support batches

Support sending multiple statements at once using the BATCH request.

https://github.com/apache/cassandra/blob/dfd9c74c67f6450fc32ea827b7a4d73af4d0e605/doc/native_protocol_v4.spec#L386

Support all CQL types

Support for CQL types is now restricted to pretty much ints and strings. We would like the rest of the types, along with collections and UDTs, to be supported by our driver as well.

Implement shard aware load balancing

As with #124, the backend is already there, but we want this option to be configurable.

Also, shard-aware load balancer should be the default choice in case a user hasn't provided any custom configuration. Shard-aware routing should fall back to token awareness if shard info is not available (e.g. because we're talking to Cassandra instead of Scylla).

Add a way to wait for schema agreement

It's a quite important feature, since otherwise users don't know when it's safe to start sending requests again, after they issue some schema modification statements.

Connection breakage if compression algorithm is not supported

If we request a compression algorithm that is not supported by the server, we should either fall back to non-compression or fail the connection with a human-readable error. Right now, our code will just assume that the compression algorithm is supported and send compressed messages, which the server will fail to parse.

Wait for Ready message from the server

We currently allow queries to be sent before a Ready message is received from the server. This can cause the server to reject the queries.

Compression support

The CQL binary protocol supports compression. The client must negotiate which compression algorithm to use during connection establishment.

Integration test support in CI

Let's make a Scylla instance available in the CI environment for integration tests, similar to what the Mongo driver does:

https://github.com/mongodb/mongo-rust-driver#running-the-tests

We can use the Scylla Docker image for this: https://hub.docker.com/u/scylladb

scylladb / scylla-rust-driver Goto Github PK

scylla-rust-driver's Introduction

ScyllaDB Rust Driver

Getting Started

Examples

Features and Roadmap

Getting Help

Supported Rust Versions

Reference Documentation

Other Drivers

License

scylla-rust-driver's People

Contributors

Stargazers

Watchers

Forkers

scylla-rust-driver's Issues

Problem

Examples

Recommend Projects

Recommend Topics

Recommend Org