Coder Social home page Coder Social logo

compact_str's Introduction

compact_str

A memory efficient string type that can store up to 24* bytes on the stack.

version on crates.io Minimum supported Rust Version: 1.59 mit license
Continuous Integration Status Cross Platform Status Minimum Supported Rust Version Status Clippy Status

* 12 bytes for 32-bit architectures


About

A CompactString is a more memory efficient string type, that can store smaller strings on the stack, and transparently stores longer strings on the heap (aka a small string optimization). It can mostly be used as a drop in replacement for String and are particularly useful in parsing, deserializing, or any other application where you may have smaller strings.

Properties

A CompactString specifically has the following properties:

  • size_of::<CompactString>() == size_of::<String>()
  • Stores up to 24 bytes on the stack
    • 12 bytes if running on a 32 bit architecture
  • Strings longer than 24 bytes are stored on the heap
  • Clone is O(n)
  • From<String> or From<Box<str>> re-uses underlying buffer
    • Eagerly inlines small strings
  • Heap based string grows at a rate of 1.5x
    • The std library String grows at a rate of 2x
  • Space optimized for Option<_>
    • size_of::<CompactString>() == size_of::<Option<CompactString>>()
  • Uses branchless instructions for string accesses

Traits

This crate exposes two traits, ToCompactString and CompactStringExt.

ToCompactString

Provides the to_compact_string(&self) method for converting types into a CompactString. This trait is automatically implemented for all types that are std::fmt::Display, with specialized higher performance impls for:

  • u8, u16, u32, u64, usize, u128
  • i8, i16, i32, i64, isize, i128
  • f32, f64
  • bool, char
  • NonZeroU*, NonZeroI*
  • String, CompactString

CompactStringExt

Provides two methods join_compact(seperator: impl AsRef<str>) and concat_compact(). This trait is automatically implemented for all types that can be converted into an iterator and yield types that impl AsRef<str>. This allows you to join Vec's, slices, and any other collection to form CompactStrings.

Macros

This crate exposes one macro format_compact! that can be used to create CompactStrings from arguments, like you can Strings with the std::format! macro.

Features

compact_str has the following optional features:

  • serde, which implements Deserialize and Serialize from the popular serde crate, for CompactString
  • bytes, which provides two methods from_utf8_buf<B: Buf>(buf: &mut B) and from_utf8_buf_unchecked<B: Buf>(buf: &mut B), which allows for the creation of a CompactString from a bytes::Buf
  • markup, which implements Render trait, so CompactStrings can be used in templates as HTML escaped strings
  • diesel, which allows using CompactStrings in diesel text columns
  • sqlx-mysql / sqlx-postgres / sqlx-sqlite, which allows using CompactStrings in sqlx text columns
  • arbitrary, which implements the arbitrary::Arbitrary trait for fuzzing
  • proptest, which implements the proptest::arbitrary::Arbitrary trait for fuzzing
  • quickcheck, which implements the quickcheck::Arbitrary trait for fuzzing
  • rkyv, which implements rkyv::Archive, rkyv::Serialize and rkyv::Deserialize for fast zero-copy serialization, interchangable with serialized Strings
  • smallvec, provides the into_bytes() method which enables you to convert a CompactString into a byte vector, using smallvec::SmallVec

How it works

Note: this explanation assumes a 64-bit architecture, for 32-bit architectures generally divide any number by 2.

Normally strings are stored on the heap since they're dynamically sized. In Rust a String consists of three fields, each of which are the size of a usize. e.g. its layout is something like the following:

String: [ ptr<8> | len<8> | cap<8> ]

  1. ptr is a pointer to a location on the heap that stores the string
  2. len is the length of the string
  3. cap is the total capacity of the buffer being pointed to

This results in 24 bytes being stored on the stack, 8 bytes for each field. Then the actual string is stored on the heap, usually with additional memory allocated to prevent re-allocating if the string is mutated.

The idea of CompactString is instead of storing metadata on the stack, just store the string itself. This way for smaller strings we save a bit of memory, and we don't have to heap allocate so it's more performant. A CompactString is limited to 24 bytes (aka size_of::<String>()) so it won't ever use more memory than a String would.

The memory layout of a CompactString looks something like:

CompactString: [ buffer<23> | len<1> ]

Memory Layout

Internally a CompactString has two variants:

  1. Inline, a string <= 24 bytes long
  2. Heap allocated, a string > 24 bytes long

We define a discriminant (aka track which variant we are) within the last byte, specifically:

  1. 0b11111110 - All 1s with a trailing 0, indicates heap allocated
  2. 0b11XXXXXX - Two leading 1s, indicates inline, with the trailing 6 bits used to store the length

and the overall memory layout of a CompactString is:

  1. heap: { ptr: NonNull<u8>, len: usize, cap: Capacity }
  2. inline: { buffer: [u8; 24] }

Both variants are 24 bytes long

For heap allocated strings we use a custom HeapBuffer which normally stores the capacity of the string on the stack, but also optionally allows us to store it on the heap. Since we use the last byte to track our discriminant, we only have 7 bytes to store the capacity, or 3 bytes on a 32-bit architecture. 7 bytes allows us to store a value up to 2^56, aka 64 petabytes, while 3 bytes only allows us to store a value up to 2^24, aka 16 megabytes.

For 64-bit architectures we always inline the capacity, because we can safely assume our strings will never be larger than 64 petabytes, but on 32-bit architectures, when creating or growing a CompactString, if the text is larger than 16MB then we move the capacity onto the heap.

We handle the capacity in this way for two reasons:

  1. Users shouldn't have to pay for what they don't use. Meaning, in the majority of cases the capacity of the buffer could easily fit into 7 or 3 bytes, so the user shouldn't have to pay the memory cost of storing the capacity on the heap, if they don't need to.
  2. Allows us to convert From<String> in O(1) time, by taking the parts of a String (e.g. ptr, len, and cap) and using those to create a CompactString, without having to do any heap allocations. This is important when using CompactString in large codebases where you might have CompactString working alongside of String.

For inline strings we only have a 24 byte buffer on the stack. This might make you wonder how can we store a 24 byte long string, inline? Don't we also need to store the length somewhere?

To do this, we utilize the fact that the last byte of our string could only ever have a value in the range [0, 192). We know this because all strings in Rust are valid UTF-8, and the only valid byte pattern for the last byte of a UTF-8 character (and thus the possible last byte of a string) is 0b0XXXXXXX aka [0, 128) or 0b10XXXXXX aka [128, 192). This leaves all values in [192, 255] as unused in our last byte. Therefore, we can use values in the range of [192, 215] to represent a length in the range of [0, 23], and if our last byte has a value < 192, we know that's a UTF-8 character, and can interpret the length of our string as 24.

Specifically, the last byte on the stack for a CompactString has the following uses:

  • [0, 191] - Is the last byte of a UTF-8 char, the CompactString is stored on the stack and implicitly has a length of 24
  • [192, 215] - Denotes a length in the range of [0, 23], this CompactString is stored on the stack.
  • 216 - Denotes this CompactString is stored on the heap
  • 217 - Denotes this CompactString stores a &'static str.
  • [218, 255] - Unused, denotes e.g. the None variant for Option<CompactString>

Testing

Strings and unicode can be quite messy, even further, we're working with things at the bit level. compact_str has an extensive test suite comprised of unit testing, property testing, and fuzz testing, to ensure our invariants are upheld. We test across all major OSes (Windows, macOS, and Linux), architectures (64-bit and 32-bit), and endian-ness (big endian and little endian).

Fuzz testing is run with libFuzzer, AFL++, and honggfuzz, with AFL++ running on both x86_64 and ARMv7 architectures. We test with miri to catch cases of undefined behavior, and run all tests on every Rust compiler since v1.59 to ensure support for our minimum supported Rust version (MSRV).

unsafe code

CompactString uses a bit of unsafe code because we manually define what variant we are, so unlike an enum, the compiler can't guarantee what value is actually stored. We also have some manually implemented heap data structures, i.e. HeapBuffer, and mess with bytes at a bit level, to make the most out of our resources. That being said, uses of unsafe code in this library are constrained to only where absolutely necessary, and always documented with // SAFETY: <reason>.

Similar Crates

Storing strings on the stack is not a new idea, in fact there are a few other crates in the Rust ecosystem that do similar things, an incomplete list:

  1. smol_str - Can inline 22 bytes, Clone is O(1), doesn't adjust for 32-bit archs
  2. smartstring - Can inline 23 bytes, Clone is O(n), is mutable
  3. kstring - Can inline 15 or 22 bytes dependent on crate features, Clone is O(1), can also store &'static strs
  4. flexstr - Can inline 22 bytes, Clone is O(1), can also store &'static strs

Thanks for readingme!

compact_str's People

Contributors

cad97 avatar dependabot[bot] avatar dragazo avatar kijewski avatar matklad avatar mcronce avatar mishrasamiksha avatar neoeinstein avatar nilstrieb avatar njaard avatar nobodyxu avatar overlookmotel avatar parkmycar avatar tylerhawkes avatar vipentti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

compact_str's Issues

Is it possible to make Option<CompactStr> same size as Option<String>?

It would be very beneficial for me to reduce size of Option<CompactStr>. I use lots of it in my projects.

I made some tests, and it looks like Option<ComactStr> takes additional 8 bytes in compaction to Option<String>.

Rust has null pointer optimization (see https://doc.rust-lang.org/std/option/index.html#representation) but I not sure if it can by used in userland crate.

sizeofs of popular Strings.

String: 24 bytes
Option<String>: 24 bytes
CompactStr: 24 bytes
Option<CompactStr>: 32 bytes
KString: 24 bytes
Option<KString>: 24 bytes
SmartString<Compact>: 24 bytes
Option<SmartString<Compact>>: 32 bytes
SmolStr: 24 bytes
Option<SmolStr>: 24 bytes

feature: Immutable shared version of CompactStr?

Having a mutable CompactStr is great since it can be a drop-in replacement of String.

However, the original immutable CompactStr provides O(1) clone, which is super useful for immutable string, which is also super useful.

One of the issue with immutable CompactStr is that it makes #16 harder without performance penalty, as std::sync::Arc does not have stable API for MaybeUninit.

But that can be solved with triomphe::Arc as it provides new_uninit_slice methods for creating an uninitialized slice, which makes implementation of #16 possible without any performance penalty.

Also triomphe::Arc is 8 bytes smaller than std::sync::Arc, as it does not support weak reference, which is not used in this crate.

Improve benchmarks

The benchmarks should also include a std::string::String and be parameterized on string length instead of testing a few different variants

Use static verifier kani in our test?

kani can verify that unsafe code does not have:

  • Memory safety (e.g., null pointer dereferences)
  • User-specified assertions (i.e., assert!(...))
  • The absence of panics (e.g., unwrap() on None values)
  • The absence of some types of unexpected behavior (e.g., arithmetic overflows)

feat: Implement the clear() API

This issue for tracking the implementation of the clear() method that exists on String.

We should implement it exactly as it exists on String, that is, it removes all the contents of the CompactString setting the length to 0, but does not touch the capacity.

Consider mutable representation for heap variant

Currently, the heap variant is Arc<str>. This has a couple drawbacks:

  • Arc wastes code/space for weak pointer support which we don't use
  • mutating the string requires a full copy of data
  • (need to re-check this) pointer in Arc<str> points to the start of the ArcInner, not to the start of str, so there's extra offset during deref.

An ideal representation would be to store *const str, which points:

                             here
                               V
| rc: AtomicUsize | cap: usize | str byte data | spare capacity |

This should allow us to just mutate the string just fine (using Arc::make_mut-like API).

Admittedly, this increases the complexity of the implementation quit a bit.

Support larger inline strings

With the recent advancements in const generics, it's theoretically possible to support a CompactString with a user defined size that is > 24, e.g. 32, 40, 48 bytes long. We should try doing this and playing with an API

Add more maintainers?

I think it is unfair to put all the burden of maintaining this crate to @ParkMyCar .

Thus I propose that we should add a few new maintainers and perhaps also move this repository into an organisation.

I volunteer to be one and I will help review the PRs and answer issues.

feature: easy integer to CompactStr

TL;DR: 1234.to_string() but without heap allocation (and without std::fmt).

My use case are file-descriptor which I need to pass to a Command (and do other things with them before) so I need to have them as a &str instead of an i32/RawFd. It would be great if compact_str would support creation from an integer type (possible with itoa).

If you think this does not match the goal of the project fell free to close.

Document minimal supported Rust version

Currently there's no MSRV on CI / in the docs. That's a useful thing to have though. I suggest waiting until 1.56.0 is released (<6 weeks), and then making that an MSRV, so that we can use edition = 2021.

impl PartialEq<CompactStr> for &str

Example that works:

use compact_str::CompactStr;
fn main() {
    dbg!(CompactStr::from("hello") == "hello");
}

Example that don't work:

use compact_str::CompactStr;
fn main() {
    dbg!("hello" == CompactStr::from("hello"));  
}

Error:

error[E0277]: can't compare `&str` with `CompactStr`
 --> src/main.rs:4:18
  |
4 |     dbg!("hello" == CompactStr::from("hello"));
  |                  ^^ no implementation for `&str == CompactStr`
  |
  = help: the trait `PartialEq<CompactStr>` is not implemented for `&str`

For more information about this error, try `rustc --explain E0277`.
error: could not compile `test01` due to previous error

I think it shouldn't make any differences whether you want to compare &str with CompactStr, &str with &CompactStr, CompactStr with &str or &CompactStr with &str.

feat: Implement the drain(range) API

This issue for tracking the implementation of the drain(range) method that exists on String.

We should implement it exactly as it exists on String, that is, it removes the specified range from the CompactString, returning all the removed characters as an iterator.

feature: truncate

Great work on this library! So far substituting String with CompactString has given immediate performance improvements in the projects I've tested it on.

The only unfortunate thing I ran into was that CompactString has no truncate method like that of String. Implementing it user-side is a bit scary because some unsafe is likely involved. Is there any chance it could be added in a future update?

Simple example using the String api:

    let mut s = String::from("Hello World!\n\n");
    assert!(s.len() == 14);
    
    s.truncate(s.trim_end().len());
    assert!(s.len() == 12);

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.