Coder Social home page Coder Social logo

artichoke / intaglio Goto Github PK

View Code? Open in Web Editor NEW
26.0 5.0 1.0 1.75 MB

๐Ÿ—ƒ UTF-8 string, byte string, and C string interner

Home Page: https://crates.io/crates/intaglio

License: MIT License

Rust 98.64% Ruby 1.36%
string-interning bytes utf-8 symbol artichoke symbol-table rust rust-crate interner

intaglio's Introduction

intaglio

GitHub Actions Code Coverage Discord Twitter
Crate API API trunk

UTF-8 string and byte string interner and symbol table. Used to implement storage for the Ruby Symbol table and the constant name table in Artichoke Ruby.

Symbol objects represent names and some strings inside the Ruby interpreter. They are generated using the :name and :"string" literals syntax, and by the various to_sym methods. The same Symbol object will be created for a given name or string for the duration of a program's execution, regardless of the context or meaning of that name.

Intaglio is a UTF-8 and byte string interner, which means it stores a single copy of an immutable &str or &[u8] that can be referred to by a stable u32 token.

Interned strings and byte strings are cheap to compare and copy because they are represented as a u32 integer.

Intaglio is an alternate name for an engraved gem, a gemstone that has been carved with an image. The Intaglio crate is used to implement an immutable Symbol store in Artichoke Ruby.

Usage

Add this to your Cargo.toml:

[dependencies]
intaglio = "1.9.1"

Then intern UTF-8 strings like:

fn intern_and_get() -> Result<(), Box<dyn std::error::Error>> {
    let mut table = intaglio::SymbolTable::new();
    let name: &'static str = "abc";
    let sym = table.intern(name)?;
    let retrieved = table.get(sym);
    assert_eq!(Some(name), retrieved);
    assert_eq!(sym, table.intern("abc".to_string())?);
    Ok(())
}

Or intern byte strings like:

fn intern_and_get() -> Result<(), Box<dyn std::error::Error>> {
    let mut table = intaglio::bytes::SymbolTable::new();
    let name: &'static [u8] = b"abc";
    let sym = table.intern(name)?;
    let retrieved = table.get(sym);
    assert_eq!(Some(name), retrieved);
    assert_eq!(sym, table.intern(b"abc".to_vec())?);
    Ok(())
}

Or intern C strings like:

use std::ffi::{CStr, CString};

fn intern_and_get() -> Result<(), Box<dyn std::error::Error>> {
    let mut table = intaglio::cstr::SymbolTable::new();
    let name: &'static CStr = CStr::from_bytes_with_nul(b"abc\0")?;
    let sym = table.intern(name)?;
    let retrieved = table.get(sym);
    assert_eq!(Some(name), retrieved);
    assert_eq!(sym, table.intern(CString::new(*b"abc")?)?);
    Ok(())
}

Or intern platform strings like:

use std::ffi::{OsStr, OsString};

fn intern_and_get() -> Result<(), Box<dyn std::error::Error>> {
    let mut table = intaglio::osstr::SymbolTable::new();
    let name: &'static OsStr = OsStr::new("abc");
    let sym = table.intern(name)?;
    let retrieved = table.get(sym);
    assert_eq!(Some(name), retrieved);
    assert_eq!(sym, table.intern(OsString::from("abc"))?);
    Ok(())
}

Or intern path strings like:

use std::path::{Path, PathBuf};

fn intern_and_get() -> Result<(), Box<dyn std::error::Error>> {
    let mut table = intaglio::path::SymbolTable::new();
    let name: &'static Path = Path::new("abc");
    let sym = table.intern(name)?;
    let retrieved = table.get(sym);
    assert_eq!(Some(name), retrieved);
    assert_eq!(sym, table.intern(PathBuf::from("abc"))?);
    Ok(())
}

Implementation

Intaglio interns owned and borrowed strings with no additional copying by leveraging Cow and a bit of unsafe code. CI runs drop tests under Miri and LeakSanitizer.

Crate features

All features are enabled by default.

  • bytes - Enables an additional symbol table implementation for interning byte strings (Vec<u8> and &'static [u8]).
  • cstr - Enables an additional symbol table implementation for interning C strings (CString and &'static CStr).
  • osstr - Enables an additional symbol table implementation for interning platform strings (OsString and &'static OsStr).
  • path - Enables an additional symbol table implementation for interning path strings (PathBuf and &'static Path).

Minimum Supported Rust Version

This crate requires at least Rust 1.58.0. This version can be bumped in minor releases.

License

intaglio is licensed under the MIT License (c) Ryan Lopopolo.

intaglio's People

Contributors

artichoke-ci avatar cad97 avatar dependabot-preview[bot] avatar dependabot[bot] avatar lopopolo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

cad97

intaglio's Issues

Miri failure on rustc 1.73.0-nightly (31395ec38 2023-07-24)

https://github.com/artichoke/intaglio/actions/runs/5663120563/job/15344308090

error: Undefined Behavior: trying to retag from <99387> for SharedReadOnly permission at alloc31495[0x0], but that tag does not exist in the borrow stack for this location
   --> /home/runner/work/intaglio/intaglio/src/bytes.rs:713:45
    |
713 |         debug_assert_eq!(self.get(id), Some(slice));
    |                                             ^^^^^
    |                                             |
    |                                             trying to retag from <99387> for SharedReadOnly permission at alloc31495[0x0], but that tag does not exist in the borrow stack for this location
    |                                             this error occurs as part of retag at alloc31495[0x0..0x64]
    |
    = help: this indicates a potential bug in the program: it performed an invalid operation, but the Stacked Borrows rules it violated are still experimental
    = help: see https://github.com/rust-lang/unsafe-code-guidelines/blob/master/wip/stacked-borrows.md for further information
help: <99387> was created by a SharedReadOnly retag at offsets [0x0..0x64]
   --> /home/runner/work/intaglio/intaglio/src/bytes.rs:708:30
    |
708 |         let slice = unsafe { name.as_static_slice() };
    |                              ^^^^^^^^^^^^^^^^^^^^^^
help: <99387> was later invalidated at offsets [0x0..0x64] by a Unique retag (of a reference/box inside this compound value)
   --> /home/runner/work/intaglio/intaglio/src/bytes.rs:711:23
    |
711 |         self.vec.push(name);
    |                       ^^^^
    = note: BACKTRACE (of the first span):
    = note: inside `intaglio::bytes::SymbolTable::intern::<std::vec::Vec<u8>>` at /home/runner/work/intaglio/intaglio/src/bytes.rs:713:45: 713:50
note: inside `bytes::dealloc_owned_data`
   --> tests/leak_drop/bytes.rs:9:22
    |
9   |         let sym_id = table.intern(symbol.clone()).unwrap();
    |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside closure
   --> tests/leak_drop/bytes.rs:4:25
    |
3   | #[test]
    | ------- in this procedural macro expansion
4   | fn dealloc_owned_data() {
    |                         ^
    = note: this error originates in the attribute macro `test` (in Nightly builds, run with -Z macro-backtrace for more info)

note: some details are omitted, run with `MIRIFLAGS=-Zmiri-backtrace=full` for a verbose backtrace

error: aborting due to previous error

error: test failed, to rerun pass `--test leak_drop`

Version 1.9.0 seems to drop Send trait

In upgrading to version 1.9.0, we hit a break with the following compilation error:

325 | impl NFSFileSystem for MirrorFS {
    |                        ^^^^^^^^ `NonNull<OsStr>` cannot be sent between threads safely
    |
    = help: within `intaglio::internal::Interned<OsStr>`, the trait `Send` is not implemented for `NonNull<OsStr>`

This seems to be a regression from 1.8.*.

This is with Cargo/rustc version 1.69.

Add a `PathBuf`/`Path` interner

Following up on #117, let's flesh out the remaining owned string/slice pairs in std.

This should be pretty easy to add and would clean up a lot of code in artichoke-backend which deals with mruby expecting NUL-terminated symbols in its own way.

Things to do:

  • Implement Slice for Path in internal.rs.
  • copy-paste the source in bytes.rs to path.rs
  • Fixup the argument types and docs.
  • Add a path feature for exposing this new interner.

Using `NonZerou32` as the inner type for `Symbol`

Hi, thanks for the great library ๐Ÿ˜ I've been working on a compiler that has many Option<Symbol>s scattered throuhgout. This takes 8 bytes to store, since it needs 4 bytes for the u32, and 4 bytes for the discimminant + padding.

If you replace u32 with NonZeroU32, the compiler can see that the all-0 bit pattern is not valid, so it can use that for the None case, saving 4 bytes:

println!("{}", std::mem::size_of::<u32>());
println!("{}", std::mem::size_of::<Option<u32>>());
println!("{}", std::mem::size_of::<NonZeroU32>());
println!("{}", std::mem::size_of::<Option<NonZeroU32>>());

prints:

4
8
4
4

I've already made this change locally, and it works well, so I'd like to upstream (if it's useful to other people). However, this would be a breaking change, and it feels hard to justify a breaking change for a small (but noticeable) perf hit.

Currently, I see 3 ways of doing this:

  • change the function signatures to accept NonZeroU32 instead of u32 (and replace some From impls with TryFrom)
  • make the affected APIs panic, and add docs saying that they cannot take 0 as a parameter
  • make the APIs simply use u32::MAX when passed 0 (or some other "random" constant)

The first option seems ideal, but is a breaking change to the API signature. The second is what I've implemented in my local change, but is also a breaking change, just without the compiler errors, which is arguably worse. The third option just makes me feel really uneasy.

It's also worth mentioning that I'm not at all familiar with Ruby, and I understand this project is part of a system that needs to be compatible with existing versions of Ruby in some way. If there is a reason why NonZeroU32 is inappropriate here, then it probably makes sense to just fork it. I came across this repo in a comparison of various string interners, and it was the only one that supported PathBuf.

I'd be happy to PR the changes that I've made, but before going further, I thought it would be best to check what the best way forward would be

Thanks ๐Ÿ˜

Miri flag passing changed

cargo miri test changed to be more compatible with cargo test, but this means we had to find a different way to pass flags to the interpreter, so there now is a MIRIFLAGS environment variable for that. The old way still works for now, but is deprecated.

This affects the following line:

run: cargo miri test -- -- drop

This should now be cargo miri test drop, which runs the same tests as cargo test drop.

Add a `CString`/`CStr` interner

This should be pretty easy to add and would clean up a lot of code in artichoke-backend which deals with mruby expecting NUL-terminated symbols in its own way.

Things to do:

  • Implement Slice for CStr in internal.rs.
  • copy-paste the source in bytes.rs to cstr.rs
  • Fixup the argument types and docs.
  • Add a cstr feature for exposing this new interner.

Add a `OsString`/`OsStr` interner

Following up on #117, let's flesh out the remaining owned string/slice pairs in std.

This should be pretty easy to add and would clean up a lot of code in artichoke-backend which deals with mruby expecting NUL-terminated symbols in its own way.

Things to do:

  • Implement Slice for OsStr in internal.rs.
  • copy-paste the source in bytes.rs to osstr.rs
  • Fixup the argument types and docs.
  • Add a osstr feature for exposing this new interner.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.