Coder Social home page Coder Social logo

Modularizing the crate about nlprule HOT 4 OPEN

bminixhofer avatar bminixhofer commented on May 18, 2024 1
Modularizing the crate

from nlprule.

Comments (4)

sai-prasanna avatar sai-prasanna commented on May 18, 2024 1

I think spacy python library also deals with these modularization questions. They have a layered pipeline architecture which could be worth checking out. https://spacy.io/usage/processing-pipelines

from nlprule.

bminixhofer avatar bminixhofer commented on May 18, 2024

One interesting thing this would enable is using more sophisticated ML approaches (e. g. modern Transformers) as suggestion producers. This is not possible at the moment because it fundamentally clashes with the portability goals of nlprule.

from nlprule.

drahnr avatar drahnr commented on May 18, 2024

I like the writeup and the direction this points!

If some crates don't make sense without others, re-export behind a feature flag.

Feel free to ping me for feedback on such a PR :)

from nlprule.

bminixhofer avatar bminixhofer commented on May 18, 2024

Thanks @sai-prasanna, SpaCy is definitely a good example where this was solved well.

I will move forward with some of the higher-level changes mentioned here before merging #51 since as @drahnr pointed out spellchecking would benefit a lot from having better modularization beforehand (also, I myself do not need nlprule in a project right now, so I'm not in a rush to add spellchecking).

In particular, properly addressing the complexity of combining multiple suggestion providers which may or may not need some preprocessing (e.g. in form of the Tokenizer) needs something with the ability to represent some tree structure. I believe the following would work:

let correcter = Pipeline::new((
    tokenizer!("en"),
    Union::new((rules!("en"), spell!("en", "en_GB"))),
));

where the Union and Pipeline are completely generic and check compatibility at compile-time. The individual components would then have to be completely separated from each other which should be quite easily possible. So this is part one.


Once spellchecking is implemented, this would add a third binary. I'm not happy with the way binaries are currently included for a couple of reasons:

  • it is very verbose:
let mut tokenizer_bytes: &'static [u8] = include_bytes!(concat!(
    env!("OUT_DIR"),
    "/",
    tokenizer_filename!("en")
));

for one binary is way too unergonomic.

  • it needs distribution via GH releases to be reliable which has been an issue in the past.
  • it requires the user to add a build.rs.
  • it pushes fetching build directories from Backblaze to application code in nlprule-build. I don't really want to guarantee that Backblaze is reliably up either.
  • it pushes inclusion of binaries downstream which e.g. in cargo-spellcheck was solved by storing the binaries in-tree which is problematic in itself and because of the 10MB limit on crates.io.

so to fix this I want to deprecate nlprule-build and store the binary source on crates.io instead (it will also stay on GH releases for the Python bindings). The crates.io team has kindly already increased the limit for nlprule to 100MB to enable this. This then makes it possible to have macros e.g. tokenizer!("en") and tokenizer_bytes!("en") which directly include the correct binary at compile time in one line of code. This is part two.

The key tradeoff here is convenience for less customizability. In particular, one change I believe I'll have to make is that the binaries will be internally gzipped (together with #20 they'll be a .tar.gz) to avoid having to store multiple binaries since the user will almost always want to do this.

I would as always appreciate feedback on both these things. They are not set in stone, just what I currently think is the best solution.

I'll open PRs once I have an implementation; in the meantime I think it makes sense to keep discussion in this issue even though the two parts are only tangentially related.

Also note that part one also implies changes for the Python bindings, with the only difference that correct composition is not checked at compile time (obviously). This would look very similar to the pipelines in sklearn.

from nlprule.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.