As suggested originally by <a class="user-mention notranslate" data-hovercard-type="us

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Modularizing the crate about nlprule HOT 4 OPEN

bminixhofer commented on May 18, 2024 1

Modularizing the crate

from nlprule.

Comments (4)

sai-prasanna commented on May 18, 2024 1

I think spacy python library also deals with these modularization questions. They have a layered pipeline architecture which could be worth checking out. https://spacy.io/usage/processing-pipelines

from nlprule.

bminixhofer commented on May 18, 2024

One interesting thing this would enable is using more sophisticated ML approaches (e. g. modern Transformers) as suggestion producers. This is not possible at the moment because it fundamentally clashes with the portability goals of nlprule.

from nlprule.

drahnr commented on May 18, 2024

I like the writeup and the direction this points!

If some crates don't make sense without others, re-export behind a feature flag.

Feel free to ping me for feedback on such a PR :)

from nlprule.

bminixhofer commented on May 18, 2024

Thanks @sai-prasanna, SpaCy is definitely a good example where this was solved well.

I will move forward with some of the higher-level changes mentioned here before merging #51 since as @drahnr pointed out spellchecking would benefit a lot from having better modularization beforehand (also, I myself do not need nlprule in a project right now, so I'm not in a rush to add spellchecking).

In particular, properly addressing the complexity of combining multiple suggestion providers which may or may not need some preprocessing (e.g. in form of the Tokenizer) needs something with the ability to represent some tree structure. I believe the following would work:

let correcter = Pipeline::new((
    tokenizer!("en"),
    Union::new((rules!("en"), spell!("en", "en_GB"))),
));

where the Union and Pipeline are completely generic and check compatibility at compile-time. The individual components would then have to be completely separated from each other which should be quite easily possible. So this is part one.

Once spellchecking is implemented, this would add a third binary. I'm not happy with the way binaries are currently included for a couple of reasons:

it is very verbose:

let mut tokenizer_bytes: &'static [u8] = include_bytes!(concat!(
    env!("OUT_DIR"),
    "/",
    tokenizer_filename!("en")
));

for one binary is way too unergonomic.

it needs distribution via GH releases to be reliable which has been an issue in the past.
it requires the user to add a build.rs.
it pushes fetching build directories from Backblaze to application code in nlprule-build. I don't really want to guarantee that Backblaze is reliably up either.
it pushes inclusion of binaries downstream which e.g. in cargo-spellcheck was solved by storing the binaries in-tree which is problematic in itself and because of the 10MB limit on crates.io.

so to fix this I want to deprecate nlprule-build and store the binary source on crates.io instead (it will also stay on GH releases for the Python bindings). The crates.io team has kindly already increased the limit for nlprule to 100MB to enable this. This then makes it possible to have macros e.g. tokenizer!("en") and tokenizer_bytes!("en") which directly include the correct binary at compile time in one line of code. This is part two.

The key tradeoff here is convenience for less customizability. In particular, one change I believe I'll have to make is that the binaries will be internally gzipped (together with #20 they'll be a .tar.gz) to avoid having to store multiple binaries since the user will almost always want to do this.

I would as always appreciate feedback on both these things. They are not set in stone, just what I currently think is the best solution.

I'll open PRs once I have an implementation; in the meantime I think it makes sense to keep discussion in this issue even though the two parts are only tangentially related.

Also note that part one also implies changes for the Python bindings, with the only difference that correct composition is not checked at compile time (obviously). This would look very similar to the pipelines in sklearn.

from nlprule.

Modularizing the crate about nlprule HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent