Comments (4)
I think spacy python library also deals with these modularization questions. They have a layered pipeline architecture which could be worth checking out. https://spacy.io/usage/processing-pipelines
from nlprule.
One interesting thing this would enable is using more sophisticated ML approaches (e. g. modern Transformers) as suggestion producers. This is not possible at the moment because it fundamentally clashes with the portability goals of nlprule.
from nlprule.
I like the writeup and the direction this points!
If some crates don't make sense without others, re-export behind a feature flag.
Feel free to ping me for feedback on such a PR :)
from nlprule.
Thanks @sai-prasanna, SpaCy is definitely a good example where this was solved well.
I will move forward with some of the higher-level changes mentioned here before merging #51 since as @drahnr pointed out spellchecking would benefit a lot from having better modularization beforehand (also, I myself do not need nlprule in a project right now, so I'm not in a rush to add spellchecking).
In particular, properly addressing the complexity of combining multiple suggestion providers which may or may not need some preprocessing (e.g. in form of the Tokenizer
) needs something with the ability to represent some tree structure. I believe the following would work:
let correcter = Pipeline::new((
tokenizer!("en"),
Union::new((rules!("en"), spell!("en", "en_GB"))),
));
where the Union
and Pipeline
are completely generic and check compatibility at compile-time. The individual components would then have to be completely separated from each other which should be quite easily possible. So this is part one.
Once spellchecking is implemented, this would add a third binary. I'm not happy with the way binaries are currently included for a couple of reasons:
- it is very verbose:
let mut tokenizer_bytes: &'static [u8] = include_bytes!(concat!(
env!("OUT_DIR"),
"/",
tokenizer_filename!("en")
));
for one binary is way too unergonomic.
- it needs distribution via GH releases to be reliable which has been an issue in the past.
- it requires the user to add a
build.rs
. - it pushes fetching build directories from Backblaze to application code in
nlprule-build
. I don't really want to guarantee that Backblaze is reliably up either. - it pushes inclusion of binaries downstream which e.g. in
cargo-spellcheck
was solved by storing the binaries in-tree which is problematic in itself and because of the 10MB limit on crates.io.
so to fix this I want to deprecate nlprule-build
and store the binary source on crates.io instead (it will also stay on GH releases for the Python bindings). The crates.io team has kindly already increased the limit for nlprule to 100MB to enable this. This then makes it possible to have macros e.g. tokenizer!("en")
and tokenizer_bytes!("en")
which directly include the correct binary at compile time in one line of code. This is part two.
The key tradeoff here is convenience for less customizability. In particular, one change I believe I'll have to make is that the binaries will be internally gzipped (together with #20 they'll be a .tar.gz
) to avoid having to store multiple binaries since the user will almost always want to do this.
I would as always appreciate feedback on both these things. They are not set in stone, just what I currently think is the best solution.
I'll open PRs once I have an implementation; in the meantime I think it makes sense to keep discussion in this issue even though the two parts are only tangentially related.
Also note that part one also implies changes for the Python bindings, with the only difference that correct composition is not checked at compile time (obviously). This would look very similar to the pipelines in sklearn.
from nlprule.
Related Issues (20)
- Make rayon optional
- Token as returned by pipe() is relative to the sentence boundaries HOT 6
- Improve loading speed (of regex?) - cli usecase HOT 13
- Usability of the rules API degraded from 0.4.6 to 0.5.1 HOT 1
- oob access since 0.5.3 HOT 6
- Support for older glibc HOT 8
- Grammar check fails HOT 3
- panic in `Regex::regex()` HOT 5
- Compile error in build.rs from README.md HOT 3
- Support Rules written in Rust HOT 1
- Coalesced words - tokenization HOT 1
- Readme link to languagetool HOT 1
- Support for AnnotatedText HOT 10
- Clarify license statement HOT 5
- Document how to load custom rulesets HOT 4
- Support distinguishing between grammar and style errors HOT 3
- Be more responsible about network requests HOT 2
- Single Or Pural
- Support python 3.11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nlprule.