Comments (7)
I did some early comparisons in the meantime and my findings are roughly consistent with trishume/syntect#34. I get a ~15% slowdown from switching tofancy-regex
. I'll do some more investigation. The vast majority of regexes nlprule runs should be delegated to the regex crate, there might be a few fancy regexes which take most of the time where the result could be cached.
I also already pre-compute lots of regex matches (making speed pretty much irrelevant for those) but there is room for improvement there too which should decrease the overall % of time spent on regex matching.
I definitely want to switch to fancy-regex. If things don't work out I might implement something like trishume/syntect#270 but in general I'm OK with slightly slower regex matching if the next release is in total faster than the previous release.
from nlprule.
This is now solved by:
- a modular regex backend like in trishume/syntect#270
- a function
from_java_regex
which usesregex-syntax
to parse the regular expressions, fix errors (e. g. unnecessary escaped chars) and get them to a state in which bothfancy-regex
and Oniguruma do the same thing (e. g. removing the case-insensitive flag and instead naively making it case insensitive).
In the tests I evaluate approx. 20k regular expressions on ~100k inputs each, it's quite cool that fancy-regex
behaves the same as Oniguruma in every case now.
Regarding speed:
I do not see a significant difference in the benchmark between fancy-regex
and Oniguruma when running without parallelism. Curiously, with parallelism enabled fancy-regex
is 10%-15% slower.
This would warrant some further investigation. I'm not so sure about the quality of the benchmark. It uses the Python bindings which incur some additional overhead and it runs the entire pipeline. To investigate the slowdown I'd have to do the benchmark in Rust and check which part of the pipeline is slower.
But for now, I'm happy with just having both backends with the performance difference being "inconclusive" (and both being fast!).
from nlprule.
This would probably come with a speedup
I'd be curious to see if there's a speedup or not. Keep in mind that Oniguruma is highly-optimized (for its type of regex engine). The situations where I could see fancy-regex win is for patterns that can be delegated to the regex crate whereas onig needs backtracking.
from nlprule.
If you have a particular regex that's delegated to regex and is unexpectedly slow, we can also ask burntsushi to have a look, he's very helpful.
from nlprule.
There is now a PR for this: #36. I opted for the solution from trishume/syntect#270 which works quite well. I already had a wrapper around the Regex anyway for serialization so this didn't add a lot of complexity and was quite easy to implement.
At the moment I still have some problems with mismatches between oniguruma and fancy-regex
/ regex
:
- Oniguruma has better case folding support
fn main() {
let regex_fancy = regex::Regex::new(r"(?i)ss|fi").unwrap();
let regex_onig = onig::Regex::new(r"(?i)ss|fi").unwrap();
for text in &["ß", "ss", "fi", "fi"] {
println!(
"{}\t{}\t{}",
text,
regex_fancy.is_match(text),
regex_onig.is_match(text)
);
}
}
prints
ß false true
ss true true
fi false true
fi true true
- Unicode property classes in a case-insensitive oniguruma regex are still case sensitive:
fn main() {
let regex_fancy = regex::Regex::new(r"(?i)\p{Lu}").unwrap();
let regex_onig = onig::Regex::new(r"(?i)\p{Lu}").unwrap();
for text in &["A", "a"] {
println!(
"{}\t{}\t{}",
text,
regex_fancy.is_match(text),
regex_onig.is_match(text)
);
}
}
prints
A true true
a true false
It is enough to reliably detect and disable regexes with 1.) since they are only a few but I need a fix for 2.). I tried to sort of escape the \p{Lu}
by adding (?-i)
before and and (?i)
afterwards but I don't think that works in every case e. g. inside []
sets. I also don't know how to reliably detect 1.) - I don't think that's trivial.
from nlprule.
I think I can get around both of these issues by using regex-syntax to naively construct lowercase regexes by replacing e g. a
with [aA]
for all literals instead of using the (?i)
flag. I am parsing regexes which are also used in a Java project, this behavior would probably be closest to how it behaves there.
from nlprule.
Awesome, glad to hear :).
On 1., that's a limitation of regex: https://github.com/rust-lang/regex/blob/master/UNICODE.md#rl15-simple-loose-matches
On 2., that seems to be the desired behavior in regex: https://github.com/rust-lang/regex/blob/d5bf98f293b48174d5378471d01c2e0ef271bbbc/tests/unicode.rs#L12
Note that PCRE agrees with regex there:
$ perl -e 'print "matches" if "a" =~ /(?i)\p{Lu}/'
matches
from nlprule.
Related Issues (20)
- Token as returned by pipe() is relative to the sentence boundaries HOT 6
- Improve loading speed (of regex?) - cli usecase HOT 13
- Usability of the rules API degraded from 0.4.6 to 0.5.1 HOT 1
- oob access since 0.5.3 HOT 6
- Support for older glibc HOT 8
- Grammar check fails HOT 3
- panic in `Regex::regex()` HOT 5
- Compile error in build.rs from README.md HOT 3
- Support Rules written in Rust HOT 1
- Coalesced words - tokenization HOT 1
- Readme link to languagetool HOT 1
- Support for AnnotatedText HOT 10
- Clarify license statement HOT 5
- Document how to load custom rulesets HOT 4
- Support distinguishing between grammar and style errors HOT 3
- Be more responsible about network requests HOT 2
- Single Or Pural
- Support python 3.11
- project dead? HOT 1
- Performance for German?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nlprule.