I would like to switch from rust-onig

This is now solved by: a modular regex backend like in <a clas

There is now a PR for this: <a class="issue-link js-issue-link" data-error-text="Faile

I think I can get around both of these issues by using <a href="https://docs.rs/regex-

switch regex engine from oniguruma to fancy-regex about nlprule HOT 7 CLOSED

bminixhofer commented on May 18, 2024

switch regex engine from oniguruma to fancy-regex

from nlprule.

Comments (7)

bminixhofer commented on May 18, 2024 1

I did some early comparisons in the meantime and my findings are roughly consistent with trishume/syntect#34. I get a ~15% slowdown from switching tofancy-regex. I'll do some more investigation. The vast majority of regexes nlprule runs should be delegated to the regex crate, there might be a few fancy regexes which take most of the time where the result could be cached.

I also already pre-compute lots of regex matches (making speed pretty much irrelevant for those) but there is room for improvement there too which should decrease the overall % of time spent on regex matching.

I definitely want to switch to fancy-regex. If things don't work out I might implement something like trishume/syntect#270 but in general I'm OK with slightly slower regex matching if the next release is in total faster than the previous release.

from nlprule.

bminixhofer commented on May 18, 2024 1

This is now solved by:

a modular regex backend like in trishume/syntect#270
a function from_java_regex which uses regex-syntax to parse the regular expressions, fix errors (e. g. unnecessary escaped chars) and get them to a state in which both fancy-regex and Oniguruma do the same thing (e. g. removing the case-insensitive flag and instead naively making it case insensitive).

In the tests I evaluate approx. 20k regular expressions on ~100k inputs each, it's quite cool that fancy-regex behaves the same as Oniguruma in every case now.

Regarding speed:

I do not see a significant difference in the benchmark between fancy-regex and Oniguruma when running without parallelism. Curiously, with parallelism enabled fancy-regex is 10%-15% slower.

This would warrant some further investigation. I'm not so sure about the quality of the benchmark. It uses the Python bindings which incur some additional overhead and it runs the entire pipeline. To investigate the slowdown I'd have to do the benchmark in Rust and check which part of the pipeline is slower.

But for now, I'm happy with just having both backends with the performance difference being "inconclusive" (and both being fast!).

from nlprule.

robinst commented on May 18, 2024

This would probably come with a speedup

I'd be curious to see if there's a speedup or not. Keep in mind that Oniguruma is highly-optimized (for its type of regex engine). The situations where I could see fancy-regex win is for patterns that can be delegated to the regex crate whereas onig needs backtracking.

from nlprule.

robinst commented on May 18, 2024

If you have a particular regex that's delegated to regex and is unexpectedly slow, we can also ask burntsushi to have a look, he's very helpful.

from nlprule.

bminixhofer commented on May 18, 2024

There is now a PR for this: #36. I opted for the solution from trishume/syntect#270 which works quite well. I already had a wrapper around the Regex anyway for serialization so this didn't add a lot of complexity and was quite easy to implement.

At the moment I still have some problems with mismatches between oniguruma and fancy-regex / regex:

Oniguruma has better case folding support

fn main() {
    let regex_fancy = regex::Regex::new(r"(?i)ss|ﬁ").unwrap();
    let regex_onig = onig::Regex::new(r"(?i)ss|ﬁ").unwrap();

    for text in &["ß", "ss", "fi", "ﬁ"] {
        println!(
            "{}\t{}\t{}",
            text,
            regex_fancy.is_match(text),
            regex_onig.is_match(text)
        );
    }
}

prints

ß       false   true
ss      true    true
fi      false   true
ﬁ       true    true

Unicode property classes in a case-insensitive oniguruma regex are still case sensitive:

fn main() {
    let regex_fancy = regex::Regex::new(r"(?i)\p{Lu}").unwrap();
    let regex_onig = onig::Regex::new(r"(?i)\p{Lu}").unwrap();

    for text in &["A", "a"] {
        println!(
            "{}\t{}\t{}",
            text,
            regex_fancy.is_match(text),
            regex_onig.is_match(text)
        );
    }
}

prints

A       true    true
a       true    false

It is enough to reliably detect and disable regexes with 1.) since they are only a few but I need a fix for 2.). I tried to sort of escape the \p{Lu} by adding (?-i) before and and (?i) afterwards but I don't think that works in every case e. g. inside [] sets. I also don't know how to reliably detect 1.) - I don't think that's trivial.

from nlprule.

bminixhofer commented on May 18, 2024

I think I can get around both of these issues by using regex-syntax to naively construct lowercase regexes by replacing e g. a with [aA] for all literals instead of using the (?i) flag. I am parsing regexes which are also used in a Java project, this behavior would probably be closest to how it behaves there.

from nlprule.

robinst commented on May 18, 2024

Awesome, glad to hear :).

On 1., that's a limitation of regex: https://github.com/rust-lang/regex/blob/master/UNICODE.md#rl15-simple-loose-matches

On 2., that seems to be the desired behavior in regex: https://github.com/rust-lang/regex/blob/d5bf98f293b48174d5378471d01c2e0ef271bbbc/tests/unicode.rs#L12

Note that PCRE agrees with regex there:

$ perl -e 'print "matches" if "a" =~ /(?i)\p{Lu}/'
matches

from nlprule.

switch regex engine from oniguruma to fancy-regex about nlprule HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent