Coder Social home page Coder Social logo

Comments (7)

bminixhofer avatar bminixhofer commented on May 18, 2024 1

I did some early comparisons in the meantime and my findings are roughly consistent with trishume/syntect#34. I get a ~15% slowdown from switching tofancy-regex. I'll do some more investigation. The vast majority of regexes nlprule runs should be delegated to the regex crate, there might be a few fancy regexes which take most of the time where the result could be cached.

I also already pre-compute lots of regex matches (making speed pretty much irrelevant for those) but there is room for improvement there too which should decrease the overall % of time spent on regex matching.

I definitely want to switch to fancy-regex. If things don't work out I might implement something like trishume/syntect#270 but in general I'm OK with slightly slower regex matching if the next release is in total faster than the previous release.

from nlprule.

bminixhofer avatar bminixhofer commented on May 18, 2024 1

This is now solved by:

  1. a modular regex backend like in trishume/syntect#270
  2. a function from_java_regex which uses regex-syntax to parse the regular expressions, fix errors (e. g. unnecessary escaped chars) and get them to a state in which both fancy-regex and Oniguruma do the same thing (e. g. removing the case-insensitive flag and instead naively making it case insensitive).

In the tests I evaluate approx. 20k regular expressions on ~100k inputs each, it's quite cool that fancy-regex behaves the same as Oniguruma in every case now.

Regarding speed:

I do not see a significant difference in the benchmark between fancy-regex and Oniguruma when running without parallelism. Curiously, with parallelism enabled fancy-regex is 10%-15% slower.

This would warrant some further investigation. I'm not so sure about the quality of the benchmark. It uses the Python bindings which incur some additional overhead and it runs the entire pipeline. To investigate the slowdown I'd have to do the benchmark in Rust and check which part of the pipeline is slower.

But for now, I'm happy with just having both backends with the performance difference being "inconclusive" (and both being fast!).

from nlprule.

robinst avatar robinst commented on May 18, 2024

This would probably come with a speedup

I'd be curious to see if there's a speedup or not. Keep in mind that Oniguruma is highly-optimized (for its type of regex engine). The situations where I could see fancy-regex win is for patterns that can be delegated to the regex crate whereas onig needs backtracking.

from nlprule.

robinst avatar robinst commented on May 18, 2024

If you have a particular regex that's delegated to regex and is unexpectedly slow, we can also ask burntsushi to have a look, he's very helpful.

from nlprule.

bminixhofer avatar bminixhofer commented on May 18, 2024

There is now a PR for this: #36. I opted for the solution from trishume/syntect#270 which works quite well. I already had a wrapper around the Regex anyway for serialization so this didn't add a lot of complexity and was quite easy to implement.

At the moment I still have some problems with mismatches between oniguruma and fancy-regex / regex:

  1. Oniguruma has better case folding support
fn main() {
    let regex_fancy = regex::Regex::new(r"(?i)ss|fi").unwrap();
    let regex_onig = onig::Regex::new(r"(?i)ss|fi").unwrap();

    for text in &["ß", "ss", "fi", "fi"] {
        println!(
            "{}\t{}\t{}",
            text,
            regex_fancy.is_match(text),
            regex_onig.is_match(text)
        );
    }
}

prints

ß       false   true
ss      true    true
fi      false   true
fi       true    true
  1. Unicode property classes in a case-insensitive oniguruma regex are still case sensitive:
fn main() {
    let regex_fancy = regex::Regex::new(r"(?i)\p{Lu}").unwrap();
    let regex_onig = onig::Regex::new(r"(?i)\p{Lu}").unwrap();

    for text in &["A", "a"] {
        println!(
            "{}\t{}\t{}",
            text,
            regex_fancy.is_match(text),
            regex_onig.is_match(text)
        );
    }
}

prints

A       true    true
a       true    false

It is enough to reliably detect and disable regexes with 1.) since they are only a few but I need a fix for 2.). I tried to sort of escape the \p{Lu} by adding (?-i) before and and (?i) afterwards but I don't think that works in every case e. g. inside [] sets. I also don't know how to reliably detect 1.) - I don't think that's trivial.

from nlprule.

bminixhofer avatar bminixhofer commented on May 18, 2024

I think I can get around both of these issues by using regex-syntax to naively construct lowercase regexes by replacing e g. a with [aA] for all literals instead of using the (?i) flag. I am parsing regexes which are also used in a Java project, this behavior would probably be closest to how it behaves there.

from nlprule.

robinst avatar robinst commented on May 18, 2024

Awesome, glad to hear :).

On 1., that's a limitation of regex: https://github.com/rust-lang/regex/blob/master/UNICODE.md#rl15-simple-loose-matches

On 2., that seems to be the desired behavior in regex: https://github.com/rust-lang/regex/blob/d5bf98f293b48174d5378471d01c2e0ef271bbbc/tests/unicode.rs#L12

Note that PCRE agrees with regex there:

$ perl -e 'print "matches" if "a" =~ /(?i)\p{Lu}/'
matches

from nlprule.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.