Coder Social home page Coder Social logo

Comments (4)

BurntSushi avatar BurntSushi commented on September 21, 2024 1

This could be avoided if rg just ran multiple loops through the data at max size chunks in the pattern file.

Ah yeah this isn't going to happen, sorry. To you it might look like one extra little flag, but to me it looks like completely reshuffling the internals. And it looks like a flag that begets more flags. This sort of thing is definitely something you'll want to build as a layer on top of ripgrep. Building something bespoke is much easier than targeting the general case.

If you do -if with a pattern file with 2M lines, it throws this error:

So for your data, while it contains 2 million patterns, it's actually less than half of that because of duplicates.

But right. As soon as you throw -i/--ignore-case in there, the regex engine has to get involved in order to implement case insensitive searching. Did you know that you can configure different size limits? The defaults are just there so that you don't accidentally consume a whole bunch of memory. But you can do, e.g., --regex-size-limit 10G --dfa-size-limit 10G and searching for your patterns will work, even with -i/--ignore-case enabled. It won't be the fastest thing in the world, but it works.

from ripgrep.

BurntSushi avatar BurntSushi commented on September 21, 2024

I don't understand the request. Whether you shard them manually or not, they all get compiled into a single regex pattern by joining them with |. I think it would be better to avoid the XY problem and describe the real thing you're trying to do. This means including the command you're running, the inputs you're giving to it, the actual output and the expected/desired output. As Jerry Maguire says, help me help you.

Regex engines (with the exception of things like Hyperscan and other more niche engines) generally don't scale well to that many patterns. You're basically up shit's creek without a paddle. ripgrep's default regex engine is based on finite automata and so will generally do much better than a backtracker like PCRE2, but it has its limits. One special case here is if all your patterns are literals (which becomes true if you pass -F). In that case, ripgrep can do a little better by diverting to Aho-Corasick.

from ripgrep.

kochbj avatar kochbj commented on September 21, 2024

Thanks for the quick reply, and the advice about -F; thats helpful. I'm asking (or really just suggesting) a convenience wrapper for when the regex engine maxes out. If you do -if with a pattern file with 2M lines, it throws this error:

compiled regex exceeds size limit of 104857600

This could be avoided if rg just ran multiple loops through the data at max size chunks in the pattern file. I know this isn't normal grep behavior, but would be useful to me as an extra flag for -f. Here's a pattern file; it's fixed strings, but you can reproduce the behavior with -i.

rg_all_conferences_papers.inArxiv.citations.citingids.txt.zip

from ripgrep.

kochbj avatar kochbj commented on September 21, 2024

Haha, I figured that's what you'd say. Fair enough. I did not notice the regex-size-limit flag. That's super useful! Anyways will close the issue. Thanks for the quick responses.

P.S. I meant to grab the uniq one, but it made the point. :)

from ripgrep.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.