Comments (4)
This could be avoided if rg just ran multiple loops through the data at max size chunks in the pattern file.
Ah yeah this isn't going to happen, sorry. To you it might look like one extra little flag, but to me it looks like completely reshuffling the internals. And it looks like a flag that begets more flags. This sort of thing is definitely something you'll want to build as a layer on top of ripgrep. Building something bespoke is much easier than targeting the general case.
If you do
-if
with a pattern file with 2M lines, it throws this error:
So for your data, while it contains 2 million patterns, it's actually less than half of that because of duplicates.
But right. As soon as you throw -i/--ignore-case
in there, the regex engine has to get involved in order to implement case insensitive searching. Did you know that you can configure different size limits? The defaults are just there so that you don't accidentally consume a whole bunch of memory. But you can do, e.g., --regex-size-limit 10G --dfa-size-limit 10G
and searching for your patterns will work, even with -i/--ignore-case
enabled. It won't be the fastest thing in the world, but it works.
from ripgrep.
I don't understand the request. Whether you shard them manually or not, they all get compiled into a single regex pattern by joining them with |
. I think it would be better to avoid the XY problem and describe the real thing you're trying to do. This means including the command you're running, the inputs you're giving to it, the actual output and the expected/desired output. As Jerry Maguire says, help me help you.
Regex engines (with the exception of things like Hyperscan and other more niche engines) generally don't scale well to that many patterns. You're basically up shit's creek without a paddle. ripgrep's default regex engine is based on finite automata and so will generally do much better than a backtracker like PCRE2, but it has its limits. One special case here is if all your patterns are literals (which becomes true if you pass -F
). In that case, ripgrep can do a little better by diverting to Aho-Corasick.
from ripgrep.
Thanks for the quick reply, and the advice about -F
; thats helpful. I'm asking (or really just suggesting) a convenience wrapper for when the regex engine maxes out. If you do -if
with a pattern file with 2M lines, it throws this error:
compiled regex exceeds size limit of 104857600
This could be avoided if rg just ran multiple loops through the data at max size chunks in the pattern file. I know this isn't normal grep behavior, but would be useful to me as an extra flag for -f. Here's a pattern file; it's fixed strings, but you can reproduce the behavior with -i
.
rg_all_conferences_papers.inArxiv.citations.citingids.txt.zip
from ripgrep.
Haha, I figured that's what you'd say. Fair enough. I did not notice the regex-size-limit flag. That's super useful! Anyways will close the issue. Thanks for the quick responses.
P.S. I meant to grab the uniq one, but it made the point. :)
from ripgrep.
Related Issues (20)
- Ripgrep significantly slower than grep HOT 3
- Ability to ignore non-existent `PATTERNFILE` HOT 2
- multiline search regex wildcard not expanding HOT 1
- Precompiled binary for powerpc64le HOT 1
- Add Vuejs as a file list to type-list HOT 1
- zero or more quantifier does not work HOT 1
- Inconsistent behavior with negation pattern in `.gitignore` HOT 8
- compiling with simd-accel feature is broken due to removal of stdsimd feature in nightly breaking the packed_simd crate HOT 1
- File named `.config` is ignored for no reason HOT 1
- Process substitution search path behavior change in 14.0.0
- [feature] Line masking (ignoring lines or part of lines in matching but displayed in output) HOT 3
- Provide Statically Compiled Binaries for (aarch64|arm64) Linux HOT 4
- Include regex syntax in man page HOT 1
- Include ---no-ignore-files in --unrestricted option? HOT 1
- ripgrep mis-parses `*[\<\>\:\"\/\\\|\?\*]*` in `.gitignore` HOT 2
- Incorrect application of ignore rule with single glob in nested HOT 1
- Since nightly-2024-02-06 , could not compile with --features 'simd-accel' HOT 4
- rg allocates too much memory with: `rg --files --ignore-file ~/.ultimate-gitignore` HOT 2
- Repo HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ripgrep.