Coder Social home page Coder Social logo

ripgrep-all's Introduction

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

rga is a line-oriented search tool that allows you to look for a regex in a multitude of file types. rga wraps the awesome ripgrep and enables it to search in pdf, docx, sqlite, jpg, movie subtitles (mkv, mp4), etc.

github repo Crates.io fearless concurrency

For more detail, see this introductory blogpost: https://phiresky.github.io/blog/2019/rga--ripgrep-for-zip-targz-docx-odt-epub-jpg/

rga will recursively descend into archives and match text in every file type it knows.

Here is an example directory with different file types:

demo/
├── greeting.mkv
├── hello.odt
├── hello.sqlite3
└── somearchive.zip
├── dir
│ ├── greeting.docx
│ └── inner.tar.gz
│ └── greeting.pdf
└── greeting.epub

rga output

Integration with fzf

rga-fzf

See the wiki for instructions of integrating rga with fzf.

INSTALLATION

Linux x64, macOS and Windows binaries are available in GitHub Releases.

Linux

Arch Linux

pacman -S ripgrep-all

Nix

nix-env -iA nixpkgs.ripgrep-all

Debian-based

download the rga binary and get the dependencies like this:

apt install ripgrep pandoc poppler-utils ffmpeg

If ripgrep is not included in your package sources, get it from here.

rga will search for all binaries it calls in $PATH and the directory itself is in.

Windows

Note that installing via chocolatey or scoop is the only supported download method. If you download the binary from releases manually, you will not get the dependencies (for example pdftotext from poppler).

If you get an error like VCRUNTIME140.DLL could not be found, you need to install vc_redist.x64.exe.

Chocolatey

choco install ripgrep-all

Scoop

scoop install rga

Homebrew/Linuxbrew

rga can be installed with Homebrew:

brew install rga

To install the dependencies that are each not strictly necessary but very useful:

brew install pandoc poppler ffmpeg

MacPorts

rga can also be installed on macOS via MacPorts:

sudo port install ripgrep-all

Compile from source

rga should compile with stable Rust (v1.75.0+, check with rustc --version). To build it, run the following (or the equivalent in your OS):

~$ apt install build-essential pandoc poppler-utils ffmpeg ripgrep cargo
~$ cargo install --locked ripgrep_all
~$ rga --version    # this should work now

Available Adapters

rga works with adapters that adapt various file formats. It comes with a few adapters integrated:

rga --rga-list-adapters

You can also add custom adapters. See the wiki for more information.

Adapters:

  • pandoc Uses pandoc to convert binary/unreadable text documents to plain markdown-like text Runs: pandoc --from= --to=plain --wrap=none --markdown-headings=atx
    Extensions: .epub, .odt, .docx, .fb2, .ipynb, .html, .htm

  • poppler Uses pdftotext (from poppler-utils) to extract plain text from PDF files Runs: pdftotext - -
    Extensions: .pdf
    Mime Types: application/pdf

  • postprocpagebreaks Adds the page number to each line for an input file that specifies page breaks as ascii page break character. Mainly to be used internally by the poppler adapter.
    Extensions: .asciipagebreaks

  • ffmpeg Uses ffmpeg to extract video metadata/chapters, subtitles, lyrics, and other metadata
    Extensions: .mkv, .mp4, .avi, .mp3, .ogg, .flac, .webm

  • zip Reads a zip file as a stream and recurses down into its contents
    Extensions: .zip, .jar
    Mime Types: application/zip

  • decompress Reads compressed file as a stream and runs a different extractor on the contents.
    Extensions: .als, .bz2, .gz, .tbz, .tbz2, .tgz, .xz, .zst
    Mime Types: application/gzip, application/x-bzip, application/x-xz, application/zstd

  • tar Reads a tar file as a stream and recurses down into its contents
    Extensions: .tar

  • sqlite Uses sqlite bindings to convert sqlite databases into a simple plain text format
    Extensions: .db, .db3, .sqlite, .sqlite3
    Mime Types: application/x-sqlite3

The following adapters are disabled by default, and can be enabled using '--rga-adapters=+foo,bar':

  • mail Reads mailbox/mail files and runs extractors on the contents and attachments.
    Extensions: .mbox, .mbx, .eml
    Mime Types: application/mbox, message/rfc822

USAGE:

rga [RGA OPTIONS] [RG OPTIONS] PATTERN [PATH ...]

FLAGS:

--rga-accurate

Use more accurate but slower matching by mime type

By default, rga will match files using file extensions. Some programs, such as sqlite3, don't care about the file extension at all, so users sometimes use any or no extension at all. With this flag, rga will try to detect the mime type of input files using the magic bytes (similar to the `file` utility), and use that to choose the adapter. Detection is only done on the first 8KiB of the file, since we can't always seek on the input (in archives).

--rga-no-cache

Disable caching of results

By default, rga caches the extracted text, if it is small enough, to a database in ${XDG_CACHE_DIR-~/.cache}/ripgrep-all on Linux, ~/Library/Caches/ripgrep-all on macOS, or C:\Users\username\AppData\Local\ripgrep-all on Windows. This way, repeated searches on the same set of files will be much faster. If you pass this flag, all caching will be disabled.

-h, --help

Prints help information

--rga-list-adapters

List all known adapters

--rga-print-config-schema

Print the JSON Schema of the configuration file

--rg-help

Show help for ripgrep itself

--rg-version

Show version of ripgrep itself

-V, --version

Prints version information

OPTIONS:

--rga-adapters=<adapters>...

Change which adapters to use and in which priority order (descending)

"foo,bar" means use only adapters foo and bar. "-bar,baz" means use all default adapters except for bar and baz. "+bar,baz" means use all default adapters and also bar and baz.

--rga-cache-compression-level=<compression-level>

ZSTD compression level to apply to adapter outputs before storing in cache db

Ranges from 1 - 22 [default: 12]

--rga-config-file=<config-file-path>

--rga-max-archive-recursion=<max-archive-recursion>

Maximum nestedness of archives to recurse into [default: 5]

--rga-cache-max-blob-len=<max-blob-len>

Max compressed size to cache

Longest byte length (after compression) to store in cache. Longer adapter outputs will not be cached and recomputed every time.

Allowed suffixes on command line: k M G [default: 2000000]

--rga-cache-path=<path>

Path to store cache db [default: /home/phire/.cache/ripgrep-all]

-h shows a concise overview, --help shows more detail and advanced options.

All other options not shown here are passed directly to rg, especially [PATTERN] and [PATH ...]

Config

The config file location leverage the mechanisms defined by

Development

To enable debug logging:

export RUST_LOG=debug
export RUST_BACKTRACE=1

Also remember to disable caching with --rga-no-cache or clear the cache (~/Library/Caches/rga on macOS, ~/.cache/rga on other Unixes, or C:\Users\username\AppData\Local\rga on Windows) to debug the adapters.

Nix and Direnv

You can use the provided flake.nix to setup all build- and run-time dependencies:

  1. Enable Flakes in your Nix configuration.
  2. Add direnv to your profile: nix profile install nixpkgs#direnv
  3. cd into the directory where you have cloned this directory.
  4. Allow use of .envrc: direnv allow
  5. After the dependencies have been installed, your shell will now have all of the necessary development dependencies.

ripgrep-all's People

Contributors

abelcha avatar aliesbelik avatar br1ght0ne avatar dloss avatar duxovni avatar fkarg avatar fliegendewurst avatar hacksore avatar herbygillot avatar kfogel avatar lafrenierejm avatar liskin avatar makefu avatar mathieupost avatar mbrubeck avatar moonfruit avatar neved4 avatar nicoulaj avatar phiresky avatar prj-2501 avatar richiksc avatar smokris avatar svenstaro avatar sweetbbak avatar tarnadas avatar tripleight avatar uhthomas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ripgrep-all's Issues

possible bug with search path

Hi.
In the following example I'm looking for "static analysis", and I expect to find this phrase in 2 PDFs, in a folder called Compi:
rga "static analysis" . --no-heading -m=1
rga outputs nothing.
However,
mv Compi compi
rga "static analysis" . --no-heading -m=1
will output:
./compi/Lectures/12-abstract-interp-fullscreen.pdf:Page 217: ★ 30% Static Analysis
./compi/Lectures/08-analysis-fullscreen.pdf:Page 8: Static Analysis
Am I missing something?
Thanks.

feature_request(ebooks): kill gremlin characters

1. Summary

It would be nice, if would be possible searched text in ebooks with gremlin characters.

2. Gremlins

2.1. Definition

Gremlins — is invisible non-printable characters, which prevent text in ebooks from being searched. They come across the books with poor-quality OCR.

2.2. Gremlins example

Paragraph of text from page 10 of “The Enigma of Reason” book:

Okular text

They drink and piss, eat and shit. They sleep and snore. They sweat and
shiver. They lust. They mate. Their births and deaths are messy affairs. Ani-
mals, ­humans are animals! Ah, but ­humans, and ­humans alone, are endowed
with reason. Reason sets them apart, high above other creatures—or so
Western phi­los­o­phers have claimed.

See it in service, that show non printable characters:

soscisurvey.de

For example, we can't find in this ebook philosophers word, because 3 gremlins inside it:

Philosophers

It would be nice, if ripgrep-all users can find philosophers word in this book.

2.3. pdftotext gremlins

pdftotext KiraTheEnigmaOfReason.pdf

Paragraph in KiraTheEnigmaOfReason.txt:

They drink and piss, eat and shit. They sleep and snore. They sweat and
shiver. They lust. They mate. Their births and deaths are messy affairs. Animals, ­humans are animals! Ah, but ­humans, and ­humans alone, are endowed
with reason. Reason sets them apart, high above other creatures—or so
Western phi­los­o­phers have claimed.

pdftotext philosophers

pdftotext not delete gremlins.

3. Additional links

  1. Hunting gremlin characters
  2. Removing non-printable “gremlin” chars from text files

4. Environment

  1. Windows 10.0.18363 Pro N for Workstations 64-bit EN
  2. ripgrep-all 0.9.3 (currently, the latest Windows version)
  3. pdftotext (from conda-forge Poppler) 0.88.0
  4. Okular 1.10.70

Thanks.

rga-accurate problem for pdf files without filename extension

I can't get rga --rga-accurate to work on pdfs with no filename extension.

  1. I take some random pdfs and copy them to random filenames without filename extensions
  2. I check that "file *" correctly identifies both original and copies as pdf files
  3. I search for a generic word which occurs in all/most pdfs: rga --rga-accurate -l "the"
    This only lists the pdfs with the filename extension ".pdf" but not the renamed copies without ".pdf" file extension.

docs(adapters): third-party adapters commands

1. Summary

It would be nice, if ripgrep-all documentation will have information, which commands of available third-party adapters ripgrep-all run in the process of finding text in books.

2. Example

For example, for PDF ripgrep-all uses command pdftotext -layout <BookName>.pdf. It would be nice, if users will know commands for another file types.

3. Argumentation

3.1. Common cause

More debugging information.

3.2. Details

Imagine that ripgrep-all user get unexpected result. If the user knows the adapter commands, he can determine, where the problem — in ripgrep-all or in third-party tools → user can continue debugging with more of important data.

Thanks.

feature_request(ebooks): words with hyphens in the ends of lines

1. Summary

It would be nice, if ripgrep-all will find words, which are separated by hyphens at the ends of lines.

Okular expected behavior

2. Argumentation

2.1. Common cause

More correct search.

2.2. Details

Hyphen between word parts used, for example, in English, German, Russian.

Currently, ripgrep-all doesn't find words, which are divided by hyphens. Users may not find the right words, because of what they may have problems with their work. The desired words can be separated by hyphens.

2.3. Additional information

  1. About optional hyphens
  2. Soft hyphen in PDF

3. Example

3.1. Data

  1. KiraTheEnigmaOfReason.pdf — “The Enigma of Reason” book.

3.2. Purpose

I want to find word understandable on page 17 of this book. Okular successfully found this word:

Okular expected behavior

3.3. Current behavior

rga understandable

No results.

3.4. Expected behavior

I converted KiraTheEnigmaOfReason.pdf to KiraTheEnigmaOfReason.txt:

pdftotext KiraTheEnigmaOfReason.pdf

Now I run ripgrep-all:

D:\SashaDebugging\KiraRipgrepAll>rga understandable
KiraTheEnigmaOfReason.txt
distrust. When we talk to ­others, we often have to overcome their understandable lack of trust. If we distrusted ­others only when they ­don’t deserve

pdftotext converted under-standable to understandable. From this I conclude, that problem in ripgrep-all, not in pdftotext.

4. Environment

  1. Windows 10.0.18363 Pro N for Workstations 64-bit EN
  2. ripgrep-all 0.9.3 (currently, the latest Windows version)
  3. pdftotext (from conda-forge Poppler) 0.88.0
  4. Okular 1.10.70

Thanks.

Option to suppress

Would it be possible to add a flag to automatically skip files where adapter fails?
My searches are sometime polluted by quite verbose errors, e.g.:

notebook/symetrical_components.ipynb: preprocessor command failed: '"rga-preproc" "notebook/symetrical_components.ipynb"': 
-------------------------------------------------------------------------------
adapter: pandoc
Unknown reader: ipynb
Error: Broken pipe (os error 32)
-------------------------------------------------------------------------------

(I'm not sure why it attempts to use pandoc on jupyter notebooks, but that's a separate issue :))

Problem with `--pre` parameter when calling ripgrep

On Debian 10, when attempting to use rga (0.9.3) with any search, it results with ripgrep (11.0.2) error screen stating that:

error: Found argument '--pre' which wasn't expected, or isn't valid in this context
	Did you mean --pretty?

After taking the latest rg binary from the repo I still get this issue.
Is there any workaround or am I missing anything?

feature_request(windows): third-party instead of built-in dependencies

1. Summary

It would be nice, if would be possible to use third-party dependencies instead of built-in dependencies, that implemented to ripgrep-all archive.

It would be nice to have normal dependencies management instead of old-school built-in dependencies.

2. Example of expected behavior

2.1. Dependencies installation

Users install ripgrep-all via ChocolateyWindows package manager. We can add dependencies to ripgrep-all Chocolatey package. For example, we add this line to this file:

<dependency id="pandoc" />

→ Chocolatey install pandoc, when Windows install ripgrep-all via Chocolatey.

Chocolatey works with dependencies as pip with requirements.txt or npm with package.json.

2.2. lib

Now pandoc.exe to ripgrep-all archive not necessary. ripgrep-all developers can remove this file from ripgrep_all-{version}.7z file. ripgrep-all will use pandoc.exe version from PATH environment variable.

3. Argumentation

3.1. Updates

Users will be able to have newest or specific dependencies versions and will not wait until the ripgrep-all developers will adds new version of dependency. Users can always use required dependencies versions.

3.2. Disk space

Built-in ripgrep-all dependencies take up space on users hard drives. Users, that already have the newest versions of Pandoc and Poppler, don't need extra old versions of these softwares.

4. Additional links

  1. How Chocolatey saves my time
  2. Creating Chocolatey Packages

Thanks.

Adapter documentation... request for wireshark adapter

Do you have documentation for creating ripgrep-all adapters?

I would like to have a ripgrep adapter for wireshark files (.pcap, .pcapng, etc). A general purpose adapter would have to be highly configurable, but in my case, I generally convert to text using this command:
tshark -V -Y "(ngap || nas-5gs || s1ap || sip || diameter)" -r ${wireshark_file_name}

A general purpose adapter could just use tshark -V -r ${wireshark_file_name}.

An even more general purpose adapter could allow you to specify the following parameters:
RGA_USER_ADAPTER=${path_to_executable}
RGA_USER_ADAPTER_FLAGS=${flags}

As in ...
RGA_USER_ADAPTER=/usr/bin/tshark
RGA_USER_ADAPTER_FLAGS="-V -r {}"
or ...
RGA_USER_ADAPTER_FLAGS="-V -Y \"(ngap || nas-5gs || s1ap || sip || diameter)\" -r {}"

Note also that there do exist some rust pcap libraries...

pdftotext not found in OSX

There are no simple instructions about how to use pdftotext or poppler-utils in OSX, nor are they available on Brew or Cargo that I've found.

Idea: allow adapters to bail out

Sometimes adapters may want to use own logic to decide whether they can / want to handle a file or not. Might be useful to allow them to bail out and run a different adapter when there are multiple choices.

example: if poppler (pdftotext) detects no text in the pdf, run tesseract (OCR) instead.

Pdfgrep and rga comparison

Hello; this is not a bug or anything, just a question.

I know that ripgrep is blazingly fast compared to other grep options, and I was recently recommended to use your ripgrep-all for my scripts to mass grep/sort/filter on pdfs. With little knowledge of specifics, I expected little from rga for search in pdfs: it seemed to me that decoding the pdf would be the time-consuming part and ripgrep would not make much difference for, say, a 200 page text over other greps. On the contrary, when I benchmarked rga against pdfgrep the difference was ridiculous and the diffs seem to clear, so no inconsistencies so far.

Could you let me know, briefly, what makes rga so much faster than things like pdfgrep (that is, if you know or can guess)? The speed difference seems so remarkable that for my purposes makes pdfgrep useless.

cache-max-blob-len no effect with certain files

I run rga whatever . in the home directory routinely to cache all the files, to achieve a 0-delay integration with fzf. I currently have a cache with a size of 228M in ~/.cache/rga/data.mdb , which is pretty small, and I don't care about the size of the caches.
However, some files are not cached no matter how big I set the value with --cache-max-blob-len. For instance I've tried to run it with rga -L whatever . --rga-cache-max-blob-len 100000000000, there'll still be files not cached (for rga will spend plenty of times when searching around them, I can see pdftotext process running with rga pointing to these files) after running the command.
The files are typically large pdf files. The largest one is around 8M or 12.4M after conversion to text file without any compression.
Am I doing anything wrong?

Feature Request: `--glob` and `--iglob` for filenames within an archive

Unless I missed something, there doesn't seem to be a way to filter by filename inside the archive. It would be helpful to be able to do so to avoid uncompressing data that will be ignored anyways.

Right now I'm filtering the output to achieve the same goal but it's inefficient. Here's my workaround.

rga "what to find" --iglob "*.zip" --no-heading --color always | rga -i "^[^:]*:[^:]*\.txt:" --color never -

cargo install ripgrep_all fails

Cargo install ripgrep_all fails with below error.
Machine is linux Rhel 7
cargo 1.34.0

Do let me know if you need more details.

error[E0161]: cannot move a value of type dyn for<'r> std::ops::FnOnce(&'r [u8]) -> std::result::Result<(), failure::error::Error>: the size of dyn for<'r> std::ops::FnOnce(&'r [u8]) -> std::result::Result<(), failure::error::Error> cannot be statically determined
--> /home/<My_home>/.cargo/registry/src/github.com-1ecc6299db9ec823/ripgrep_all-0.9.1/src/preproc_cache.rs:77:17
|
77 | callback(cached)?;
| ^^^^^^^^

error[E0161]: cannot move a value of type dyn std::ops::FnOnce() -> std::result::Result<std::option::Option<std::vec::Vec>, failure::error::Error>: the size of dyn std::ops::FnOnce() -> std::result::Result<std::option::Option<std::vec::Vec>, failure::error::Error> cannot be statically determined
--> /home/<My_home>/.cargo/registry/src/github.com-1ecc6299db9ec823/ripgrep_all-0.9.1/src/preproc_cache.rs:83:36
|
83 | if let Some(got) = runner()? {
| ^^^^^^

error: aborting due to 2 previous errors

For more information about this error, try rustc --explain E0161.
error: failed to compile ripgrep_all v0.9.1, intermediate artifacts can be found at /tmp/cargo-installyxoOOy

Caused by:
Could not compile ripgrep_all.

WSL warning for other users

I wanted to use this tool on Windows WSL (with the Ubuntu there), and wanted to give others who also wanted to try this a note on this. The caching library that ripgrep-all uses doesn't work on WSL (which is WSL's fault from what I read online), so if you use it in the default configs, you'll get a ton of warnings. Just disabling the use of the cache makes the warnings go away.

There really isn't a fix on the horizon, but I do know that WSL2 is designed differently, and in the future using that should let everything broken like this "just work".

feature_request(ebooks): non UTF-8 books support

1. Summary

It would be nice, if ripgrep-all will support search in ebooks, that haven't UTF-8 encoding.

2. Argumentation

2.1. Common cause

More cases support. Automation.

2.2. Details

Windows-1251 is character encoding, that was popular for Cyrillic texts. In my books list I have at least 7 books in this encoding (I can't run ripgrep-all for all books from this list, because my PC hangs).

I agree with the statement, that UTF-8 — is better character encoding. But not me create all books in which users needs to do a search.

3. Example

3.1. Data

  1. KiraRome.fb2 file — “Легенды и сказания Древнего Рима” book by Alexandra Neihardt.

I opened file in the text editor (FB2 is XML-based format) → I saw the first file line:

<?xml version="1.0" encoding="windows-1251"?>

3.2. pandoc

Pandoc output:

D:\SashaDebugging\KiraRipgrepAll>pandoc KiraRome.fb2 -o KiraRome.txt --verbose
UTF-8 decoding error in KiraRome.fb2 at byte offset 246 (c0).
The input must be a UTF-8 encoded text.

ripgrep-all output the same:

D:\SashaDebugging\KiraRipgrepAll>rga Валерий
KiraRome.fb2: preprocessor command failed: '"rga-preproc" "KiraRome.fb2"':
-------------------------------------------------------------------------------
adapter: pandoc
UTF-8 decoding error in - at byte offset 246 (c0).
The input must be a UTF-8 encoded text.
Error: subprocess failed: ExitStatus(ExitStatus(92))
-------------------------------------------------------------------------------

4. Example of expected behavior

ebook-convert support non UTF-8 character encodings:

D:\SashaDebugging\KiraRipgrepAll>ebook-convert KiraRome.fb2 KiraRome.txt
1% Converting input to HTML…
InputFormatPlugin: FB2 Input running
on D:\SashaDebugging\KiraRipgrepAll\KiraRome.fb2
Parsing all content…
Forcing index.xhtml into XHTML namespace
Generating default TOC from spine…
34% Running transforms on e-book…
Merging user specified metadata…
Detecting structure…
Auto generated TOC with 27 entries.
Flattening CSS and remapping font sizes…
Source base font size is 12.00000pt
Removing fake margins…
Cleaning up manifest…
Trimming unused files from manifest…
Trimming u'MAR2.png' from manifest
Trimming u'cover.jpg' from manifest
Creating TXT Output…
67% Running TXT Output plugin
Converting XHTML to TXT…
TXT output written to D:\SashaDebugging\KiraRipgrepAll\KiraRome.txt
Output saved to   D:\SashaDebugging\KiraRipgrepAll\KiraRome.txt

And now:

D:\SashaDebugging\KiraRipgrepAll>rga Валерий
KiraRome.txt
К великому ужасу Брута, в числе заговорщиков, кроме брата его жены, оказались и оба его сына – Тит и Тиберий. Послы Тарквиния были изгнаны, а его имущество отдано народу на разграбление, чтобы, получив часть захваченных царем богатств, римский народ навсегда потерял надежду на возможность примирения с бывшим царем. Изменники были судимы и приговорены к казни. Среди привязанных к позорному столбу знатных юношей особенное внимание привлекали сыновья Брута. Они, дети консула, только что освободившего народ, решились предать дело отца, его самого и весь Рим в руки мстительного и несправедливейшего из деспотов! В полном молчании оба консула вышли, сели на свои места и приказали ликторам приступить к свершению унизительной и жестокой казни. С приговоренных были сорваны одежды, их долго секли прутьями, а затем отрубили головы. Консул Публий Валерий с состраданием смотрел на муки осужденных юношей, Брут же словно превратился в статую, ни единым движением не выдал он обуревавших его чувств. Лишь когда покатились головы его сыновей, легкая судорога передернула неподвижное лицо консула.

5. Do not offer

5.1. “Convert your book to UTF-8”

Yes, I can change ebook encoding:

  1. Manually change in ebook: encoding="windows-1251"encoding="utf-8".
  2. PowerShell command for conversion.

But:

5.1.1. Time

If user have many non UTF-8 books, conversion takes a time.

5.1.2. Another users

If I convert my book to UTF-8, I change character encoding solely for book in my PC. Another users, that download this book from services as LibGen, still will have another encoding → they can't find text in books via ripgrep-all by default.

  1. These users will have to spend time for conversion.
  2. Not all users have a good computer skills. Not all users can understand the problem and convert file character encoding to UTF-8.

Programs are written for users. We must make their life easier where possible.

6. Environment

  1. Windows 10.0.18363 Pro N for Workstations 64-bit EN
  2. ripgrep-all 0.9.3
  3. pandoc 2.7.3
  4. Calibre 4.16.0

Thanks.

Debugging after freeze when running command

I have just started using ripgrep-all to batch search PDF files. I am on Ubuntu 18.04.
In first test it works well on just text search.

[Edited] I now understand better how to use regex expressions (as in rg) so I move on to the question on"freeze" below.

I ran a basic command on a large PDF document. Text was found without issues,
I then repeated run with option added --rga-adapters=+pdfpages,tesseract

There is no activity seen after starting the command. Then the desktop freezes (cursor stuck) with no results in terminal. I rebooted and same freeze occurred. I imagine this might be a memory intensive operation. I arranged a monitor (Stacer) to show resource usage and cpu usage went up to 88% and memory usage 6.8 GB out of 8GB total.

Always I have to to force shutdown and reboot.

Is there any log file I can inspect when I reboot?
Is my hunch that this is a memory issue on mark?
If so I guess I might need to order another 8GB to bring RAM to 16GB.

OSX Homebrew formula

Hi,

Would be great to have in the Homebrew. I am using rg for a while now and rga would be useful to have in handy.

Use page numbers as line numbers where appropriate

Editor tools that integrate with rg(a) rely on line number info to jump to the appropriate location. Implementing this would enable them to jump to the right page (as the line number is arbitrary anyway, generated by pdf2text, docx2text... and latex (at compile time), libre office, mobi, word re-flow text to page/margin size ).

Old way:

60:Page 5:       Helpman, 2019)
654:Page 51:            Grossman-Helpman Model)
745:Page 56:    Grossman, G. and Helpman, E., 2019, “Identity politics and trade policy”, Working Paper

New way:

5:60:       Helpman, 2019)
51:654:            Grossman-Helpman Model)
56:745:    Grossman, G. and Helpman, E., 2019, “Identity politics and trade policy”, Working Paper

This could either be a default or a command line --option.

Panic on `targ` file

ripgrep-all version: 0.9.1.
The panic happens both with and without --rga-accurate option.

File: a text file with targ extension.
Stack trace:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: StringError("numeric field was not a number: from the when getting cksum for # -*- mode: makefile -*-\n#\n# Copyright (c) 2012, Joyent, Inc. All rights reserved.\n#\n# Makefile.targ") }', src/libcore/result.rs:997:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:39
   1: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at src/libstd/sys_common/backtrace.rs:59
             at src/libstd/panicking.rs:197
   3: std::panicking::default_hook
             at src/libstd/panicking.rs:211
   4: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:474
   5: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:381
   6: rust_begin_unwind
             at src/libstd/panicking.rs:308
   7: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
   8: core::result::unwrap_failed
   9: <ripgrep_all::adapters::tar::TarAdapter as ripgrep_all::adapters::FileAdapter>::adapt
  10: ripgrep_all::preproc::rga_preproc
  11: <ripgrep_all::adapters::tar::TarAdapter as ripgrep_all::adapters::FileAdapter>::adapt
  12: ripgrep_all::preproc::rga_preproc
  13: <ripgrep_all::adapters::decompress::DecompressAdapter as ripgrep_all::adapters::FileAdapter>::adapt
  14: core::ops::function::FnOnce::call_once{{vtable.shim}}
  15: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
  16: <ripgrep_all::preproc_cache::LmdbCache as ripgrep_all::preproc_cache::PreprocCache>::get_or_run
  17: ripgrep_all::preproc::rga_preproc
  18: rga_preproc::main
  19: std::rt::lang_start::{{closure}}
  20: std::panicking::try::do_call
             at src/libstd/rt.rs:49
             at src/libstd/panicking.rs:293
  21: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:87
  22: std::rt::lang_start_internal
             at src/libstd/panicking.rs:272
             at src/libstd/panic.rs:388
             at src/libstd/rt.rs:48
  23: main
  24: __libc_start_main
  25: _start
-------------------------------------------------------------------------------

Let me know if more info is needed.

Respect .rgignore

Hi, first of all thanks for this - this is incredibly useful for me.
Caching makes all the difference comparing to the slow pdfgrep!

It seems rga doesn't respect .rgignore file? Would it be possible to add it please?

DOCX but not DOC

Do I understand correctly that DOC is not searched but DOCX is? I made a test where I confirmed this.

No progress when searching with pdfpages adapter

I wanted to try the pdfpages + tesseract search, but it seems that it hangs every time I try to use the pdfpages adapter via the --rga-adapters=+pdfpages,tesseract flag.

Output without adapters (hit in last line):

RUST_LOG=debug rga Peroxid ./AAC-Stunde-14.pdf
[2019-06-20T11:55:57Z DEBUG ripgrep_all::args] our_args: ["rga"]
[2019-06-20T11:55:57Z DEBUG ripgrep_all::args] RGA_CONFIG={}
[2019-06-20T11:55:57Z DEBUG ripgrep_all::args] passthrough_args: ["Peroxid", "./AAC-Stunde-14.pdf"]
[2019-06-20T11:55:57Z DEBUG ripgrep_all::adapters] Chosen adapters: ffmpeg,pandoc,poppler,zip,decompress,tar,sqlite
Page 28:     b) O fast immer -2 (Ausnahme: Peroxide)

Output with adapters (process just hangs and doesn't exit on its own):

RUST_LOG=debug rga Peroxid ./AAC-Stunde-14.pdf --rga-adapters=+pdfpages,tesseract
[2019-06-20T11:57:28Z DEBUG ripgrep_all::args] our_args: ["rga", "--rga-adapters=+pdfpages,tesseract"]
[2019-06-20T11:57:28Z DEBUG ripgrep_all::args] RGA_CONFIG={"adapters":["+pdfpages","tesseract"]}
[2019-06-20T11:57:28Z DEBUG ripgrep_all::args] passthrough_args: ["Peroxid", "./AAC-Stunde-14.pdf"]
[2019-06-20T11:57:28Z DEBUG ripgrep_all::adapters] Chosen adapters: tesseract,pdfpages,ffmpeg,pandoc,poppler,zip,decompress,tar,sqlite

Just let me know if there is any more info you need!

Enable searching of local RPMs

I sometimes have to locate files in a big pool of local RPMs. It would be really neat if this tool supported searching through that.

Feature request: optionally enable use of cache also for plain text files

First of all thank you for you work in this tool.
Very usefull and clever designed.
And thank you for remembering ARM
and providing a binary release also. Otherwise compiling Rust on the Raspberry Pi and friends takes GBs and hours.

Now for my request.
I'm very interested in this cache functionality.
It works wonderfully.
It allows me to say, for ex, have a huge
collection of academic, books, PDF's, epubs etc. that dont change neither content nor location in the file system, and have a kind of index for fast search. And when I say huge, I say really huge. No more worrying about which format did I get that resource. I now have a single tool to search them all.

I noticed ripgrep-all doesnt create the "data.mdb" file when I only
search in a collection of plain text files.
Ex. rga python /usr/share/nvim/runtime/docs

I'm no programmer, please correct me if I'm wrong.

1. This "data.mdb" file is a kind of index right ?

Or is it really just a cache, so that the conversion work of the "heavy tools" like pdftotext" doesnt have to be repeated ? But the "search work" of ripgrep is still repeated each time

I'm asking this because I also would like to "index" huge amounts of
plain text files. Years of notes files, wikis in plain markdown (ex git wikis), downloaded web sites converted to plain text, user documentation files ( ex vim docs ) etc etc.

There are other specialized tools for this.
Ex, desktop utilites, or recoll.
But I prefer the command line, and to use grep or ripgrep.

2. Would it be possible to search against this "data.mdb" without specifing the location of the files.

rga some-important-fact-regex data.mdb

So that I dont have to specify, which file or folder to look for. I dont remember where it is. I just know I read it someday in some of my books/notes/papers. That data.mdb is just my personal knowledge base.

It's the old dream of having your text files indexed, indexed in one file,
instead of running grep all over again over them each time you look for something.

Or am I seeing it wrongly, and it doesnt make sense to "index" plain text
files, as the computational cost of running grep or ripgrep on them is low ? Am I confusing indexing and caching ? An "index" created with ripgrep is something completely different ?

build failed in ubuntu 18.04 docker

I want to build ripgrep-all from source and get errors.
I reproduce the same error on docker environment.

  • docker original image: ubuntu/latest
docker run -it --rm ubuntu /bin/bash
  • OS version
root@7927ed67baf9:/# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.2 LTS
Release:	18.04
Codename:	bionic
  • Installation code
apt update
apt upgrade

# install requirements
apt install build-essential pandoc poppler-utils ffmpeg cargo curl

# install rust
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"
  • rust version check
root@7927ed67baf9:/# rustc --version
rustc 1.36.0 (a53f9df32 2019-07-03)
  • Install ripgrep & ripgrep-all
# install ripgrep
cargo install ripgrep

# install repgrep_all
cargo install ripgrep_all
  • Error messages
error[E0277]: `*const libsqlite3_sys::sqlite3_module` cannot be sent between threads safely
  --> /root/.cargo/registry/src/github.com-1ecc6299db9ec823/rusqlite-0.18.0/src/vtab/series.rs:21:1
   |
21 | / lazy_static! {
22 | |     static ref SERIES_MODULE: Module<SeriesTab> = eponymous_only_module::<SeriesTab>(1);
23 | | }
   | |_^ `*const libsqlite3_sys::sqlite3_module` cannot be sent between threads safely
   |
   = help: within `vtab::Module<vtab::series::SeriesTab>`, the trait `std::marker::Send` is not implemented for `*const libsqlite3_sys::sqlite3_module`
   = note: required because it appears within the type `libsqlite3_sys::sqlite3_vtab`
   = note: required because it appears within the type `vtab::series::SeriesTab`
   = note: required because it appears within the type `std::marker::PhantomData<vtab::series::SeriesTab>`
   = note: required because it appears within the type `vtab::Module<vtab::series::SeriesTab>`
   = note: required because of the requirements on the impl of `std::marker::Sync` for `spin::once::Once<vtab::Module<vtab::series::SeriesTab>>`
   = note: required because it appears within the type `lazy_static::lazy::Lazy<vtab::Module<vtab::series::SeriesTab>>`
   = note: shared static variables must have a type that implements `Sync`
   = note: this error originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

error[E0277]: `*mut i8` cannot be sent between threads safely
  --> /root/.cargo/registry/src/github.com-1ecc6299db9ec823/rusqlite-0.18.0/src/vtab/series.rs:21:1
   |
21 | / lazy_static! {
22 | |     static ref SERIES_MODULE: Module<SeriesTab> = eponymous_only_module::<SeriesTab>(1);
23 | | }
   | |_^ `*mut i8` cannot be sent between threads safely
   |
   = help: within `vtab::Module<vtab::series::SeriesTab>`, the trait `std::marker::Send` is not implemented for `*mut i8`
   = note: required because it appears within the type `libsqlite3_sys::sqlite3_vtab`
   = note: required because it appears within the type `vtab::series::SeriesTab`
   = note: required because it appears within the type `std::marker::PhantomData<vtab::series::SeriesTab>`
   = note: required because it appears within the type `vtab::Module<vtab::series::SeriesTab>`
   = note: required because of the requirements on the impl of `std::marker::Sync` for `spin::once::Once<vtab::Module<vtab::series::SeriesTab>>`
   = note: required because it appears within the type `lazy_static::lazy::Lazy<vtab::Module<vtab::series::SeriesTab>>`
   = note: shared static variables must have a type that implements `Sync`
   = note: this error originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

error: aborting due to 2 previous errors

For more information about this error, try `rustc --explain E0277`.
error: failed to compile `ripgrep_all v0.9.2`, intermediate artifacts can be found at `/tmp/cargo-installOqSn2W`

Caused by:
  Could not compile `rusqlite`.

To learn more, run the command again with --verbose.
  • With --verbose option
cargo install ripgrep_all --verbose
  • Error messages
   Compiling rusqlite v0.18.0
     Running `rustc --edition=2018 --crate-name rusqlite /root/.cargo/registry/src/github.com-1ecc6299db9ec823/rusqlite-0.18.0/src/lib.rs --color always --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 --cfg 'feature="bundled"' --cfg 'feature="lazy_static"' --cfg 'feature="libsqlite3-sys"' --cfg 'feature="vtab"' -C metadata=b5985805439e28e0 -C extra-filename=-b5985805439e28e0 --out-dir /tmp/cargo-installa2jlTk/release/deps -L dependency=/tmp/cargo-installa2jlTk/release/deps --extern bitflags=/tmp/cargo-installa2jlTk/release/deps/libbitflags-e53b4b50e5713d69.rlib --extern fallible_iterator=/tmp/cargo-installa2jlTk/release/deps/libfallible_iterator-229d1b548363649b.rlib --extern fallible_streaming_iterator=/tmp/cargo-installa2jlTk/release/deps/libfallible_streaming_iterator-3dc8799337b38106.rlib --extern lazy_static=/tmp/cargo-installa2jlTk/release/deps/liblazy_static-b92b4a1bf6babc8c.rlib --extern libsqlite3_sys=/tmp/cargo-installa2jlTk/release/deps/liblibsqlite3_sys-fec7266de6e25bdf.rlib --extern lru_cache=/tmp/cargo-installa2jlTk/release/deps/liblru_cache-b6c9918432310b68.rlib --extern memchr=/tmp/cargo-installa2jlTk/release/deps/libmemchr-292d84c02d2d5263.rlib --extern time=/tmp/cargo-installa2jlTk/release/deps/libtime-d82d709fd943c2c8.rlib --cap-lints allow -L native=/tmp/cargo-installa2jlTk/release/build/libsqlite3-sys-0f04e3e46e346ca2/out`
error[E0277]: `*const libsqlite3_sys::sqlite3_module` cannot be sent between threads safely
  --> /root/.cargo/registry/src/github.com-1ecc6299db9ec823/rusqlite-0.18.0/src/vtab/series.rs:21:1
   |
21 | / lazy_static! {
22 | |     static ref SERIES_MODULE: Module<SeriesTab> = eponymous_only_module::<SeriesTab>(1);
23 | | }
   | |_^ `*const libsqlite3_sys::sqlite3_module` cannot be sent between threads safely
   |
   = help: within `vtab::Module<vtab::series::SeriesTab>`, the trait `std::marker::Send` is not implemented for `*const libsqlite3_sys::sqlite3_module`
   = note: required because it appears within the type `libsqlite3_sys::sqlite3_vtab`
   = note: required because it appears within the type `vtab::series::SeriesTab`
   = note: required because it appears within the type `std::marker::PhantomData<vtab::series::SeriesTab>`
   = note: required because it appears within the type `vtab::Module<vtab::series::SeriesTab>`
   = note: required because of the requirements on the impl of `std::marker::Sync` for `spin::once::Once<vtab::Module<vtab::series::SeriesTab>>`
   = note: required because it appears within the type `lazy_static::lazy::Lazy<vtab::Module<vtab::series::SeriesTab>>`
   = note: shared static variables must have a type that implements `Sync`
   = note: this error originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

error[E0277]: `*mut i8` cannot be sent between threads safely
  --> /root/.cargo/registry/src/github.com-1ecc6299db9ec823/rusqlite-0.18.0/src/vtab/series.rs:21:1
   |
21 | / lazy_static! {
22 | |     static ref SERIES_MODULE: Module<SeriesTab> = eponymous_only_module::<SeriesTab>(1);
23 | | }
   | |_^ `*mut i8` cannot be sent between threads safely
   |
   = help: within `vtab::Module<vtab::series::SeriesTab>`, the trait `std::marker::Send` is not implemented for `*mut i8`
   = note: required because it appears within the type `libsqlite3_sys::sqlite3_vtab`
   = note: required because it appears within the type `vtab::series::SeriesTab`
   = note: required because it appears within the type `std::marker::PhantomData<vtab::series::SeriesTab>`
   = note: required because it appears within the type `vtab::Module<vtab::series::SeriesTab>`
   = note: required because of the requirements on the impl of `std::marker::Sync` for `spin::once::Once<vtab::Module<vtab::series::SeriesTab>>`
   = note: required because it appears within the type `lazy_static::lazy::Lazy<vtab::Module<vtab::series::SeriesTab>>`
   = note: shared static variables must have a type that implements `Sync`
   = note: this error originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

error: aborting due to 2 previous errors

For more information about this error, try `rustc --explain E0277`.
error: failed to compile `ripgrep_all v0.9.2`, intermediate artifacts can be found at `/tmp/cargo-installa2jlTk`

Caused by:
  Could not compile `rusqlite`.

Caused by:
  process didn't exit successfully: `rustc --edition=2018 --crate-name rusqlite /root/.cargo/registry/src/github.com-1ecc6299db9ec823/rusqlite-0.18.0/src/lib.rs --color always --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 --cfg 'feature="bundled"' --cfg 'feature="lazy_static"' --cfg 'feature="libsqlite3-sys"' --cfg 'feature="vtab"' -C metadata=b5985805439e28e0 -C extra-filename=-b5985805439e28e0 --out-dir /tmp/cargo-installa2jlTk/release/deps -L dependency=/tmp/cargo-installa2jlTk/release/deps --extern bitflags=/tmp/cargo-installa2jlTk/release/deps/libbitflags-e53b4b50e5713d69.rlib --extern fallible_iterator=/tmp/cargo-installa2jlTk/release/deps/libfallible_iterator-229d1b548363649b.rlib --extern fallible_streaming_iterator=/tmp/cargo-installa2jlTk/release/deps/libfallible_streaming_iterator-3dc8799337b38106.rlib --extern lazy_static=/tmp/cargo-installa2jlTk/release/deps/liblazy_static-b92b4a1bf6babc8c.rlib --extern libsqlite3_sys=/tmp/cargo-installa2jlTk/release/deps/liblibsqlite3_sys-fec7266de6e25bdf.rlib --extern lru_cache=/tmp/cargo-installa2jlTk/release/deps/liblru_cache-b6c9918432310b68.rlib --extern memchr=/tmp/cargo-installa2jlTk/release/deps/libmemchr-292d84c02d2d5263.rlib --extern time=/tmp/cargo-installa2jlTk/release/deps/libtime-d82d709fd943c2c8.rlib --cap-lints allow -L native=/tmp/cargo-installa2jlTk/release/build/libsqlite3-sys-0f04e3e46e346ca2/out` (exit code: 1)

No such file or directory (os error 2)

Unsure what's going on here. Running on my unraid server, rg binary installed in current directory.

# PATH=$PATH:. rga "..." /mnt/user/backup/
/mnt/user/backup/_omitted_.zip: No such file or directory (os error 2)

File exists:

root@Tower:/tmp/rga# ls -l /mnt/user/backup/_omitted_.zip
-rw-rw-rw- 1 nobody users ... Aug  4  2018 /mnt/user/backup/_omitted_.zip

ripgrep exists:

# PATH=$PATH:. rg
error: The following required arguments were not provided:
    <PATTERN>

Build fails with "unstable feature" error in rkv dependency

I tried doing cargo build (of master at commit ef2e4eb) and got this error:

  $ cargo build 
      Updating crates.io index
   Downloading crates ...
    Downloaded chrono v0.4.6
    Downloaded encoding_rs v0.8.17
    [...]
     Compiling zip v0.5.2
     Compiling serde_json v1.0.39
     Compiling rkv v0.9.6
  error[E0658]: use of unstable library feature 'try_from' (see issue #33417)
     --> /home/kfogel/.cargo/registry/src/github.com-1ecc6299db9ec823/rkv-0.9.6/src/error.rs:166:11
      |
  166 | impl From<::std::num::TryFromIntError> for MigrateError {
      |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  
  error[E0658]: use of unstable library feature 'try_from' (see issue #33417)
    --> /home/kfogel/.cargo/registry/src/github.com-1ecc6299db9ec823/rkv-0.9.6/src/migrate.rs:78:5
     |
  78 |     convert::TryFrom,
     |     ^^^^^^^^^^^^^^^^
  
  [...many more similar error lines...]
  
  error: aborting due to 12 previous errors
  
  For more information about this error, try `rustc --explain E0658`.
  error: Could not compile `rkv`.
  warning: build failed, waiting for other jobs to finish...
  error: build failed
  $ 

I don't know much Rust, but it looks like rkv is using an unstable feature (rust bug 33417 has more about it), and that since rga depends on rkv, this affects the rga build too. I ran rustc --explain E0658 and got some information about how to solve the problem -- presumably those solutions would have to be implemented upstream in rkv, if we wanted to solve this for everyone, or else I'd have either build a modified rkv locally or get the nightly version of rustc to do the build I just tried to do.

I'm not sure what ways might be available to solve this within rga. Ideas welcome; like I said, I don't know Rust that well.

Anyway, this was all along the way to submitting a PR for README.md to add installation instructions. I'll submit that PR, and then in its commentary mention this issue.

Add djvu support

So, this is obviously a feature request… Would it be possible to add support for djvu files, using djvutxt from djvulibre?

Failed to run the demo.

Hi,

See the following:

$ rga "hello" ./demo/
./demo/hello.sqlite3
tbl: greeting='hello', from='sqlite database!'

./demo/greeting.mkv
metadata: chapters.chapter.0.tags.title="Chapter 1: Hello"
00:08.398 --> 00:11.758: Hello from a movie!
./demo/hello.odt: preprocessor command failed: '"rga-preproc" "./demo/hello.odt"':

adapter: pandoc
pandoc: Cannot read archive from stdin
CallStack (from HasCallStack):
error, called at pandoc.hs:1300:22 in main:Main
Error: subprocess failed: ExitStatus(ExitStatus(256))

./demo/somearchive.zip: preprocessor command failed: '"rga-preproc" "./demo/somearchive.zip"':

adapter: zip
adapter: pandoc
pandoc: Cannot read archive from stdin
CallStack (from HasCallStack):
error, called at pandoc.hs:1300:22 in main:Main
Error: subprocess failed: ExitStatus(ExitStatus(256))

Any hints?

ps.

$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 9.9 (stretch)
Release: 9.9
Codename: stretch

$ rga -V
ripgrep-all 0.9.2

Minimize (and provide option to restore) content adapter errors

I'm searching a local tree with lots of "crap" files where some are invalid, some are misnamed, others are mis-predicted mime types. It would be great to minimize the overall error logging impact of these files to a single line (maybe two at most) to avoid losing my search results in the noise.

Unicode normalization

rga currently does not find text that is encoded in a non-normalized form in the source documents.

Example:

Modellüberprüfung does not match Modellüberprüfung

(decodeURIComponent("Modellu%CC%88berpru%CC%88fung") != decodeURIComponent("Modell%C3%BCberpr%C3%BCfung")

Probably would be useful to just always unicode-normalize input text (e.g. in postproc_line_prefix)

example file:
nachklausur1718.pdf

error running pdf search on windows 10 - 64bit

I tried running the pdf search with the adapter "poppler" on both version 0.9.2 and 0.9.3 and I get the following error message. What am I missing here?

Reference.pdf: preprocessor command failed: '"rga-preproc" "Reference.pdf"':
-------------------------------------------------------------------------------
adapter: poppler
pdftotext version 4.00
Copyright 1996-2017 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>             : first page to convert
  -l <int>             : last page to convert
  -layout              : maintain original physical layout
  -simple              : simple one-column page layout
  -table               : similar to -layout, but optimized for tables
  -lineprinter         : use strict fixed-pitch/height layout
  -raw                 : keep strings in content stream order
  -fixed <number>      : assume fixed-pitch (or tabular) text
  -linespacing <number>: fixed line spacing for LinePrinter mode
  -clip                : separate clipped text
  -nodiag              : discard diagonal text
  -enc <string>        : output text encoding name
  -eol <string>        : output end-of-line convention (unix, dos, or mac)
  -nopgbrk             : don't insert page breaks between pages
  -bom                 : insert a Unicode BOM at the start of the text file
  -opw <string>        : owner password (for encrypted files)
  -upw <string>        : user password (for encrypted files)
  -q                   : don't print any messages or errors
  -cfg <string>        : configuration file to use in place of .xpdfrc
  -v                   : print copyright and version info
  -h                   : print usage information
  -help                : print usage information
  --help               : print usage information
  -?                   : print usage information
Error: The pipe has been ended. (os error 109)

brew tesseract

Unknown command: tesseract

Any idea why this error occurs?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.