Comments (4)
But, i don't see a log message about skipping the file on the console. cindex
run looks normal.
Original comment by [email protected]
on 21 Nov 2012 at 7:36
- Added labels: ****
- Removed labels: ****
from codesearch.
I encountered all these issues you mentioned and was annoyed enough by them to
implement the following changes for myself at
https://github.com/junkblocker/codesearch
1) Do not stop at first bad UTF-8 character encountered. Instead allow a
percentage of non-UTF-8 characters to be in the document. These are ignored but
the rest of the document gets indexed. The option, which I call,
-maxinvalidutf8ratio, defaults to 0.1. This combined with considering a
document containing a 0x00 byte as binary has been working great for me.
2) Allow custom trigrams size. The current hardcoded limit is at 20000 trigrams
but I sadly have to work on code with one important source file beyond that.
(-maxtrigrams).
3) Add message and reasoning for every document skipped from indexing.
I would love to get those changes merged or at least considered for alternate
implementation here in this official sources but am not sure about the
aliveness of project here.
Original comment by [email protected]
on 21 Nov 2012 at 3:09
- Added labels: ****
- Removed labels: ****
from codesearch.
The project is not super alive. Mostly the code just works and we
leave it alone. I think the UTF-8 heuristic works pretty well as does
the trigram size heuristic. It's possible to tune these forever, of
course. How many trigrams does your important file have?
I thought that the indexer already did print about files it skipped if
you run it in verbose mode, but maybe I am misremembering.
Original comment by [email protected]
on 6 Dec 2012 at 4:30
- Added labels: ****
- Removed labels: ****
from codesearch.
All source files being UTF-8 is a pretty big assumption. A lot of files may be
latin-1 etc. which is the most common problem I encountered. Having random
european author's name with a diacritic in the source or some cyrillic, for
example, loses a whole file from index making codesearch something that can't
be depended on at all. When I am changing code based on what codesearch finds
in my codebase, I don't wanna miss some files for this reason. codesearch
should not be less reliable that a regular grep.
The file I mentioned is around 30K trigrams. It was simple to just add a custom
limit flag.
The indexer misses the warning in a couple of places mainly because of the
assumptions it makes about the input data. The one example I recall off the top
of my head is about quietly ignoring symlinked paths (which I submitted another
patch to optionally not ignore for).
Original comment by [email protected]
on 6 Dec 2012 at 5:47
- Added labels: ****
- Removed labels: ****
from codesearch.
Related Issues (20)
- How to install HOT 1
- Support text files compressed with gzip, bzip2 and xz HOT 2
- Couldn't install HOT 1
- Possible to scan git?
- How to use cgrep? HOT 1
- Some files are not indexed HOT 11
- [Maintainance status] Does anyone own this? HOT 3
- index can't be larger than 4GB
- Color matched patterns in csearch output
- Enable multiline searches
- What will be returned by getSmallest(…) method when you run this program?
- What will be returned by getSmallest(…) method when you run this program?
- cindex is silently ignoring some text files and there's no way to tell why HOT 3
- Fix temporary file deletion not working on Windows
- Esew codes
- index/read line 200 is a bug
- Coding
- facebook account recovery HOT 1
- Filter for filetype and exact words
- Use go 1.22 if 1.23 is not required.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from codesearch.