Comments (2)
Indeed it is, you have an overflow on the sentence id. Currently we do not support a config to let you change the datatype for sentence id, so the quickest solution is to find your document with >= 312748 sentences and split it into smaller documents (it's document number 5179808 on this specific input file, based on your log).
Note that the issue here is not the number of documents but rather that one individual document has a lot of sentences
from datatrove.
Thanks - Yes - i just saw that it was on sent.id. have done the filter and re-running thank you!
from datatrove.
Related Issues (20)
- Can a Slurm pipeline be executed across all nodes in a cluster? HOT 6
- Log progress
- Add support for skipping documents HOT 5
- PineconeWriter
- OpensearchWriter HOT 2
- Custom Adapter Function Cannot Use self in BaseReader Class
- Periodical logging of stats HOT 2
- Dependency resolving issue installing from source HOT 4
- how to turn log/traceback color off? HOT 7
- Fastwarc reader
- Parallelizing Reading data from large input files HOT 1
- URL dedup of two datasets HOT 1
- Migrate word tokenizer download functions to process locked download
- What is the difference between tasks and workers? HOT 2
- How to deduplicate multiple datasets? HOT 2
- [Suggestion] Allow an integer parameter for 'randomize_start' in executor/base.py HOT 8
- Minhash mersenne hashing overflow issues. HOT 2
- The naming in gopher_quality_filter seems to be incorrect
- Memory overflow issue with long-context data using datatrove HOT 2
- JSONL loading slow when using megawarcs
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datatrove.