Comments (1)
This is actually what been used in the original paper, you can find this on their readme file https://github.com/google-research/deduplicate-text-datasets:
In our paper we suggest just taking all of these duplicate sequences that have been identified and completely striking them from the dataset. This somewhat breaks the flow of text, for example if previously had an example "Alice wanted to go to the store" and we deduplicated at the level of 10 characters, we might completely strike " to go to the " and be left with "Alice wantedstore". In practice we have found this doesn't break the language model because we remove relatively little text, and so these breaks don't cause harm.
However, it might not be the case for your own dataset, or you have different use case of those duplicated strings, in which case you will have to modify the code.
from text-dedup.
Related Issues (20)
- Deduplication of union find clusters explained HOT 2
- PySpark without DataProc HOT 4
- Papers, Datasets that use this repo HOT 3
- FileNotFoundError when run `Finding clusters` step. HOT 2
- many duplicate pairs were not actually similar using minhash_spark.py HOT 1
- Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py HOT 8
- boundaries of sub-strings HOT 2
- Can we accelerate the groupByKey operation by md5 hashing for the Minhash spark version? HOT 2
- the effect of min_length in minhash_spark.py/minhash.py HOT 4
- The max_iteration for small star and large star in minhash_spark.py HOT 3
- how about make a ray executor to deduplication HOT 2
- Little refactor to allow imports from python instead of cli/subprocess HOT 3
- no module named numpy._typing HOT 2
- Failed to install using `pip install text-dedup`, but succeeded using `pip install -e .` HOT 1
- minhash_spark.py [UNABLE_TO_INFER_SCHEMA] HOT 3
- 数据读取失败 HOT 5
- OSError: Memory mapping file failed: Cannot allocate memory HOT 4
- Deduplication results vary depending on whether Spark is used or not. HOT 6
- Can we consider using dask for distributed deduplication HOT 1
- Can we use it for Arabic text? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from text-dedup.