Coder Social home page Coder Social logo

chenghaomou / text-dedup Goto Github PK

View Code? Open in Web Editor NEW
570.0 4.0 69.0 6.01 MB

All-in-one text de-duplication

License: Apache License 2.0

Python 81.45% Jupyter Notebook 17.49% Makefile 0.67% Dockerfile 0.39%
text-processing de-duplication nlp data-processing

text-dedup's Introduction

GitHub Codacy Badge Codacy Badge DOI

Installation

pip install text-dedup

or

pip install git+https://github.com/ChenghaoMou/text-dedup

Documentation

Github Pages

Features

This repository contains a collection of text deduplication scripts that are ready to use, or modify based on your needs:

  • RETSim/UniSim, an embedding-based near deduplication (WIP)
  • MinHash + MinHashLSH, including a spark implementation suitable for large (TB) datasets
  • 64 or 128 bit SimHash
  • SuffixArray Substring
  • Bloom Filter
  • Exact Hash (document-level, line-level/ccnet)

I also have big plans for the future:

However, I do not intent to build a general purpose deduplication library, which was the goal of this repo early on. I will gradually retire the pypi package as well. The reason behind it is that each use-case can be wildly different and requires careful design and consideration. I sincerely encourage you to read the script first (they are relatively short) so you can understand what are at stake here when using it. You can use it to bootstrap your own script, or just use it as a reference.

Acknowledgements

This repository is inspired by the following projects, and is heavily influenced by lessons learned from my own participation in BigScience (Apache 2.0) and BigCode (Apache 2.0). There is a blog post about the journey. Feedbacks are welcome!

Quick Examples

Native PySpark

MODIFY text_dedup/minhash_spark.py FOR YOUR OWN PROJECT AND DATASET FIRST!

Assuming you have a downloaded dataset (in parquet files) under "./temp-data", you can process with file with your local compute by:

export PYSPARK_PYTHON="path to your python with scipy, xxhash, and numpy installed"
spark-submit --executor-memory 16g \
    --driver-memory 20g \
    --executor-cores 3 \
    --num-executors 2 \
    --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 \
    --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \
    --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \
    text_dedup/minhash_spark.py\
    --input "./temp-data" \
    --output "./temp-output" \
    --column "text" \
    --threshold 0.7
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Using B=25, R=10
DEBUG __main__ - Loaded documents: 88803
DEBUG __main__ - args.input='./temp-data'
DEBUG __main__ - args.output='./temp-output'
DEBUG __main__ - args.threshold=0.7
DEBUG __main__ - args.ngram_size=5
DEBUG __main__ - args.min_length=5
DEBUG __main__ - args.num_perm=250
DEBUG __main__ - args.column='text'
DEBUG __main__ - id                                                              : bigint
DEBUG __main__ - text                                                            : string
DEBUG __main__ - meta                                                            : struct<warc_headers:struct<warc-record-id:string,warc-date:string,content-type:string,content-length:int,warc-type:string,warc-identified-content-language:string,warc-refers-to:string,warc-target-uri:string,warc-block-digest:string>,identification:struct<label:string,prob:float>,annotations:array<string>,line_identifications:array<struct<label:string,prob:float>>>
DEBUG __main__ - __id__                                                          : bigint
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Initial edges: 52102
DEBUG __main__ - Edges DataFrame: 52102
DEBUG __main__ - Vertices DataFrame: 50206
DEBUG __main__ - Assignment DataFrame: 50206
DEBUG __main__ - Merging records: 88803
INFO  __main__ - Saving with 1 partitions and 44092 rows each
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Number of rows before:    88803
DEBUG __main__ - Number of rows after:     44092
DEBUG __main__ - Percentage of rows kept:  49.65%
DEBUG __main__ - Output:                   ./temp-output
DEBUG __main__ - Time:                     68.80s
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------

Or take a look at bigcode-v2/run.sh on how to run the job with GCP DataProc.

UniSim (WIP)

Based on Google's RETSim model(Github, Arxiv), it is an embedding based on near-deduplication method.

For a large dataset, it would require GPU(s) for fast inference.

python text_dedup/ann_unisim.py --path truthful_qa --name generation --split validation --output temp --column question

Output:

INFO     Load Dataset                    : 5.56s
INFO     Index Dataset                   : 8.13s
INFO     Clustering                      : 8.72s
INFO     Filtering                       : 0.35s
INFO     Saving                          : 0.01s
INFO     Cleaning                        : 0.00s
INFO     Total                           : 22.77s
INFO     Before                          : 817
INFO     After                           : 788
Suffix Array Substring Exact Deduplication
# input
python -m text_dedup.suffix_array \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/suffix_array/oscar_gl_dedup" \
    --column "text" \
    --google_repo_path "/Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets" \
    --use_auth_token true

# output
INFO     Loading                       : 2.75 seconds
INFO     Preprocessing                 : 4.78 seconds
INFO     SuffixArray                   : 98.29 seconds
INFO     SelfSimilar                   : 4.24 seconds
INFO     Restore                       : 0.25 seconds
INFO     Deduplicate                   : 6.23 seconds
INFO     Saving                        : 8.91 seconds
INFO     Total                         : 125.45 seconds
INFO     Before                        : 180332342 bytes (88803)
INFO     After                         : 97646271 bytes (40404)
MinHash Near Deduplication
# input
python -m text_dedup.minhash \
  --path "oscar-corpus/OSCAR-2201" \
  --name "gl" \
  --split "train" \
  --cache_dir "./cache" \
  --output "output/minhash/oscar_gl_dedup" \
  --column "text" \
  --batch_size 10000 \
  --use_auth_token true

# output
INFO     Loading                         : 2.62 seconds
INFO     MinHashing                      : 0.08 seconds
INFO     Clustering                      : 2.20 seconds
INFO     Filtering                       : 0.53 seconds
INFO     Saving                          : 9.86 seconds
INFO     Total                           : 15.29 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 44124 (49.69%)
INFO     Duplicate Number                : 44679 (50.31%)
INFO     ๐Ÿค— Happy Deduplicating ๐Ÿค—
SimHash Near Deduplication
# input
python -m text_dedup.simhash \
  --path "oscar-corpus/OSCAR-2201" \
  --name "gl" \
  --split "train" \
  --cache_dir "./cache" \
  --output "output/simhash/oscar_gl_dedup" \
  --column "text" \
  --batch_size 10000 \
  --use_auth_token true

# output
INFO     Loading                         : 2.60 seconds
INFO     SimHashing                      : 0.04 seconds
INFO     Indexing                        : 28.88 seconds
INFO     Filtering                       : 0.88 seconds
INFO     Saving                          : 10.41 seconds
INFO     Total                           : 42.80 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 46163 (51.98%)
INFO     Duplicate Number                : 42640 (48.02%)
INFO     ๐Ÿค— Happy Deduplicating ๐Ÿค—
Exact Hash Exact Deduplication
# input
python -m text_dedup.exact_hash \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/exact_hash/oscar_gl_dedup" \
    --column "text" \
    --batch_size 1000 \
    --use_auth_token true

# output
INFO     Loading                       : 2.95s
INFO     Processing                    : 3.79s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.89s
INFO     Total                         : 9.72s
INFO     Before                        : 88803
INFO     After                         : 47049
Bloom Filter Exact Deduplication
# input
python -m text_dedup.bloom_filter \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/bloom_filter/oscar_gl_dedup" \
    --error_rate 1e-5 \
    --column "text" \
    --use_auth_token true    --batch_size 1000

# output
INFO     Loading                       : 2.72s
INFO     Processing                    : 4.84s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.88s
INFO     Total                         : 10.54s
INFO     Before                        : 88803
INFO     After                         : 47045

Benchmarks

Note

Spark implementation has some overhead for small datasets, so I recommend using the script only when you have a large dataset and enough compute resources.

pinecone/core-2020-05-10-deduplication

See tests/benchmark_core.py for reproduction.

Algorithm Precision (Duplicates) Recall (Duplicates) Precision (Non Duplicates) Recall (Non Duplicates) Macro F1 score Accuracy Time
UniSim 0.9307 0.8924 0.9055 0.9394 0.9181 0.9054 1305.79s
MinHash Spark 0.957 0.9445 0.9471 0.959 0.952 0.9202 691.77s
MinHash 0.9594 0.9445 0.9474 0.9616 0.9534 0.924 18.88s
SimHash 0.9042 0.721 0.792 0.9329 0.8481 0.8321 644.36s
Exact Title 0.8302 0.5521 0.7098 0.9065 0.77 0.7456 -
Exact Title Matching 1 0.830 0.50 0.709 0.992 0.757 0.746 -
Simhash Matching 1 0.697 0.247 0.598 0.985 0.631 0.616 -
Document Vector Similarity 1 0.912 0.779 0.861 0.986 0.885 0.883 -
Hybrid Method 1 0.908 0.828 0.899 0.979 0.904 0.903 -
LaBSE2 0.937 0.923 0.930 0.943 0.933 0.919 -
Multilingual USE2 0.917 0.907 0.918 0.927 0.917 0.909 -
Multilingual E5-Base2 0.931 0.908 0.919 0.939 0.924 0.920 -
MinHash + LSH2 0.929 0.902 0.915 0.938 0.921 0.918 -
RETSim Partial-Dup2 0.945 0.941 0.945 0.949 0.945 0.928 -
RETSim Near-Dup2 0.928 0.937 0.942 0.934 0.935 0.926 -
NEWS-COPY

See tests/benchmark_news.py for reproduction.

Adjusted Rand Index (ARI) on NEWS-COPY dataset:

Model/Algorithm ARI
SimHash 0.612
MinHash (Spark) 0.740
MinHash 0.742
RETSim Near-Dup + ANN* 0.051
n-gram 3 0.440
SimHash2 0.695
MinHash3 0.737
MinHash2 0.783
Multilingual USE2 0.730
Multilingual E5-Base2 0.742
S-BERT3 0.700
RETSim Partial-Dup2 0.831
RETSim Near-Dup2 0.704
Re-ranking 3 0.937
Bi-encoder 3 0.915

*: I can't seem to reproduce the results from the paper.

License

Apache 2.0

Citations

Generally, you can cite this repository as:

@software{chenghao_mou_2023_8364980,
  author       = {Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu},
  title        = {ChenghaoMou/text-dedup: Reference Snapshot},
  month        = sep,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {2023.09.20},
  doi          = {10.5281/zenodo.8364980},
  url          = {https://doi.org/10.5281/zenodo.8364980}
}

The spark version was born from BigCode (Apache 2.0) and BigScience (Apache 2.0), and you can cite the original paper if you want:

@article{
kocetkov2023the,
title={The Stack: 3 {TB} of permissively licensed source code},
author={Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu{\~n}oz Ferrandis and Sean Hughes and Thomas Wolf and Dzmitry Bahdanau and Leandro Von Werra and Harm de Vries},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=pxpbTdUEpD},
note={}
}

Footnotes

  1. Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings โ†ฉ โ†ฉ2 โ†ฉ3 โ†ฉ4

  2. RETSim: Resilient and Efficient Text Similarity โ†ฉ โ†ฉ2 โ†ฉ3 โ†ฉ4 โ†ฉ5 โ†ฉ6 โ†ฉ7 โ†ฉ8 โ†ฉ9 โ†ฉ10 โ†ฉ11 โ†ฉ12

  3. Noise-Robust De-Duplication at Scale โ†ฉ โ†ฉ2 โ†ฉ3 โ†ฉ4 โ†ฉ5

text-dedup's People

Contributors

chenghaomou avatar chris-ha458 avatar dependabot[bot] avatar hank0626 avatar kennethenevoldsen avatar louisowen6 avatar wangcho2k avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

text-dedup's Issues

Can the suffix_array here be used for Chinese encoded in UTF-8?

https://github.com/google-research/deduplicate-text-datasets states, "Second, we don't want UTF8 strings. Everything is a [u8] byte array, because we might be working over token sequences which aren't valid UTF8.".

However, in my usage, regardless of how I adjust the value of k (30, 100, 150, 1200), many Chinese documents are reduced to very small lengths and lose their semantic meaning completely. I'm not sure if this is related to the issue mentioned earlier.

Out of memory on Spark

Thanks for your great work first!

Recently, I deduplicate some Chinese books (~2000 books with 10G). I adopt the jieba tokenizer while the Spark throws out of memory error at the groupby statement. I increase the executor memory to 65G while not working. Could you help me with where the memory costs most? THX

Doubts about the multiprocessing start method and behavior

First of all, thanks for developers' time in creating such a useful tool.
I am trying to modify the text_dedup/minhash.py and use it for deduplication of a ~800GB dataset with each document size varing from 5GB to 50GB. The detailed code running environment is as follows.

CPU: 128-cores
Memory: 2TB
OS: Ubuntu 20.04.5 LTS
Kernel: 3.10.0-1160.el7.x86_64
User: root
Python version: 3.8.10

And I run the code with following command (I modify the data-loading code so that it can handle json files directly):

python -m text_dedup.custom_minhash --path /path/to/src --output /path/to/dst --column text --ngram 7 --min_length 7 --num_perm 64 --threshold 0.85 --local

I found that when the code ran into some specific steps, such as fingerprinting, it seemed that only 1 process was working as I could only see 1 core has 100% load at the same time. And this step could take extremely long time, namely about 3 hours. I tried to search in the multiprocessing documentation and other related sources, but I cannot figure out why. I tried to change the multiprocessing method multiprocessing.set_start_method, but this does not work too.

So I am wondering, is this an expected situation or is there something going wrong? And why did you use fork in the multiprocessing start method? Any special consideration?

Appreciate your time and patience!

Load data from serveral workers with minhash_pyspark.py

Hi, I'm setting up a local Spark cluster but I have a problem that my data is too large and cannot be stored in a single machine in the cluster for Pyspark to load and process later.

My data consists of some small subsets and these small subsets are small enough to store on a machine. I wonder if Pyspark can load different data stored in different worker machines then collect and process that data?

I am also thinking of another option which is to deduplicate each part of the data, then merge it and continue deduplicating. I think since we use Minhash this will probably still produce the equivalent result as deduplicating all the data at once. What do you think about this option?

Thank you!

Failed to install using `pip install text-dedup`, but succeeded using `pip install -e .`

Hi,
First, thank you for a great library :-)

When I installed text-dedup using pip install, it failed while trying to uninstall scipy library version 1.12.0 and install scipy==1.9.1.
Strange errors poped up mentioning meson and compile ....

While digging the cause of error, I found that the repo's main branch is updated AFTER the text-dedup 0.3.1 released.
So I cloned it and installed using pip install -e . without any issues.

I think that text-dedup in pypi should be updated :-)

Best regards,
Han-Cheol

Issue with NON_ALPHA Regular Expression for Multilingual Support

The current NON_ALPHA = re.compile("[^A-Za-z]") is not adequate for multilingual support. For example, NON_ALPHA.split("ไฝ ๅฅฝ") returns ['', '', ''] instead of the expected ['ไฝ ๅฅฝ']. I recommend using re.compile("\W", re.UNICODE) to accommodate all languages properly.

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py

Hi @ChenghaoMou ,

I have been using minhash_spark.py via GCP dataproc (removed all the code present for bigcode) for deduplicating my multi-lingual dataset. To get an understanding on the reproducibility of the result, I also deduped the same multi-lingual dataset using minhash.py.

Currently, deduplication is performed on one individual language at a time.

When I ran this for the first language, I witnessed that minhash.py retained around 15-20% more documents as compared to minhash_spark.py.
minhash_spark.py output had around ~12M documents and minhash.py output had around ~14.5M documents.

In #28 , you have mentioned that for same algorithm, although the documents being removed is random, the number of documents being removed is same. But, I am witnessing different behaviour.

To validate my experience, I ran the deduplication over rest of the language subsets and found more documents were being dropped in minhash_spark.py.

It would be great if you can help me better understand this by answering a few question:

  1. Does connected components used in minhash_spark.py create different clusters than union-find used in minhash.py?
  2. If the number of clusters are same then shouldn't the number of samples in the outputs for both the scripts be same?
  3. Is running the scripts on different machines responsible for this behaviour? If yes, what is the reason for this behaviour.

I would be grateful if you can share any info apart from the above questions which can help me troubleshoot this behaviour!

Thanks.

Suffix Array consumed time

Hi, there

Backgrounds: I want to deduplicate the large volume data with suffix array algorithms.

Progress:

  • My pure dataset size is about 465GB(temp_text.txt disk size)
  • If following the code without any code edition, the output of the first step of "Suffix Array" is temp_text.txt.part.{string_num}-{next_chunk_string_num},
    temp_text.txt.part.{string_num}-{next_chunk_string_num}.table.bin
  • After the constructing binary table data, which is the first step of suffix array, merging suffix tree is now going on.

Problem:

  • I found merging suffix tree is taking unexpected time. For instance, the first step took about less than 12 hours. But the merging step has been going on more than 3 days but if i calculate the expected time, it needs12 days more. What's more, there are more steps(e.g, merging individual tables, Doing selfsimilar, restore, deduplicate) to complete the whole process and I can't guess the expected time.

Question:

  • How much time required to deduplicate 465GB data with the provided suffix array algorithm?
  • Is there anything I can try to reduce the expected complete time?

thank you in advance for sharing the sources.
Cheers!

Question about GoogleSuffixArrayDeduplicator

Hi, this repo makes a lot of deduplication code very easy to use; I'm very grateful for it!

I just noticed that the suffix array is built with this command:

https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/exact_dedup/suffix_array/suffix_array_google.py#L60

But right above that command, the suffix array is also built with this command:

https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/exact_dedup/suffix_array/suffix_array_google.py#L57

I could be misunderstanding the code, but if lines 60 and 57 actually do the same thing, it could be good to remove line 57. This is because line 57 tries to load the whole suffix array into memory which fails for large datasets.

Thanks!

New release

I've noticed that the latest release doesn't have the suffix array changes made in this commit. Do you plan to make a new pip release?

Thanks once again for the great work!

Python 3.9 compatibility

Is Python 3.9 going to be supported? When I try to install from source with a lower Python version it says it only supports Python <3.12 >=3.9. But then, installing it and running with Python 3.9 it seems that there are some idioms that are from higher versions and the interface crashes like:

python -m text_dedup.simhash -h
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 188, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.9/runpy.py", line 158, in _get_module_details
    code = loader.get_code(mod_name)
  File "<frozen importlib._bootstrap_external>", line 988, in get_code
  File "<frozen importlib._bootstrap_external>", line 918, in source_to_code
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/work/user/text-dedup/text_dedup/minhash.py", line 170
    match args.hash_func:
          ^
SyntaxError: invalid syntax
python -m text_dedup.minhash -h
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/work/user/text-dedup/text_dedup/minhash.py", line 34, in <module>
    from text_dedup.utils.analysis import optimal_param
  File "/work/user/text-dedup/text_dedup/utils/analysis.py", line 13, in <module>
    doc1: str | List[str],
TypeError: unsupported operand type(s) for |: 'type' and '_GenericAlias'

It would be helpful to have the option to run at least with python 3.9 or even 3.8. 3.10 is still too new and it's not available in some environments and sometimes the user does not have a choice.

Can we accelerate the groupByKey operation by md5 hashing for the Minhash spark version?

Thanks for your great efforts for this excellent work first!

Recently, we have adopted this tool in our research. In detail, we used the minhash algorithm in the Spark version. Unfortunately, we found the groupByKey operation in lines 80 and 103 is time-consuming. After diving into your code, we have an idea of replacing the key in tuple format into md5 hashing string. For example, we replace the following code (line 259 of minhash_spark.py)

return [(band_idx, H, idx) for band_idx, H in enumerate(Hs)]

into

return [((calculate_md5('{}, {}'.format(band_idx, H)), idx) for band_idx, H in enumerate(Hs)]

We empirically found such md5 string replacement can accelerate a lot. However, we cannot make sure whether this behavior has no damage to the deduplication workflow. Thus, we raise an issue here to verify it. Could you help us? Thanks a lot.

refactor hash related code

This code base utilizes several different kinds of hashes (md5, SHA-1, xxhash, sha256).
Some are implemented in redundant manners.
For instance, sha1_hash32 in minhash_spark and sha1_hash (with default d=32) in minhash.py
seem to achieve the same thing.
I think that refactoring it like the hashfunc.py in ekzhu/datasketch could be helpful.

if I prepared a PR, would you consider it?

How to get duplicates cluster ids?

Hi, my use case is that I would not only like to remove duplicates from a dataset but also do some analytic on what was clustered as duplicates. So in the result I would like to have table with colums example_id, cluster_id. Is it possible with current code? If not what would be the best place to add that feature?

Suffix array collect src/main.rs:174 assertion failed: input.len() % size_width == 0

Cleaning up                                                                                                                                                                                                            
    Finished dev [optimized + debuginfo] target(s) in 3.32s                                                                                                                                                            
     Running `target/debug/dedup_dataset self-similar --data-file output/temp_text.txt --length-threshold 100 --cache-dir ./tmp/cache --num-threads 128`                                                               
Start load!                                                                                                                                                                                                            
0 / 60318160                                                                                                                                                                                                           
3093441 / 60318161                                                                                                                                                                                                     
9500721 / 60318161                                                                                                                                                                                                     
19001441 / 60318161                                                                                                                                                                                                    
28502161 / 60318161                                                                                                                                                                                                    
34909441 / 60318161                                                                                                                                                                                                    
44410161 / 60318161                                                                                                                                                                                                    
53910881 / 60318161                                                                                                                                                                                                    
Duplicates found: 191378774                                                                                                                                                                                            
Total time taken: 70972ms                                                                                                                                                                                              
    Finished dev [optimized + debuginfo] target(s) in 0.89s                                                                                                                                                            
     Running `target/debug/dedup_dataset collect --data-file output/temp_text.txt --length-threshold 100 --cache-dir ./tmp/cache`                                                                                      
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace                                                                                                                                          
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Any { .. }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.3.2/src/scoped.rs:34:43                                [95/1849]
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5

The dataset is C4, the codes worked well on small size sample,but when i scaled to 7.5 GB,this error occurs. I don't know why this happened.
Also another question, I have tried a Chinese language Dataset called wudao200G,and it always gives the warning "There is a match longer than 50,000,000 bytes.",i don't konw whether this is the Dataset's problem,or the code's problem.
Thanks for your help

MinHash dedup parameters

Hey, how do the following arguments for MinHash:

python -m text_dedup.minhash \
    --ngram 1 \
    --num_perm 128 \
    --threshold 0.8 \

relate to the parameters of the Lee et al. paper? In particular, is num_perm the k parameter of Appendix A? (how do you set r and b then?)

If one wanted to have the exact parameters that this paper used, is there an example somewhere?

Suffix Array Substring Exact Deduplication

I would like to deduplicate a dataset which is stored in a csv file ("/foo_dir/foo.csv"). I was wondering how to pass the arguments of text_dedup.suffix_array, particularly --google_repo_path Please find the dummy dataset below.

df = pd.DataFrame({
    'food_name': ['Steak and lobster', 'Steak and crab', 'Steak and lobster', 'smoked meat', 'chickpea veggie burger', 'veggie burger beyond the meat', 'Penne With Tomato Sauce'],
    'resturant_address': ['New York', 'NYC', 'Toronto', 'Montreal', 'Toronto', 'LA', 'Milan']
})

target column: food_name 

Thank you!

Little refactor to allow imports from python instead of cli/subprocess

Currently, there is no real way to import deduplication algorithm and use it as a dependency in my python code without almost totally rewriting the content of main (the code under if __name__ == "__main__") - I think it would be beneficial to simply extract logic to sth. like main() function, and then in real "main" just construct parser, and pass parsed parameters to main.

minhash_spark.py [UNABLE_TO_INFER_SCHEMA]

When I use the Spark cluster to execute minhash_spark.py, I occasionally encounter [UNABLE-TO-INFER-SCHEMA] errors, as shown in the following figure. I don't know if it's a problem with the data. Because workers need to copy data to different machines. For files with errors, they can run normally after retransmission, but errors may also occur after a period of time. I don't know if the file movement or reading has an impact on Spark? Now I have set up an NFS server, which can ensure that the files read by each worker are consistent, but this problem still occurs. Can you help me analyze where the problem lies?

image

image

the effect of min_length in minhash_spark.py/minhash.py

Appreciate your efforts on this excellent work!
I found that there's a ngram encoding step is processed before hashing in minhash_spark.py. If the length of a doc is below the min_length, then it will output an empty list. After hashing, it will be a list of MAX_HASH values, which means all the low-length docs will have the same signature after generate_hash_values. Will they be deduplicated to only one in the end (regardless of the content but only length-based)?
Looking forward to your reply~

no module named numpy._typing

when i run the minhash_spark.py with spark submit, the program exit with this bug. I search the keywork in whole project and do not find any related code and also do not find any similar situation on the internet. do you know what causes the bug? thx!
20240328-100906
and my numpy version is 1.22.3

Should min_hash and min_hash_spark return the same result?

Hi, thank you for your great work!

As title, I wonder if these two codes will return the same result or not.

I tested these two codes with default configuration on the same dataset (Oscar 2201 - gl) but the output datasets from them are different. The output datasets have the same number of rows after removing duplicates but the content is different.

And the number of rows that I get after deduplicated is different from your example. But maybe it shouldn't be a problem because I tried the code on several different machines and it all outputs the same number of rows.

Deduplication of union find clusters explained

Hi, I've been reading your code of MinHash deduplication to find differences with my implementations or ideas of optimization. Is the filtering step using UnionFind parents to keep non duplicates and one sample of each cluster? That is why understand from here

function=lambda record, idx: record["__cluster__"] == idx,

At least, that is what I've been doing in my code, filter every sample id that union find vector says it is not a parent.

Also, may I ask why sharding is needed? Is the HF dataset class able to unload from memory and only allocate the current shard?

for i in tqdm(
range(0, NUM_SHARDS),
dynamic_ncols=True,
desc="Iterating MinHashes...", # noqa: E501
):
embedded_shard = embedded.shard(
num_shards=NUM_SHARDS, index=i, contiguous=True, writer_batch_size=args.batch_size
)
for key, Hs in zip(embedded_shard["__id__"], embedded_shard["__signatures__"]):
for i, H in enumerate(Hs):
HASH_TABLES[i][H].add(key)

Is that minhash oom error normal ?

I get that error when use the minhash program here.

Iterating MinHashes...: 17%|โ–ˆโ–‹ | 342/1982 [2:04:34<4:07:05, 9.04s/it]

python -m text_dedup.minhash --batch_size 10000 --column "text" --num_perm 9000 --b 450 --r 20

error: Detected 1 oom-kill event(s) in step 2008.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

I have input a 65GB chinese josnl file with 19816925 documents, using a 700GB memory and 70 cores cpu, but get that error.

When the file size doubles to 130GB, the error appears at Iterating MinHashes...: 9% , and the batch_size sets to 5000 don't influnce the 9% error.

According to this calculation, 700GB of memory can only handle about 10GB of data. I wonder if this is normal?

the ngram setting of minhash

I built a very small data set to test minhash deduplication, the data is as follows:

{'text': 'Farm to Market Road 1506 (FM 1506) is located in Lamar County.'}
{'text': 'Farm to Market Road 1507 (FM 1507) is located in Lamar County.'}
{'text': 'Farm to Market Road 1508 (FM 1508) is located in Lamar County.'}
{'text': 'Farm to Market Road 1514 (FM 1514) is located in San Jacinto County.'}
{'text': 'Farm to Market Road 1511 (FM 1511) is located in Leon County.'}
{'text': 'Farm to Market Road 1512 (FM 1512) is located in Leon County.'}
{'text': 'Farm to Market Road 1513 (FM 1513) is located in Rusk County.'}
{'text': 'Farm to Market Road 1503 (FM 1503) is located in Lamar County.'}
{'text': 'Farm to Market Road 1504 (FM 1504) is located in Van Zandt County.'}
{'text': 'Farm to Market Road 1505 (FM 1505) was located in El Paso County.'}

When using the default settings, the result is the same as the input. It started reducing duplicates until I set ngram to 2.

{'text': 'Farm to Market Road 1506 (FM 1506) is located in Lamar County.'}
{'text': 'Farm to Market Road 1508 (FM 1508) is located in Lamar County.'}
{'text': 'Farm to Market Road 1514 (FM 1514) is located in San Jacinto County.'}
{'text': 'Farm to Market Road 1511 (FM 1511) is located in Leon County.'}
{'text': 'Farm to Market Road 1513 (FM 1513) is located in Rusk County.'}
{'text': 'Farm to Market Road 1504 (FM 1504) is located in Van Zandt County.'}
{'text': 'Farm to Market Road 1505 (FM 1505) was located in El Paso County.'}

ngram is set to 2, is it reasonable for most cases? Are there other details I haven't noticed?

How possible to control Deduplication Threshold ?

Hi

I would like to deduplicate a column dataset which is stored in a csv file. I am wondering to know whether we could access/mange the potential argument(s) of text_dedup and version
https://pypi.org/project/text-dedup/0.0.12/ (with sentence-transformer and Annoy) in particular to control threshold.

For example; if cosine (I think that package is using) similarity between two embeddings is less or greater than 0.9 accept or reject duplication .

Thanks for clarification and comments in this matter.

many duplicate pairs were not actually similar using minhash_spark.py

minhash_spark.py

When I used this code to process English CC data, I found that many duplicate pairs were not actually similar, but similar pairs could indeed be captured. Why is this? I have tried many sets of parameters (ngram_size, B, R) and the same is true. Is there any optimal parameter recommendation?
In addition, when I processed Chinese CC data, I did not encounter such a situation.

Suffix array clean up

I think here instead of removing slices from the string array we should remove from the byte array since suffix array deduplication code returns byte pair pointers. Since a utf-8 character can be represented by 1-4 bytes this can lead to errors with text that have at least 1 character with > 1 bytes.

Maybe something like this

def clean_up(text, remove_pairs):
    """
    Remove duplicate substrings from the text.
    Parameters
    """
    byte_array = np.array(list(text.encode()))
    for start, end in remove_pairs:
        byte_array[start:end] = -1 # byte must be between 0-256
    return bytearray([byte for byte in byte_array if byte != -1]).decode("utf-8", "ignore") 

Let me know if I am missing something regarding suffix_array.py code which might already handle this.

Spark configurations

I have been trying to get the spark.py to work for a 800 GB dataset, tried many different configurations but no success so far, I stopped few runs after getting such logs:

2023-03-14 06:59:28 ERROR YarnScheduler:73 - Lost executor 31 on  Executor heartbeat timed out after 149802 ms
2023-03-14 06:59:28 WARN  TaskSetManager:69 - Lost task 5607.0 in stage 2.0 (TID 12007) ExecutorLostFailure (executor 31 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 149802 ms
2023-03-14 06:59:28 WARN  TaskSetManager:69 - Lost task 5735.0 in stage 2.0 (TID 12135)  ExecutorLostFailure (executor 31 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 149802 ms
2023-03-14 06:59:28 WARN  TaskSetManager:69 - Lost task 5732.0 in stage 2.0 (TID 12132)   ExecutorLostFailure (executor 31 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 149802 ms
2023-03-14 06:59:28 WARN  TaskSetManager:69 - Lost task 5677.0 in stage 2.0 (TID 12077)   ExecutorLostFailure (executor 31 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 149802 ms
2023-03-14 06:59:28 WARN  TaskSetManager:69 - Lost task 5608.0 in stage 2.0 (TID 12008)  ExecutorLostFailure (executor 31 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 149802 ms
2023-03-14 06:59:28 WARN  BlockManagerMasterEndpoint:69 - No more replicas available for rdd_11_3651 !
2023-03-14 06:59:28 WARN  BlockManagerMasterEndpoint:69 - No more replicas available for rdd_11_1209 !
2023-03-14 06:59:28 WARN  BlockManagerMasterEndpoint:69 - No more replicas available for rdd_11_5090 !
2023-03-14 06:59:28 WARN  BlockManagerMasterEndpoint:69 - No more replicas available for rdd_11_1879 !
2023-03-14 06:59:28 WARN  BlockManagerMasterEndpoint:69 - No more replicas available for rdd_11_411 !

Usually CPU utilization is very low and memory usage is at max:

Screenshot 2023-03-14 at 12 49 40 AM

Screenshot 2023-03-14 at 12 49 31 AM

A worker node utilization during job:

Screenshot 2023-03-14 at 12 53 43 AM

Here is my cluster config:

1 driver node, 5 worker nodes all same machine type n1-standard-32.

gcloud dataproc clusters create $CLUSTER_NAME \
    --enable-component-gateway \
    --region us-west1 \
    --zone us-west1-a \
    --master-machine-type n1-standard-32 \
    --master-boot-disk-size 500 \
    --num-workers 5 \
    --worker-machine-type n1-standard-32 \
    --worker-boot-disk-size 500 \
    --image-version 2.0-debian10 \
    --project $PROJECT_ID \
    --max-idle 15m

Here is the latest job config:

From what I read online spark.executor.cores=5 is the recommended setting and I calculated the remaining based on that.

gcloud dataproc jobs submit pyspark --cluster ${CLUSTER_NAME} \
    --driver-log-levels root=WARN \
    --properties="spark.executor.memory"="17g","spark.driver.memory"="17g","spark.executor.cores"="5","spark.driver.cores"=5 \
    neardup_e2e_v2.py -- \
    --output_dir <some input> \
    --input_dir <some input>  \
    --file_pattern <some input>  \
    --threshold 0.3 

Note: I am reading uncompressed json line files from GCS same zone as the cluster.

I assume you have experience running this code on such ~1TB dataset so what would be your suggestions? Thanks!

how to deduplicate already save_to_disk dataset ?

Hey @ChenghaoMou , thanks for building such an amazing tool.

could you please link to your documentation of text-dedup.

I am looking for a way to deduplicate an already save_to_disk dataset that i am laoding using load_from_disk i see there is some flag --local to be used. could not really make this work

suffix array "No such file or directory" exception

Hi,

I read the RefinedWeb dataset article from Falcon model and I want to follow same approach. To do this, I preprocessed my data with some rules and after that I used MinHash Near Deduplication in your repo. It reduced dataset like this:

[07/15/23 13:45:54] INFO     Loading                         : 496.09s                                        minhash.py:330
                    INFO     MinHashing                      : 800.25s                                        minhash.py:330
                    INFO     Clustering                      : 554.21s                                        minhash.py:330
                    INFO     Filtering                       : 387.45s                                        minhash.py:330
                    INFO     Saving                          : 50.34s                                         minhash.py:330
                    INFO     Total                           : 2288.34s                                       minhash.py:330
                    INFO     Before                          : 6316662                                        minhash.py:332
                    INFO     After                           : 4530959                                        minhash.py:333

As suggested in RefinedWeb dataset article, next step is exact deduplication with suffix arrays. To that I want to use Suffix Array Substring Exact Deduplication with the following command:

python -m text_dedup.suffix_array \
    --path "output/minhash/oscar_tr_dedup" \
    --local \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/suffix_array/oscar_tr_dedup" \
    --column "filtered_content" \
    --google_repo_path "/data/dedup_for_llm/text-dedup/deduplicate-text-datasets"

Note: I modified the code in here with local option like here

After running the text_dedup.suffix_array command, process throw the exception below:

[07/17/23 12:05:51] INFO     Loading dataset...                                                          suffix_array.py:305
[07/17/23 12:05:52] INFO     Loading dataset... Done                                                     suffix_array.py:319
                    INFO     Started to preprocessing...                                                 suffix_array.py:322
[07/17/23 12:11:31] INFO     Started to preprocessing... Done                                            suffix_array.py:332
                    INFO     Started to suffix array...                                                  suffix_array.py:335
Data size: 19152440938
Total jobs: 100
Jobs at once: 20
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 0 --end-byte 191624409
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 191524409 --end-byte 383148818
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 383048818 --end-byte 574673227
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 574573227 --end-byte 766197636
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 766097636 --end-byte 957722045
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 957622045 --end-byte 1149246454
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 1149146454 --end-byte 1340770863
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 1340670863 --end-byte 1532295272
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 1532195272 --end-byte 1723819681
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 1723719681 --end-byte 1915344090
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 1915244090 --end-byte 2106868499
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 2106768499 --end-byte 2298392908
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 2298292908 --end-byte 2489917317
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 2489817317 --end-byte 2681441726
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 2681341726 --end-byte 2872966135
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 2872866135 --end-byte 3064490544
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 3064390544 --end-byte 3256014953
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 3255914953 --end-byte 3447539362
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 3447439362 --end-byte 3639063771
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 3638963771 --end-byte 3830588180
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
Waiting for jobs to finish
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 3830488180 --end-byte 4022112589
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 4022012589 --end-byte 4213636998
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 4213536998 --end-byte 4405161407
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 4405061407 --end-byte 4596685816
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 4596585816 --end-byte 4788210225
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 4788110225 --end-byte 4979734634
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 4979634634 --end-byte 5171259043
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 5171159043 --end-byte 5362783452
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 5362683452 --end-byte 5554307861
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 5554207861 --end-byte 5745832270
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 5745732270 --end-byte 5937356679
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 5937256679 --end-byte 6128881088
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 6128781088 --end-byte 6320405497
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 6320305497 --end-byte 6511929906
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 6511829906 --end-byte 6703454315
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 6703354315 --end-byte 6894978724
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 6894878724 --end-byte 7086503133
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 7086403133 --end-byte 7278027542
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 7277927542 --end-byte 7469551951
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 7469451951 --end-byte 7661076360
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
Waiting for jobs to finish
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 7660976360 --end-byte 7852600769
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 7852500769 --end-byte 8044125178
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 8044025178 --end-byte 8235649587
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 8235549587 --end-byte 8427173996
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 8427073996 --end-byte 8618698405
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 8618598405 --end-byte 8810222814
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 8810122814 --end-byte 9001747223
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 9001647223 --end-byte 9193271632
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 9193171632 --end-byte 9384796041
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 9384696041 --end-byte 9576320450
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 9576220450 --end-byte 9767844859
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 9767744859 --end-byte 9959369268
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 9959269268 --end-byte 10150893677
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 10150793677 --end-byte 10342418086
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 10342318086 --end-byte 10533942495
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 10533842495 --end-byte 10725466904
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 10725366904 --end-byte 10916991313
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 10916891313 --end-byte 11108515722
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 11108415722 --end-byte 11300040131
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 11299940131 --end-byte 11491564540
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
Waiting for jobs to finish
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 11491464540 --end-byte 11683088949
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 11682988949 --end-byte 11874613358
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 11874513358 --end-byte 12066137767
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 12066037767 --end-byte 12257662176
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 12257562176 --end-byte 12449186585
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 12449086585 --end-byte 12640710994
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 12640610994 --end-byte 12832235403
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 12832135403 --end-byte 13023759812
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 13023659812 --end-byte 13215284221
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 13215184221 --end-byte 13406808630
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 13406708630 --end-byte 13598333039
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 13598233039 --end-byte 13789857448
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 13789757448 --end-byte 13981381857
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 13981281857 --end-byte 14172906266
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 14172806266 --end-byte 14364430675
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 14364330675 --end-byte 14555955084
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 14555855084 --end-byte 14747479493
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 14747379493 --end-byte 14939003902
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 14938903902 --end-byte 15130528311
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 15130428311 --end-byte 15322052720
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
Waiting for jobs to finish
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 15321952720 --end-byte 15513577129
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 15513477129 --end-byte 15705101538
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 15705001538 --end-byte 15896625947
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 15896525947 --end-byte 16088150356
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 16088050356 --end-byte 16279674765
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 16279574765 --end-byte 16471199174
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 16471099174 --end-byte 16662723583
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 16662623583 --end-byte 16854247992
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 16854147992 --end-byte 17045772401
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 17045672401 --end-byte 17237296810
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 17237196810 --end-byte 17428821219
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 17428721219 --end-byte 17620345628
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 17620245628 --end-byte 17811870037
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 17811770037 --end-byte 18003394446
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 18003294446 --end-byte 18194918855
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 18194818855 --end-byte 18386443264
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 18386343264 --end-byte 18577967673
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 18577867673 --end-byte 18769492082
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 18769392082 --end-byte 18961016491
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 18960916491 --end-byte 19152440938
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
Waiting for jobs to finish
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
Checking all wrote correctly
>>>>: output/temp_text.txt.part.0-191624409
Traceback (most recent call last):
  File "/data/dedup_for_llm/text-dedup/deduplicate-text-datasets/scripts/make_suffix_array.py", line 71, in <module>
    size_data = os.path.getsize(x)
  File "/data/miniconda3/lib/python3.9/genericpath.py", line 50, in getsize
    return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: 'output/temp_text.txt.part.0-191624409'
Traceback (most recent call last):
  File "/data/miniconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/miniconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/dedup_for_llm/text-dedup/text_dedup/suffix_array.py", line 378, in <module>
    ds.save_to_disk(args.output)
  File "/data/dedup_for_llm/text-dedup/text_dedup/utils/timer.py", line 19, in __exit__
    raise exc_val
  File "/data/dedup_for_llm/text-dedup/text_dedup/suffix_array.py", line 340, in <module>
    logger.info("Started to suffix array... Done")
  File "/data/dedup_for_llm/text-dedup/text_dedup/utils/timer.py", line 19, in __exit__
    raise exc_val
  File "/data/dedup_for_llm/text-dedup/text_dedup/suffix_array.py", line 336, in <module>
    __run_command(
  File "/data/dedup_for_llm/text-dedup/text_dedup/suffix_array.py", line 247, in __run_command
    raise RuntimeError(f"Command {cmd} failed with code {code}. CWD: {cwd}")
RuntimeError: Command python scripts/make_suffix_array.py output/temp_text.txt failed with code 1. CWD: /data/dedup_for_llm/text-dedup/deduplicate-text-datasets

Based on the error it couldn't find the file "FileNotFoundError: [Errno 2] No such file or directory: 'output/temp_text.txt.part.0-191624409'" I checked several times but no other files were created except "output/temp.txt"

I couldn't figure out the solution. Could you help me?

Question about code of spark.py

thanks for the great code, I have some understanding problems about the following code:

a = edges
    while True:
        b = a.flatMap(large_star_map).groupByKey().flatMap(large_star_reduce).distinct().cache()
        a = b.map(small_star_map).groupByKey().flatMap(small_star_reduce).distinct().cache()
        changes = a.subtract(b).union(b.subtract(a)).collect()
        if len(changes) == 0:
            break

this code is used to clustering the total docs by the generated edges(which means duplication) and remove the duplicated docs? If so, variable a is the texts after dedup?

Papers, Datasets that use this repo

Recently the paper for CulturaX a dataset utilizing text-dedup has been released.

I am sure there are many more including the stack.

Would it be valuable to have some mentions regarding these kind of use cases in the README.md?
I'll prepare a PR if there is interest.

PySpark without DataProc

First of all, I'd like to thank you for that amazing tool!

I got it run with the python version, but unfortunately, it stucked at 94%. I think, that the corpus is too large.
Thus, I would like to recompute it with the spark version, but I would like to use this verison on a standalone server and not on a DataProc environment. Could you maybe explain, how to to that as I ran into some errors. I'm not a Java pro, so I stuck here:

python -m text_dedup.minhash_spark --input test-corpus/ --output output/spark


23/09/19 00:23:38 WARN Utils: Your hostname, WIN067 resolves to a loopback address: 127.0.1.1; using 172.25.109.83 instead (on interface eth0)
23/09/19 00:23:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/19 00:23:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/09/19 00:23:41 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Exception thrown in awaitResult: 
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:322)
        at org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:396)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readParquetFootersInParallel(ParquetFileFormat.scala:422)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:472)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:464)
        at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:79)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:853)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:853)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkException: [CANNOT_READ_FILE_FOOTER] Could not read footer for file: file:/home/scheible/git/github/text-dedup/test-corpus/de_dup.txt.
        at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFooterForFileError(QueryExecutionErrors.scala:1077)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:435)
        at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:393)
        at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
        at scala.util.Success.$anonfun$map$1(Try.scala:255)
        at scala.util.Success.map(Try.scala:213)
        at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
        at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
        at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.lang.RuntimeException: file:/home/scheible/git/github/text-dedup/test-corpus/de_dup.txt is not a Parquet file. Expected magic number at tail, but found [57, 32, 48, 46]
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:557)
        at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:777)
        at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:658)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:53)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:429)
        ... 14 more
23/09/19 00:23:41 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (172.25.109.83 executor driver): org.apache.spark.SparkException: Exception thrown in awaitResult: 
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:322)
        at org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:396)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readParquetFootersInParallel(ParquetFileFormat.scala:422)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:472)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:464)
        at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:79)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:853)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:853)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkException: [CANNOT_READ_FILE_FOOTER] Could not read footer for file: file:/home/scheible/git/github/text-dedup/test-corpus/de_dup.txt.
        at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFooterForFileError(QueryExecutionErrors.scala:1077)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:435)
        at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:393)
        at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
        at scala.util.Success.$anonfun$map$1(Try.scala:255)
        at scala.util.Success.map(Try.scala:213)
        at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
        at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
        at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.lang.RuntimeException: file:/home/scheible/git/github/text-dedup/test-corpus/de_dup.txt is not a Parquet file. Expected magic number at tail, but found [57, 32, 48, 46]
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:557)
        at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:777)
        at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:658)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:53)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:429)
        ... 14 more

23/09/19 00:23:41 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "/home/scheible/anaconda3/envs/dedup/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/scheible/anaconda3/envs/dedup/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/scheible/git/github/text-dedup/text_dedup/minhash_spark.py", line 403, in <module>
    df: DataFrame = spark.read.option("mergeSchema", "true").parquet(args.input)
  File "/home/scheible/anaconda3/envs/dedup/lib/python3.10/site-packages/pyspark/sql/readwriter.py", line 531, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "/home/scheible/anaconda3/envs/dedup/lib/python3.10/site-packages/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
  File "/home/scheible/anaconda3/envs/dedup/lib/python3.10/site-packages/pyspark/errors/exceptions/captured.py", line 169, in deco
    return f(*a, **kw)
  File "/home/scheible/anaconda3/envs/dedup/lib/python3.10/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o29.parquet.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (172.25.109.83 executor driver): org.apache.spark.SparkException: Exception thrown in awaitResult: 
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:322)
        at org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:396)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readParquetFootersInParallel(ParquetFileFormat.scala:422)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:472)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:464)
        at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:79)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:853)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:853)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkException: [CANNOT_READ_FILE_FOOTER] Could not read footer for file: file:/home/scheible/git/github/text-dedup/test-corpus/de_dup.txt.
        at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFooterForFileError(QueryExecutionErrors.scala:1077)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:435)
        at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:393)
        at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
        at scala.util.Success.$anonfun$map$1(Try.scala:255)
        at scala.util.Success.map(Try.scala:213)
        at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
        at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
        at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.lang.RuntimeException: file:/home/scheible/git/github/text-dedup/test-corpus/de_dup.txt is not a Parquet file. Expected magic number at tail, but found [57, 32, 48, 46]
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:557)
        at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:777)
        at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:658)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:53)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:429)
        ... 14 more

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1206)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1206)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1206)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2984)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2263)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2284)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2328)
        at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1019)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:1018)
        at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:73)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:476)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:132)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.inferSchema(ParquetFileFormat.scala:78)
        at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$11(DataSource.scala:208)
        at scala.Option.orElse(Option.scala:447)
        at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:205)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:407)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
        at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
        at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:563)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:322)
        at org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:396)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readParquetFootersInParallel(ParquetFileFormat.scala:422)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:472)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:464)
        at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:79)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:853)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:853)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        ... 1 more
Caused by: org.apache.spark.SparkException: [CANNOT_READ_FILE_FOOTER] Could not read footer for file: file:/home/scheible/git/github/text-dedup/test-corpus/de_dup.txt.
        at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFooterForFileError(QueryExecutionErrors.scala:1077)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:435)
        at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:393)
        at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
        at scala.util.Success.$anonfun$map$1(Try.scala:255)
        at scala.util.Success.map(Try.scala:213)
        at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
        at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
        at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.lang.RuntimeException: file:/home/scheible/git/github/text-dedup/test-corpus/de_dup.txt is not a Parquet file. Expected magic number at tail, but found [57, 32, 48, 46]
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:557)
        at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:777)
        at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:658)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:53)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:429)
        ... 14 more

boundaries of sub-strings

Hello!

I'm currently using a suffix array and Persian language text. However, in some examples, the outcome of deduplication is not ideal when removing substrings from the text. This leads to boundaries of strings being overlapped by words, resulting in a deprecated and sometimes meaningless text. How can I rectify this issue?

one example (translated to english):

ORIGINAL: According to BBC and quoted by Currency, the dollar to ruble rate increased by 0.32% to 55.19 rubles and the euro decreased by 0.36% to 56.09 rubles.

RESULT AFTER DEDUP: o ruble rate increased by 0.32% to 55.19 rubles and the euro decreased by 0.36% to 56.09 rubles.

FileNotFoundError: [Errno 2] No such file or directory: 'output/temp_text.txt.part.0-7320579'

When i wanted to run text deduplication on my own local dataset, i got this error below.

python -m text_dedup.suffix_array --path "csv" --data_files train_enron_emails_dataset.csv --cache_dir './cache' --output 'output' --split 'train' --column 'text' --google_repo_path "deduplicate-text-datasets" --no-use_auth_token --local
Dataset({
features: ['file', 'text'],
num_rows: 10348
})
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 0 --end-byte 7320579
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 7220579 --end-byte 14541158
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 14441158 --end-byte 21761737
/bin/sh: 1: ./target/debug/dedup_dataset: not found
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 21661737 --end-byte 28882318
/bin/sh: 1: ./target/debug/dedup_dataset: not found
Waiting for jobs to finish
/bin/sh: 1: ./target/debug/dedup_dataset: not found
/bin/sh: 1: ./target/debug/dedup_dataset: not found
Checking all wrote correctly
Traceback (most recent call last):
File "/root/text-dedup-main/deduplicate-text-datasets/scripts/make_suffix_array.py", line 66, in
size_data = os.path.getsize(x)
File "/root/miniconda3/lib/python3.10/genericpath.py", line 50, in getsize
return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: 'output/temp_text.txt.part.0-7320579'
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/text-dedup-main/text_dedup/suffix_array.py", line 299, in
with timer("Total"):
File "/root/text-dedup-main/text_dedup/utils/timer.py", line 19, in exit
raise exc_val
File "/root/text-dedup-main/text_dedup/suffix_array.py", line 324, in
with timer("SuffixArray"):
File "/root/text-dedup-main/text_dedup/utils/timer.py", line 19, in exit
raise exc_val
File "/root/text-dedup-main/text_dedup/suffix_array.py", line 325, in
__run_command(
File "/root/text-dedup-main/text_dedup/suffix_array.py", line 244, in __run_command
raise RuntimeError(f"Command {cmd} failed with code {code}. CWD: {cwd}")
RuntimeError: Command python scripts/make_suffix_array.py output/temp_text.txt failed with code 1. CWD: deduplicate-text-datasets

i cant figure out alone. Can anyone help me solve this problem and is there any tutorial for custom dataset like .txt and .csv?

FileNotFoundError when run `Finding clusters` step.

Crashing when run Finding clusters step.

I found the issue case by new_fingerprint=str(random.getrandbits(128))

Iterating MinHashes...: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 392/392 [06:50<00:00,  1.05s/it]
Clustering...: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 25/25 [00:28<00:00,  1.14s/it]
multiprocess.pool.RemoteTraceback:                                                                                                                                                                 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1328, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3507, in _map_single
    if update_data:
FileNotFoundError: [Errno 2] No such file or directory: '/data/sample-txt/cache-185154854570279270480969661689785176288_00010_of_00048.arrow'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/work/minhash.py", line 192, in <module>
    with timer("Total"):
  File "/usr/local/lib/python3.10/site-packages/text_dedup/utils/timer.py", line 19, in __exit__
    raise exc_val
  File "/var/work/minhash.py", line 266, in <module>
    with timer("Filtering"):
  File "/usr/local/lib/python3.10/site-packages/text_dedup/utils/timer.py", line 19, in __exit__
    raise exc_val
  File "/var/work/minhash.py", line 270, in <module>
    ds = ds.map(
  File "/usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 580, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 545, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3180, in map
    for rank, done, content in iflatmap_unordered(
  File "/usr/local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1354, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/usr/local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1354, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
FileNotFoundError: [Errno 2] No such file or directory: '/data/sample-txt/cache-185154854570279270480969661689785176288_00010_of_00048.arrow'

Open up more avenues for discussion

It would be helpful if either the

  • Wiki
  • Discussions
  • Projects
    become open to discuss usage scenarios or future improvement plans.

Such avenues allow both users, collaborators of this repo, and developers to discuss matters that are not strictly related to PRs, commits or otherwise bugs or issues.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.