I found suffix_array.py will remove ALL the duplicated substring instead of keeping ON

duplicated substring removal in suffix_array.py about text-dedup HOT 1 CLOSED

chenghaomou commented on August 25, 2024

duplicated substring removal in suffix_array.py

from text-dedup.

Comments (1)

ChenghaoMou commented on August 25, 2024

This is actually what been used in the original paper, you can find this on their readme file https://github.com/google-research/deduplicate-text-datasets:

In our paper we suggest just taking all of these duplicate sequences that have been identified and completely striking them from the dataset. This somewhat breaks the flow of text, for example if previously had an example "Alice wanted to go to the store" and we deduplicated at the level of 10 characters, we might completely strike " to go to the " and be left with "Alice wantedstore". In practice we have found this doesn't break the language model because we remove relatively little text, and so these breaks don't cause harm.

However, it might not be the case for your own dataset, or you have different use case of those duplicated strings, in which case you will have to modify the code.

from text-dedup.

duplicated substring removal in suffix_array.py about text-dedup HOT 1 CLOSED

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent