oscar-project / ungoliant Goto Github PK

View Code? Open in Web Editor NEW

158.0 158.0 14.0 4.83 MB

:spider: The pipeline for the OSCAR corpus

Home Page: https://oscar-corpus.com

License: Apache License 2.0

Rust 100.00%

common-crawl commoncrawl corpus-linguistics crawler fasttext language-classification nlp oscar

ungoliant's People

Contributors

Stargazers

Watchers

Forkers

benopoku albertvillanova drawfish wlike nemani qhduan acul3 ashokchhetri7 sadra-barikbin openthaigpt chris-ha458 xiechengmude kargaranamir habibzadeh

ungoliant's Issues

Feature: Noisy annotation on Documents with a high punctuation/letters ratio.

If a given Document is mainly constituted of punctuation (or non alpha characters), it should be annotated with a noisy tag.

[Feature request] Train a classifier to better classify languages

Is your feature request related to a problem? Please describe.
Since the Oscar is limited by the fasttext language classifier which was trained on Wikipedia, the datasets contain also the sentences in other languages. For instance, Tajik (tg.txt) language contains large chunks of Uzbek sentences in Cyrillic script

Describe the solution you'd like
Train new models using other data other than Wikipedia, for instance for text material that was taken from randomly chosen language specific websites, language specific news websites , and text material collected via CURL portal (https://curl.corpora.uni-leipzig.de).

Describe alternatives you've considered
Leipzig Corpora, but it also has some "noise" that needs to be cleaned for efficient language detection

Additional context
for ex:
File: tg.txt Line: 660247: Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани.

if you do simple check using fasttext

import fasttext
model = fasttext.load_model('lid.176.ftz')
print(model.predict('Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани.', k=2))
Output will be

#(('__label__tg', '__label__bg'), array([0.38605371, 0.14384778]))

Which indicates that it is Tajik but in fact it is not

[Feature request] Controling the number of thread being used

Hello the ungoliant team,

I would like to know if it's possible to have as a parameter the number of thread that we would like to use for a given step? (for instance the pipeline step). By default, if the parameter is not provided, it would use all the thread, and if a number of thread, below the number of available of thread of the machine, is provided, it would only use this number ?

Thanks a lot !!

[BUG] UnexpectedEof While running Ungoliant Pipeline

UnexpectedEof While running Ungoliant Pipeline
I have tried to run the pipeline to extract the languages from the CC wet file which is already downloaded (only 25 files).

Step that produce error
Steps to reproduce the behavior:

Saved the CC index to a paths file 'cc-index.paths'
Run Ungoliant download 'ungoliant download -t 10 <paths> <dst>'
Run Ungoliant pipeline 'ungoliant pipeline --lid-path <model path> <wet dir> <dst>'
See the error

[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))

I have done debugging all the files and found that one file that caused the error couldn't be unzipped.

Desktop:

OS: Ubuntu
Version: 22.04

[BUG] Cannot install via cargo

Describe the bug
This error blocked installation

error[E0107]: this struct takes 0 lifetime arguments but 1 lifetime argument was supplied
  --> /home/norapatbuppodom/.cargo/registry/src/github.com-1ecc6299db9ec823/ungoliant-1.2.3/src/transformers/content_detector.rs:18:9
   |
18 |     bl: Blocklist<'a>,
   |         ^^^^^^^^^---- help: remove these generics
   |         |
   |         expected 0 lifetime arguments
   |
note: struct defined here, with 0 lifetime parameters
  --> /home/norapatbuppodom/.cargo/registry/src/github.com-1ecc6299db9ec823/ut1_blocklist-0.1.1/src/blocklist.rs:27:12
   |
27 | pub struct Blocklist {
   |            ^^^^^^^^^

error[E0107]: this struct takes 0 lifetime arguments but 1 lifetime argument was supplied
  --> /home/norapatbuppodom/.cargo/registry/src/github.com-1ecc6299db9ec823/ungoliant-1.2.3/src/transformers/content_detector.rs:23:20
   |
23 |     pub fn new(bl: Blocklist<'a>) -> Self {
   |                    ^^^^^^^^^---- help: remove these generics
   |                    |
   |                    expected 0 lifetime arguments
   |
note: struct defined here, with 0 lifetime parameters
  --> /home/norapatbuppodom/.cargo/registry/src/github.com-1ecc6299db9ec823/ut1_blocklist-0.1.1/src/blocklist.rs:27:12
   |
27 | pub struct Blocklist {
   |            ^^^^^^^^^

For more information about this error, try `rustc --explain E0107`.
error: could not compile `ungoliant` due to 2 previous errors
warning: build failed, waiting for other jobs to finish...
error: failed to compile `ungoliant v1.2.3`, intermediate artifacts can be found at `/tmp/cargo-installxMoF4S`

To Reproduce
Steps to reproduce the behavior:

Install unigoliant via cargo

Expected behavior
Install unigoliant successfully

Desktop (please complete the following information):

OS: MacOS m1, Linux, Rust Container Image (All three not working)
Version 1.2.3

Additional context
I also provided dockerfile that can reproduce the same error

FROM rust:1.61
RUN apt-get update && apt-get -y install cmake
RUN cargo install ungoliant

[BUG] No hard fail when blocklist path is invalid

Describe the bug
When specifying a blocklist path that is invalid, error messages do appear but the process ends "normally", with a status=0.

Either:

Check path existence beforehand and stop if we have a problem with it,
Fail the whole pipeline if we encounter this error somewhere.

~~Moreover, blocklists are instantiated once per shard and it shouldn't be that way. Maybe make ut1-blocklists Send and/or Sync, and then use a shared blocklist.~~ -> #74

[BUG] Cargo install of Ungoliant not working

Describe the bug
Disclaimer : I'm sorry if the bug I face can be solve easily in some way, but I'm not a RUST developer and didn't find on stackoverflow or similar website the error I face

When running Ungoliant (ungoliant -h for instance), I get the following message :

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/ungoliant-0.1.0/src/main.rs:10:28
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

It works when installing from --git, so you might want to check/update the ungoliant repo on rust hub. (git version of ungoliant is 1.0.0 whereas cargo version is 0.1.0)

To Reproduce
Steps to reproduce the behavior:

Install cargo + rust (https://doc.rust-lang.org/cargo/getting-started/installation.html)
Install Ungoliant with Cargo : cargo install ungoliant
Run Ungoliant : ungoliant -h

Expected behavior
Ungoliant working

Desktop (please complete the following information):

OS: Ubuntu
Version: 20.04

Thanks a lot !

Feature: Failures handling

Write failed shard downloads into a file, along with their id
Enable downloading from a failure file
(later milestone?) while downloading, put failed downloads at the end of the download queue to retry

Configuration file for `ungoliant pipeline`

ungoliant has a lot of parameters, so it could benefit from having a configuration file rather than command-line params.

We could have both systems but it might get tedious updating everything.

See https://github.com/rust-cli/confy and possibly others that could do both, à la hydra iirc.

Bug in `MeanLength` filter

There should be an abs() in this line so as to the logic becomes true:
https://github.com/oscar-corpus/ungoliant/blob/6f1571516fd3337fe8fe9e6c533144e73d2d7017/src/filtering/sentence.rs#L109

Tor reproduce, we could add this assertion the mean_default test in filtering/sentence.rs:

let short_invalid: String = ['a'].iter().cycle().take(80).collect();
assert_eq!(f.detect(&short_invalid), false);

Output:

---- filtering::sentence::tests::mean_default stdout ----
init rng   : mu:100.000 sig:10.000
from filter: mu:99.492 sig:9.989
thread 'filtering::sentence::tests::mean_default' panicked at 'assertion failed: `(left == right)`
  left: `true`,
 right: `false`'

[Question] Different multilingual identification methods

Hi
Thank you so much for making this available !
I had a question regarding the two different multilingual identification strategies mentioned in the code: StrictMultilingual and Multilingual.
Specifically, would it be possible to share why the StrictMultilingual strategy was preferred for dataset creation, and if there was any benchmarking done (downstream performance / human annotation / eyeballing results) to say one is better than the other ?
Would be grateful for any input on this : )

Thank you !

Cache dependencies

needed for #61 / #66 merge

[BUG] Pipeline command not working

Describe the bug
I have the downloaded CC files in a .../download folder but when I run the pipeline command, I get the following error

Error: Io(Os { code: 2, kind: NotFound, message: "No such file or directory" })

To Reproduce
Steps to reproduce the behavior:

Have ungoliant installed
Have the downloaded n.txt.gz in a folder (where n are the different files "id". I have from 5 to 34)
Create a destination folder for where the processed files will be saved
Run ungoliant pipeline /path/to/1_downloaded/ /path/to/ungoliant/ --lid-path /path/to/lid.176.bin

Expected behaviour
Command working fine

Desktop (please complete the following information):

OS: Ubuntu
Version 20.04

Thanks you a lot :)

[BUG] corrupt deflate stream

Describe the Bug
When running the Ungoliant pipeline, everything proceeds smoothly initially as the JSONL files for each language are built. However, after a couple of hours, an error suddenly appears in the logs, and thereafter, only this error persists. I am curious as to why this occurs and whether it could be resolved by skipping the problematic inputs.

[2024-03-27T23:49:00Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Custom { kind: InvalidInput, error: "corrupt deflate stream" })

To Reproduce
Nothing specific to mention, just the routine: downloading and pipelining.

Expected Behavior
The expected behavior is for the pipeline to function as it did earlier or to skip the corrupt inputs.

Screenshots

at first:

later:

Desktop (Please Complete the Following Information):

uname -a
Linux delta 5.14.21-150500.55.36-default #1 SMP PREEMPT_DYNAMIC Tue Oct 31 08:37:43 UTC 2023 (e7a2e23) x86_64 x86_64 x86_64 GNU/Linux

[BUG] Deduplication with Ungoliant

Describe the bug
Hi, I tried to deduplicate downloaded web data (after processing it with ungoliant download and ungoliant pipeline), but i get warnings (see screenshot) and the execution ends without deduplication.

The command
ungoliant dedup <source> <destination>

Expected behavior
generated deduplicated dataset in destination directory

Screenshots

[BUG] download malfunctioning

Describe the Bug
It appears that the download function is malfunctioning. Previously (6 months ago), I have successfully used the download feature with Ungoliant. However, currently, it seems to be skipping files, possibly due to an inability to download or a change in the address.

To Reproduce
Attempt to download the wet.paths.gz file from https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-10/index.html.

Expected Behavior
The expected behavior is that the download should occur successfully.

Screenshots
It quickly moves to the next file, indicating that it's not downloading properly.

Desktop (Please Complete the Following Information):

uname -a
Linux delta 5.14.21-150500.55.36-default #1 SMP PREEMPT_DYNAMIC Tue Oct 31 08:37:43 UTC 2023 (e7a2e23) x86_64 x86_64 x86_64 GNU/Linux

Additional Context
Additionally, the link to the CommonCrawl dump in the readme.md is broken. I have used this link instead: https://commoncrawl.org/overview.

Handle dependabot vulnerabilities

[BUG] Chavacano marked as "cbr" rather than cbk

Due to a typo, Chavacano code is cbr in OSCAR 22.01.
See https://github.com/oscar-corpus/ungoliant/blob/master/src/lang.rs#L229

Feature: Multilingual documents

Add a multilingual document identification.

A multilingual document is currently defined as :

10 lines/sentences (after length-based filtering), in order to have documents of a minimum length
90% of the lines have a prediction confidence that is >90%, in order to only have high quality predictions
At most 5 languages are identified (not sure about that one, the idea is to avoid having only one sentence by language

We also need a "repartition" metric, in order to avoid labelling as "multilingual" a document with 99 french sentences and 1 english one. With two languages, we could do 70/30. (that means if one of the two languages takes between 30 and 70% of the space, then the document is multilingual.)

Fix Cargo.toml errors for crates.io publishing

Automate release and deployment to crates.io

[Feature request] Add larger timeouts between 503 retries for CC download

See https://groups.google.com/g/common-crawl/c/dj5cQYcWvi4/m/BjcOWPxGAgAJ

[Feature request] Secure against length extension attacks

Is your feature request related to a problem? Please describe.
Per folder SHA256 hashes can be potentially vulnerable to length extension attacks.

Describe the solution you'd like

Change hash function to one that is resistant to length extension attacks
SHA-384 and SHA-512/256 exist. However the latter is difficult to canonicalize due to slash in the name (it is part of its official name). If speed or size is a concern, BLAKE3 is extremely fast and secure as well.
Add filesize in bytes when writing the hash and filename
This can work, and can be achieved with minimal code change.
Both of the above. It would be nice to have file sizes with the checksum for verification purposes. Also, even if the new hash has a length extension attack found, it will still be secure.

Describe alternatives you've considered
DO NOT

use xxhash from the zstd file.
I thought about this possibility since it is already present, but xxhash64 as used in zstd is a very fast hash function with minimal
security guarantees. If we assume that somebody is manipulating the json.zst file and doing so with sufficient compute to actually launch a length extension attack this will not provide any further security.

Additional context
I consider OSCAR as an important part of the Data pipelines supply chain so I this the bar i hope OSCAR can clear.
I will be willing to further investigate potential hash functions, implementations and provide PRs if necessary.

Reference : oscar-project/documentation#13

[Feature request] Document how to set fasttext model

Is your feature request related to a problem? Please describe.
There are multiple fasttext models available, and in principle, one could train their own.
Besides the one indicated by the README.md (lid.176.bin), the official page lists lid.176.ftz
On huggingface there is lid218eavailable and
there is also a recent independent lib201 model

Describe the solution you'd like

I would like the README.md to mention that there are other models available
I would like the code to provide a way to select a model through configuration.
I would like the README.md to reflect how 2. would be implemented

Describe alternatives you've considered
I can download my own model and rename it into lid.176.bin. This is prone to confusion and unsatisfactory

Additional context

There seems to be some unexposed options to achieve this.
It would be useful to modify them into a modular fashion and document it.
The code model.rs also seems to default to Path::new("lid.176.bin") but when absents tries to default to lid.208a.bin? which is unclear where it is obtainable. It's obvious certain efforts were made behind the curtain so I am hesitant to implment a solution on my own.
Since lid.176.bin is the most publically available, that could be the backup, while the user could provide/select their own model.
#21 might be fixed with this change.

[BUG] Error when downloading full CC snapshot

Describe the bug
If I only sample 0.01% of total files, ungoliant download works well.
However, if I download full CC snapshot, the following error appears.

I'm using the latest version of ungoliant.

To Reproduce

from os import mkdir, path
from subprocess import run
import argparse
import random
import requests

import multiprocessing


tmp_download_dir_name = f"tmp/"

if path.exists(tmp_download_dir_name):
    run(f"rm -r {tmp_download_dir_name}", shell=True)
run(f"mkdir {tmp_download_dir_name}", shell=True)

num_proc = multiprocessing.cpu_count()

run(f"wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-40/wet.paths.gz", shell=True)
run(f"gzip -d wet.paths.gz", shell=True)
paths_name = f"wet-CC-MAIN-2020-40.paths"
run(f"mv wet.paths {paths_name}", shell=True)
segments = open(paths_name, "r").readlines()

run(f"ungoliant download -t={num_proc} {paths_name} {tmp_download_dir_name}", shell=True)

Screenshots

Feature: Add retry option on downloader

When downloading CCrawl, the current downloader ignores failed shard downloads, resulting in lost data.

Add a -r n that enables retrying n times for each failed item (at the end of the downloading, for example).

We may/should have a failed_items.json file that holds the failed items after downloading, and a ungoliant download -f failed_items.json that tries to download them.

[BUG] ungoliant::io::reader::corpus] [<lang>] no text/meta file.

Hi, it's me again :)

The issue I'm opening is more a "how to" than a bug.
I just runned the dedup command :

ungoliant dedup /gazelle/corpora/ungoliant/2_pipeline/ /gazelle/corpora/ungoliant/3_dedup

And got the error :

[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["my"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["th"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["multi"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["nap"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["uz"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["pnb"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["av"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["af"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["ku"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["nl"] no text/meta file.
[........ the same for all languages]

I presume that the reason is because my last script, the pipeline one, might have not finished. However, the precedent script worked for about a day on 40 thread and didn't finish, so I want to know if one of the following solution is possible :

is there a way to generate a partial text/meta file ?
can I run the pipeline script, so it continues to where it was, so the pipeline doesn't restart from scratch
when running the pipeline script, is it possible to have the text/meta being generated on the fly so in case of early exit of the pipeline command, the current "pipelined" data would be usable as it is.

Maybe some of my suggestions are already possible, but I couldn't find any info in the doc.

Thanks again !
Kirian

Blocklists checklist

Some changes are required on blocklist management:

Use category rather than annotation when documents are detected on blocklists. This is to separate quality annotations and topic classification
Improve the way the blocklist is built. Currently we might have a number of blocklists in parallel, where we'd only need one. This may need updating ut1-rs to provide some blocklist merging feature.
Change the adult content detector so that it only checks if the adult category is present in the category field, rather than doing the blocklist lookup itself.

Publish on crates.io

waiting for warc library update on crates.io

Feature: Zipflike validation on documents at character-level

It may or may not work, but the idea is to check if a provided document follows Zipf, in order to infer its potential quality. Zipf has been documented to work well with large enough corpora, but not (to my knowledge) regarding characters.

The principal difficulty is to find a way to automatically decide if the document follows the law or not.

This metric can be used in conjunction with #26

Feature `std_rng` depends on `rand_hc` which is not an optional dependency

Describe the bug
Cannot install Ungoliant:

$ cargo install --verbose ungoliant
    Updating registry `https://github.com/rust-lang/crates.io-index`
  Installing ungoliant v1.2.3
error: failed to compile `ungoliant v1.2.3`, intermediate artifacts can be found at `/tmp/cargo-install.dSby1MuzbHhE`

Caused by:
  failed to parse registry's information for: rand

Caused by:
  Feature `std_rng` depends on `rand_hc` which is not an optional dependency.

Desktop (please complete the following information):

OS: Linux 5.4.0-117-generic #132-Ubuntu SMP x86_64 GNU/Linux
Versions:
- Ungoliant: 1.2.3
- cargo: 0.19.0

Additional context
I installed rust into a Conda environment via conda install -c pkgw-forge cargo. It's really bare-bones:

$ conda list
# packages in environment at /home/ndavid/miniconda3/envs/rust:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
ca-certificates           2022.4.26            h06a4308_0  
cargo                     0.19.0                        0    pkgw-forge
curl                      7.61.0               h84994c4_0  
libcurl                   7.61.0               h1ad7b7a_0  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libssh2                   1.8.0                h9cfc8f7_4  
openssl                   1.0.2u               h7b6447c_0  
zlib                      1.2.12               h7f8727e_2

Feature: Pipeline and Benchmarking

We aim to propose a high performance concurrent/parallel pipeline, and to do so we try different concurrent strategies using criterion for benchmarking.

Linked branch is dev-pipeline.

Option to keep documents that can't be identified

We could add an option that enables keeping documents that are not identifiable (where the classifier can't infer a document language), for further inspection.

Bug: Bad computation of Identification probability score

at https://github.com/oscar-corpus/ungoliant/blob/627ecd686c41a0aa370d8f23375120d923c39c62/src/pipelines/oscardoc/pipeline.rs#L252

The document score computation is done at line level, taking the score per number of lines rather than the score per byte.
Following this computation, a document with 100 lines classified with prob= 0.8 will have a 1.0 prob.

Change the computation so that it uses bytes rather than lines:

change (lang, bytes) to (lang, bytes, sum(bytes*prob) , and divide sum(bytes*prob) by bytes to get a better averaged score.

Feature: Header/Footer annotation

This annotation should be added if a document has a high number of short sentences in its first/last lines.

Add integration testing for rebuild files with oscardoc pipeline

Improve operation order in pipeline

The current pipeline does the PFiltering and identification before removing the short lines, which:

may improve error rate on classification, since short sentences are of a worse quality
is more computationnaly costly, since we have to identify lines that will be discarded.

The pipeline shoud remove short sentences, then PFilter, then identify.

Thanks for any help you can give!

[Feature request] Pipeline remove download file after process and extract single language

hi thank you for releasing tool!

since cc dump very size/disk demanding

can we have optional pipeline step like this:

immediately process(pipeline step) for each file in download command step instead to waiting all file download complete
remove the file immediately after being process (this will save disk ussage)

also
can we make optional pipeline that we can choose which language to process instead of process all of them

maybe something like
ungoliant pipeline download/ src/ -lang id

thank you before

oscar-project / ungoliant Goto Github PK

ungoliant's People

Contributors

Stargazers

Watchers

Forkers

ungoliant's Issues

Recommend Projects

Recommend Topics

Recommend Org