Coder Social home page Coder Social logo

oscar-project / ungoliant Goto Github PK

View Code? Open in Web Editor NEW
150.0 2.0 14.0 4.83 MB

:spider: The pipeline for the OSCAR corpus

Home Page: https://oscar-corpus.com

License: Apache License 2.0

Rust 100.00%
commoncrawl oscar crawler language-classification corpus-linguistics nlp fasttext common-crawl

ungoliant's Introduction

Ungoliant

codecov

🕷️ Ungoliant is a high-performance pipeline that provides tools to build corpus generation pipelines from CommonCrawl. 🕷️

It currently is the generation pipeline for OSCAR corpus, from CommonCrawl. Ungoliant is a replacement of goclassy.

Installation

Installing/Compiling the binary

  • Via cargo: cargo install ungoliant
  • Via git: cargo install --git https://github.com/oscar-corpus/ungoliant

Ungoliant needs numerous dependencies that should be compiled when installing. However cmake / gcc can be needed as the project uses fasttext-rs.

KenLM feature

The KenLM feature is optional because it relies on unsafe code that can break if the supplied model files are not correct.

To enable it, install KenLM requirements:

apt install -y libboost-all-dev libeigen3-dev

and use cargo install ungoliant --features kenlm or cargo b --features kenlm if you're building from source.

Getting a language identification file (for fastText):

By default, ungoliant expects the lid.176.bin model by meta. Use curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o lid.176.bin to get it.

However, you can use the model you want: just point to its path using ungoliant download --lid-path <path to lid>.

Other options include:

Usage

The usual way of generating corpora is:

  1. Fetch the wet.paths.gz file from the last CommonCrawl dump and decompress it.
  2. Download the files using the download command.
  3. Generate the corpus using the pipeline command (it may take some time).
  4. Head on to oscar-tools for the packaging steps

You can find more information on each command's --help.

ungoliant 2
corpus generation tool.

USAGE:
    ungoliant <SUBCOMMAND>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    download    Download a CommonCrawl release
    help        Prints this message or the help of the given subcommand(s)
    pipeline    Run pipeline
    rebuild     Rebuild the corpus for a given language.

Documentation

Ungoliant is not yet on docs.rs: use cargo doc --bins --open to open the documentation.

Head on to OSCAR Documentation for more info about the project.

ungoliant's People

Contributors

force1ess avatar pjox avatar qhduan avatar sadra-barikbin avatar uinelj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

ungoliant's Issues

[BUG] Cargo install of Ungoliant not working

Describe the bug
Disclaimer : I'm sorry if the bug I face can be solve easily in some way, but I'm not a RUST developer and didn't find on stackoverflow or similar website the error I face

When running Ungoliant (ungoliant -h for instance), I get the following message :

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/ungoliant-0.1.0/src/main.rs:10:28
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

It works when installing from --git, so you might want to check/update the ungoliant repo on rust hub. (git version of ungoliant is 1.0.0 whereas cargo version is 0.1.0)

To Reproduce
Steps to reproduce the behavior:

  1. Install cargo + rust (https://doc.rust-lang.org/cargo/getting-started/installation.html)
  2. Install Ungoliant with Cargo : cargo install ungoliant
  3. Run Ungoliant : ungoliant -h

Expected behavior
Ungoliant working

Desktop (please complete the following information):

  • OS: Ubuntu
  • Version: 20.04

Thanks a lot !

[Feature request] Pipeline remove download file after process and extract single language

hi thank you for releasing tool!

since cc dump very size/disk demanding

can we have optional pipeline step like this:

  1. immediately process(pipeline step) for each file in download command step instead to waiting all file download complete
  2. remove the file immediately after being process (this will save disk ussage)

also
can we make optional pipeline that we can choose which language to process instead of process all of them

maybe something like
ungoliant pipeline download/ src/ -lang id

thank you before

[BUG] Pipeline command not working

Describe the bug
I have the downloaded CC files in a .../download folder but when I run the pipeline command, I get the following error

Error: Io(Os { code: 2, kind: NotFound, message: "No such file or directory" })

To Reproduce
Steps to reproduce the behavior:

  1. Have ungoliant installed
  2. Have the downloaded n.txt.gz in a folder (where n are the different files "id". I have from 5 to 34)
  3. Create a destination folder for where the processed files will be saved
  4. Run ungoliant pipeline /path/to/1_downloaded/ /path/to/ungoliant/ --lid-path /path/to/lid.176.bin

Expected behaviour
Command working fine

Desktop (please complete the following information):

  • OS: Ubuntu
  • Version 20.04

Thanks you a lot :)

Feature: Multilingual documents

Add a multilingual document identification.

A multilingual document is currently defined as :

  • 10 lines/sentences (after length-based filtering), in order to have documents of a minimum length

  • 90% of the lines have a prediction confidence that is >90%, in order to only have high quality predictions

  • At most 5 languages are identified (not sure about that one, the idea is to avoid having only one sentence by language

We also need a "repartition" metric, in order to avoid labelling as "multilingual" a document with 99 french sentences and 1 english one. With two languages, we could do 70/30. (that means if one of the two languages takes between 30 and 70% of the space, then the document is multilingual.)

[Feature request] Train a classifier to better classify languages

Is your feature request related to a problem? Please describe.
Since the Oscar is limited by the fasttext language classifier which was trained on Wikipedia, the datasets contain also the sentences in other languages. For instance, Tajik (tg.txt) language contains large chunks of Uzbek sentences in Cyrillic script

Describe the solution you'd like
Train new models using other data other than Wikipedia, for instance for text material that was taken from randomly chosen language specific websites, language specific news websites , and text material collected via CURL portal (https://curl.corpora.uni-leipzig.de).

Describe alternatives you've considered
Leipzig Corpora, but it also has some "noise" that needs to be cleaned for efficient language detection

Additional context
for ex:
File: tg.txt Line: 660247: Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани.

if you do simple check using fasttext

import fasttext
model = fasttext.load_model('lid.176.ftz')
print(model.predict('Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани.', k=2))
Output will be

#(('__label__tg', '__label__bg'), array([0.38605371, 0.14384778]))

Which indicates that it is Tajik but in fact it is not

[Question] Different multilingual identification methods

Hi
Thank you so much for making this available !
I had a question regarding the two different multilingual identification strategies mentioned in the code: StrictMultilingual and Multilingual.
Specifically, would it be possible to share why the StrictMultilingual strategy was preferred for dataset creation, and if there was any benchmarking done (downstream performance / human annotation / eyeballing results) to say one is better than the other ?
Would be grateful for any input on this : )

Thank you !

[BUG] download malfunctioning

Describe the Bug
It appears that the download function is malfunctioning. Previously (6 months ago), I have successfully used the download feature with Ungoliant. However, currently, it seems to be skipping files, possibly due to an inability to download or a change in the address.

To Reproduce
Attempt to download the wet.paths.gz file from https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-10/index.html.

Expected Behavior
The expected behavior is that the download should occur successfully.

Screenshots
It quickly moves to the next file, indicating that it's not downloading properly.

Screenshot 2024-03-28 at 1 01 39 AM

Desktop (Please Complete the Following Information):

uname -a
Linux delta 5.14.21-150500.55.36-default #1 SMP PREEMPT_DYNAMIC Tue Oct 31 08:37:43 UTC 2023 (e7a2e23) x86_64 x86_64 x86_64 GNU/Linux

Additional Context
Additionally, the link to the CommonCrawl dump in the readme.md is broken. I have used this link instead: https://commoncrawl.org/overview.

[Feature request] Controling the number of thread being used

Hello the ungoliant team,

I would like to know if it's possible to have as a parameter the number of thread that we would like to use for a given step? (for instance the pipeline step). By default, if the parameter is not provided, it would use all the thread, and if a number of thread, below the number of available of thread of the machine, is provided, it would only use this number ?

Thanks a lot !!

Blocklists checklist

Some changes are required on blocklist management:

  • Use category rather than annotation when documents are detected on blocklists. This is to separate quality annotations and topic classification
  • Improve the way the blocklist is built. Currently we might have a number of blocklists in parallel, where we'd only need one. This may need updating ut1-rs to provide some blocklist merging feature.
  • Change the adult content detector so that it only checks if the adult category is present in the category field, rather than doing the blocklist lookup itself.

Bug in `MeanLength` filter

There should be an abs() in this line so as to the logic becomes true:
https://github.com/oscar-corpus/ungoliant/blob/6f1571516fd3337fe8fe9e6c533144e73d2d7017/src/filtering/sentence.rs#L109

Tor reproduce, we could add this assertion the mean_default test in filtering/sentence.rs:

let short_invalid: String = ['a'].iter().cycle().take(80).collect();
assert_eq!(f.detect(&short_invalid), false);

Output:

---- filtering::sentence::tests::mean_default stdout ----
init rng   : mu:100.000 sig:10.000
from filter: mu:99.492 sig:9.989
thread 'filtering::sentence::tests::mean_default' panicked at 'assertion failed: `(left == right)`
  left: `true`,
 right: `false`'

[BUG] No hard fail when blocklist path is invalid

Describe the bug
When specifying a blocklist path that is invalid, error messages do appear but the process ends "normally", with a status=0.

Either:

  1. Check path existence beforehand and stop if we have a problem with it,
  2. Fail the whole pipeline if we encounter this error somewhere.

Moreover, blocklists are instantiated once per shard and it shouldn't be that way. Maybe make ut1-blocklists Send and/or Sync, and then use a shared blocklist. -> #74

[BUG] Error when downloading full CC snapshot

Describe the bug
If I only sample 0.01% of total files, ungoliant download works well.
However, if I download full CC snapshot, the following error appears.

I'm using the latest version of ungoliant.

To Reproduce

from os import mkdir, path
from subprocess import run
import argparse
import random
import requests

import multiprocessing


tmp_download_dir_name = f"tmp/"

if path.exists(tmp_download_dir_name):
    run(f"rm -r {tmp_download_dir_name}", shell=True)
run(f"mkdir {tmp_download_dir_name}", shell=True)

num_proc = multiprocessing.cpu_count()

run(f"wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-40/wet.paths.gz", shell=True)
run(f"gzip -d wet.paths.gz", shell=True)
paths_name = f"wet-CC-MAIN-2020-40.paths"
run(f"mv wet.paths {paths_name}", shell=True)
segments = open(paths_name, "r").readlines()

run(f"ungoliant download -t={num_proc} {paths_name} {tmp_download_dir_name}", shell=True)

Screenshots
Screenshot from 2023-02-28 22-56-22

Question about the -o <offset> option in download

Hey there, this is a fantastic codebase! I just have a quick question about the -o option. It could be more of a question about common crawl itself. Here it is:

Is the content of common crawl files uniformly random? So, for example, if I specify -o=79,900, will the 100 files that are downloaded contain a uniformly random sample of the pages from the whole 80,000 files?

Thanks for any help you can give!

Feature: Zipflike validation on documents at character-level

It may or may not work, but the idea is to check if a provided document follows Zipf, in order to infer its potential quality. Zipf has been documented to work well with large enough corpora, but not (to my knowledge) regarding characters.

The principal difficulty is to find a way to automatically decide if the document follows the law or not.

This metric can be used in conjunction with #26

[Feature request] Secure against length extension attacks

Is your feature request related to a problem? Please describe.
Per folder SHA256 hashes can be potentially vulnerable to length extension attacks.

Describe the solution you'd like

  1. Change hash function to one that is resistant to length extension attacks
    SHA-384 and SHA-512/256 exist. However the latter is difficult to canonicalize due to slash in the name (it is part of its official name). If speed or size is a concern, BLAKE3 is extremely fast and secure as well.
  2. Add filesize in bytes when writing the hash and filename
    This can work, and can be achieved with minimal code change.
  3. Both of the above. It would be nice to have file sizes with the checksum for verification purposes. Also, even if the new hash has a length extension attack found, it will still be secure.

Describe alternatives you've considered
DO NOT

  1. use xxhash from the zstd file.
    I thought about this possibility since it is already present, but xxhash64 as used in zstd is a very fast hash function with minimal
    security guarantees. If we assume that somebody is manipulating the json.zst file and doing so with sufficient compute to actually launch a length extension attack this will not provide any further security.

Additional context
I consider OSCAR as an important part of the Data pipelines supply chain so I this the bar i hope OSCAR can clear.
I will be willing to further investigate potential hash functions, implementations and provide PRs if necessary.

Reference : oscar-project/documentation#13

Bug: Bad computation of Identification probability score

at https://github.com/oscar-corpus/ungoliant/blob/627ecd686c41a0aa370d8f23375120d923c39c62/src/pipelines/oscardoc/pipeline.rs#L252

The document score computation is done at line level, taking the score per number of lines rather than the score per byte.
Following this computation, a document with 100 lines classified with prob= 0.8 will have a 1.0 prob.

Change the computation so that it uses bytes rather than lines:

change (lang, bytes) to (lang, bytes, sum(bytes*prob) , and divide sum(bytes*prob) by bytes to get a better averaged score.

Feature `std_rng` depends on `rand_hc` which is not an optional dependency

Describe the bug
Cannot install Ungoliant:

$ cargo install --verbose ungoliant
    Updating registry `https://github.com/rust-lang/crates.io-index`
  Installing ungoliant v1.2.3
error: failed to compile `ungoliant v1.2.3`, intermediate artifacts can be found at `/tmp/cargo-install.dSby1MuzbHhE`

Caused by:
  failed to parse registry's information for: rand

Caused by:
  Feature `std_rng` depends on `rand_hc` which is not an optional dependency.

Desktop (please complete the following information):

  • OS: Linux 5.4.0-117-generic #132-Ubuntu SMP x86_64 GNU/Linux
  • Versions:
    • Ungoliant: 1.2.3
    • cargo: 0.19.0

Additional context
I installed rust into a Conda environment via conda install -c pkgw-forge cargo. It's really bare-bones:

$ conda list
# packages in environment at /home/ndavid/miniconda3/envs/rust:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
ca-certificates           2022.4.26            h06a4308_0  
cargo                     0.19.0                        0    pkgw-forge
curl                      7.61.0               h84994c4_0  
libcurl                   7.61.0               h1ad7b7a_0  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libssh2                   1.8.0                h9cfc8f7_4  
openssl                   1.0.2u               h7b6447c_0  
zlib                      1.2.12               h7f8727e_2

[BUG] corrupt deflate stream

Describe the Bug
When running the Ungoliant pipeline, everything proceeds smoothly initially as the JSONL files for each language are built. However, after a couple of hours, an error suddenly appears in the logs, and thereafter, only this error persists. I am curious as to why this occurs and whether it could be resolved by skipping the problematic inputs.

[2024-03-27T23:49:00Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Custom { kind: InvalidInput, error: "corrupt deflate stream" })

To Reproduce
Nothing specific to mention, just the routine: downloading and pipelining.

Expected Behavior
The expected behavior is for the pipeline to function as it did earlier or to skip the corrupt inputs.

Screenshots

at first:
Screenshot 2024-03-28 at 12 56 51 AM

later:
Screenshot 2024-03-28 at 12 55 51 AM

Desktop (Please Complete the Following Information):

uname -a
Linux delta 5.14.21-150500.55.36-default #1 SMP PREEMPT_DYNAMIC Tue Oct 31 08:37:43 UTC 2023 (e7a2e23) x86_64 x86_64 x86_64 GNU/Linux

Feature: Failures handling

  • Write failed shard downloads into a file, along with their id
  • Enable downloading from a failure file
  • (later milestone?) while downloading, put failed downloads at the end of the download queue to retry

Feature: Add retry option on downloader

When downloading CCrawl, the current downloader ignores failed shard downloads, resulting in lost data.

Add a -r n that enables retrying n times for each failed item (at the end of the downloading, for example).

We may/should have a failed_items.json file that holds the failed items after downloading, and a ungoliant download -f failed_items.json that tries to download them.

[BUG] Deduplication with Ungoliant

Describe the bug
Hi, I tried to deduplicate downloaded web data (after processing it with ungoliant download and ungoliant pipeline), but i get warnings (see screenshot) and the execution ends without deduplication.

The command
ungoliant dedup <source> <destination>

Expected behavior
generated deduplicated dataset in destination directory

Screenshots
ungoliant

Improve operation order in pipeline

The current pipeline does the PFiltering and identification before removing the short lines, which:

  • may improve error rate on classification, since short sentences are of a worse quality
  • is more computationnaly costly, since we have to identify lines that will be discarded.

The pipeline shoud remove short sentences, then PFilter, then identify.

[BUG] UnexpectedEof While running Ungoliant Pipeline

UnexpectedEof While running Ungoliant Pipeline
I have tried to run the pipeline to extract the languages from the CC wet file which is already downloaded (only 25 files).

Step that produce error
Steps to reproduce the behavior:

  1. Saved the CC index to a paths file 'cc-index.paths'
  2. Run Ungoliant download 'ungoliant download -t 10 <paths> <dst>'
  3. Run Ungoliant pipeline 'ungoliant pipeline --lid-path <model path> <wet dir> <dst>'
  4. See the error
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))

I have done debugging all the files and found that one file that caused the error couldn't be unzipped.

Desktop:

  • OS: Ubuntu
  • Version: 22.04

[BUG] Cannot install via cargo

Describe the bug
This error blocked installation

error[E0107]: this struct takes 0 lifetime arguments but 1 lifetime argument was supplied
  --> /home/norapatbuppodom/.cargo/registry/src/github.com-1ecc6299db9ec823/ungoliant-1.2.3/src/transformers/content_detector.rs:18:9
   |
18 |     bl: Blocklist<'a>,
   |         ^^^^^^^^^---- help: remove these generics
   |         |
   |         expected 0 lifetime arguments
   |
note: struct defined here, with 0 lifetime parameters
  --> /home/norapatbuppodom/.cargo/registry/src/github.com-1ecc6299db9ec823/ut1_blocklist-0.1.1/src/blocklist.rs:27:12
   |
27 | pub struct Blocklist {
   |            ^^^^^^^^^

error[E0107]: this struct takes 0 lifetime arguments but 1 lifetime argument was supplied
  --> /home/norapatbuppodom/.cargo/registry/src/github.com-1ecc6299db9ec823/ungoliant-1.2.3/src/transformers/content_detector.rs:23:20
   |
23 |     pub fn new(bl: Blocklist<'a>) -> Self {
   |                    ^^^^^^^^^---- help: remove these generics
   |                    |
   |                    expected 0 lifetime arguments
   |
note: struct defined here, with 0 lifetime parameters
  --> /home/norapatbuppodom/.cargo/registry/src/github.com-1ecc6299db9ec823/ut1_blocklist-0.1.1/src/blocklist.rs:27:12
   |
27 | pub struct Blocklist {
   |            ^^^^^^^^^

For more information about this error, try `rustc --explain E0107`.
error: could not compile `ungoliant` due to 2 previous errors
warning: build failed, waiting for other jobs to finish...
error: failed to compile `ungoliant v1.2.3`, intermediate artifacts can be found at `/tmp/cargo-installxMoF4S`

To Reproduce
Steps to reproduce the behavior:

  1. Install unigoliant via cargo

Expected behavior
Install unigoliant successfully

Desktop (please complete the following information):

  • OS: MacOS m1, Linux, Rust Container Image (All three not working)
  • Version 1.2.3

Additional context
I also provided dockerfile that can reproduce the same error

FROM rust:1.61
RUN apt-get update && apt-get -y install cmake
RUN cargo install ungoliant

[BUG] ungoliant::io::reader::corpus] [<lang>] no text/meta file.

Hi, it's me again :)

The issue I'm opening is more a "how to" than a bug.
I just runned the dedup command :

ungoliant dedup /gazelle/corpora/ungoliant/2_pipeline/ /gazelle/corpora/ungoliant/3_dedup

And got the error :

[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["my"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["th"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["multi"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["nap"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["uz"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["pnb"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["av"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["af"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["ku"] no text/meta file.
[2022-03-04T15:14:35Z WARN  ungoliant::io::reader::corpus] ["nl"] no text/meta file.
[........ the same for all languages]

I presume that the reason is because my last script, the pipeline one, might have not finished. However, the precedent script worked for about a day on 40 thread and didn't finish, so I want to know if one of the following solution is possible :

  • is there a way to generate a partial text/meta file ?
  • can I run the pipeline script, so it continues to where it was, so the pipeline doesn't restart from scratch
  • when running the pipeline script, is it possible to have the text/meta being generated on the fly so in case of early exit of the pipeline command, the current "pipelined" data would be usable as it is.

Maybe some of my suggestions are already possible, but I couldn't find any info in the doc.

Thanks again !
Kirian

[Feature request] Document how to set fasttext model

Is your feature request related to a problem? Please describe.
There are multiple fasttext models available, and in principle, one could train their own.
Besides the one indicated by the README.md (lid.176.bin), the official page lists lid.176.ftz
On huggingface there is lid218eavailable and
there is also a recent independent lib201 model

Describe the solution you'd like

  1. I would like the README.md to mention that there are other models available
  2. I would like the code to provide a way to select a model through configuration.
  3. I would like the README.md to reflect how 2. would be implemented

Describe alternatives you've considered
I can download my own model and rename it into lid.176.bin. This is prone to confusion and unsatisfactory

Additional context

  • There seems to be some unexposed options to achieve this.
    It would be useful to modify them into a modular fashion and document it.
  • The code model.rs also seems to default to Path::new("lid.176.bin") but when absents tries to default to lid.208a.bin? which is unclear where it is obtainable. It's obvious certain efforts were made behind the curtain so I am hesitant to implment a solution on my own.
  • Since lid.176.bin is the most publically available, that could be the backup, while the user could provide/select their own model.
    #21 might be fixed with this change.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.