bigscience-workshop / metadata Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 12.0 949 KB

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.

License: Apache License 2.0

Python 72.82% Makefile 0.04% Shell 27.14%

metadata's People

Contributors

Stargazers

Watchers

Forkers

tianjianjiang shanyas10 manandey saullu chkla annawegmann masoudjs ppommer jordiclive stjordanis shomyliu ethicalsecurity-agency

metadata's Issues

feat: support HF dataset loader and streaming mode for bs-modeling-metadata/openwebtext-html-cc

Background

I haven't prepare a dataset loading script so https://huggingface.co/datasets/bs-modeling-metadata/openwebtext-html-cc 's gz files are not yet working for HF datasets library.

simple zero-shot eval function: website description

ppl on website specific testset. Contact @cccntu and Christopher

Which HTML tags should be used during training?

method to sample local metadata

Testing the training script

Use a very simple toy dataset, without metadata, to test the training script

eval hyperparameters: occupied tokens

feat: resume training from a checkpoint

Feature request

It might be useful to be able to continue training left in a checkpoint, unless I am mistaken this is not a feature that is currently included

feat: add an option for loss calculation for local metadata

Add an option for loss calculation for local metadata

Feature description

Currently tokens corresponding to local metadata are not taken into account in the loss calculation. We would like to add an option in the training arguments to choose to take them into account or not.

simple zero-shot eval function: datasource

entity tagging speedup

Start joint training

estimate amount of data

feat: save the model and stop training based on `exit-duration-in-mins`

As discussed a long time ago in a meeting it would be really great if we had a feature to save the model and stop training after a certain time as the jobs on the JZ cluster are limited to 20 hours.

For example, in the architecture and scaling working group, they added the exit-duration-in-mins argument the library used to run trainings Megatron-DeepSpeed

related: #37 (#42)

Error `IndexError: list index out of range` while testing the entities extraction

Error raised

  File "experiments/jz/dataset/c4/python_scripts/add_metadata.py", line 165, in main
    raw_datasets = raw_datasets.map(
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/dataset_dict.py", line 482, in map
    {
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/dataset_dict.py", line 483, in <dictcomp>
    k: dataset.map(
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2018, in map
    return self._map_single(
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 521, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 488, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/fingerprint.py", line 411, in wrapper
    out = func(self, *args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2382, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2269, in apply_function_on_filtered_inputs
    function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1978, in decorated
    result = f(decorated_item, *args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/bsmetadata/preprocessing_utils.py", line 225, in preprocess
    ent_desc = self._extract_desc_from_entity(entity)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/bsmetadata/preprocessing_utils.py", line 213, in _extract_desc_from_entity
    return self.entity_utils.fetch_entity_description_from_keyword(key)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/bsmetadata/preprocessing_tools/wikipedia_desc_utils.py", line 50, in fetch_entity_description_from_keyword
    text = self.fetch_wikipedia_description_for_title(title)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/bsmetadata/preprocessing_tools/wikipedia_desc_utils.py", line 26, in fetch_wikipedia_description_for_title
    text = self.wiki_dump_db.get_paragraphs(title)[0].text
IndexError: list index out of range

Discuss style evaluation for website description and data source with Anna

How do we define a paragraph?

https://docs.google.com/document/d/1S2gQrOZl5UVwILboc9E4wwmXJtO2QmfNq4B_pTdzYS0

data analysis: website description (quality and yield)

slurm script test multi-gpu training

When done e-mail @VictorSanh

Find temaplte here: https://github.com/bigscience-workshop/metadata/tree/master/experiments/jz/templates/SLURM

feat: add utility function (and/or data) for URL datasets if necessary

A Light Discussion about Dataset Choices for URL (at least)

Besides a small subset of (m)C4, I prefer finding intersections among metadata (URL at least), promptsource, and evaluation WGs.

TyDi QA (primary task) is probably the only common dataset

For either one of two WGs excluding us metadata here,

From evaluation
- https://www.kaggle.com/rmisra/news-category-dataset
- GEM from eval WG, specifically
  - MLSum
  - WikiLingua
From promptsource
- app_reviews: although not really URL/URI but basically namespace and date
- CC-News: virtually a subset of C4
- Probably some more

Improve: Entity extraction speed

If I understand correctly, the extraction strategic for entity metadata uses the REL library. REL uses a flair model to do entity detection. I think it's quite possible that this kind of model could take advantage of batching : that would be very interesting for our dataset processing!

Currently, with a speed of 5000 examples every 50min. Even if we didn't choose yet the final size of the dataset, it would take 2536 days to process all c4/en.

explore hyperparameters:

manage PR merge strategies

eval hyperparameters: amount of metadata

feat: add URL tokenizer(s) for data source, time stamp, and website description

Background

It isn't required for URL metadata, but some of preprocessing requirements are the same among data source, time stamp, website description, and URL itself. For starters, helps of urllib.parse.urlsplit().

Later I will put together links I can find among PRs and issue tickets.

perf: reduce file sizes of openwebtext-html-cc

docs: what if we also want to separate metadata from their target (source?) of raw data?

Background

For replying to @timoschick 's questions (via Slack DM), my thought gets stuck with the question that may sounds similar to what @cccntu asked with #12: what is the expected interfaces for us to collaborate?

Three Sub-questions

mC4, one of my preferred datasets, comes with URLs and timestamps directly. Do we still want to have our own JSONL files?
Some datasets for https://github.com/bigscience-workshop/evaluation may also work for us, e.g., GEM (those originated from Wikipedia or WikiHow), CRD3, WikiANN, etc. The (1) situation appears again an may be a bit more complicated.
(1)+(2), suppose we also want to use some parts from https://github.com/bigscience-workshop/promptsource, especially the "applying templates" part, which I noticed that https://github.com/bigscience-workshop/evaluation has been duplicating and my gut tells me such duplication can be avoided, just not sure whether we can get helps from data-tooling.

Thoughts

Again, like @cccntu aforementioned in #12, shall we go for quick solutions or not?
- To my limited knowledge, fastai got some callback function based transformation;
- Or we simply want to have our own copies of datasets.
Perhaps we will have more than one Python packages.
- The current one;
- The one that may be shared with promptsource, evaluation, metadata, and probably more?

Side Notes

The title of this issue follows https://github.com/angular/angular/blob/master/CONTRIBUTING.md#commit , just my habit;
- Which sometimes overlaps with GitHub's default Labels, and yet in my opinion, labels should serve different purpose (e.g., search filters);
The milestone and the project attached to this issue ticket are also just my habit... My apologies;
I took liberty of assigning some of us, pardon me for the intrusion or the unintended exclusion;
Perhaps we will also need some issue ticket templates and code owner file (and a PR review policy?).

Include entity description

Once @shanyas10 's data gets uploaded to the cloud, I plan to use the website description preprocessor to add an additional field "entity_description" for entities ASAP. Previously I was using the Wikipedia API to do so and since JZ doesn't have access to the internet that approach wouldn't be of much use now.

Which HTML tags should be used during training?

perf: filter 404 URLs out by Common Crawl's cluster.idx

Tentative Tasks

0. Download month-wise cluster.idx
1. Convert cluster.idx → a python dictionary of URL (actually SURT) parts;
2. Convert each input URL → (SURT) parts;
3. (Partially) Match (2) with (1) → a list of cdx-\d{5}.gz (with ranges, of course);
4. Convert matched cdx-\d{5}.gz → like (1) except probably for the whole SURTs only;
5. (Exactly) Match (2) with (4) → None or a WARC file path with a range;
6. Try (0)-(5) with more cluster.idx files that are not exactly in the same month as the OpenWebText URLs are.
7. Multi-thread at least for (2) and (4); in theory (1), (3), (5), and by extend (6) are thread-safe but not 100% sure yet.

Tentative Outcomes

Running (0)-(3) sequentially for 2018-10 can be done in 6 minutes on Colab.

Background

Currently trying to really do partial matching iteratively with cluster.idx and cdx-\d{5}.gz locallly.

Below are cut-n-pasted from the pessimistic comments:

Unfortunately, the chance to get a matched URL from cluster.idx is much lower than I've anticipated.
For example, among 10,240 successfully downloaded URLs of 2018-10, only 7 are found in the corresponding cluster.idx.
Since cluster.idx only samples approximately every 3000 URLs (as a cluster) from the whole index, it is after all an understandable outcome...

cluster.idx only samples approximately every 3000 URLs (as a cluster) from the whole index

Although it is possible to develop a fuzzy search that uses a partial URL to close in on potential index files (cdx-\d{5}.gz), and then recursively apply that fuzzy search on those cdx-\d{5}.gz, I probably don't have enough time to do so...

Remove the vendor folder

As discussed before, we think it's best to avoid using git submodules when we can just use regular dependencies. That would be great if we can remove the vendor folder with dateuil in it 🙂

feat: refine dependency imports for preprocessing

At the moment, only one option is available: install all dependencies for preprocessing or don't install them. A user might want to install only the dependencies for one type of preprocessing (Website description, Timestamp, Entity, etc)

As proposed by @cccntu:

We can probably add if available then import, in the processors def file. But installing them all at once saves the trouble of seeing error then installing them one by one.

simple zero-shot eval function: HTML tags

simple zero-shot eval function: generation length

method to sample global metadata

Rename dateutil submodule with another name

There are reports that the current approach might not work with pip install.
So I am planning to drop the submodule, and rename the modified dateutil so we can add it to requirements.

cc. @SaulLu @shanyas10

feat: find a solution to load the dataset

While testing the real data extraction, I encountered a new problem: the websites descriptions are rarely present in the metadata_website_desc column. Therefore, the datasets library cannot load such a dataset by having to guessing the feature type. It has to know them beforehand.

question: Error raised by entities extrator

The entity extraction process raise the following warning:

gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/sklearn/base.py:324: UserWarning: Trying to unpickle estimator LogisticRegression from version 0.23.1 when using version 1.0.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations

It might be worth investigating to see if anything harmful is happening.

Add metadata feature - Behavioral question

I was wondering if the output of the add_metadata_and_chunk_examples function was in the desired format.

When we want to add only local metadata, the text example always begin as follows: |||>Releas. Is it right to start with the separator even if you don't add global metadata?

Conflict in torch version

The torch version used in the requirements.txt is 1.8.1 which seems to have some issue with the flair library I am using (flairNLP/flair#2137). Can we use torch==1.9.0 instead? @SaulLu @cccntu

feat: add special tokens to request (or not) metadata generation

Add special tokens to request (or not) metadata generation

Feature details

We would like to add an argument to MetadataConfig that would control whether the model should generate one type of metadata or not. I propose to call this argument add_special_token_for_metadata_generation.

EDIT: New specification

Following the offline discussion (during Friday's meeting), this PR has been modified to implement a different format. The adopted format is as follows: Add the special local token like we add global metadata - order defined by the user in metadata_list. In addition, I added an argument (local_metadata_special_tokens) so that we can specify special tokens for local metadata

Example:

url: https://xx | timestamp: 2018-xx | HtmlOn | EntityOn ||| <div class:xx> this is a word [[entity 1]] </div>

HtmlOn | EntityOn ||| <div class:xx> this is a word [[entity 1]] </div>

 url: https://xx | timestamp: 2018-xx ||| this is a word

EDIT: Old specification

Proposed specification:

I propose to use as special tokens for each type of metadata the term used in the metadata_list argument;
I propose to add a special token related to a metadata type at the beginning of a sample only if this metadata type appears in the example (one should be aware that this will not necessarily imply that an occurrence of this metadata type appears in every generated example because the text_with_local_metadata can be split into several examples)
I propose to add these special generation tokens at the beginning of the example by separating them from the rest with the special_token_for_metadata_generation_sep token specified in MetadataConfig
I propose to add these special tokens both for global and local metadata

Example

Let's consider the following example

{
    "text": "the  2018 winter Olympic Games was held between 9 and 25 February 2018 in South Korea",
    "metadata": [
        {"key": "url", "type": "global", "value": "https://www.bbc.com/sport/live/olympics/50974152"},
        {"key": "timestamp", "type": "global", "value": "2018-12-10T13:45:00.000Z"},
        {'key': 'html', 'type': 'local', 'char_start_idx': 0, 'relative_start_pos': 1, 'char_end_idx': 84, 'relative_end_pos': 0, 'value':'div', 'html_attrs': {'attr': ['class'], 'value': ['summary']}}
    ],
}

Generated example 1

With the arguments:

metadata_list = ["url", "timestamp", "html"]
special_token_for_metadata_generation = True
special_token_for_metadata_generation_sep = " ||| "
metadata_probability = 1

Generated sample:

url timestamp html ||| url: https://www.bbc.com/sport/live/olympics/50974152 | timestamp: 2018-12-10T13:45:00.000Z ||| <div class:summary> the  2018 winter Olympic Games was held between 9 and 25 February 2018 in South Korea</div>

Generated example 2

With the arguments:

metadata_list = ["url", "timestamp", "html"]
special_token_for_metadata_generation = True
special_token_for_metadata_generation_sep = " ||| "
metadata_probability = 0 # <- change here

Generated sample:

the  2018 winter Olympic Games was held between 9 and 25 February 2018 in South Korea

Generated example 3

With the arguments:

metadata_list = ["url", "timestamp", "entity", "html"] <- change here
special_token_for_metadata_generation = True
special_token_for_metadata_generation_sep = " ||| "
metadata_probability = 1

Generated sample:

url timestamp html ||| url: https://www.bbc.com/sport/live/olympics/50974152 | timestamp: 2018-12-10T13:45:00.000Z ||| <div class:summary> the  2018 winter Olympic Games was held between 9 and 25 February 2018 in South Korea</div>

Discussion

cc @timoschick, @cccntu, @tianjianjiang, @manandey, @shanyas10, and everybody in the Modeling-Metadata WG ! 🙂

perf: separate C4 or CommonCrawl URLs from OpenWebText URLs

Background

According to https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4.py, c4/webtextlike uses

OPENWEBTEXT_CC_VERSIONS = (  # August 2018 - July 2019
    "2019-18",  # Original default for single-crawl dataset (April 2019).
    "2019-30",
    "2019-26",
    "2019-22",
    "2019-13",
    "2019-09",
    "2019-04",
    "2018-51",
    "2018-47",
    "2018-43",
    "2018-39",
    "2018-34")

However, OpenWebText URLs are almost all older than the above CC indices, except for 2018-34, 2018-39, and 2018-43.

Since C4 downloads a much larger set of CC WET texts and then filter out those texts by different sets of URLs, we probably won't gain much benefit of throughput based on C4 configurations.

Another intriguing situation is that, AllenNLP people had tried to replicate c4/webtextlike but stopped, cf. https://huggingface.co/datasets/allenai/c4/blame/f888b0f407c37dd4a0e52d0c3bf56b8a7088f58b/README.md. I wonder what happened...

fix: clean up OpenWebText URL fragments, duplicates, variants, and malformed ones

Turns out OpenWebText URLs have some duplicates and many URLs with useless fragments that may duplicate.
For example, RS_2013-01.bz2.deduped.txt has

1 malformed http://1:05 EST and I haven't gotten the update yet on lumia 900
1 exact duplicates of http://www.eurekalert.org/pub_releases/2013-01/foas-pcr010213.php
1,218 duplicates by scheme, www variation, trailing slash, and fragment
6,278 URLs with fragments that may or may not cause the above 1,218 duplications but useless anyway

simple zero-shot eval function: time stamps

Complete `get_dataloader` method

Initial thoughts

I think that the get_dataloader method is in charge of:

retrieving the dataset contained in one (or more?) .jsonl file with the template below
~~applying collation functions and tokenize the text (in what order?)~~ EDIT: for now, augment the text with metadata and tokenizer it as a pre-processing step
returning train and evaluation dataloaders

Toy .jsonl data file

[
    {
        "document_id": 10,
        "text": "this is the input",
        "metadata": [
            {
                "key": "url",
                "type": "global",
                "value": "http://1"
            },
            {
                "key": "entity",
                "type": "local",
                "value": "address",
                "start_idx": 20,
                "end_idx": 40
            }
        ]
    },
    {
        "document_id": 12,
        "text": "this is the second input",
        "metadata": [
            {
                "key": "url",
                "type": "global",
                "value": "http://2"
            },
            {
                "key": "entity",
                "type": "local",
                "value": "date",
                "start_idx": 60,
                "end_idx": 90
            }
        ]
    }
]

bigscience-workshop / metadata Goto Github PK

metadata's People

Contributors

Stargazers

Watchers

Forkers

metadata's Issues

Background

Feature request

Add an option for loss calculation for local metadata

Feature description

A Light Discussion about Dataset Choices for URL (at least)

Background

Background

Three Sub-questions

Thoughts

Side Notes

Tentative Tasks

Tentative Outcomes

Background

Add special tokens to request (or not) metadata generation

Feature details

EDIT: New specification

EDIT: Old specification

Example

Generated example 1

Generated example 2

Generated example 3

Discussion

Background

Initial thoughts

Recommend Projects

Recommend Topics

Recommend Org