Coder Social home page Coder Social logo

bigscience-workshop / metadata Goto Github PK

View Code? Open in Web Editor NEW
30.0 30.0 12.0 949 KB

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.

License: Apache License 2.0

Python 72.82% Makefile 0.04% Shell 27.14%

metadata's People

Contributors

cccntu avatar chkla avatar jordiclive avatar manandey avatar muennighoff avatar ppommer avatar saullu avatar shanyas10 avatar tianjianjiang avatar timoschick avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

metadata's Issues

feat: resume training from a checkpoint

Feature request

It might be useful to be able to continue training left in a checkpoint, unless I am mistaken this is not a feature that is currently included

feat: add an option for loss calculation for local metadata

Add an option for loss calculation for local metadata

Feature description

Currently tokens corresponding to local metadata are not taken into account in the loss calculation. We would like to add an option in the training arguments to choose to take them into account or not.

Error `IndexError: list index out of range` while testing the entities extraction

Error raised

  File "experiments/jz/dataset/c4/python_scripts/add_metadata.py", line 165, in main
    raw_datasets = raw_datasets.map(
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/dataset_dict.py", line 482, in map
    {
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/dataset_dict.py", line 483, in <dictcomp>
    k: dataset.map(
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2018, in map
    return self._map_single(
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 521, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 488, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/fingerprint.py", line 411, in wrapper
    out = func(self, *args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2382, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2269, in apply_function_on_filtered_inputs
    function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1978, in decorated
    result = f(decorated_item, *args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/bsmetadata/preprocessing_utils.py", line 225, in preprocess
    ent_desc = self._extract_desc_from_entity(entity)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/bsmetadata/preprocessing_utils.py", line 213, in _extract_desc_from_entity
    return self.entity_utils.fetch_entity_description_from_keyword(key)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/bsmetadata/preprocessing_tools/wikipedia_desc_utils.py", line 50, in fetch_entity_description_from_keyword
    text = self.fetch_wikipedia_description_for_title(title)
  File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/bsmetadata/preprocessing_tools/wikipedia_desc_utils.py", line 26, in fetch_wikipedia_description_for_title
    text = self.wiki_dump_db.get_paragraphs(title)[0].text
IndexError: list index out of range

feat: add utility function (and/or data) for URL datasets if necessary

A Light Discussion about Dataset Choices for URL (at least)

Besides a small subset of (m)C4, I prefer finding intersections among metadata (URL at least), promptsource, and evaluation WGs.

  • TyDi QA (primary task) is probably the only common dataset

For either one of two WGs excluding us metadata here,

  • From evaluation
    • GEM from eval WG, specifically
      • MLSum
      • WikiLingua
  • From promptsource
    • app_reviews: although not really URL/URI but basically namespace and date
    • CC-News: virtually a subset of C4
    • Probably some more

Improve: Entity extraction speed

If I understand correctly, the extraction strategic for entity metadata uses the REL library. REL uses a flair model to do entity detection. I think it's quite possible that this kind of model could take advantage of batching : that would be very interesting for our dataset processing!

Currently, with a speed of 5000 examples every 50min. Even if we didn't choose yet the final size of the dataset, it would take 2536 days to process all c4/en.

docs: what if we also want to separate metadata from their target (source?) of raw data?

Background

For replying to @timoschick 's questions (via Slack DM), my thought gets stuck with the question that may sounds similar to what @cccntu asked with #12: what is the expected interfaces for us to collaborate?

Three Sub-questions

  1. mC4, one of my preferred datasets, comes with URLs and timestamps directly. Do we still want to have our own JSONL files?
  2. Some datasets for https://github.com/bigscience-workshop/evaluation may also work for us, e.g., GEM (those originated from Wikipedia or WikiHow), CRD3, WikiANN, etc. The (1) situation appears again an may be a bit more complicated.
  3. (1)+(2), suppose we also want to use some parts from https://github.com/bigscience-workshop/promptsource, especially the "applying templates" part, which I noticed that https://github.com/bigscience-workshop/evaluation has been duplicating and my gut tells me such duplication can be avoided, just not sure whether we can get helps from data-tooling.

Thoughts

  • Again, like @cccntu aforementioned in #12, shall we go for quick solutions or not?
    • To my limited knowledge, fastai got some callback function based transformation;
    • Or we simply want to have our own copies of datasets.
  • Perhaps we will have more than one Python packages.
    • The current one;
    • The one that may be shared with promptsource, evaluation, metadata, and probably more?

Side Notes

  • The title of this issue follows https://github.com/angular/angular/blob/master/CONTRIBUTING.md#commit , just my habit;
    • Which sometimes overlaps with GitHub's default Labels, and yet in my opinion, labels should serve different purpose (e.g., search filters);
  • The milestone and the project attached to this issue ticket are also just my habit... My apologies;
  • I took liberty of assigning some of us, pardon me for the intrusion or the unintended exclusion;
  • Perhaps we will also need some issue ticket templates and code owner file (and a PR review policy?).

Include entity description

Once @shanyas10 's data gets uploaded to the cloud, I plan to use the website description preprocessor to add an additional field "entity_description" for entities ASAP. Previously I was using the Wikipedia API to do so and since JZ doesn't have access to the internet that approach wouldn't be of much use now.

perf: filter 404 URLs out by Common Crawl's cluster.idx

Tentative Tasks

  • 0. Download month-wise cluster.idx
  • 1. Convert cluster.idx → a python dictionary of URL (actually SURT) parts;
  • 2. Convert each input URL → (SURT) parts;
  • 3. (Partially) Match (2) with (1) → a list of cdx-\d{5}.gz (with ranges, of course);
  • 4. Convert matched cdx-\d{5}.gz → like (1) except probably for the whole SURTs only;
  • 5. (Exactly) Match (2) with (4) → None or a WARC file path with a range;
  • 6. Try (0)-(5) with more cluster.idx files that are not exactly in the same month as the OpenWebText URLs are.
  • 7. Multi-thread at least for (2) and (4); in theory (1), (3), (5), and by extend (6) are thread-safe but not 100% sure yet.

Tentative Outcomes

  • Running (0)-(3) sequentially for 2018-10 can be done in 6 minutes on Colab.

Background

Currently trying to really do partial matching iteratively with cluster.idx and cdx-\d{5}.gz locallly.

Below are cut-n-pasted from the pessimistic comments:

Unfortunately, the chance to get a matched URL from cluster.idx is much lower than I've anticipated.
For example, among 10,240 successfully downloaded URLs of 2018-10, only 7 are found in the corresponding cluster.idx.
Since cluster.idx only samples approximately every 3000 URLs (as a cluster) from the whole index, it is after all an understandable outcome...

cluster.idx only samples approximately every 3000 URLs (as a cluster) from the whole index

Although it is possible to develop a fuzzy search that uses a partial URL to close in on potential index files (cdx-\d{5}.gz), and then recursively apply that fuzzy search on those cdx-\d{5}.gz, I probably don't have enough time to do so...

Remove the vendor folder

As discussed before, we think it's best to avoid using git submodules when we can just use regular dependencies. That would be great if we can remove the vendor folder with dateuil in it 🙂

feat: refine dependency imports for preprocessing

At the moment, only one option is available: install all dependencies for preprocessing or don't install them. A user might want to install only the dependencies for one type of preprocessing (Website description, Timestamp, Entity, etc)

As proposed by @cccntu:

We can probably add if available then import, in the processors def file. But installing them all at once saves the trouble of seeing error then installing them one by one.

feat: find a solution to load the dataset

While testing the real data extraction, I encountered a new problem: the websites descriptions are rarely present in the metadata_website_desc column. Therefore, the datasets library cannot load such a dataset by having to guessing the feature type. It has to know them beforehand.

question: Error raised by entities extrator

The entity extraction process raise the following warning:

gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/sklearn/base.py:324: UserWarning: Trying to unpickle estimator LogisticRegression from version 0.23.1 when using version 1.0.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations

It might be worth investigating to see if anything harmful is happening.

Add metadata feature - Behavioral question

I was wondering if the output of the add_metadata_and_chunk_examples function was in the desired format.

When we want to add only local metadata, the text example always begin as follows: |||>Releas. Is it right to start with the separator even if you don't add global metadata?

feat: add special tokens to request (or not) metadata generation

Add special tokens to request (or not) metadata generation

Feature details

We would like to add an argument to MetadataConfig that would control whether the model should generate one type of metadata or not. I propose to call this argument add_special_token_for_metadata_generation.

EDIT: New specification

Following the offline discussion (during Friday's meeting), this PR has been modified to implement a different format. The adopted format is as follows: Add the special local token like we add global metadata - order defined by the user in metadata_list. In addition, I added an argument (local_metadata_special_tokens) so that we can specify special tokens for local metadata

Example:

url: https://xx | timestamp: 2018-xx | HtmlOn | EntityOn ||| <div class:xx> this is a word [[entity 1]] </div>
HtmlOn | EntityOn ||| <div class:xx> this is a word [[entity 1]] </div>
 url: https://xx | timestamp: 2018-xx ||| this is a word

EDIT: Old specification

Proposed specification:

  1. I propose to use as special tokens for each type of metadata the term used in the metadata_list argument;
  2. I propose to add a special token related to a metadata type at the beginning of a sample only if this metadata type appears in the example (one should be aware that this will not necessarily imply that an occurrence of this metadata type appears in every generated example because the text_with_local_metadata can be split into several examples)
  3. I propose to add these special generation tokens at the beginning of the example by separating them from the rest with the special_token_for_metadata_generation_sep token specified in MetadataConfig
  4. I propose to add these special tokens both for global and local metadata

Example

Let's consider the following example

{
    "text": "the  2018 winter Olympic Games was held between 9 and 25 February 2018 in South Korea",
    "metadata": [
        {"key": "url", "type": "global", "value": "https://www.bbc.com/sport/live/olympics/50974152"},
        {"key": "timestamp", "type": "global", "value": "2018-12-10T13:45:00.000Z"},
        {'key': 'html', 'type': 'local', 'char_start_idx': 0, 'relative_start_pos': 1, 'char_end_idx': 84, 'relative_end_pos': 0, 'value':'div', 'html_attrs': {'attr': ['class'], 'value': ['summary']}}
    ],
}

Generated example 1

With the arguments:

metadata_list = ["url", "timestamp", "html"]
special_token_for_metadata_generation = True
special_token_for_metadata_generation_sep = " ||| "
metadata_probability = 1

Generated sample:

url timestamp html ||| url: https://www.bbc.com/sport/live/olympics/50974152 | timestamp: 2018-12-10T13:45:00.000Z ||| <div class:summary> the  2018 winter Olympic Games was held between 9 and 25 February 2018 in South Korea</div>

Generated example 2

With the arguments:

metadata_list = ["url", "timestamp", "html"]
special_token_for_metadata_generation = True
special_token_for_metadata_generation_sep = " ||| "
metadata_probability = 0 # <- change here

Generated sample:

the  2018 winter Olympic Games was held between 9 and 25 February 2018 in South Korea

Generated example 3

With the arguments:

metadata_list = ["url", "timestamp", "entity", "html"] <- change here
special_token_for_metadata_generation = True
special_token_for_metadata_generation_sep = " ||| "
metadata_probability = 1

Generated sample:

url timestamp html ||| url: https://www.bbc.com/sport/live/olympics/50974152 | timestamp: 2018-12-10T13:45:00.000Z ||| <div class:summary> the  2018 winter Olympic Games was held between 9 and 25 February 2018 in South Korea</div>

Discussion

cc @timoschick, @cccntu, @tianjianjiang, @manandey, @shanyas10, and everybody in the Modeling-Metadata WG ! 🙂

perf: separate C4 or CommonCrawl URLs from OpenWebText URLs

Background

According to https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4.py, c4/webtextlike uses

OPENWEBTEXT_CC_VERSIONS = (  # August 2018 - July 2019
    "2019-18",  # Original default for single-crawl dataset (April 2019).
    "2019-30",
    "2019-26",
    "2019-22",
    "2019-13",
    "2019-09",
    "2019-04",
    "2018-51",
    "2018-47",
    "2018-43",
    "2018-39",
    "2018-34")

However, OpenWebText URLs are almost all older than the above CC indices, except for 2018-34, 2018-39, and 2018-43.

Since C4 downloads a much larger set of CC WET texts and then filter out those texts by different sets of URLs, we probably won't gain much benefit of throughput based on C4 configurations.

Another intriguing situation is that, AllenNLP people had tried to replicate c4/webtextlike but stopped, cf. https://huggingface.co/datasets/allenai/c4/blame/f888b0f407c37dd4a0e52d0c3bf56b8a7088f58b/README.md. I wonder what happened...

fix: clean up OpenWebText URL fragments, duplicates, variants, and malformed ones

Turns out OpenWebText URLs have some duplicates and many URLs with useless fragments that may duplicate.
For example, RS_2013-01.bz2.deduped.txt has

  • 1 malformed http://1:05 EST and I haven't gotten the update yet on lumia 900
  • 1 exact duplicates of http://www.eurekalert.org/pub_releases/2013-01/foas-pcr010213.php
  • 1,218 duplicates by scheme, www variation, trailing slash, and fragment
  • 6,278 URLs with fragments that may or may not cause the above 1,218 duplications but useless anyway

Complete `get_dataloader` method

Initial thoughts

I think that the get_dataloader method is in charge of:

  • retrieving the dataset contained in one (or more?) .jsonl file with the template below
  • applying collation functions and tokenize the text (in what order?) EDIT: for now, augment the text with metadata and tokenizer it as a pre-processing step
  • returning train and evaluation dataloaders

Toy .jsonl data file

[
    {
        "document_id": 10,
        "text": "this is the input",
        "metadata": [
            {
                "key": "url",
                "type": "global",
                "value": "http://1"
            },
            {
                "key": "entity",
                "type": "local",
                "value": "address",
                "start_idx": 20,
                "end_idx": 40
            }
        ]
    },
    {
        "document_id": 12,
        "text": "this is the second input",
        "metadata": [
            {
                "key": "url",
                "type": "global",
                "value": "http://2"
            },
            {
                "key": "entity",
                "type": "local",
                "value": "date",
                "start_idx": 60,
                "end_idx": 90
            }
        ]
    }
]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.