bigscience-workshop / metadata Goto Github PK
View Code? Open in Web Editor NEWExperiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
License: Apache License 2.0
Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
License: Apache License 2.0
I haven't prepare a dataset loading script so https://huggingface.co/datasets/bs-modeling-metadata/openwebtext-html-cc 's gz
files are not yet working for HF datasets library.
ppl on website specific testset. Contact @cccntu and Christopher
Use a very simple toy dataset, without metadata, to test the training script
It might be useful to be able to continue training left in a checkpoint, unless I am mistaken this is not a feature that is currently included
Currently tokens corresponding to local metadata are not taken into account in the loss calculation. We would like to add an option in the training arguments to choose to take them into account or not.
As discussed a long time ago in a meeting it would be really great if we had a feature to save the model and stop training after a certain time as the jobs on the JZ cluster are limited to 20 hours.
For example, in the architecture and scaling working group, they added the exit-duration-in-mins
argument the library used to run trainings Megatron-DeepSpeed
Error raised
File "experiments/jz/dataset/c4/python_scripts/add_metadata.py", line 165, in main
raw_datasets = raw_datasets.map(
File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/dataset_dict.py", line 482, in map
{
File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/dataset_dict.py", line 483, in <dictcomp>
k: dataset.map(
File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2018, in map
return self._map_single(
File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 521, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 488, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/fingerprint.py", line 411, in wrapper
out = func(self, *args, **kwargs)
File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2382, in _map_single
batch = apply_function_on_filtered_inputs(
File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2269, in apply_function_on_filtered_inputs
function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1978, in decorated
result = f(decorated_item, *args, **kwargs)
File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/bsmetadata/preprocessing_utils.py", line 225, in preprocess
ent_desc = self._extract_desc_from_entity(entity)
File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/bsmetadata/preprocessing_utils.py", line 213, in _extract_desc_from_entity
return self.entity_utils.fetch_entity_description_from_keyword(key)
File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/bsmetadata/preprocessing_tools/wikipedia_desc_utils.py", line 50, in fetch_entity_description_from_keyword
text = self.fetch_wikipedia_description_for_title(title)
File "/gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/bsmetadata/preprocessing_tools/wikipedia_desc_utils.py", line 26, in fetch_wikipedia_description_for_title
text = self.wiki_dump_db.get_paragraphs(title)[0].text
IndexError: list index out of range
When done e-mail @VictorSanh
Find temaplte here: https://github.com/bigscience-workshop/metadata/tree/master/experiments/jz/templates/SLURM
Besides a small subset of (m)C4, I prefer finding intersections among metadata (URL at least), promptsource, and evaluation WGs.
For either one of two WGs excluding us metadata here,
If I understand correctly, the extraction strategic for entity metadata uses the REL library. REL uses a flair model to do entity detection. I think it's quite possible that this kind of model could take advantage of batching : that would be very interesting for our dataset processing!
Currently, with a speed of 5000 examples every 50min. Even if we didn't choose yet the final size of the dataset, it would take 2536 days to process all c4/en.
It isn't required for URL metadata, but some of preprocessing requirements are the same among data source, time stamp, website description, and URL itself. For starters, helps of urllib.parse.urlsplit()
.
Later I will put together links I can find among PRs and issue tickets.
For replying to @timoschick 's questions (via Slack DM), my thought gets stuck with the question that may sounds similar to what @cccntu asked with #12: what is the expected interfaces for us to collaborate?
Once @shanyas10 's data gets uploaded to the cloud, I plan to use the website description preprocessor to add an additional field "entity_description" for entities ASAP. Previously I was using the Wikipedia API to do so and since JZ doesn't have access to the internet that approach wouldn't be of much use now.
cluster.idx
cluster.idx
→ a python dictionary of URL (actually SURT) parts;cdx-\d{5}.gz
(with ranges, of course);cdx-\d{5}.gz
→ like (1) except probably for the whole SURTs only;None
or a WARC file path with a range;cluster.idx
files that are not exactly in the same month as the OpenWebText URLs are.2018-10
can be done in 6 minutes on Colab.Currently trying to really do partial matching iteratively with cluster.idx
and cdx-\d{5}.gz
locallly.
Below are cut-n-pasted from the pessimistic comments:
Unfortunately, the chance to get a matched URL from
cluster.idx
is much lower than I've anticipated.
For example, among 10,240 successfully downloaded URLs of 2018-10, only 7 are found in the correspondingcluster.idx
.
Sincecluster.idx
only samples approximately every 3000 URLs (as a cluster) from the whole index, it is after all an understandable outcome...
cluster.idx
only samples approximately every 3000 URLs (as a cluster) from the whole index
Although it is possible to develop a fuzzy search that uses a partial URL to close in on potential index files (
cdx-\d{5}.gz
), and then recursively apply that fuzzy search on thosecdx-\d{5}.gz
, I probably don't have enough time to do so...
As discussed before, we think it's best to avoid using git submodules when we can just use regular dependencies. That would be great if we can remove the vendor folder with dateuil in it 🙂
At the moment, only one option is available: install all dependencies for preprocessing or don't install them. A user might want to install only the dependencies for one type of preprocessing (Website description, Timestamp, Entity, etc)
As proposed by @cccntu:
We can probably add if available then import, in the processors def file. But installing them all at once saves the trouble of seeing error then installing them one by one.
There are reports that the current approach might not work with pip install
.
So I am planning to drop the submodule, and rename the modified dateutil so we can add it to requirements.
cc. @SaulLu @shanyas10
While testing the real data extraction, I encountered a new problem: the websites descriptions are rarely present in the metadata_website_desc
column. Therefore, the datasets library cannot load such a dataset by having to guessing the feature type. It has to know them beforehand.
The entity extraction process raise the following warning:
gpfswork/rech/six/commun/conda/lucile-modelling-metadata/lib/python3.8/site-packages/sklearn/base.py:324: UserWarning: Trying to unpickle estimator LogisticRegression from version 0.23.1 when using version 1.0.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
It might be worth investigating to see if anything harmful is happening.
I was wondering if the output of the add_metadata_and_chunk_examples
function was in the desired format.
When we want to add only local metadata, the text example always begin as follows: |||>Releas
. Is it right to start with the separator even if you don't add global metadata?
The torch version used in the requirements.txt is 1.8.1 which seems to have some issue with the flair library I am using (flairNLP/flair#2137). Can we use torch==1.9.0 instead? @SaulLu @cccntu
We would like to add an argument to MetadataConfig
that would control whether the model should generate one type of metadata or not. I propose to call this argument add_special_token_for_metadata_generation
.
Following the offline discussion (during Friday's meeting), this PR has been modified to implement a different format. The adopted format is as follows: Add the special local token like we add global metadata - order defined by the user in metadata_list
. In addition, I added an argument (local_metadata_special_tokens
) so that we can specify special tokens for local metadata
Example:
url: https://xx | timestamp: 2018-xx | HtmlOn | EntityOn ||| <div class:xx> this is a word [[entity 1]] </div>
HtmlOn | EntityOn ||| <div class:xx> this is a word [[entity 1]] </div>
url: https://xx | timestamp: 2018-xx ||| this is a word
Proposed specification:
metadata_list
argument;text_with_local_metadata
can be split into several examples)special_token_for_metadata_generation_sep
token specified in MetadataConfig
Let's consider the following example
{
"text": "the 2018 winter Olympic Games was held between 9 and 25 February 2018 in South Korea",
"metadata": [
{"key": "url", "type": "global", "value": "https://www.bbc.com/sport/live/olympics/50974152"},
{"key": "timestamp", "type": "global", "value": "2018-12-10T13:45:00.000Z"},
{'key': 'html', 'type': 'local', 'char_start_idx': 0, 'relative_start_pos': 1, 'char_end_idx': 84, 'relative_end_pos': 0, 'value':'div', 'html_attrs': {'attr': ['class'], 'value': ['summary']}}
],
}
With the arguments:
metadata_list = ["url", "timestamp", "html"]
special_token_for_metadata_generation = True
special_token_for_metadata_generation_sep = " ||| "
metadata_probability = 1
Generated sample:
url timestamp html ||| url: https://www.bbc.com/sport/live/olympics/50974152 | timestamp: 2018-12-10T13:45:00.000Z ||| <div class:summary> the 2018 winter Olympic Games was held between 9 and 25 February 2018 in South Korea</div>
With the arguments:
metadata_list = ["url", "timestamp", "html"]
special_token_for_metadata_generation = True
special_token_for_metadata_generation_sep = " ||| "
metadata_probability = 0 # <- change here
Generated sample:
the 2018 winter Olympic Games was held between 9 and 25 February 2018 in South Korea
With the arguments:
metadata_list = ["url", "timestamp", "entity", "html"] <- change here
special_token_for_metadata_generation = True
special_token_for_metadata_generation_sep = " ||| "
metadata_probability = 1
Generated sample:
url timestamp html ||| url: https://www.bbc.com/sport/live/olympics/50974152 | timestamp: 2018-12-10T13:45:00.000Z ||| <div class:summary> the 2018 winter Olympic Games was held between 9 and 25 February 2018 in South Korea</div>
cc @timoschick, @cccntu, @tianjianjiang, @manandey, @shanyas10, and everybody in the Modeling-Metadata WG ! 🙂
According to https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4.py, c4/webtextlike
uses
OPENWEBTEXT_CC_VERSIONS = ( # August 2018 - July 2019
"2019-18", # Original default for single-crawl dataset (April 2019).
"2019-30",
"2019-26",
"2019-22",
"2019-13",
"2019-09",
"2019-04",
"2018-51",
"2018-47",
"2018-43",
"2018-39",
"2018-34")
However, OpenWebText URLs are almost all older than the above CC indices, except for 2018-34
, 2018-39
, and 2018-43
.
Since C4 downloads a much larger set of CC WET texts and then filter out those texts by different sets of URLs, we probably won't gain much benefit of throughput based on C4 configurations.
Another intriguing situation is that, AllenNLP people had tried to replicate c4/webtextlike
but stopped, cf. https://huggingface.co/datasets/allenai/c4/blame/f888b0f407c37dd4a0e52d0c3bf56b8a7088f58b/README.md. I wonder what happened...
Turns out OpenWebText URLs have some duplicates and many URLs with useless fragments that may duplicate.
For example, RS_2013-01.bz2.deduped.txt
has
http://1:05 EST and I haven't gotten the update yet on lumia 900
http://www.eurekalert.org/pub_releases/2013-01/foas-pcr010213.php
www
variation, trailing slash, and fragmentI think that the get_dataloader
method is in charge of:
.jsonl
file with the template belowToy .jsonl
data file
[
{
"document_id": 10,
"text": "this is the input",
"metadata": [
{
"key": "url",
"type": "global",
"value": "http://1"
},
{
"key": "entity",
"type": "local",
"value": "address",
"start_idx": 20,
"end_idx": 40
}
]
},
{
"document_id": 12,
"text": "this is the second input",
"metadata": [
{
"key": "url",
"type": "global",
"value": "http://2"
},
{
"key": "entity",
"type": "local",
"value": "date",
"start_idx": 60,
"end_idx": 90
}
]
}
]
to be completed
As in repository, we might need a --exit-duration-in-mins
argument to save a checkpoint just before the end of the time limit
When finished share with Christopher and Shanya.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.