bigscience-workshop / bigscience Goto Github PK

View Code? Open in Web Editor NEW

949.0 949.0 99.0 3.29 MB

Central place for the engineering/scaling WG: documentation, SLURM scripts and logs, compute environment and data.

License: Other

Makefile 0.09% Python 27.40% Shell 72.51%

machine-learning models nlp training

bigscience's People

Contributors

Stargazers

Watchers

Forkers

stas00 justheuristic ftarlaci manandey samyam thomasw21 adammoody saullu nafabrar mohitzsh tanglinjie0815 aashiqmuhamed raineydavid annajiat switiz cakiki admariner mbyase zephyrzilla sengxian techthiyanes younesbelkada pnnl-compbio alleniver mediapreneur muennighoff dataandai kheele laplacekorea murilo basicv8vc rks-rex eify occtop brunotech jcarlosneto jagwar pjl-takeoff netzkontrast nouamanetazi morpheusph eltociear ruiwang1998 sandyz1000 rubickh kalufinnle henry-zeng poonamzham amutong phymucs syno8 zui-jiang sundevil0405 rhapsodic-legacy lemuria-wchen hertera1 zurichrain stjordanis chenchongyuan fudp aimetrics tx1103mark shawnhowell weberxie adamlangc tingshua-yts zxf864823150 jongwon-jay-lee mhhamdan mitzen yh646492956 snoopycn mars-wei arksingh akkarimi socioprophet huanglk dopu2k16 mooyeh279 defi-ventures fredhu21 wannaphong nlqq ethicalsecurity-agency superbrucejia crazyboystop shendlcode fyf2016 157459387 imanu20 mgmacleod wavelet2008 jianshuzhao cyysu liutaowen-tony lancekung defiventures rhnm

bigscience's Issues

Files for bias evaluation

Can you please provide the files for the bias evaluation on the crowspairs dataset? The results are given in section 4.9 of the paper, but I do not see the files in the evaluation folder here. Thank you.

Fill in request for the second half of compute

Doc here: https://docs.google.com/document/d/1aBM3_4bCY4jCraAx14XPnDSPNy6vF8wslBgGwGPa14M/edit?usp=sharing

What is the number of epochs of the final training?

The config file lists the sample count of the dataset as 220M and a global batch size of 2048, which equates to ~107K steps per epoch. The main README says the total number of training steps is 95K, which means epoch 1 is not finished. However, the training chronicles suggest more than one epochs of training.

What is the number of epoch for the final training and what am I missing?

Is the 13B - unmodified Megatron gpt2 - baseline available? ( tr1-13B-base)

I was super excited to hear about this project! I was wondering if the model is available anywhere?

In the chronicles of tr1-13B-base it says at the end: "All checkpoints converted to HF format and uploaded to HUB.", which I thought meant that it is available on Huggingface, but I can't seem to find it.

Is it available and I'm just not able to find it, or did I misunderstand and it's not available?

Sharing the 1.3B-Pile@300B model

The 1.3B-Pile@300B model is quite strong:
https://docs.google.com/spreadsheets/d/1CI8Q9RCblLRzUOPJ6ViqBmo284-8ojluQ-CmaEuhuv0/edit#gid=1295801165

lambada 0.6088 piqa 0.7160 hellaswag 0.5209 --> these are all better than gpt-neo 1.3B.

Could you share the model? Thank you.

why 384（12216） will be the first time all pipeline stages be filled

I am reading the content in https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/README.md#global-batch-size, but I don't quite understand some of the numbers. Can someone help me explain it?

such as："So it'll take several days of very inefficient run. We know we get 113 TFLOPs at iteration 512, and since PP=12 and MBS=2, only at 384 12216 it'll be the first time all pipeline stages will be filled and that's when the performance should be much better, probably around 90 TFLOPs."

I can't understand why need 384(12*2*16) all pipeline stages will be filled

Running Bloom

What kind of machine is required to just run the inference on the 176B model? https://huggingface.co/bigscience/bloom

Why is deepspeed enabled in the Bloom training script?

Why is the value of Zero-State 0 when deepspeed is enabled in the Bloom training script? Can the Bloom model be trained and the loss curve is aligned when deepspeed is disabled? Thanks very much.

DEEPSPEED_ARGS=" \
    --deepspeed \
    --deepspeed_config ${config_json} \
    --zero-stage ${ZERO_STAGE} \
    --deepspeed-activation-checkpointing \
    "

eval opt-175B

I noticed you evaluated the opt-175B model, how did it convert to a Megatron-Deepspeed checkpoint? I can not find a 175B huggingface transformers checkpoint. Also, I can not successfully convert the opt-66B checkpoint. @thomasw21 Thanks for any reply!

Requirements to perform inference over the BigScience Bloom model

What are the minimum requirements regarding RAM and GPU memories for performing only inferences over the
Bloom model?

Zero_Stage=1 results in higher TFLOPS?

Description

I am learning the chronicles_prequel, and I find the last table in the chapter indicates the higher TFLOPS is achieved with Zero_Stage = 1.
Trying with ZeRO_STAGE=0/1
Zero_stage=1 could reduce the memory cost, but how come it increases the performance with other parameter being the same?

Nodes	Size	ZS	DP	TP	PP	MBS	GBS	Mem	Sec/it	TFLOPs	Notes
48	181B	1	4	8	12	2	2048	37GB	120.29	134.02	02-21
48	181B	0	4	8	12	2	2048	72GB	137.34	113.02	02-21

mC4 sampling & pre-processing

Hi @TevenLeScao,

I think there are some confusing and broken link in the mC4 data preprocessing section. Can you take a look?

Both of the links are broken here,

The original link should be,

In addition to that, the multinomial data processing code to create the different language splits are in this pull request, bigscience-workshop/Megatron-DeepSpeed#9

Here's few things,

Did you use this data for any one of your experiments?
If not then I think you can update the doc, https://github.com/bigscience-workshop/bigscience/tree/master/data/mc4

For reference purpose, if you want to keep the code, I'm happy to open a pull request here. If not I'll close the pull request from bigscience/Megatron-Deepspeed repo.

Let me know what do you think.

Wrong tokenizer path in big model config

Hello,

The final model config seems to be pointing toward the wrong tokenizer :

bigscience/train/tr11-176B-ml/tr11-176B-ml.slurm

Line 117 in 155dd12

    
               --tokenizer-name-or-path bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k \

@thomasw21 notified me this one was used for testing purpose only since there is already an existing dataset tokenized with this tokenizer.

This issue tracks the fact that in a later stage this should be changed to :

--tokenizer-name-or-path bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v3-dedup-lines-articles \

@stas00

where can we get a bloomz-7b1 finetuned checkpoint

we want to continue fine tuning a bloomz-7b1 model, where can we get the model checkpoints like 176B :

can you share the slurm.conf you are using?

Hey,
pinging @stas00
I'm a researcher from Tel-Aviv University and were thinking about implementing QOS, similar to what you have with the Jean Zay cluster.
It would be really helpful to see the slurm.conf you are using for your QOS setting.
Thanks!
Ohad

make a back up for final training data

Back up these folders later today to STORE

/gpfswork/rech/six/commun/bigscience-training/merged-meg-ds_v2

Where can I download the training script for bloom-7b1?

Hello, the evaluation script of bloom-7b1 is found in the repo, is evaluation/results/tr11/scripts/run_trevalharness_7b1.slurm, but the training script of bloom-7b1 is not found. Can you share the bloom-7b1 training script?

Thank you very much.

About training data for 1B3 models

Big Science version: latest
Python version: 3.8.0
Operating System: Ubuntu18.04

Description

As mentioned in

37  CATALOGUE_JSON_PATH=$BIGSCIENCE_REPO/data/catalogue/training_dataset_ratios_merged_nigercongo_v3.json

How to get the datasets for 1B3 version? I can not find a script in https://github.com/bigscience-workshop/bigscience/tree/master/data. Could you give me some suggestions?

How to get train-splits.txt and valid-splits.txt before training tr11-176B-ml

Big Science version: latest
Python version: 3.8.8
Operating System: Ubuntu 20.04.5 LTS

Description

How to get train-splits.txt and valid-splits.txt at Line39 in train/tr11-176B-ml/tr11-176B-ml.slurm. Thx.
TRAIN_DATA_PATH=$MEGATRON_DEEPSPEED_REPO/data/train-splits.txt
VALID_DATA_PATH=$MEGATRON_DEEPSPEED_REPO/data/valid-splits.txt

bigscience-workshop / bigscience Goto Github PK

bigscience's People

Contributors

Stargazers

Watchers

Forkers

bigscience's Issues

Description

Description

Description

Recommend Projects

Recommend Topics

Recommend Org