Arabic-summarization-for-meeting

this project for Arabic summarization meeting the data was translated from sumsam and dialogsum useing bloomz 3b and lora

Finetuning BLOOMZ for messages summarization in arabic

Information about BLOOM:

Documentation: https://huggingface.co/docs/transformers/model_doc/bloom
Model: https://huggingface.co/bigscience/bloom
Github: https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#readme

Transformers Package Documentation in Huggingface.co:

Tokenizer Class: https://huggingface.co/docs/transformers/glossary#attention-mask
Trainer Class: https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/trainer#transformers.Trainer
Finetuning using Trainer: https://huggingface.co/docs/transformers/training
Token Classification: https://huggingface.co/docs/transformers/tasks/token_classification

Architecture explained:

The Technology Behind BLOOM Training: https://huggingface.co/blog/bloom-megatron-deepspeed
Understand BLOOM, the Largest Open-Access AI, and Run It on Your Local Computer: https://towardsdatascience.com/run-bloom-the-largest-open-access-ai-model-on-your-desktop-computer-f48e1e2a9a32

Dataset used for Training explained:

Corpus Map: https://huggingface.co/spaces/bigscience-catalogue-lm-data/corpus-map
Building a TB Scale Multilingual Dataset for Language Modeling: https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling

Dataset for Finetuning:

Conll2003: https://huggingface.co/datasets/conll2003

About BLOOM:

The Model:

3B parameters decoder-only architecture (GPT-like)

BLOOM uses a Transformer architecture composed of an input embeddings layer, and an output language-modeling layer, as shown in the figure below. Each Transformer block has a self-attention layer and a multi-layer perceptron layer, with input and post-attention layer norms.

The Dataset:

Multilingual: 46 languages: Full list is here: https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling
341.6 billion tokens (1.5 TB of text data)
Tokenizer vocabulary: 250 680 tokens

mohamed-em2m / arabic-summarization-for-meeting Goto Github PK