BanglaLM: Bangla Corpus For Language Model Research

This dataset consists of three parts:

Raw data
Preprocessed V1
Preprocessed V2

Link of the dataset

Kaggle: BanglaLM: Bangla Corpus For Language Model Research

Details of the dataset

We have collected text data which is a string of various lengths. The total volume of the data is 14 Gigabytes. We have collected data from various websites, including newspapers, social networks, blog sites, Wikipedia, etc. The newspaper websites include Prothom Alo, BD news, Jugantor, Jaijaidin, and so on. We have collected raw data using python script and done necessary preprocessing at the time of saving the data into local memory. Then we have gone through some more preprocessing steps that have been described later in the preprocessing section of this article. We have, in the meantime, started to build some models based on this data and the primary findings are satisfactory, and it ensures the quality of the dataset.There are a total of 19132010 observations in our dataset. We are releasing three versions of this dataset including, (i) Raw data, (ii) Preprocessed V1, and (iii) Preprocessed V2. The raw data can be preprocessed according to the demand of any particular project, and Preprocessed V1 is for LSTM based machine learning model, and Preprocessed V2 is better suitable for a statistical model. This dataset can also be manually labeled to be used for supervised learning. Fig.1 below shows the screen copy of the dataset view using pandas data frame. We can see the index and text from the table. Each of the indexes indicates a particular entry, and in the right column, we can read the value of the text. We can see the raw data along with Preprocessed V2 word count in Fig.2, Fig.3 correspondingly.

Here is he Summary of Preprocessed Data V2:

The workflow of the data collection procedure is shown below in Fig. 5.

Get the data

You can use direct links to download the dataset.

Name	Size	Link (Compressed ZIP)
`Raw data`	13.27 GB	Download
`Preprocessed V1`	13.22 GB	Download
`Preprocessed V2`	12.89 GB	Download

Usage

A bert-base-bangla (Transformer based Masked language model) has been developed using the dataset

This dataset has been used to train the pretrained model Bangla FastText Model & Toolkit

To install the latest release, you have to do:

!pip install BanglaFastText

For further information and introduction you can visit this Github repo: Bangla FastText Model & Toolkit

License

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution 4.0 International License. Copyright of the dataset contents belongs to the original copyright holders.

Cite this dataset👍

@inproceedings{kowsher-etal-2021-banglalm,
    title = "BanglaLM: Bangla Corpus for Language Model Research",
    author ="Kowsher, Md. and
     Uddin, Md.Jashim and
     Tahabilder, Anik and
     Ruhul Amin, Md and
     Shahriar, Md. Fahim and 
     Sobuj, Md. Shohanur Islam
     ",
      
    conference = "International conference on inventive research in computing applications (ICIRCA)",
    month = "September",
    year = "2021",
    address = "Online",
    publisher = "IEEE",
    url = "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3882903"
}

ashiquebiniqbal / banglalm-dataset Goto Github PK

banglalm-dataset's Introduction

BanglaLM: Bangla Corpus For Language Model Research

Link of the dataset

Details of the dataset

Get the data

Usage

License

banglalm-dataset's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent