This dataset consists of three parts:
- Raw data
- Preprocessed V1
- Preprocessed V2
We have collected text data which is a string of various lengths. The total volume of the data is 14 Gigabytes. We have collected data from various websites, including newspapers, social networks, blog sites, Wikipedia, etc. The newspaper websites include Prothom Alo, BD news, Jugantor, Jaijaidin, and so on. We have collected raw data using python script and done necessary preprocessing at the time of saving the data into local memory. Then we have gone through some more preprocessing steps that have been described later in the preprocessing section of this article. We have, in the meantime, started to build some models based on
this data and the primary findings are satisfactory, and it ensures the quality of the dataset.There are a total of 19132010 observations in our dataset. We are releasing three versions of this dataset including, (i) Raw data, (ii) Preprocessed V1, and (iii) Preprocessed V2. The raw data can be preprocessed according to the demand of any particular project, and Preprocessed V1 is for LSTM based machine learning model, and Preprocessed V2 is better suitable for a statistical model. This dataset can also be manually labeled to be used for supervised learning. Fig.1
below shows the screen copy of the dataset view using pandas data frame. We can see the index and text from the table. Each of the indexes indicates a particular entry, and in the right column, we can read the value of the text. We can see the raw data along with Preprocessed V2 word count in Fig.2
, Fig.3
correspondingly.
Here is he Summary of Preprocessed Data V2:
The workflow of the data collection procedure is shown below in Fig. 5.
You can use direct links to download the dataset.
Name | Size | Link (Compressed ZIP) |
---|---|---|
Raw data |
13.27 GB | Download |
Preprocessed V1 |
13.22 GB | Download |
Preprocessed V2 |
12.89 GB | Download |
A bert-base-bangla (Transformer based Masked language model) has been developed using the dataset
This dataset has been used to train the pretrained model Bangla FastText Model & Toolkit
To install the latest release, you have to do:
!pip install BanglaFastText
For further information and introduction you can visit this Github repo: Bangla FastText Model & Toolkit
Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution 4.0 International License. Copyright of the dataset contents belongs to the original copyright holders.
Cite this dataset๐
@inproceedings{kowsher-etal-2021-banglalm,
title = "BanglaLM: Bangla Corpus for Language Model Research",
author ="Kowsher, Md. and
Uddin, Md.Jashim and
Tahabilder, Anik and
Ruhul Amin, Md and
Shahriar, Md. Fahim and
Sobuj, Md. Shohanur Islam
",
conference = "International conference on inventive research in computing applications (ICIRCA)",
month = "September",
year = "2021",
address = "Online",
publisher = "IEEE",
url = "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3882903"
}