Coder Social home page Coder Social logo

banglalm-dataset's Introduction

banglaLM

BanglaLM: Bangla Corpus For Language Model Research size: 40GB

This dataset consists of three parts:

  • Raw data
  • Preprocessed V1
  • Preprocessed V2

Link of the dataset

Kaggle: BanglaLM: Bangla Corpus For Language Model Research

Details of the dataset

We have collected text data which is a string of various lengths. The total volume of the data is 14 Gigabytes. We have collected data from various websites, including newspapers, social networks, blog sites, Wikipedia, etc. The newspaper websites include Prothom Alo, BD news, Jugantor, Jaijaidin, and so on. We have collected raw data using python script and done necessary preprocessing at the time of saving the data into local memory. Then we have gone through some more preprocessing steps that have been described later in the preprocessing section of this article. We have, in the meantime, started to build some models based on this data and the primary findings are satisfactory, and it ensures the quality of the dataset.There are a total of 19132010 observations in our dataset. We are releasing three versions of this dataset including, (i) Raw data, (ii) Preprocessed V1, and (iii) Preprocessed V2. The raw data can be preprocessed according to the demand of any particular project, and Preprocessed V1 is for LSTM based machine learning model, and Preprocessed V2 is better suitable for a statistical model. This dataset can also be manually labeled to be used for supervised learning. Fig.1 below shows the screen copy of the dataset view using pandas data frame. We can see the index and text from the table. Each of the indexes indicates a particular entry, and in the right column, we can read the value of the text. We can see the raw data along with Preprocessed V2 word count in Fig.2, Fig.3 correspondingly.

this is screen copy of data this is the raw data summary

Here is he Summary of Preprocessed Data V2:

this is summary of preprocessed data v1


The workflow of the data collection procedure is shown below in Fig. 5.

flowchart

Get the data

You can use direct links to download the dataset.

Name Size Link (Compressed ZIP)
Raw data 13.27 GB Download
Preprocessed V1 13.22 GB Download
Preprocessed V2 12.89 GB Download

Usage

A bert-base-bangla (Transformer based Masked language model) has been developed using the dataset

This dataset has been used to train the pretrained model Bangla FastText Model & Toolkit

To install the latest release, you have to do:

!pip install BanglaFastText

For further information and introduction you can visit this Github repo: Bangla FastText Model & Toolkit

License

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution 4.0 International License. Copyright of the dataset contents belongs to the original copyright holders.

Cite this dataset๐Ÿ‘

@inproceedings{kowsher-etal-2021-banglalm,
    title = "BanglaLM: Bangla Corpus for Language Model Research",
    author ="Kowsher, Md. and
     Uddin, Md.Jashim and
     Tahabilder, Anik and
     Ruhul Amin, Md and
     Shahriar, Md. Fahim and 
     Sobuj, Md. Shohanur Islam
     ",
      
    conference = "International conference on inventive research in computing applications (ICIRCA)",
    month = "September",
    year = "2021",
    address = "Online",
    publisher = "IEEE",
    url = "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3882903"
}

banglalm-dataset's People

Contributors

kowsher avatar shohanursobuj avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.