Coder Social home page Coder Social logo

loretoparisi / news-topic-reviews-dataset Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 1.0 18 KB

Machine Learning Dataset for Topic, News, Reviews Text Classification

License: MIT License

Shell 100.00%
machine-learning dataset topic-modeling news artificial-intelligence

news-topic-reviews-dataset's Introduction

news-topic-reviews-dataset

Machine Learning Dataset for Topic and News Text Classification

How to download the dataset

These datasets are hosted by Facebook as a Google Drive folder. Please use the download.sh script to grab the dataset:

git clone https://github.com/loretoparisi/news-topic-reviews-dataset.git
cd news-topic-reviews-dataset.git
./download.sh

How to extract train and test files

You can then untar each dataset to get the train and test files:

tar xvzf yelp_review_full_csv.tar.gz
x yelp_review_full_csv/
x yelp_review_full_csv/readme.txt
x yelp_review_full_csv/train.csv
x yelp_review_full_csv/test.csv

How to verify the train and test files

Each dataset comes with a classes file, a train and a test file:

cd dbpedia_csv
ls -l
total 382688
-rw-------  1 loretoparisi  staff        146 28 Mar  2015 classes.txt
-rw-r--r--  1 loretoparisi  staff       1758 30 Mar  2015 readme.txt
-rw-------  1 loretoparisi  staff   21775285 28 Mar  2015 test.csv
-rw-------  1 loretoparisi  staff  174148970 28 Mar  2015 train.csv

Then train and a test file contain the training set and the test set files

head -n1 train.csv 
1,"E. D. Abbott Ltd"," Abbott of Farnham E D Abbott Limited was a British coachbuilding business based in Farnham Surrey trading under that name from 1929. A major part of their output was under sub-contract to motor vehicle manufacturers. Their business closed in 1972."

head -n 1 test.csv 
1,"TY KU"," TY KU /taɪkuː/ is an American alcoholic beverage company that specializes in sake and other spirits. The privately-held company was founded in 2004 and is headquartered in New York City New York. While based in New York TY KU's beverages are made in Japan through a joint venture with two sake breweries. Since 2011 TY KU's growth has extended its products into all 50 states."

How to verify the classes

The classes files contains the classes for the dataset:

cd dbpedia_csv
cat classes.txt 
Company
EducationalInstitution
Artist
Athlete
OfficeHolder
MeanOfTransportation
Building
NaturalPlace
Village
Animal
Plant
Album
Film
WrittenWork

To count each class occurences use the count_classes.sh script that has the following parameters

./count_classes.sh COLUMN FILE "SEPARATOR"

Note that the separator is between double quotes like "\t" or like "," so that you do

./count_classes.sh 1 dbpedia_csv/train.csv ","
1 2,40000
2 3,40000
3 4,40000
4 5,40000
5 6,40000
6 7,40000
7 8,40000
8 9,40000
9 10,40000
10 11,40000
11 12,40000
12 13,40000
13 14,40000
14 1,40000
$ ./count_classes.sh 1 sogou_news_csv/train.csv ","
1 "5",90000
2 "2",90000
3 "4",90000
4 "1",90000
5 "3",90000

where in the output the first column is the current column index, the second is the label, the third column is the number of occurences of that label, so like the label "2" has 90000 occurences, and the sogou_news_csv/train.csv has 5 classes, while the dbpedia_csv/train.csv has 14 classes, etc.

How to use the datasets for training and testing

The train and test files must be normalized before used. Use the normalize.sh script to pre-process the files before training:

cd dbpedia_csv
./normalize.sh train.csv dbpedia.train
./normalize.sh test.csv dbpedia.test

To shuffle the dataset you can use the shuffle.sh script:

cd dbpedia_csv
./shuffle.sh train.csv dbpedia.train
./shuffle.sh test.csv dbpedia.test

If you prefer to split the input dataset with a different ratio, you can use the split.sh script:

cd dbpedia_csv
./split.sh train.csv 80

to have 80% in the training set and 20% in the test set.

news-topic-reviews-dataset's People

Contributors

loretoparisi avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar James Cloos avatar  avatar

Forkers

wenxinxu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.