news-topic-reviews-dataset

Machine Learning Dataset for Topic and News Text Classification

How to download the dataset

These datasets are hosted by Facebook as a Google Drive folder. Please use the download.sh script to grab the dataset:

git clone https://github.com/loretoparisi/news-topic-reviews-dataset.git
cd news-topic-reviews-dataset.git
./download.sh

How to extract train and test files

You can then untar each dataset to get the train and test files:

tar xvzf yelp_review_full_csv.tar.gz
x yelp_review_full_csv/
x yelp_review_full_csv/readme.txt
x yelp_review_full_csv/train.csv
x yelp_review_full_csv/test.csv

How to verify the train and test files

Each dataset comes with a classes file, a train and a test file:

cd dbpedia_csv
ls -l
total 382688
-rw-------  1 loretoparisi  staff        146 28 Mar  2015 classes.txt
-rw-r--r--  1 loretoparisi  staff       1758 30 Mar  2015 readme.txt
-rw-------  1 loretoparisi  staff   21775285 28 Mar  2015 test.csv
-rw-------  1 loretoparisi  staff  174148970 28 Mar  2015 train.csv

Then train and a test file contain the training set and the test set files

head -n1 train.csv 
1,"E. D. Abbott Ltd"," Abbott of Farnham E D Abbott Limited was a British coachbuilding business based in Farnham Surrey trading under that name from 1929. A major part of their output was under sub-contract to motor vehicle manufacturers. Their business closed in 1972."

head -n 1 test.csv 
1,"TY KU"," TY KU /taɪkuː/ is an American alcoholic beverage company that specializes in sake and other spirits. The privately-held company was founded in 2004 and is headquartered in New York City New York. While based in New York TY KU's beverages are made in Japan through a joint venture with two sake breweries. Since 2011 TY KU's growth has extended its products into all 50 states."

How to verify the classes

The classes files contains the classes for the dataset:

cd dbpedia_csv
cat classes.txt 
Company
EducationalInstitution
Artist
Athlete
OfficeHolder
MeanOfTransportation
Building
NaturalPlace
Village
Animal
Plant
Album
Film
WrittenWork

To count each class occurences use the count_classes.sh script that has the following parameters

./count_classes.sh COLUMN FILE "SEPARATOR"

Note that the separator is between double quotes like "\t" or like "," so that you do

./count_classes.sh 1 dbpedia_csv/train.csv ","
1 2,40000
2 3,40000
3 4,40000
4 5,40000
5 6,40000
6 7,40000
7 8,40000
8 9,40000
9 10,40000
10 11,40000
11 12,40000
12 13,40000
13 14,40000
14 1,40000
$ ./count_classes.sh 1 sogou_news_csv/train.csv ","
1 "5",90000
2 "2",90000
3 "4",90000
4 "1",90000
5 "3",90000

where in the output the first column is the current column index, the second is the label, the third column is the number of occurences of that label, so like the label "2" has 90000 occurences, and the sogou_news_csv/train.csv has 5 classes, while the dbpedia_csv/train.csv has 14 classes, etc.

How to use the datasets for training and testing

The train and test files must be normalized before used. Use the normalize.sh script to pre-process the files before training:

cd dbpedia_csv
./normalize.sh train.csv dbpedia.train
./normalize.sh test.csv dbpedia.test

To shuffle the dataset you can use the shuffle.sh script:

cd dbpedia_csv
./shuffle.sh train.csv dbpedia.train
./shuffle.sh test.csv dbpedia.test

If you prefer to split the input dataset with a different ratio, you can use the split.sh script:

cd dbpedia_csv
./split.sh train.csv 80

to have 80% in the training set and 20% in the test set.

loretoparisi / news-topic-reviews-dataset Goto Github PK