The cz4045 from nguyenhuyanhh

CZ4045 Assignment: Online Forum Data Processing

0. Download links

- Library used: NLTK

***********************************************************
$ pip2 install --user --upgrade nltk
$ python2 -m nltk.downloader all
***********************************************************

- Dataset

The following link contains two folders, raw_data and tagged_data.
Please place these two folders under source_code/, as explained
below.

***********************************************************
https://github.com/nguyenhuyanhh/cz4045/releases/download/release/source_code_data.zip
***********************************************************

- Installation guide

Setup Python and NLTK if you haven't done so.
Then, extract the zip download/ clone the repository:

***********************************************************
$ git clone https://github.com/nguyenhuyanhh/cz4045.git
***********************************************************

Then download the dataset and place into the appropriate location
as mentioned above.

1. Project Structure and Documentation

The file structure of this project is as follows:

***********************************************************
source_code/
    raw_data/
        query.sql               # SQL query
        QueryResults.csv        # raw data
                                  (1754 posts)
        TokenTagRaw.csv         # raw annotation data
                                  (100 posts)
        IrregularTokenSent.csv  # irregular tokens
                                  sentences (10)
    tagged_data/
        [100 tagged files]
        Annotation Notes.txt    # some annotation notes
    dataset.py
    stem_and_pos.py
    tokenizer.py
    test.py
    main.py                     # main calling script
report/
    [report materials]
***********************************************************

Calling just the main script (by running $ python source_code/main.py)
would print out this command-line usage:

***********************************************************
$ python2 source_code/main.py
Invalid arguments! Exiting...
usage: main.py [report|stempos|test|eval|tokenize|
                irregular|commonX]
            report          report dataset stats
            stempos         stemming and POS tagging
                            on dataset
            test            test the tokenizer
            eval            evaluate the tokenizer
                            on annotated dataset
            tokenize        tokenize the dataset,
                            output irregular tokens
            irregular       POS tagging on sentences
                            with irregular tokens
            commonX         get the most common X
                            libraries from the dataset
***********************************************************

2. Sample Project Runs

Reporting of dataset statistics:

***********************************************************
$ python2 source_code/main.py report
Number of questions: 500
Number of answers: 1254
Average number of answers per questions: 2.508
Questions with 1 answer: 259
Questions with 2 answers: 107
Questions with 3 or more answers: 134
***********************************************************

Stemming and POS tagging on the dataset (there are four separate print outputs
--- 10 random sentences, top 20 words before stemming, top 20 words after
stemming, the original words):

***********************************************************
$ python2 source_code/main.py stempos
[[('You', 'PRP'), ('are', 'VBP'), ('being', 'VBG'),
('tricked', 'VBN')...]]
[('', 20481), ('I', 947),...]]
[('', 20481), ('I', 947),...]]
[[''], ['I'], ...]]
***********************************************************

Testing of the tokenizer:

***********************************************************
$ python2 source_code/main.py test
................................................
..........................
------------------------------------------------
Ran 74 tests in 0.010s

OK
***********************************************************

Evaluating the tokenizer:

***********************************************************
$ python2 source_code/main.py eval
...
Id: 38810765, precision: 1.000, recall: 1.000, f1: 1.000
Id: 38834478, precision: 1.000, recall: 1.000, f1: 1.000
Id: 39432272, precision: 0.651, recall: 0.719, f1: 0.683
Id: 40488966, precision: 1.000, recall: 1.000, f1: 1.000
Id: 45003750, precision: 1.000, recall: 1.000, f1: 1.000
Overall: precision: 0.972, recall: 0.976, f1: 0.974
***********************************************************

Outputting irregular tokens from the dataset:

***********************************************************
$ python2 source_code/main.py tokenize
Using Unix dictionary...
[..., ('Django', 64), ('app', 61)...]
***********************************************************

POS tagging on sentences with irregular tokens:

***********************************************************
$ python2 source_code/main.py irregular
[('I', 'PRP'), ("'m", 'VBP'), ('using', 'VBG'),
('Google', 'NNP'),...
***********************************************************

Getting the most common libraries from the dataset:

***********************************************************
$ python2 source_code/main.py common5
[('numpy', 51), ('re', 32), ('sys', 29), ('os', 27),
('matplotlib', 23)]
$ python2 source_code/main.py common10
[('numpy', 51), ('re', 32), ('sys', 29), ('os', 27),
('matplotlib', 23), ('selenium', 22), ('random', 22),
('collections', 21), ('time', 18), ('pandas', 17)]
***********************************************************
nguyenhuyanhh / cz4045 Goto Github PK

cz4045's Introduction

cz4045's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent