nguyenhuyanhh / cz4045 Goto Github PK
View Code? Open in Web Editor NEWAssignment for CZ4045: Natural Language Processing
Assignment for CZ4045: Natural Language Processing
CZ4045 Assignment: Online Forum Data Processing 0. Download links - Library used: NLTK *********************************************************** $ pip2 install --user --upgrade nltk $ python2 -m nltk.downloader all *********************************************************** - Dataset The following link contains two folders, raw_data and tagged_data. Please place these two folders under source_code/, as explained below. *********************************************************** https://github.com/nguyenhuyanhh/cz4045/releases/download/release/source_code_data.zip *********************************************************** - Installation guide Setup Python and NLTK if you haven't done so. Then, extract the zip download/ clone the repository: *********************************************************** $ git clone https://github.com/nguyenhuyanhh/cz4045.git *********************************************************** Then download the dataset and place into the appropriate location as mentioned above. 1. Project Structure and Documentation The file structure of this project is as follows: *********************************************************** source_code/ raw_data/ query.sql # SQL query QueryResults.csv # raw data (1754 posts) TokenTagRaw.csv # raw annotation data (100 posts) IrregularTokenSent.csv # irregular tokens sentences (10) tagged_data/ [100 tagged files] Annotation Notes.txt # some annotation notes dataset.py stem_and_pos.py tokenizer.py test.py main.py # main calling script report/ [report materials] *********************************************************** Calling just the main script (by running $ python source_code/main.py) would print out this command-line usage: *********************************************************** $ python2 source_code/main.py Invalid arguments! Exiting... usage: main.py [report|stempos|test|eval|tokenize| irregular|commonX] report report dataset stats stempos stemming and POS tagging on dataset test test the tokenizer eval evaluate the tokenizer on annotated dataset tokenize tokenize the dataset, output irregular tokens irregular POS tagging on sentences with irregular tokens commonX get the most common X libraries from the dataset *********************************************************** 2. Sample Project Runs Reporting of dataset statistics: *********************************************************** $ python2 source_code/main.py report Number of questions: 500 Number of answers: 1254 Average number of answers per questions: 2.508 Questions with 1 answer: 259 Questions with 2 answers: 107 Questions with 3 or more answers: 134 *********************************************************** Stemming and POS tagging on the dataset (there are four separate print outputs --- 10 random sentences, top 20 words before stemming, top 20 words after stemming, the original words): *********************************************************** $ python2 source_code/main.py stempos [[('You', 'PRP'), ('are', 'VBP'), ('being', 'VBG'), ('tricked', 'VBN')...]] [('', 20481), ('I', 947),...]] [('', 20481), ('I', 947),...]] [[''], ['I'], ...]] *********************************************************** Testing of the tokenizer: *********************************************************** $ python2 source_code/main.py test ................................................ .......................... ------------------------------------------------ Ran 74 tests in 0.010s OK *********************************************************** Evaluating the tokenizer: *********************************************************** $ python2 source_code/main.py eval ... Id: 38810765, precision: 1.000, recall: 1.000, f1: 1.000 Id: 38834478, precision: 1.000, recall: 1.000, f1: 1.000 Id: 39432272, precision: 0.651, recall: 0.719, f1: 0.683 Id: 40488966, precision: 1.000, recall: 1.000, f1: 1.000 Id: 45003750, precision: 1.000, recall: 1.000, f1: 1.000 Overall: precision: 0.972, recall: 0.976, f1: 0.974 *********************************************************** Outputting irregular tokens from the dataset: *********************************************************** $ python2 source_code/main.py tokenize Using Unix dictionary... [..., ('Django', 64), ('app', 61)...] *********************************************************** POS tagging on sentences with irregular tokens: *********************************************************** $ python2 source_code/main.py irregular [('I', 'PRP'), ("'m", 'VBP'), ('using', 'VBG'), ('Google', 'NNP'),... *********************************************************** Getting the most common libraries from the dataset: *********************************************************** $ python2 source_code/main.py common5 [('numpy', 51), ('re', 32), ('sys', 29), ('os', 27), ('matplotlib', 23)] $ python2 source_code/main.py common10 [('numpy', 51), ('re', 32), ('sys', 29), ('os', 27), ('matplotlib', 23), ('selenium', 22), ('random', 22), ('collections', 21), ('time', 18), ('pandas', 17)] ***********************************************************
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.