Coder Social home page Coder Social logo

team_oracle's Introduction

Team_Oracle

Drexel Winter 2021 DSCI 592 Team Oracle

TEAM ORACLE

Capstone Project for DSCI 592, Winter 2021, team Oracle

Introduction

Whole Project on Google Drive

Instructions on Mirroring Project and Running Code


Table of Contents

  1. Team Members
  2. Datasets
  3. Acquisition
  4. Pre-Processing
  5. Vizualization
  6. Analysis
  7. Final Reports
  8. Responsibilities

TEAM MEMBERS

Jennifer Bochenek

  • Education
    • B.S. in Psychology with concentrations in Psychobiology of Addiction and Clinical Psychology, and a minor in Biology from Purdue University (May 2011)
    • M.S. in Psychology from New Mexico Highlands University, thesis topic on Sensation Seeking and Sleep Quality: Activity as a pre-requisit for high quality sleep (December 2012)
  • Occupation
    • Research Associate at Educational Testing Service
  • Skills
    • R, Python, Java, SQL, Unix
    • SPSS, Orange, Weka, Tableau
    • Data collection/acquisition, management and cleaning; descriptive and inferential statistics; machine learning; data visualization; paper writing

Yifan Yang

  • Education
    • B.S. in Statistics and minor in Computer Science from Virginia Tech
  • Occupation
    • Student, formerly IT for a POS company
  • Skills
    • SQL, Unix, Java, Python, R
    • R-Studio, Python
    • Descriptive, infernetial, non-parametric and other advanced statistics, machine learning, data visualization

Shibo Yao

  • Education
    • B.S. in Software Engineering
  • Occupation
    • Student, formerly assistant industry analyst for a consulting company
  • Skills
    • Python, Java, R, SQL
    • Eclipse, Pycharm, Rstudio
    • Data acquisition, pre-processing, analysis, and interpretation

Joe Larson

  • Education
    • Penn State
    • Electrical Engineering
    • MBA
    • Master of Expert Systems
  • Occupation
    • Manager at a GSE
  • Skills
    • R, Python
    • Google Colaboratory
    • Management

For our team we have split out the research areas/topics and each team member is responsible for acquisition, cleaning, merging (if needed) of data for their topic, and analysis of said data, which all contribute to the final dataset and report. We will also be rotating who is the team facilitator for meetings on a weekly basis.


EXPLANATION OF DATASETS

The data for this project comes from Kaggle. It is part of their ongoing process to provide a dataset for use in NLP analysis. The data is part of a posted competition called “Natural Language Processing with Disaster Tweets”, with the stated purpose to “predict which Tweets are about real disasters and which ones are not”. We chose this dataset because the topic was of interest and the data was not pre-cleaned, so this would make it a challenge. The data from Kaggle is a selection of tweets from Twitter that were then tagged by humans for if they were about real tweets or not. Additionally, humans tagged keywords concerning what disaster type the tweet could be concerning. These keywords were present regardless of real vs not disaster status. The interesting part is that neither the keyword nor the tweet have been preprocessed nor cleaned, which makes it perfect for our purposes.


DATA ACQUISITION

The dataset was downloaded directly from Kaggle. In order to do so, we made a team on Kaggle and joined the competition. There are two files, a training dataset, and a test dataset. Both were downloaded as a CSV and transferred to the group’s google documents folder.


DATA PRE-PROCESSING

As stated previously, there is extensive data pre-processing that has to be done. There are 5 columns in the original dataset: ID, Location, Keyword, Text, and Target. Location and Keyword both had missing data (Location: 2533 in training, 1105 in test, Keyword: 61 null in training, 26 null in test), but text and target were never null. There are 7,613 tweets in the training set and 3,263 in the test set. Location is the location of the tweeter’s account that is set in their settings, some of them are non-locations. We decided to not use the location data as a result.

The biggest challenge was in cleaning the text data and creating additional variables to use in the machine learning algorithms. In order to do so in an organized manner, we created a data flow table that describes the order that the steps should be done in and how they concern the variables that are being altered. Table 1 below shows that process, with the description of the step, what variable is being used, what variable is being created, alongside which members of the team were responsible for each step. Many of the same steps were also used on the keyword variable, as sensible. The final resultant dataset contained 91 variables.

Description In Variable Out Variable Responsibility Notes/Progress
1. Change text to lower charters [‘text’] [‘text_to_lower’] Joe Done
2. Remove encoding errors (otherwise would artificially inflate the char count) [‘text_lower’] [‘text_remove_encoding_error’] Jenni Done
3. Count of total char Count of hashtags (#) Count of urls Count of words Count of punctuation Count of unique words (non-repeated) Count of average word length [‘text_remove_encoding_error’] ['text_count_total_char'] ['text_count_hashtags'] ['text_count_urls'] ['text_count_words'] ['text_count_punctuation'] ['text_count_unique_words'] ['text_mean_words_length'] Jenni/ Joe Done
4. Separate hashtags into new column [‘text_remove_encoding_errors’] [‘text_hashtags’] Jenni Done
5. Edit typos, slang, and informal language [text_remove_encoding_errors'] [‘text_informal_language’] Jenni Done
6. Remove URLs [‘text_informal_language’] [‘text_url_removed’] Joe Done
7. Do we want/need to redo counts here? [‘text_informal_language’] ['text_count_total_char'] ['clean_text_count_hashtags'] ['clean_text_count_urls'] ['clean_text_count_words'] ['clean_text_count_punctuation'] ['clean_text_count_unique_words'] ['clean_text_mean_words_length'] Joe Done
8. Determine reading level, comprehension level, grade level of text? [‘text_informal_language’] Y&S decided not to do based on not being impactful
9. Tokenize [‘text_url_removed’] ['text_token'] Joe Done
10. Sentiment df_train Full dataframe ['text_affect_dict'] ['text_top_affect'] ['text_affect_freq'] ['text_raw_emotion’] Jenni Done Use NRCLex
11. Higher level sentiment analysis df_train Full dataframe ['all_negative'] ['all_positive'] ['anger'] [‘disgust’] [‘fear’] [‘sadness’] [‘anticipation’] [‘joy’] [‘surprise’] [‘trust’] Jenni Done
12. Remove punctuation [‘text_url_removed’] [‘text_remove_punctuations’] Joe Done
13. Named Entity Recognition POS Tagging [‘text_token’] [‘text_pos_tag’] [‘text_ner_tag’] [‘text_ner_tag’] Y&S Done
14. Remove Stopwords (remember to add stands for retweet to Stopwords list) [‘text_token’] [‘text_token_remove_stopwords’] Yifan& Shibo Done
15. Stem words Lemmatize words [‘text_token_remove_stopwords’] [‘text_pos_tag’] [‘text_stem’] [‘text_clean_lemma’] Y&S Done
16. TF TF-IDF [‘text_lemma’] [‘text_tf’] [‘text_tfidf’] Jenni See doing tf-idf with scikitlearn link below
17. Word2Vec [‘text_token’] [‘text_vec’] Jenni Use Gensim

DATA VISUALIZATION

Tweet Histograms

Tweet Histograms

Keyword Histograms

Keywords Histograms

Pair Plot by Target

Pair Plot by Target

Correlation Matrix

Correlations

Wordcloud for Tweets

Tweet Wordcloud

Wordcloud for Tweets with non-real Disasters

Non-Real Disaster Tweets

Wordcloud for Tweets with real Disasters

Real Disaster Tweets

Wordcloud for Keywords

Keyword Wordcloud


DATA ANALYSIS

Feature Importance

Naive Bayes

  • Accuracy: 0.803
  • Recall: 0.69
  • F1: 0.75
  • Pecision: 0.82

Support Vector Machines

Before Optimization SVC(random_state = 23)

  • Accuracy: 0.629
  • Recall: 0.397
  • F1: 0.479
  • Pecision: 0.604

After Optimization SVC(C=0.5, class_weight = None, gamma = 0.001, kernel='linear', random_state = 23)

  • Accuracy: 0.804
  • Recall: 0.668
  • F1: 0.746
  • Pecision: 0.845

BERT

  • Accuracy: 0.831
  • Recall: 0.766
  • F1: 0.796
  • Pecision: 0.827

K-Nearest Neighbor

Before Optimization

  • Accuracy: 0.718
  • Recall: 0.718
  • F1: 0.717
  • Pecision: 0.716

After Optimization

  • Accuracy: 0.727
  • Recall: 0.727
  • F1: 0.727
  • Pecision: 0.728

Gradient Boosting

Before Optimization

  • Accuracy: 0.691
  • Recall: 0.691
  • F1: 0.692
  • Pecision: 0.694

After Optimization

  • Accuracy: 0.718
  • Recall: 0.718
  • F1: 0.717
  • Pecision: 0.717

Model Comparison

Naive Bayes SVC BERT KNN Gradient Boosting
Accuracy 0.802 0.804 0.831 0.727 0.718
Recall 0.691 0.668 0.766 0.727 0.718
F1 0.754 0.746 0.796 0.727 0.717
Precision 0.823 0.845 0.827 0.728 0.717

FINAL REPORTS


RESPONSIBILITY SUMMARY

Jenni Bochenek: Launch report, Data acquisition and Preprocessing/Exploratory Data Analysis report

Joe Larson: Outline and steps to accomplish the project, coding format and some unit testing

Yifan Yang: Pitch Presentation

Shibo Yao: Final Presentation

All: Data acquisition, preprocessing, applying ML model, Model evaluation and final report.

team_oracle's People

Contributors

jbochenek avatar

Watchers

 avatar

Forkers

dr-yao

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.