Coder Social home page Coder Social logo

tweeting-dogs's Introduction

tweeting-dogs

The repository summarises the data wrangling steps using datasets derived online and ofline.In this case I gather data from the We Rate Dogs Twitter archive. The WeRate Dogs is a renowned dog rating account that allows users to rate images of dogs with a humorous comment about the dog. The rating is a fraction the denominator is always a 10 while the numerator can be less or more than 10.

Data Gathering

In this stage, three datasets relating to the WeRateDogs Twitter account were gathered and imported into the Jupyter Notebook. These are;

  1). The  WeRateDogs twitter archive
  2). the Twitter image predictions
  3). tweet_df.  

The second dataset ocntianed three predictions on the type of dog identified from the images collected from the prediction dataset. This dataset was downloaded using the requests library and stored as a TSV file. The last dataset,tweet_df, contained additional information on the Twitter archive.. This data was queried from Twitter using the Tweepy API.

Data Quality and Tidiness Assessment

The three datasets were assessed visually and programmatically in the jupyter notebook to ascertain their quality and tidiness. The following is a list of the data quality issues identified in the three datasets using the two techniques:

Data Quality issues

• Missing variables in the twitter dataset such as in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status timestamp,and in the master dataset.
• The rating_denominator is redundant because it has only one unique value,10.
• Errors in the extraction of ratings_numerator.
• Presence of retweets in the Twitter data.
• Incorrect data types; for example, tweet_id is an integer in all the datasets.
• Tweet_id duplicated in all the datasets.
• Non descriptive column names e.g p1,p1,p3.

Data Tidiness issues

• The tidiness issue observed from the dataset was that the dog stages were divided into three columns when they should have been in one column. 
• Three tables contain similar information.

Data Cleaning

Data Quality

• All rows that are retweets(nulls in the retweeted_status_timestamp column were dropped.
• All columns conisdered redundant and not useful for final analysis were dropped.
• The extraction of rating _numerators  from the twitter_archive was redone suing Regex and then converted to floats.
• Incorrect data types were corrected.

Data Tidiness

• The three datasets were merged based on the tweet_id.
• The columns on the dog classes were summarised into one by extracting the class from the columns.

Data Analysis

Anlysis was conducted to determine:

 • The top rated image/tweet
 • Dominant dog types/breeds in the dataset

tweeting-dogs's People

Contributors

lunaloclet avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.