Coder Social home page Coder Social logo

karthik-d / data-mining_preprocessing-analysis Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 10.03 MB

Clearning, transformation and analysis large datasets as part of coursework for UCS1629: Data Warehousing and Data Mining.

Jupyter Notebook 100.00%
data-cleaning data-mining data-mining-python dataset preprocessing-data python weka weka-library

data-mining_preprocessing-analysis's Introduction

Data-Mining_Preprocessing-Analysis

Clearning, transformation and analysis large datasets as part of coursework for UCS1629: Data Warehousing and Data Mining.

Dataset

This dataset is a subset of the Google BigQuery public datasets - Nyc yellow taxi cab trips data set containing a random 10,000,000 rows of data. This dataset was extracted and uploaded for the purpose of experimenting with and learning regression models for price prediction. There is also a lot of room for data cleaning, outliers in the data, and plenty of data to work with for more realistic model training, testing, and validation.

The analyzed subset of the data is publicly accessible through Kaggle.

Data Attributes

column type nullable description
vendor_id text required A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc
pickup_datetime datetime nullable The date and time when the meter was engaged.
dropoff_datetime datetime nullable The date and time when the meter was disengaged.
passenger_count integer nullable The number of passengers in the vehicle. This is a driver-entered value
trip_distance numeric nullable The elapsed trip distance in miles reported by the taximeter.
rate_code string nullable The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride
storeandfwd_flag string nullable This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip
payment_type string nullable A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip
fare_amount numeric nullable The time-and-distance fare calculated by the meter
extra numeric nullable Miscellaneous extras and surcharges. Currently, this only includes the \$0.50 and \$1 rush hour and overnight charges.
mta_tax numeric nullable \$0.50 MTA tax that is automatically triggered based on the metered rate in use
tip_amount numeric nullable Tip amount – This field is automatically populated for credit card tips. Cash tips are not included
tolls_amount numeric nullable Total amount of all tolls paid in the trip.
imp_surcharge numeric nullable \$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
total_amount numeric nullable The total amount charged to passengers. Does not include cash tips
pickuplocationid string nullable TLC Taxi Zone in which the taximeter was engaged
dropofflocationid string nullable TLC Taxi Zone in which the taximeter was disengaged

Basic Analysis Steps

Data Cleaning Steps

Data Transformation Steps

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.