Coder Social home page Coder Social logo

aly202012 / cleaning-data-for-effective-data-science Goto Github PK

View Code? Open in Web Editor NEW

This project forked from packtpublishing/cleaning-data-for-effective-data-science

0.0 0.0 0.0 218.78 MB

Cleaning Data for Effective Data Science, published by Packt

License: MIT License

JavaScript 0.09% Python 0.80% R 0.01% Jupyter Notebook 99.11%

cleaning-data-for-effective-data-science's Introduction

Cleaning Data for Effective Data Science

This is the code repository for Cleaning Data for Effective Data Science, published by Packt. It contains all the supporting project files necessary to work through the book from start to finish.

Binder

  • Paperback: 498 pages
  • ISBN-13: 9781801071291
  • Date Of Publication: 30 March 2021

Links

About the Book

It is something of a truism in data science, data analysis, or machine learning that most of the effort needed to achieve your actual purpose lies in cleaning your data. Written in David’s signature friendly and humorous style, this book discusses in detail the essential steps performed in every production data science or data analysis pipeline and prepares you for data visualization and modeling results.

The book dives into the practical application of tools and techniques needed for data ingestion, anomaly detection, value imputation, and feature engineering. It also offers long-form exercises at the end of each chapter to practice the skills acquired.

You will begin by looking at data ingestion of data formats such as JSON, CSV, SQL RDBMSes, HDF5, NoSQL databases, files in image formats, and binary serialized data structures. Further, the book provides numerous example data sets and data files, which are available for download and independent exploration.

Moving on from formats, you will impute missing values, detect unreliable data and statistical anomalies, and generate synthetic features that are necessary for successful data analysis and visualization goals.

By the end of this book, you will have acquired a firm understanding of the data cleaning process necessary to perform real-world data science and machine learning tasks.

Instructions and Navigation

All of the code for each chapter is within Jupyter Notebooks.

Table of Contents

  1. Preface

    1. Doing the Other 80% of the Work
    2. Types of Grime
    3. Nomenclature
    4. Typography
    5. Taxonomy
    6. Included Code
    7. Running the Book
    8. Using this Book
    9. Data Hygiene
    10. Exercises
  2. Data Ingestion – Tabular Formats

    1. Tidying Up
    2. CSV
    3. Spreadsheets Considered Harmful
    4. SQL RDBMS
    5. Other formats
    6. Data Frames
    7. Exercises
    8. Denouement
  3. Data Ingestion – Hierarchical Formats

    1. JSON
    2. XML
    3. Configuration Files
    4. NoSQL Databases
    5. Denouement
  4. Data Ingestion – Repurposing Data Sources

    1. Web Scraping
    2. Portable Document Format
    3. Image Formats
    4. Binary Serialized Data Structures
    5. Custom Text Formats
    6. Exercises
    7. Denouement
  5. Anomaly Detection

    1. Missing data
    2. Miscoded Data
    3. Fixed Bounds
    4. Outliers
    5. Multivariate Outliers
    6. Exercises
    7. Denouement
  6. Data Quality

    1. Missing Data
    2. Biasing Trends
    3. Benford's Law
    4. Class Imbalance
    5. Normalization and Scaling
    6. Cyclicity and Autocorrelation
    7. Bespoke Validation
    8. Exercises
    9. Denouement
  7. Value Imputation

    1. Typical-Value Imputation
    2. Trend Imputation
    3. Sampling
    4. Exercises
    5. Denouement
  8. Feature Engineering

    1. Date/time fields
    2. String fields
    3. String Vectors
    4. Decompositions
    5. Quantization and Binarization
    6. One-Hot Encoding
    7. Polynomial Features
    8. Exercises
    9. Denouement
  9. Closure

    1. What You Know
    2. What You Don't Know (Yet)
  10. Glossary

Related Products

cleaning-data-for-effective-data-science's People

Contributors

adii1823 avatar davidmertz avatar packtutkarshr avatar sabypackt avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.