Coder Social home page Coder Social logo

data-cleaning's Introduction

DATA-CLEANING

Data is in 2 formats: 1.structured data. 2.Unstructred data.

Gone are the days when we used to have data mostly in row-column format, or we can say Structured data. In present times, the data being collected is more unstructured than structured. We have data in the form of text, images, audio etc and the ratio of Structured to Unstructured data has decreased over the years. FOR UNSTRUCTED DATA: 1.ESCAPING HTML CHARCATERS (import from html.parser) 2.DECODING DATA(#Encode from UTF-8 to ascii encode_tweet =tweet.encode('ascii','ignore') print("encode_tweet = \n{}".format(encode_tweet)) #decode from ascii to UTF-8 decode_tweet=encode_tweet.decode(encoding='UTF-8') print("decode_tweet = \n{}".format(decode_tweet))) 3.APOSTROPHE LOOKUP
4.REMOVAL STOP WORDS(Stop words are the words which occur frequently in the text but add no significant meaning to it) 5.REMOVE PUNCTUATIONS 6.SPLIT ATTACHED WORDS(eg: forthe ---for the) 7.SLANG LOOKUP(eg:asap---as soon as possible) 9.STANDARIZING WORD(eg: drivnig---driving ,misss u ---miss you) 10.REMOVAL OF URLS ADVANCE DATA CLEANING: 1.GRAMMER CHECKING 2.SPELLING CORRECTION NOW THE DATA IS CLEANED,U ARE READY FOR TEXT MINING:

Data cleaning not only refers to removing chunks of unnecessary data, but it’s also often associated with fixing incorrect information within the train-validation-test dataset and reducing duplicates.

*The importance of data cleaning: Data cleaning is a key step before any form of analysis can be made on it. Datasets in pipelines are often collected in small groups and merged before being fed into a model. Merging multiple datasets means that redundancies and duplicates are formed in the data, which then need to be removed.

FOR STRUCTRED DATA :(TABULAR) Here are some key takeaways on the best practices you can employ for data cleaning: (DONE IN PANDAS)

Identify and drop duplicates and redundant data Detect and remove inconsistencies in data by validating with known factors Maintain a strict data quality measure while importing new data. Fix typos and fill in missing regions with efficient and accurate algorithms

DATA EXPLORATION: (USING PANDAS): 1.Reading and writing data.( csv files :df=pd.read_csv() or write ki df=to_csv() for excel use Excel in place of csv) 2.Getting preview of data frame(head or tail) 3.Rename columns of df 4.selecting columns or rows 5.Handling missing values 6.creating new columns 7.aggregate 8.Merging/concate df 9.Applying function to elemnet,column,or df 10.basic stats

data-cleaning's People

Contributors

ruparani777 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.