<<<<<<< HEAD <<<<<<< HEAD
The goal of this lab is for you to read real data analytics projects and give some feedback on them.
Choose one of the lectures below and try to give your opinion about it, as well as answer the proposed questions.
Feel free to read all of them and make some comments if you are interested!
Reference: A Brief Exploration of Baseball Statistics
The first lecture is related to finding a relation between certain baseball statistics and the players salary.
Only read it until the T-test part!
Pay attention to how the author began his analysis:
- He defined his goal.
- He asked himself some questions.
- He formed some perfectly defined hypotheses.
- He studied the raw data and made some decisions based on what he saw.
- He checked that his code worked as intended by examining his dataframes.
- He knew exactly which visualizations he would need to try to answer his questions.
Think about the process you've been following during your current project.
- Is it similar to what the author of the article did?
- What do you think you can improve in your next project by reading how the author planned his analysis?
- Is there anything you think the author missed, something he could improve or do yo have some to-do ideas for him?
Reference: I was looking for a house, so I built a web scraper in Python!
The second lecture is related to scraping the data from a real state website.
After reading the article...
- Try to access the robots.txt file of Giphy and Facebook as explained in the article. Can you scrape those web pages? Do some research and try to understand those files.
- Did you do some web scraping in your project? Did you have to use some of the tricks the author used regarding time sleep, user-agent, ...? Why? Explain your experience.
- Is there anything you think the author missed, something he could improve or do yo have some to-do ideas for him?
Reference: Lies, Damned Lies and Statistics (about TEDTalks)
The third lecture is a presentation about the insights gotten from a TED Talks data analysis.
After watching the video...
- What do you think about the visualisations that the speaker used to present his insights? Which one is your favorite?
- What's your opinion about the speech? Did he get his point across?
- Was the presentation about a trascendental topic? Did it keep you glued to the screen?
- Is there anything you think the speaker missed, something he could improve or do yo have some to-do ideas for him?
- Markdown file with your comments about the lecture.
In the Advanced Regular Expressions lesson, you learned about the various components of regular expressions as well as how to use them both in isolation and together with other components.
In this lab, you will practice putting together your own regular expressions from scratch. Some of the examples are similar to the ones we went over in the lesson, while others have slight modifications included to test your knowledge and ensure that you grasp the concepts covered.
Open the main.ipynb
file in the your-code
directory. There are a bunch of questions to be solved. If you get stuck in one exercise you can skip to the next one. Read each instruction carefully and provide your answer beneath it.
main.ipynb
with your responses to each of the exercises.
Upon completion, add your deliverables to git. Then commit git and push your branch to the remote.
- Regular Expression Operations | Python Documentation
- Regular Expression How To | Python Documentation
- Python - Regular Expressions | TutorialsPoint
lab-advanced-regex/master =======
We keep seeing a common phrase that 80% of the work of a data scientist is data cleaning. We have no idea whether this number is accurate but a data scientist indeed spends lots of time and effort in collecting, cleaning and preparing the data for analysis. This is because datasets are usually messy and complex in nature. It is a very important ability for a data scientist to refine and restructure datasets into a usable state in order to proceed to the data analysis stage.
Try to tidy the weather data included in Ironhack's database (db: weather, table: temperatures). This dataset is a subset of a global historical climatology network dataset. The data represents the daily weather records for a weather station (MX17004) in Mexico for five months in 2010. The goal of this additional challenge is to get the most tidy dataset you are able to produce.
Hint: variables are stored in both rows and columns.
To accomplish this challenge, you will need to do some research on tidying and melt&pivot. Feel free to reference any resources you consider appropiate.
lab-advanced-data-cleaning/master