Quick overview:
Business-oriented Data Scientist, experienced in the whole data funnel, from preprocessing to model creation and have been working with programming languages like Python and SQL for 4+ years as well as business-oriented tools like Tableau, PowerBI and Looker which have enabled me to excel in data-driven projects to gain actionable insights for companies. I hold a Mathematics & Statistics Degree at the University of Warwick, Data Mining & Statistics Programmes at Stanford, a Digital MBA at ISDI as well as a Machine Learning Bootcamp at Ironhack.
Job Data Scientist Experiences: Hipoo, Seedtag (see below)
Individual Projects: Else (see below)
This is my portfolio of projects I have worked on as a Data Scientist. Click on the blue links to go to the actual repository of the individual projects
In this project I was given a dataset of Semifinals and Finals from 2002-2009 and the objective was to predict the leaderboard order of the 2010 final.
As the problem is a regression supervised learning type, I decided to predict the number of points instead of the actual position. This is because these models are not as accurate in predicting discrete values. Once I have the points, I can order them. Aside from betting data, data on gender and home/away country was affecting the final result the most.
2010 Final Winner in Model- Germany
2010 Final Winner Actual - Germany
Here are the main results of our midbootcamp project, which consists of the implementation of a regression model.
May we explain a little bit what this is about.
Scenario
We are working as analysts for a real estate company. Our company wants to build a machine learning model to predict the selling prices of houses based on a variety of features on which the value of the house is evaluated.
Objective
Our job is to build a model that will predict the price of a house based on features provided in the dataset. Senior management also wants to explore the characteristics of the houses using some business intelligence tools. One of those parameters includes understanding which factors are responsible for higher property value - $650K and above.
Expected Outcomes
Since this is a regression model, you can use linear regression for building a model. You are also encouraged to use other models in your project including KNN regressor, decision trees for regression.
1. Explore the data To explore the data, you can use the techniques that have been discussed in class. Some of them include using the describe method, checking null values, using Matplotlib and Seaborn for developing visualizations.
The data has many categorical and numerical variables. Explore the nature of data for these variables before you start with the data cleaning process and then data pre-processing (scaling numerical variables and encoding categorical variables).
2. Build a Model Use different models to compare the accuracies and find the model that best fits your data. You can use the measures of accuracies that have been discussed in class. Please note that while comparing different models, make sure you use the same measure of accuracy as a benchmark.
3. Visualize You will use Tableau to visually explore the data further.
For this project, we had the challenge of working with a dataset from the FIFA 19 football game. We used a dataset from this project brief.
We decided to create a fictional data analytics consulting firm, the Data Dribblers, who specialise in the football (or soccer) industry.
The problem that we investigated was that clubs are are not getting a Return On their Investment when buying players. Clubs use a variety of features to choose the players but, lack a data-based approach to choose players who are the best value for money.
Our hypothesis was that we could identify under-valued players by creating a ranking model using performance attributes.
Using our database as a snapshot of player performance, we developed a model that predicts market value based on objective performance measures and compared it with their actual market value. In this way we can generate lists of undervalued players that are high performers.
This is my final Project of the Data Analysis Bootcamp at Ironhack where I am using Statistics about each player in the Top 500 to determine who will win a head to head match
Canva Presentation:https://www.canva.com/design/DAFpqtA7zsg/3M_0Q-65LLObk-iVmXvC-w/edit
Tableau Public: https://public.tableau.com/app/profile/ricardo.bravo1853/viz/TennisFinalProjectFinal/BreakPointsSavedFirstServes?publish=yes
Model Folder: Find the actual Model with the dataset found from Webscraping the atp official website
SQL folder: Find the queries I used in SQL to extract useful information
Tableau: Find the Tableau file where I did basic EDA, and the first insights on the data
Web Scraping: The actual Web Scraping code of all the data I initially used, and APIs I used. The end Dataset is at the very end
Streamlit: The code is available to actually run the Streamlit App as I showed in the presentation. The code also has the model inside.
Hope you are ready to make some money
Random Forest: https://github.com/ricardobravo98/lab-random-forests-Ricardo/tree/master/files_for_lab
Handling Data Imbalance: https://github.com/ricardobravo98/lab-handling-data-imbalance-classification-Ricardo
Inferential Statistics: https://github.com/ricardobravo98/lab-inferential-statistics-Ricardo
Cross Validation: https://github.com/ricardobravo98/lab-cross-validation-Ricardo
Unsupervised Learning: https://github.com/ricardobravo98/lab-unsupervised-learning-intro-Ricardo
T-test P values: https://github.com/ricardobravo98/lab-t-tests-p-values-Ricardo
Web Scraping: https://github.com/ricardobravo98/lab-web-scraping-single-page-Ricardo