Coder Social home page Coder Social logo

portfolio's Introduction

Data Science portfolio

A collection of personal data science and computational biology projects.

About me

I am passionate about solving business problems using Data Science & Machine Learning. I systematically & creatively use my skillset to add tangible value to the team, the business, and the end-user. I am constantly learning, and always looking to improve.

Contact Information

William Guesdon, Aberdeen UK
email
Personal website
linkedIn
Kaggle


Data Science projects

The goal of this Kaggle competition was to develop a model capable of detecting Abnormalities on Chest X-rays and locating them. The model created by our team was ranked in the top 7% of the competition. My role in the team was to run the GPU powered virtual machine used to improve our model.

This project aims to compare vaccine coverage between countries and the correlation between coverage and vaccine hesitancy.

The goal of this project is to use statistical learning to identify the combination of the features that are more likely to be associated with stroke. For this analysis, I first performed an exploratory data analysis and feature engineering. As often seen in health-related datasets, the proportion of patients with stroke is low, resulting in a severely unbalanced dataset. To compensate for the unbalance, I used the ROSE package to artificially over sample the number of strokes observations.
I then used a dimension reduction technique adapted to mixed datasets of continuous and categorical variables with the FAMD package. This visualization allowed me to identify the features most likely associated with the risk of developing a stroke. Using these selected variables, I built a logistic regression model to evaluate the contribution of each feature to the risk of developing a stroke. The logistic regression model identified the Age, Average Glucose Level, Smoking, Hypertension, and Heart Diseases features as the most likely risk factors of developing a stroke. Age was the principal risk factor in this study.

Heart disease is the leading cause of death in the United States (1). To address this significant health issue, the Center for Disease Control (CDC) has a division dedicated to heart disease and stroke prevention (2). The CDC also recently started to use machine learning for prevention and diagnoses, which should be useful in identifying the population at risk (3). Assisting diagnoses is an exciting and promising application of machine learning (4, 5), but the use of black box models is a potential issue (6). In this analysis, I used a decision tree open model to identify subjects with heart disease. The model achieved 80% accuracy, highlighting the potential of machine leering in assisting and accelerating the diagnosis of heart disease.

For this project, I explored a dataset of Airbnb properties in NYC. I used Python and the Seaborn library to visualize the data. The examination of the dataset highlights the large variability of the price distribution and the expected impact of the neighborhood and room type on price. To model the property price based on the independent variables, I used the scikit-learn library to implement multiple linear regression and random forest regressors. Given the limited number of independent variables on the dataset, the model accuracy was limited, especially for the higher-priced properties. A more elaborated model could be built using neural network and natural language processing analysis of the properties reviews to improve the model efficiency.

For this project, as for the NYC Airbnb analysis, I used Python and the scikit-learn library to predict the price of houses based on the independent variables. Because there were more variables in this dataset and because several factors were linearly correlated to the price, a multiple linear regression model performed well.

The users of the A14 road can report incidents using an application. The goal of this challenge proposed by the organizers of the Project: Hack5 hackathon is to perform sentiment analysis to obtain new insights from the user's comments and improve the user experiences on the application. The sentiment analysis was performed using the TextBlog library in python. Based on the sentiment analysis, an incident prediction tool was built using a Random Forest algorithm. The model can predict the type of incident with an accuracy of 83% and could be used to add a questionnaire auto-completion functionality in the application.

66DaysOfData is a challenge to learn data science by committing to work at least 5 min every day and share your progress. The goal of this project is to analyse the messages of the 66DaysOfData discord server. The first analysis focuses on the Introduction and Progress channels.

This project was proposed by the Data Scientist Syndicate facebook group.

This project aims to clean up and analyze the data set of Ph. D. students salaries by universities and departments over time. I performed the analysis with R and the tidyverse libraries. For the data cleaning, I excluded the variables containing a majority of missing values, combined similar departments, and separated the universities per location. The significant highlights from the analysis are:

  • The majority of responders are from the USA, data collection from non-USA universities started in 2013
  • The students stipend and living wages are higher in the USA.
  • The stipends do not increase with experience
  • The Ph.D. stipends had a significant decrease around the 2008 crisis.
  • The Ph.D. stipends are not equal between departments.

This project was proposed by the Data Scientist Syndicate facebook group.

Kasey Hemington runs BrainPost with a fellow PhD friend, Leigh Christopher as a way to keep in touch with her scientific roots while working as a data scientist!

The goal of this project is to answer the following question:

  • What content (or types of content) is most popular (what are patterns we see in popular content) and is different content popular amongst different subgroups (e.g. by source/medium)?

  • Where are people visiting from (source-wise)?

The Food Standards Agency (FSA) is an independent government department working across England, Wales and Northern Ireland to protect public health and consumers’ wider interest in food. The FSA is responsible for making sure food is safe and what it says it is.
On Sat, November 23, 2019 FSA and Pivigo organized a Hackathon to analyse food survey questionaires.

This simple project proposed by DataCamp aims to visualize the inequalities in life expectancy among countries using R. I performed the data cleaning and wrangling with the Dplyr package and the visualization with the Ggplot2 package.


Computational Biology projects

Systemic lupus erythematosus (SLE) is an autoimmune disease in which the B lymphocytes produce pathogenic auto-antibodies targeting healthy tissues. The exact causes of the diseases are unknown but involves a combination of genetic and environmental factors. In this project I used the AntibodyMap database to extract the heavy chains of healthy subject and SLE Patients. My interest was particularly to compare the IgM heavy chains repertoire between healthy control and SLE patients. I used the Immcantation pipeline developed by Steven Kleinstein’s team and my owns custom scripts to compare the B cell repertoire of SLE patient to healthy controls.

The steps used to create the Ubuntu 18.04 server used for B cells Receptor clonal analysis.

portfolio's People

Contributors

wguesdon avatar

Stargazers

Allie Vasserman avatar Rohit R avatar

Watchers

 avatar

Forkers

huyle84

portfolio's Issues

create site

create site is better to view and manage and hosted on github pages it easy

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.