Coder Social home page Coder Social logo

uganda_child_mortality's Introduction

Predicting Mortality of Children under 5 years of age

In this project we set 2 goals:

  1. Inference: Identifying factors that are predictive in nature of child mortality

  2. Predictive: Build a model that would allow to classify households and assign them a child mortality risk score

I worked in collaboration with Living Goods, a San Francisco based NGO, already operative on the field since many year, specifically in Uganda and Kenya. Community Health Providers (CHPs) are identified in each village, they are given a basic education and the tools to visit families in their parishes and provide them with first health assistance. Using a web application developed by Living Goods and deployed on Android phones, each CHP can answer basic questions and gather data as they visit households, but for legal reasons, they cannot record deaths.

Our problem was to find a proxy of information that contained both general characteristics of households and mortality data. For that purpose we used data from a Randomized Controlled Trial (RCT) in the form of surveys that targeted the exact same population.

Inference

The first phase of the project was inferential in nature: identify questions that relate to child mortality. The main challenge of the project lied in the data:

  • pretty small data (6700 data points, one data point being a survey)
  • pretty fat data (361 base features, one feature being one question)
  • extremely unbalanced classes (5.33% minority class, minority class being child death)
  • very low signal (almost all questions were categorical in nature, yes/no kind of question, Likert scales when we were lucky)

We run each single feature through a separate statistical test: Chi-squared vs ANOVA depending upon the nature of the feature (categorical vs quantitative), and reported the resulting p-values. We chose to set significance at 0.1, which is pretty common in these kind of low signal scenarios, but corrected for the number of tests we run using a simple Bonferroni correction.

The resulting set of 40 some features were further passes through:

  • sanity check: do the results make sense?
  • reality check: if the results make sense, can Living Goods collect these data on the field?
  • double check for statistical significance: deviance analysis, where we basically fit as many simple logistic regression as there are significant features that survived the sanity and reality check and compared the resulting deviance to the one of the same model only after shuffling the predictor variable. That was aimed at answering the question: apart from the statistical significance, do we see some real predictive power in the feature?

Model Fitting

The second phase of the project was the model fitting. We had to deal with the very high unbalance and tried the following approaches for that:

  • oversample the minority class: that has the drawback that we are basically duplicating our positive outcomes, and thus reducing the variance of that population. And the variance is our signal.
  • undersample the majority class: It preserves all the signal in the minority class, but it throws away data, and we don't have many of it in the first place.

We fit Random Forests, Ada- and Gradient-Boosting and SVM. We also extended the Bagging algorithm to implement a Balanced Bagging version of it, where basically between the bootstrap sampling and the actual model fitting, an over-/undersampling of the bootstrapped data set takes place.

Not surprisingly at the end Gradient Boosting was the best performer. With it we were able to achieve 0.75 AUC ROC score. Our recall was 0.82, with a False Positive Rate of 0.43. We weren't so much concerned with the high FPR, because Living Goods is going to still visit all households even if a scoring model will classify them as low risk. What we were focused on was achieving high specificity, and 0.82, compared to the 0.053 baseline without any predictor is very satisfactory to begin with. Back to the was not surprisingly: with a very low signal data set, any algorithm that is able to learn slowly and progressively correct prediction errors step-by-step, sounds very much like one that has a better chance at performing well among those tried, as opposed to algorithms that fit very deep decision trees, like random forests.

Future work would imply trying to use an anomaly detection framework to handle the very high unbalance: in the literature there are very interesting references to isolation forests, where many random forest are fitted and the output one's interested in is the average number of branches that are necessary to perfectly isolate any data point. The hope is that minority class instances will have a significantly higher average path length.

uganda_child_mortality's People

Contributors

sbrunelli avatar

Stargazers

Manish Saraswat avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.