Coder Social home page Coder Social logo

cdsw-certification's Introduction

CDSW-certification

Repo to be used to certify for the SME CDSW team

Overview

We're interested in seeing whether self-reported health measures of a person, along with some of their demographic information, might enable us to predict their asthmatic status.

The data we have to work with is five years of telephone surveys taken from across the US where respondents are asked to report on a variety of health conditions and demographics, including their asthmatic status.

Data Set

The data sets for health come from the CDC's Behavior Risk Factor Surveillance System (aka BRFSS). These data sets cover many years but we'll limit ours to the years 2011-2015 because this set can be compared one against the other.

These data sets contain much information. For convenience we will largely use the calculated data rather than the raw data.

The information (and the corresponding column header) we're interested in includes:

Data of Interest

The field names and the descriptions we'll use are:

Variable Name Description
CVDCRHD4 Angina
CVDINFR4 heart attack
DISPCODE Was the interview completed or not
RENTHOM1 Own or rent home
SEQNO Record sequence number
SEX Respondent's sex
_AGE_G Six level age group
_BMI5CAT BMI Category
_CASTHM1 Adult currently has asthma - THIS IS THE RESULT TO BE PREDICTED
_DRDXAR1 Diagnosed with Arthritis
_EDUCAG Level of education completed
_HCVU651 Adults with Health Care coverage
_INCOMG Income group
_MICHD Respondents with heart issues
_RACEGR3 Five-level race group
_RFHLTH Adults with good or better health
_SMOKER3 Four Level Smoker Status
_STATE State FIPS code

Field Value Meaning & Count by Year

The meaning of field values and the count of those values by year is shown below. We're only going to use the field value count as a sanity check so only two random fields (_DISPCODE & _CASTHM1) have been included in that part of the table.

Some of the fields in the original data used a BLANK value to indicate uncertain data. We will convert all BLANK values to a 9 during the data processing phase.

CVDCRHD4

Value Meaning
1 Yes
2 No
7 Don’t know/Not sure
9 Refused

CVDINFR4

Value Meaning
1 Yes
2 No
7 Don’t know/Not sure
9 Refused
BLANK Not asked or Missing

DISPCODE

Note that in 2011 the code was 110 and 120, as compared to 1100, 1200 for later years. This will be corrected during the data processing phase.

Value Meaning 2011 2012 2013 2014 2015
1100 Completed interview 463,716 441,608 433,220 413,558 375,059
1200 Partially completed interview 42,751 34,079 58,553 51,106 66,397

RENTHOM1

Indicates whether the respondent rents or owns their home.

Value Meaning
1 Own
2 Rent
3 Other arrangement
7 Don’t know/Not Sure
9 Refused
Blank Missing

SEQNO

An arbitrary, monotonically increasing, sequence number that distinguishes all records in any year

SEX

Value Meaning
1 Male
2 Female

_AGE_G

Value Meaning
1 Age 18 to 24
2 Age 25 to 34
3 Age 35 to 44
4 Age 45 to 54
5 Age 55 to 64
6 Age 65 or older

_CASTHM1

Value Meaning 2011 2012 2013 2014 2015
1 No 457,964 429,280 442,718 418,561 398,154
2 Yes 45,203 43,267 45,630 42,875 40,000
9 Don’t know/Not Sure Or Refused/Missing 3,300 3,140 3,425 3,228 3,302

_BMI5CAT

Value Meaning
1 Underweight
2 Normal Weight
3 Overweight
4 Obese
BLANK Don’t know/Refused/Missing

_DRDXAR1

Value Meaning
1 Diagnosed with arthritis
2 Not diagnosed with arthritis
BLANK Don´t know/Not Sure/Refused/Missing

_EDUCAG

A categorized evaluation of the highest level of education achieved by the respondent

Value Meaning
1 Did not graduate High School
2 Graduated High School
3 Attended College or Technical School
4 Graduated from College or Technical School
9 Don't know, not sure, missing

_HCVU651

Value Meaning
1 Have health care coverage
2 Do not have health care coverage
9 Don’t know/Not Sure, Refused or Missing

_INCOMG

Respondent's annual household income

Value Meaning
1 [$0 - $10,000)
2 [$10,000 - $15,0000)
3 [$15,000 - $20,000)
4 [$20,000 - $25,000)
5 [$25,000 - $35,000)
6 [$35,000 - $50,000)
7 [$50,000 - $75,000)
8 [$75,000 - )
77 Don't know, not sure
99 Refused
Blank Not asked, or missing

_MICHD

Value Meaning
1 Reported having heart problem
2 Did not report having heart problem

_RACEGR3

Value Meaning
1 White only, Non-Hispanic
2 Black only, Non-Hispanic
3 Other race only, Non-Hispanic
4 Multiracial, Non-Hispanic
5 Hispanic
9 Don’t know/Not sure/Refused

_RFHLTH

Value Meaning
1 Good or Better Health
2 Fair or Poor Health
9 Unsure or missing

_SMOKER3

Value Meaning
1 Current smoker - now smokes every day
2 Current smoker - now smokes some days
3 Former smoker
4 Never smoked
9 Don’t know/Refused/Missing

_STATE

State FIPS code - widely known, not reproduced here.

I had hoped to investigate the relationship between cholesterol and asthma but the data sets for the years 2011 through 2014 dont contain any cholesterol information.

The data is available in SAS XPORT and fixed width formats. The SAS format doesn't include sufficient data to calculate race properly, so we'll use the fixed width format.

For statistically accurate reporting sophisticated weighting models must be used in the analysis of this data. For our purposes (certification for SME CDSW team) we're more interested in the processing, so will not apply those models but simply use the raw data as is.

The data along with codebooks can be found here:

https://www.cdc.gov/brfss/annual_data/annual_data.htm

Data Ingest

The data is in the form of zip files, each file of the order of a few MB.

The ingest process does the minimum necessary to get the data into HDFS:

  • download each zip file locally into a data directory ($HOME/brfss)
  • unzip each file into a fixed width file for that year ($HOME/brfss/YEAR.txt)
  • write that file into HDFS ($HDFS_HOME/brfss/YEAR.txt)

Ingestion can be performed (and should only be necessary once) using the scripts/1_ingest.py python2 script.

Data Processing

The files are in fixed width format, and location of the fields differ in each of the five years. Furthermore the fields present are different for different years. We have to calculate fields for the years where they're missing.

We encoded the various data processing steps in the scripts/2_process.py python2 script.

Field extraction

We use a json codebook file, indexed by year, to contain this fixed width content. We then use an observer function (xtract) which will extract a named field's contents given the spark data frame and the json codebook contents for the relevant year.

Field Calculation & Correction.

2015, 2016

The 2015 & 2016 years have all fields and need no further processing.

Other years

_RACEGR3 is known as _RACEGR2 in 2011 & 2012. If we find a year in which _RACEGR3 is not present then we create a new _RACEGR3 with the value of _RACEGR2

If we find a value of 110 or 120 in the DISPCODE field we will replace it with 1100 or 1200

If we find a BLANK value in any field we drop the record. This is the default behavior of the mechanism used to read the records so makes the coding easier.

_MICHD is not present in the years 2011 thru 2014 so is calculated as defined in the 2015 Codebook

Validation

Using the figures given in the original table we determine whether we have sufficient records in total.

Impala

We create Impala tables using the following technique:

First, figure out what the parquet files are called eg:

cdsw@psjfrp91wt2nw5rn:~$ hdfs dfs -ls brfss/*.parquet/part-00000-* | awk '{ print $8 }'
brfss/2011.parquet/part-00000-7b4d4723-9872-4d17-b921-bfcd8c1c9c66.snappy.parquet
brfss/2012.parquet/part-00000-699da841-6c73-449b-ad4a-68def324d033.snappy.parquet
brfss/2013.parquet/part-00000-833cbf0c-5b15-4d4d-8efe-ad01ab07ba3e.snappy.parquet
brfss/2014.parquet/part-00000-80fcdcd0-d1bc-4ad2-8d4c-063294d7764a.snappy.parquet
brfss/2015.parquet/part-00000-843db927-4e9c-4a57-be9c-a29f17f3b5a6.snappy.parquet

Then, using Hue, create the external tables in the Impala editor:

create external table brfss_2011 like parquet '/user/clouderanT/brfss/2011.parquet/part-00000-7b4d4723-9872-4d17-b921-bfcd8c1c9c66.snappy.parquet' 
stored as parquet location '/user/clouderanT/brfss/2011.parquet';
create external table brfss_2012 like parquet '/user/clouderanT/brfss/2012.parquet/part-00000-699da841-6c73-449b-ad4a-68def324d033.snappy.parquet' 
stored as parquet location '/user/clouderanT/brfss/2012.parquet';
create external table brfss_2013 like parquet '/user/clouderanT/brfss/2013.parquet/part-00000-833cbf0c-5b15-4d4d-8efe-ad01ab07ba3e.snappy.parquet' 
stored as parquet location '/user/clouderanT/brfss/2013.parquet';
create external table brfss_2014 like parquet '/user/clouderanT/brfss/2014.parquet/part-00000-80fcdcd0-d1bc-4ad2-8d4c-063294d7764a.snappy.parquet'    
stored as parquet location '/user/clouderanT/brfss/2014.parquet';
create external table brfss_2015 like parquet '/user/clouderanT/brfss/2015.parquet/part-00000-843db927-4e9c-4a57-be9c-a29f17f3b5a6.snappy.parquet'    
stored as parquet location '/user/clouderanT/brfss/2015.parquet';

This is all wrapped up in a single shell script scripts/2a_generate_sql.sh which will publish the appropriate sql.

We then checked some sample counts to see if the data had been read and converted correctly:

select count(dispcode) from brfss_2011 where dispcode = 1100;

463716

This is a correct value.

select count(dispcode) from brfss_2014 where dispcode = 1200;

51106

This is a correct value.

select count(casthm1) from brfss_2012 where casthm1 = 9;

3140

This is a correct value.

select count(casthm1) from brfss_2013 where casthm1 = 1;

442713

This is an incorrect value. It should be 447218 according to the 2013 code book.

select count(casthm1) from brfss_2013 where casthm1 = 2;

45630

This is a correct value

select count(casthm1) from brfss_2013 where casthm1 = 9;

3425

This is a correct value

select count(dispcode) from brfss_2013 where dispcode = 1100;

433220

This is a correct value

select count(dispcode) from brfss_2013 where dispcode = 1200;

58553

So we have one incorrect value, where a 3 (our data) should be an 8. I think I'm going to ignore it for now!

Exploration

Using scripts/3_explore.R we produced prevalence graphs. Two graphs were produced for each feature:

  • A graph showing the prevalence of the various classes and the asthmatic response as a percentage of the global population
  • A graph showing the prevalence of asthma amongst the various classes of the feature

Overall we observed that there could be large variations on a global scale (e.g. only 10% of the population had suffered from a heart attack)

Of interest was the monotonic increase in population with age (the survey was quite skewed toward elderly people although they were slightly less likely to be asthma suffers than the younger population)

Modeling/Analysis

We used three different models:

  • LogisticRegression
  • RandomForest
  • GBTClassifier

Unfortunately none of them gave any useful output; all of them simply predicted 'no asthma', which gave an accuracy of approximately 90%

To try to circumvent the influence of the negative cases on the model we tried reducing the negative cases by about 60%, but that didn't make any difference to the result.

We did attempt to search several hyperparameters using a TrainValidationSplit system, but found no hyperparameters that gave any different results. We also found that the model seemed to be very slow when trying to calculate a confusionMatrix. On the earlier models this had taken seconds; on the TVS model (using LinearRegression) this took many minutes. As a consequence we simply left this in the scripts/old/4_analysis_with_trainvalidationsplit_very_slow.scala file in case we want to come back to re-examine this at a future stage.

cdsw-certification's People

Contributors

tobyhferguson avatar

Watchers

James Cloos avatar  avatar

Forkers

naswiz

cdsw-certification's Issues

Tidy up rdd

Save the models and recover from a file
Wrap up evaluation in a function

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.