Coder Social home page Coder Social logo

qss20_s21_proj's People

Contributors

a-reaves avatar bess-n avatar camguage avatar camjwright avatar chadlass avatar davidkantor2000 avatar eprice819 avatar euniceyliu avatar grant-98 avatar helenma0223 avatar helenma0223234 avatar jecalianos avatar jgourdeau avatar katherinechristie avatar kevinchen9865 avatar lauraholland0905 avatar lizard12995 avatar nicole-student avatar rebeccajohnson88 avatar sambrant avatar

Stargazers

 avatar  avatar

Watchers

 avatar

qss20_s21_proj's Issues

Geocoding next steps

al_shapes = gpd.read_file(PATH_TO_INT_OUTPUTS + "tl_2016_01_tract")
ky_shapes = gpd.read_file(PATH_TO_INT_OUTPUTS + "tl_2016_21_tract")
la_shapes = gpd.read_file(PATH_TO_INT_OUTPUTS + "tl_2016_22_tract")
ms_shapes = gpd.read_file(PATH_TO_INT_OUTPUTS + "tl_2016_28_tract")
tn_shapes = gpd.read_file(PATH_TO_INT_OUTPUTS + "tl_2016_47_tract")
tx_shapes = gpd.read_file(PATH_TO_INT_OUTPUTS + "tl_2016_48_tract")

shapefile_crs = tx_shapes.crs

## intersections while setting the CRS/projection of the 
## points to match the shapefiles CRS; could be made more efficient
## by storing the points and shapes in a length-2 list and then 
## putting that list in a dict with state as key
al_join = sjoin(AL_points.set_crs(shapefile_crs), al_shapes, how="right", op = "intersects")
ky_join = sjoin(KY_points.set_crs(shapefile_crs), ky_shapes, how="right", op = "intersects")
ms_join = sjoin(MS_points.set_crs(shapefile_crs), ms_shapes, how="right", op = "intersects")
la_join = sjoin(LA_points.set_crs(shapefile_crs), la_shapes, how="right", op = "intersects")
tx_join = sjoin(TX_points.set_crs(shapefile_crs), tx_shapes, how="right", op = "intersects")
tn_join = sjoin(TN_points.set_crs(shapefile_crs), tn_shapes, how="right", op = "intersects")

match row number discrepancy diagnostic code for reference

deleted from final script but here


indices_approved = approved_matches$matches$inds.a
indices_approved_b = approved_matches$matches$inds.b 
sprintf("In the match object, there are %s unique indices for A; %s for B",
        length(unique(indices_approved)),
        length(unique(indices_approved_b))) 
both_ind = intersect(indices_approved, indices_approved_b) # see that they're the same just in diff order
indices_dropped = setdiff(rownames(approved_only), indices_approved)

approved_only$row_id = rownames(approved_only)
View(approved_only %>% filter(row_id %in% indices_dropped) %>% select(dedupe_fields, EMPLOYER_NAME))

Paper Feedback To-Do's

Writing Edits

  • "Executive Summary" at the beginning- can put under the new heading for it
  • Jamaica statistic in 2nd paragraph of section 1.1
  • Should "H-2" in first paragraph of section 1.2 be "H-2A"?
  • 2nd paragraph of section 1.3- Feedback: "This paragraph only reports the first mode of enforcement. Please also provide the other modes to complete this paragraph." RJ: "I think for revisions, we could switch phrasing to drawbacks of complaint-driven enforcement and then the next sections outline different ways to improve."
  • 2nd paragraph of section 1.3- Need citation for claim at the end of the paragraph
  • Last paragraph of section 1.4.2- Elaborate on "actual issues" statement
  • Section 1.4.3- Define trusted messengers and provide citation reference
  • Section 3.1- Only list research questions, and move methods/findings to the relevant sections
  • Section 3.1 Question 3- confirm that Cam got the abbreviations (ACS, NAICS, SOC) correct and see if we need more explanation
  • Section 4.1- Provide source reference/website link for the data program
  • Section 4.1- Add footnote explaining how the links link out to publicly-available code
  • Section 4.1- Address "Please clarify" note
  • Section 4.1- Add footnote to clarify what CASE_STATUS means
  • Section 4.2- Add footnote to provide more detailed information about the data, including the webpage
  • Section 4.3- Clarify the dataset used
  • Section 4.3- Add footnote to explain the National Trafficking Hotline
  • Cameron- check with RJ about "N ~ 155"
  • Section 5.1- Explain models in detail, identifying predictors of concern and the outcome variables in the models
  • Section 5.3- Add footnote to describe "fuzzy matching algorithm" in detail
  • Section 5.5.1, second paragraph- DOL comment: "It is unclear for what purpose these three reasons/statements serve. Please clarify and consider only reporting the stringent definition of an investigation"
  • Section 5.5.1, third paragraph- Clarify statement and whether outcome variable is continuous or categorical
  • Section 5.6- Clarify what the "similar criteria" are
  • Section 5.7- Clarify/add footnote with source reference for GroupShuffler() from Scikit learn
  • Section 6.1- Don't use "very low" or other relative language, quantify instead
  • Section 6.1- Clarify what test set refers to
  • All of Section 6.1- Need consistent rounding rule for computing percentages
  • Section 6.2.2- clarify what job orders means in that context
  • Section 6.3- Indicate data sources used when pooling from all states across 2014-2021
  • Section 6.3.1- Try to introduce LASSO without its abbreviation non-awkwardly, and also provide citation references
  • Section 6.3.2, second paragraph- Report quantitative data in the text instead of describing just "higher" or "lower" risk so readers don't need to read the table. Apply throughout the report where applicable.
  • Same paragraph- avoid using "higher risk" and throughout the report do not report differences when they are not statistically significant
  • Section 6.4- Provide the data in the text (in the paragraphs labeled 1, 2, and 3)
  • Section 6.4, last sentence- Quantify and avoid using "higher concentrations"
  • Footnote 6- Add time and link to press release if available
  • Move table 7 and 8 to an appendix "specifically describing the text analysis process"
  • Section 7- Elaborate on distinct entities being "flagged" based on the results of the data analyses
  • Section 7- Summarize data to support the claim that "it was less useful to predict which entities are likely to face local, TRLA-based oversight of activities versus federal oversight"
  • Section 7.3- Provide citation reference for risk-based sampling strategy
  • Move Supplementary Materials section into the appendices

Figure/Table Edits

  • Table 1- Add source citation and year below table. Also do this for other tables/figures where data are cited
  • Table 2- 3 notes in Overleaf (columns not aligned, add note to explain terms so table can stand alone, and move tables 2/3 along with detailed descriptions of analysis process to an appendix)
  • Appendix Table 16 (and other instances of Appendix) mentioned in the text, but no appendices in the report
  • Mentioned just before Table 3 in Overleaf- Add table listing research questions and the corresponding data sources/analysis methods used
  • Table 3- Not strictly speaking a table, they recommend revising
  • Section 6.2.1- Add time period covered in "6,000 - 8,000" range
  • Figure 2- "1. Please use “WHD investigation” to be consistent with the legend. 2. Please add a note to clarify the differences between WHD investigations and violations. 3. Please use a note to spell out all the acronyms in the tables or figures as they need to be able to stand alone without the need to read the text unless these is a glossary page provided upfront in the report."
  • Figure 3- "1. Please do not use a question in the title and revise it to an informative title, including what and how classified. Please apply this comment to revise the figure titles throughout the report where applicable. 2. Please place values on the bars (could be rounded to whole numbers) for clarity as these two figures intend to present counts and percentages. Please apply this comment to the other figures which report counts or percentages. 3. The text in the legend seems to be cut off. Please adjust and apply this comment throughout this report where applicable."
  • Section 6.2.1- Provide the actual data for the four categories in the text (from Figure 4)
  • Section 6.2.1- Provide reasons/data to support conclusion that "TRLA intake records (which largely derive from direct outreach workers to workers)---capture different employers in their enforcement purview"
  • Figure 4- Please provide the sample sizes (denominators) for the percentage data in the note for clarity. Please apply this comment to revise all the related tables or figures."
  • Figure 5- "Please add a note to explain “job orders” and “Investigations.” Please also explain why the areas after 2020-03-01 are shaded and have no data in the note. Please clarify whether the vertical axis for “Investigations” is counts or percent. Please apply this comment to Figure 6 as well. The x-axis is hard to read. Since it is “rounded to month,” please consider using “Jan-2019,” “Mar-2019” ... instead. Please apply this comment to Figure 6 as well."
  • Section 6.2.3- CG check with RJ how to replace \sim
  • Figure 7- Clarify what the values on the bar represent and use consistent formats, use "WHD investigations" consistently throughout the report, adjust the data labels that are hard to read
  • Figure 9- Avoid using symbols in the labels, like "TRLA intakes?", revise "had confirmed violations" in the title to "confirmed WHD-found violations" to be consistent with the axis, consider removing the None category
  • Section 6.2.3- provide in-text data to support the following claim: "We see that although the two types of oversight focus on generally similar types of employers, the TRLA intake records reflect a slightly higher proportion of agricultural equipment operators than WHD investigations."
  • Figure 10- Provide unit of measurement on the horizontal axis, fix label on vertical axis that seems to be cut off (I think "Farmworkers and Laborers, Crop, Nursery, and" should have "Greenhouse" at the end)
  • Section 6.3.1- Elaborate on what "training/estimation set employers" means here
  • Figure 11- "1. We recommend moving this figure on the results of various tests to an appendix specifically on the modeling process to keep the main report focused on the findings. 2. Please add a note below the figure to spell out and explain all the acronyms in this figure."
  • Figure 12- 1. It is unclear what “label” means here in this comparison. Please clarify and also explain how to interpret these two figures in more detail in the text. 2. As the figures need to stand alone without the need to read the text, please add a note to explain the analytic model and the sample sizes used for the data shown in this figure and in Figure 13."
  • Table 4- "Please include the model applied to generate these probabilities in the title for clarity.
    Please also provide a note to explain the analytic model and the sample size used to generate these predictions. Please also apply this comment to provide analytic models/methods and sample sizes for the other figures or tables where applicable."
  • Figure 17- "Please also add a legend for gray and darker gray in this map"- are they just referring to the areas where the county borders are clustered...?
  • Section 6.5- Explain in a note why the word "translat" from the text analysis is not "translate"
  • Section 6.5- "Please consider revising" term "leading indicator" because our design was observational and not experimental
  • Figure 18- Explain stemming
  • Figure 19- Use more specific title including what data sources are used and how classified

Paper to dos

Writing major revisions:- @lizard12995 @cassaydavis

  • Need to write an executive summary
  • 2-column table with glossary -- one column is abbreviation; other column is what it means. Things to include: WHD, TRLA, DOL, others?

Writing minor revisions:

  • Finish all overleaf comments in Sections 1-3 and 7 (intro, gaps, present report, and discussion)- @lizard12995 @cassaydavis
  • Add underlines to links in Sections 1-3 and 7 due to 508 compliance (with exception of references and links to other sections, so just external urls- can be done using \underline{}) - @lizard12995 @cassaydavis
  • Finish all overleaf comments in Sections 4, 5, and 6 (data, methods, results) - @camguage

Figure edits that can be made in overleaf (captions/alt text): - @camguage

  • Finish all overleaf comments

Edits that require new code or figure revisions: @camguage and @rebeccajohnson88

  • Figure 3: Place values on the bars (could be rounded to whole number). Apply this comment to the other figures which report counts or percentages

508 Compliance Edits

  • The blue of citations and script links is too light- switch to #0064A8 blue
  • Underline links if possible
  • Links not working as is- fix here will depend on final submission format
  • Need alt text for each figure (tables too?)
  • Make sure all tables are tables- 1-1 cells, no merged, no split (could be an issue with conversion to Word)

ACS percentages code to borrow from

## melt to long format
df_acs_long = pd.melt(df_acs, id_vars= ['county', 'state', 'tract']).sort_values(by = 
                                    ['tract', 'county'])
## create prefix and suffix columns
df_acs_long['variable_prefix'], df_acs_long['variable_suffix'] = df_acs_long['variable'].str.split('_', 1).str

## manually indicate which prefix columns don't follow the pattern
prefixes_perc_notrelevant = ['B01002','B25031','B25107']
df_acs_long['perc_NA'] = np.where(df_acs_long.variable_prefix.isin(prefixes_perc_notrelevant), 
                                    1, 0)

prefixes_perc_extrahier = ['B25026', 'B25123']
df_acs_long['perc_extrahier'] = np.where(df_acs_long.variable_prefix.isin(prefixes_perc_extrahier),
                                        1, 0)

## group by county, tract, and variable prefix to generate 
## percentages (and later variable names)
df_acs_long_toiterate = df_acs_long[df_acs_long.perc_NA == 0].copy()
group_co_tract_varg = df_acs_long_toiterate.groupby(['county', 'tract', 'variable_prefix'])



################################## Generate percentages: auto-calc #################################



df_acs_long_percentage = []
#flag = 0
for group,data in group_co_tract_varg:
    #if flag > 100: # flag to test code
     #   break
#     print(data)
    tract = data.tract.iloc[0]
    county = data.county.iloc[0]
    prefix = data.variable_prefix.iloc[0]
    row_list_group = []
    for i in range(1,data.shape[0]):
        numerator = data.value.iloc[i]
        denominator = data.value.iloc[0].astype(float)
        percentage= numerator/denominator
        row = [county,tract,prefix]
        row = row + [data.variable_suffix.iloc[i],percentage]
        row_list_group.append(row)
#         print(row)
#         print(row_list_group)
#         print('___________________________')
        
#         break
    df_acs_long_percentage.append(pd.DataFrame(row_list_group))
    #flag = flag + 1
    
    
percentages_all_groups = pd.concat(df_acs_long_percentage)
percentages_all_groups.columns = ['county', 'tract', 'variable_prefix', 
                                  'variable_suffix', 'percentage']

percentages_all_groups['variable_prefix_suffix'] = percentages_all_groups.variable_prefix + "_" + percentages_all_groups.variable_suffix
percentages_all_groups.drop(columns = ['variable_prefix', 'variable_suffix'], inplace= True)


####################################### Reshape percentages to wide ##################################

## subset to a few counties/tract (so even row numbers for wide reshape) (remove after testing)

### commented out testing code
## test_forwide = percentages_all_groups[(percentages_all_groups.county == 5) & 
   ##                                   (percentages_all_groups.tract.isin([100, 200, 400]))].copy()

percentages_all_groups['county_tract'] = percentages_all_groups.county.astype(str) + "_" + percentages_all_groups.tract.astype(str)


percentages_all_groups.drop(columns = ['county', 'tract'], inplace= True)

## try the pivot -- before pivoting, creating county-tract 
percentages_wide_pivot = percentages_all_groups.pivot(index = 'county_tract',
                                 columns = 'variable_prefix_suffix',
                                 values = 'percentage')

percentages_wide_pivot_reset = percentages_wide_pivot.reset_index()

## other cols to drop
percentages_wide_pivot_reset.drop(columns = ['b01001_002e', 
                                      'b01001_026e'], inplace = True)

CG questions for RJ

  • Pushed updated figures to github under \output\figs. If these look good I can add in Overleaf
  • One DOL comment on this sentence ("Figure 4 presents these as percentages to account for the low counts of investigations relative to overall employers") says "Please also provide the actual data for the four categories in the text for clarity." 1) I'm not quite sure what 4 categories they're referring to, and 2) with the percentages now on the figure directly, should I include this data in the text as well? Should I include all or just part of it?
  • Figure 11, which is now Figure 19 after moving to appendix (called Model Performance: any WHD investigation) needs the acronyms spelled out/explained. Is there any chance you could spell these out for me?
  • We need to "summarize data" to support the claim that "it was less useful to predict which entities are likely to face local, TRLA-based oversight of activities versus federal oversight" (line 862). Which data/figures should I point to here?
  • Line 353- Sorry but I need another clarification on this one. When I add a footnote, what "ones" should I add that are filtered out- just the status codes like "CERTIFICATION EXPIRED"? Also is this what they're asking us to clarify when their question is about script 2?
  • Per "yep I can share an outline of this" below, wanted to confirm you already implemented the whole table as Table 1?
  • As I was turning the one-column tables into figures, I am running into an issue because what is now Figure 20 contains a table that is longer than a page. This isn't working inside \begin{figure} ... \end{figure}. Not sure what the fix is here. Should I keep it as a table, suppress the name "table", and write a caption* starting with "Figure 20 ... "?
  • For Figures 6 and 7 (in case the numbers get switched around they are the ones beginning with "Monthly patterns pre and during COVID-19...", I have 2 questions. 1) are the vertical axes for investigations counts or percents? just so I can specify and 2) they suggest changing the x-axis to labels like "Jan-2019", etc. I would go ahead and implement but I cannot seem to find the code that creates these figures
  • Line 735 says "pooling data from all states across 2014-2021" and we are asked to specify the data sources. Is this just the WHD compliance action data?
  • Lines 738/9 uses the term "training/estimation set employers" and we are asked to specify what this means. If you or Helen could do this that would be awesome
  • Bottom of page 31, I have a footnote link that is hanging over the margin that I cannot seem to fix. Not sure if you knew any fixes or if copyeditor could handle
  • Figures 12 and 13 (both titled "Predictions versus true investigations label") require clarification for how to interpret them in the text, clarification for the word "label" in the title, and "a note to explain the analytic model and the sample sizes used for the data shown." I unfortunately do not understand these figures enough to add these changes
  • On table 4, they're asking for more information about the analytic model/sample size (and to apply this where else is appropriate). If you or Helen could help with this that would be great
  • Line 873, we have "it was less useful to predict which entities are likely to face local, TRLA-based oversight of activities versus federal oversight" and they ask us to summarize the data to support the claim. Was hoping to get help as this seems to be about ML

from lizzie:

  • Line 121 - can you validate that what I wrote is correct?
  • Line 164 - question about source citation
  • [ ]

Exploratory data analysis

  • Loading the wage and hour division enforcement data linked to in the readme--- giving some basic descriptives on how businesses are identified (business name, business address, ein or tax id), whether there's a field indicating whether the employer is part of the H-2 visa program, # of rows, span of years
  • Similarly, load the H-2 jobs data and start to explore --- @rebeccajohnson88 will check with @TRLegalAid on whether we can get a static copy that's behind the interface

Report empirical things

Must do:

  • Input TRLA not WHD and WHD not TRLA accuracy results
  • Find feature importances for those
  • add within-covid over time variation- could be maybe- gen graphs; add to report along with refreshed others
  • basic text comparisons- could be maybe

Maybe do:

  • Go back to rawer data sources and find link between attorney agent name and attorney firm (probably trla raw-er data; think this was excluded pretty early from h2a disclosures data because not collected consistently across years)
  • Predict WHD violations conditional on investigation
  • Heatmap stuff

Notes on intake data (migrated from slack)

Dispositions:

Closed = TRLA involvement with the case has concluded. A case with this disposition was never rejected. We took it on in some capacity (anywhere from just giving advice to taking on significant litigation) and then the case was closed.
Open = TRLA has accepted the case and it is in progress. The client and their case qualify for our services (re: income, immigration status, and area of law)
Pending = TRLA is evaluating the case before it is accepted. This stage could involve completing an intake application by asking/answering clarifying questions about the applicant and their case.
Rejected = TRLA declined to take the case. This often happens because of ineligibility. The client may be over-income, stop responding to messages from us after repeated tries, or outside of our geographic reach.

How to code adverse parties:

Case # is tied to an individual client, Lead Case # is one number to link those cases together (for a group case, which includes multiple individual cases).

Any case (or group case) could have multiple adverse parties tied to that case. A set of individual cases in a group could also all have the same adverse party.
Lead case AP would apply to all individual cases in the group.

If multiple are present you’d want to separate out.
So, if ind AP is missing, use Lead AP. You have to decide if a “count” is tied to an individual client or to a group case.
If Lead AP is missing, that is fine, use individual case AP.
If both are present, you could separate out into two cases for the first occurrence? Or for each depending on how you’ve decided to count.

Share case ids for illogical values like for some timestamps or states?

image

image

Script 7 notes

  • At end of script 3, will want to output these pre-deduped datasets (investigations_cleaned and approved_only) with ids
approved_only <- approved_only %>%
  mutate(id = match(name_city_state,id_vector_1))

investigations_cleaned <- investigations_cleaned %>%
  mutate(name_city_state = sprintf("%s, %s, %s", name, city, st_cd))

id_vector_2 <- unique(investigations_cleaned$name_city_state)

investigations_cleaned <- investigations_cleaned %>%
  mutate(id = match(name_city_state,id_vector_2))

approved_only <- approved_only %>%
  mutate(id = match(name_city_state,id_vector_1))

investigations_cleaned <- investigations_cleaned %>%
  mutate(name_city_state = sprintf("%s, %s, %s", name, city, st_cd))

id_vector_2 <- unique(investigations_cleaned$name_city_state)

investigations_cleaned <- investigations_cleaned %>%
  mutate(id = match(name_city_state,id_vector_2))


  • Discuss deduplication and whether to do it in script 03 or script 07 --- can use fastlink within the same dataset to deduplicate and create ids for distinct businesses --- here's some example code from different project

For this note, those were within the same dataset or across two datasets? seems like it could be a threshold issue in the earlier script or needing to do the within-dataset deduplication in that script

Also, issues like DEL'S GRASS FARMS, LTD, SAN ANTONIO, TX, one without apostrophe, one with period after LTD., one without the LTD showing up as unique entries

dedupe_fields = c("derived_businessname",
                    "derived_businessownername",
                   "derived_street") 
er_matches <- fastLink(dfA = er_todedupe,
                       dfB = er_todedupe,
                       varnames = dedupe_fields,
                       stringdist.match = dedupe_fields,
                       dedupe.matches = FALSE)
er_deduped = getMatches(dfA = er_todedupe, 
                       dfB = er_todedupe,
                       fl.out = er_matches) 
## add multiple flag
er_wnewid = er_deduped %>%
          group_by(dedupe.ids) %>%
          mutate(is_multiple_rows = ifelse(n() > 1, TRUE, FALSE)) %>%
          ungroup() %>%
          rename(derived_biz_id_distinct = dedupe.ids) 



  • Once deduplication done and ready for outcomes, would construct categorical variable with a few different values: (1) no match to investigations, (2) match and findings_start_date >= job_start_date but findings_end_date > job_end_date, (3) match but findings_start_date < job_start_date, (4) match and findings_start_date >= job_start_date and findings_start_date <= job_end_date

Startup notes

Geolocation data:

Main challenge: getting tract identifiers onto the violations data if they don't already exist

  • Look at the address fields available in the violations data (exploratory notebook here you can copy over to your repo and build on: https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/00_explore_WHD_data.ipynb). Subset to ones with any H2A violations
  • If Lat and Long don't exist already in the data, either use zip codes as the geographic identifier or do geocoding using address and a geocoding API. Google maps geocoder, Geocod.io API, and Census geocoder are a couple options; you'll need to do in batches/probably in a .py given runtime
  • Once you have lat/long, transform those into spatial points data and intersect them with census tracts to find which census tract a given location is located in
  • Use the Census API or data explorer to pull American Community Survey (ACS) data for relevant years/variables

Ways to make scope more manageable: possibly focusing on states within TRLA catchment area ('TX', 'MS', 'LA', 'KY', 'AL', 'TN') to have fewer things to geocode and fewer sets of tracts to pull; could also increase geographic unit of analysis to zip code or county, but less interesting; if time, could add housing locations data (in this repo)

DOL staffing:

Main challenge: getting the OPM workforce data into a usable format/investigating granularity of staffing data

  • Visit site containing OPM workforce data across many years and read through documentation (https://www.opm.gov/data/index.aspx). Things to investigate: (1) am pretty sure already it gets down to the granularity of wage and hour division within dol, but you'll want to see what the granularity of job titles are (eg does it have a category for inspectors or just general dol staff?) and what the granularity of geographic identifiers are (eg does it contain specific office locations, county-level sites, or just state?)
  • Write code that can iterate over the different OPM years, subset the rows you need, and rowbind across years. You can decide on things like location and year range using violations data - https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/00_explore_WHD_data.ipynb
  • Aggregate h2a violations to whatever aggregation level makes sense given DOL data (e.g., state; broader regions)
  • Can then do both descriptives and if you want, ML or statistical modeling --- if doing latter, make sure that staffing is measured pre-investigation, so you're looking at how staffing in 2017, for instance, predicts investigations either later that year or in the following year (depending on time granularity of OPM data)

Text as data:

Main challenge: data is already acquired so data acquisition straightforward; main goal is using the problem set two text as data content to find interesting patterns/expanding beyond that if relevant and also merging of jobs/violations data

ML/stats:

Main challenge: since not explicitly covered in course content, ML code in sklearn or other packages

  • Use DOL quarterly jobs data as the universe of potential employers--- see note for above group about where to find/relevant script
  • Decide on time unit of analysis --- e.g., is it going to be all employers ever, employers repeated across months, employers repeated across quarters, etc
  • Use violations data to create different binary labels in employer dataset- "investigated that month," "violation that month", etc
  • Predict those labels using whatever features are interesting w/in jobs data

Next steps coding

Some cleanup of rowbinding code @rebeccajohnson88

  • Relative rather than absolute path names
## relative path name -- should work regardless of user if we have same directory structure (can sync at the meeting)
OUTPUT_DIR = "../../qss20_finalproj_rawdata/summerwork/raw/"

## absolute pathname has my username in it, eg
/Users/rebeccajohnson/Dropbox/qss20_s21_proj/code

  • Cam's code in issue fully subsumed in script 02?

Cleaning of job addresses/geocoding:- maybe @helenma0223234 ?

Check ACS codebook and pull additional ACS characteristics for all census tracts: - maybe @Grant-98 and @eunice30718

Fuzzy matching between job orders data and violations data- maybe @camguage

Coding tasks

Pulling and rowbind DOL quarterly H-2A disclosures data from 2014 onwards (just csv or excel files and not the addendums)

General link: https://www.dol.gov/agencies/eta/foreign-labor/performance
2014-2020 are in the "H2-A program" section
2021 is up top

  • @helenma0223234 leading - ither scraping (looking at page structure and getting links; seems doable) or writing out the links, pull the files and write copies of the file from each year to local Dropbox qss20_finalproj_rawdata/summerwork/raw; have argument in the script to either pull from source or not

@eunice30718 and @camguage

  • #6
  • Rename cols to reconcile across years as needed/if renaming is clear
  • #7

Geocoding locations of rowbound data (after rowbinding above):

Fuzzy matching between violations and quarterly disclosure data (after rowbinding above):

  • Notes: code here uses python recordlinkage; switching to fastlink in R (I can provide some starter code)

Running list of TRLA intake data questions

Script here: https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/11_cleanTRLA_intake.Rmd

Category: adverse party/opponent:

I did the hierarchical coding discussed on slack in this chunk of code, so "missing all" means that a given row had none of the following fields: adverse party organization, lead case ap organization, adverse party name, lead case ap name

About 47% are missing all four fields:

image

  • I assume that some of these we'll fill in when we consolidate rows across case ids/lead case ids, but for others, should we just exclude?
trla_orig = trla_orig %>%
      mutate(derived_opponent_consolidated = case_when(!is.na(adverse_party_organization) ~ adverse_party_organization,
                                               is.na(adverse_party_organization) & !is.na(lead_case_ap_organization) ~ lead_case_ap_organization,
                                               is.na(adverse_party_organization) & is.na(lead_case_ap_organization) &
                                              !is.na(adverse_party_name) ~ adverse_party_name,
                                              is.na(adverse_party_organization) & is.na(lead_case_ap_organization) &
                                              is.na(adverse_party_name) & !is.na(lead_case_ap_name) ~ lead_case_ap_name,
                                              TRUE ~ NA_character_),
             derived_opponent_source = case_when(!is.na(adverse_party_organization) ~ "AP org",
                                               is.na(adverse_party_organization) & !is.na(lead_case_ap_organization) ~ "Lead AP org",
                                               is.na(adverse_party_organization) & is.na(lead_case_ap_organization) &
                                              !is.na(adverse_party_name) ~ "AP name",
                                              is.na(adverse_party_organization) & is.na(lead_case_ap_organization) &
                                              is.na(adverse_party_name) & !is.na(lead_case_ap_name) ~ "Lead AP name",
                                              TRUE ~ "Missing all"),

  • There are some govt institutions as opponents (SSa, texas workforce, dol)-- assume we exclude? code currently constructs a flag for them but not sure if others i should add
derived_is_notemp = case_when(grepl("Social Security Administration|Department of Labor|Workforce", derived_opponent_consolidated) ~ TRUE,
                                      TRUE ~  FALSE)

Intake date: was the pull all cases or pre-2014? there are a non-negligible number in the 1990s and possibly one data entry error that should be recoded to maybe 1991?

image

Outcome variable next steps

Commit with changes here: 26b30b9

Main things I changed were:

  1. renaming ids to more informative names. Now, there are four ids:
  • job_group_id: id for each distinct group
  • job_row_id: id for each distinct row in jobs
  • investigations_group_id: id for each distinct group
  • investigations_row_id: id for each distinct row in investigations
  1. Changed fuzzy matching code so that jobs are the "base" data that we merge matches onto--- key thing is we want to retain all jobs and use the presence or NA for matches to code outcomes; previous code used investigations as the base data so lost the majority of jobs that didn't match to an investigation

  2. also noticed that the names function was producing below bug so updated that and reran all the fuzzy matching/deduplication

image

I stopped around line 330; i think hopefully straightforward to finish setting all flags to FALSE so things aren't rerun; script 7 can then be restricted to the date-based outcome variable coding

Modeling notes

  • I think you updated but we don't want to assume that all booleans are outcomes to predict. Instead, I'd just start a list of outcome vars like:
outcomes = ['outcome_XX', 'outcome_YY']

df_y = pre_df.select_dtypes(bool)
print("Outcome variables to predict are:" + str(df_y.columns.values))
y1 = list(df_y.iloc[:, 0])
y2 = list(df_y.iloc[:, 1])
# remove the them from the preMatrix ... because that would be too easy!
pre_df = pre_df.select_dtypes(exclude=['bool'])
  • For predictors, I don't think the capital letter heuristic will work once we add in more data so I'd: (1) read in the csv file with the merged job disclosure data, (2) use the list of columns for that as the initial features to consider using

Predictive modeling next steps

  • separate out preprocessing script from modeling script
  • read in following new datasets below (csv versions) to preprocessing script: (1) "WHD data" (all states; no TRLA outcomes)- whd_violations.csv; (2) "TRLA data" (7 states; whd and trla outcomes)
  • get existing preprocessing code to work with new data, making sure that the ACS variables are newly added as features
  • Make sure the acs is included in the mean imputation (separate for train and test) and do min-max scaling separately in each dataset
  • Ping me after preprocessing is done for me to look over
  • Then work on the model estimation: (1) for WHD data, this should iterate over two labels: (1) investigations (outcome_is_investigation_overlapsd) and (2) violations (outcome_is_viol_overlapsd); (2) for data combining TRLA, should use the following variable outcome_compare_TRLA_WHD and use np.where to define four outcomes: investigated by either, investigated by TRLA only, investigated by WHD only, investigated by both
  • For model estimation script, write the following outputs for each model and label to the model_outputs folder in Github (so not Dropbox): (1) confusion matrix (can melt to long format where one col is each of the for categories, another col is the count), (2) coefficients/feature importances, (3) continuous predicted probabilities for each row (should have job_row_id attached)

Notes on paper deadline - September 1st

The Researcher(s) will produce a final paper and will submit it via email to the DOL Summer Data Challenge address ([email protected]) that will:

  1. present the results of the analysis;

  2. discuss the questions that the research was not able to answer;

  3. identify other gaps in knowledge;

  4. recommend new data collection efforts that should be considered to address these gaps;

  5. propose what the DOL should fund in the future.

The researchers will deliver a draft paper by September 2021. They will deliver a briefing to DOL on the paper in October 2021 on a date yet to be determined.

Things to check for script 3

  • Creating of new unique ids if deduping results in jobs across multiple states or investigations across multiple states

Currently, the code seems to contain a step to create a modified id if an investigations_group_id spans multiple states. The main issue is:

  • This should be done before the match step and immediately after the deduping because we want the correct ids to be in the deduped dataset before the random sample
  • Change 1: right after the investigations_deduped_clean creation on line 255, do the new id creation to create new ids for same name entities that span multiple states- this needs to occur here because otherwise the sampling is with the wrong ids. Then, use that new id to do the sampling of 1 within an id
  • Change 2: similar for jobs, after creating approved_deduped_clean around line 208, do the new id creation and sampling within that id

Descriptives next steps

Questions when reading in fuzzy matching outputs to construct outcomes variables:

  • Correct to assume that jobs_group_id is the final identifier for a unique employer (and safe to drop jobs_group_id_part1 and jobs_group_id_part2 from the df if those were intermediate created for the cross-state stuff)?
  • For @eunice30718 when I read in the ACS merged file, it has the original variable names and not the ones renamed from the codebook. Is there an easy fix somewhere to get the renamed versions? If not we can maybe skip but the renamed ones are important/know that was a big focus of the earlier scripts. maybe just do the renaming in R by melting and reading in the codebook?
acs_pred = fread("intermediate/job_combined_acs_premerging.csv")

image

Next steps:

  • Ideally fix ACS var naming (can just do in R by reading in codebook if needed)
  • Line 144: work on deduplication of ACS-jobs merge
  • Write two outputs
  • Visualizations using ACS characteristics of three groups: (1) no investigation, (2) investigation but no violation, (3) investigation and violation (TRLA and WHD)
  • Visualizations of things like naics codes and attorney agent between the three groups

next steps on TRLA

  • review commit here and see if there's any other changes you'd want to make- one part i wasn't sure about was standardizing the case dispositions between the two in this code chunk, but we also aren't likely to use that other than descriptively
derived_casedispo = case_when(case_disposition_orig == "Closed" | 
                                            case_disposition_more == "CLO" ~ "Closed",
                                            case_disposition_orig == "Pending" |
                                            case_disposition_more == "PEN" ~ "Pending",
                                            case_disposition_orig == "Open" | 
                                            case_disposition_more == "ACC" ~ "Open/accepted",
                                            TRUE ~ "Unknown dispo")) 
  • copy a version of this script and name it something like 13_fuzzymatching_TRLA.R- https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/03_fuzzy_matching.R

  • work on edits to that script where, in places where that script uses the investigations data, substitute in the TRLA data- we can either touch base to go over general script logic before you dig into it or once you've started making modifications; the basic logic is: (1) dedupe both datasets (for jobs, you can just read in the ones with the job_group_ids added; for TRLA, maybe skip this step), (2) fuzzy match between the two, (3) add back the rows removed during the deduplication

new step on predicative ml

  • gen_topk_dummies for selected_cat_large (the threshold should make the final unique values in the col around 100)
  • #25
  • run random forest classifier with outcome variables[1]
  • run naive bayes model
  • get confusion matrix and test accuracy (should be similiar to B_feature_matrix_prep.ipynb)

take a look at the fuzzy matching code

# use the find status function and put into a new column
h2a <- h2a %>%
  mutate(status = find_status(CASE_STATUS)) # RStudio is taking forever to view so hard for me to tell if this worked

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.