The qss20_s21_proj from rebeccajohnson88

read through predictive modeling code/prepare to adapt it

data prep script (eg imputation) and modeling script- https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/qss20_groupcode/predict_viol/B_feature_matrix_prep.ipynb

for above, be prepared to: (1) update imputation to correspond to new data and (2) update modeling by including a longer list of classifiers (can share other sample code on that)

Geocoding next steps

Download tract shapefiles for 2014 (can just use that year)- https://www2.census.gov/geo/tiger/TIGER2014/TRACT/
Notes: write code to scrape the links and save the zip file in "raw"-- has two-digit state codes; can mass unzip using command line
Load and use sjoin for spatial intersection with the lat/longs- can adapt this code from here (https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/qss20_groupcode/geocoding_acs/01_geocode_mergewithACS_plot_studentversion.ipynb) and might need to use crosswalk to connect two-digit state code with state names (eunice i think has code related to that)

al_shapes = gpd.read_file(PATH_TO_INT_OUTPUTS + "tl_2016_01_tract")
ky_shapes = gpd.read_file(PATH_TO_INT_OUTPUTS + "tl_2016_21_tract")
la_shapes = gpd.read_file(PATH_TO_INT_OUTPUTS + "tl_2016_22_tract")
ms_shapes = gpd.read_file(PATH_TO_INT_OUTPUTS + "tl_2016_28_tract")
tn_shapes = gpd.read_file(PATH_TO_INT_OUTPUTS + "tl_2016_47_tract")
tx_shapes = gpd.read_file(PATH_TO_INT_OUTPUTS + "tl_2016_48_tract")

shapefile_crs = tx_shapes.crs

## intersections while setting the CRS/projection of the 
## points to match the shapefiles CRS; could be made more efficient
## by storing the points and shapes in a length-2 list and then 
## putting that list in a dict with state as key
al_join = sjoin(AL_points.set_crs(shapefile_crs), al_shapes, how="right", op = "intersects")
ky_join = sjoin(KY_points.set_crs(shapefile_crs), ky_shapes, how="right", op = "intersects")
ms_join = sjoin(MS_points.set_crs(shapefile_crs), ms_shapes, how="right", op = "intersects")
la_join = sjoin(LA_points.set_crs(shapefile_crs), la_shapes, how="right", op = "intersects")
tx_join = sjoin(TX_points.set_crs(shapefile_crs), tx_shapes, how="right", op = "intersects")
tn_join = sjoin(TN_points.set_crs(shapefile_crs), tn_shapes, how="right", op = "intersects")

match row number discrepancy diagnostic code for reference

deleted from final script but here


indices_approved = approved_matches$matches$inds.a
indices_approved_b = approved_matches$matches$inds.b 
sprintf("In the match object, there are %s unique indices for A; %s for B",
        length(unique(indices_approved)),
        length(unique(indices_approved_b))) 
both_ind = intersect(indices_approved, indices_approved_b) # see that they're the same just in diff order
indices_dropped = setdiff(rownames(approved_only), indices_approved)

approved_only$row_id = rownames(approved_only)
View(approved_only %>% filter(row_id %in% indices_dropped) %>% select(dedupe_fields, EMPLOYER_NAME))

Paper Feedback To-Do's

Writing Edits

"Executive Summary" at the beginning- can put under the new heading for it
Jamaica statistic in 2nd paragraph of section 1.1
Should "H-2" in first paragraph of section 1.2 be "H-2A"?
2nd paragraph of section 1.3- Feedback: "This paragraph only reports the first mode of enforcement. Please also provide the other modes to complete this paragraph." RJ: "I think for revisions, we could switch phrasing to drawbacks of complaint-driven enforcement and then the next sections outline different ways to improve."
2nd paragraph of section 1.3- Need citation for claim at the end of the paragraph
Last paragraph of section 1.4.2- Elaborate on "actual issues" statement
Section 1.4.3- Define trusted messengers and provide citation reference
Section 3.1- Only list research questions, and move methods/findings to the relevant sections
Section 3.1 Question 3- confirm that Cam got the abbreviations (ACS, NAICS, SOC) correct and see if we need more explanation
Section 4.1- Provide source reference/website link for the data program
Section 4.1- Add footnote explaining how the links link out to publicly-available code
Section 4.1- Address "Please clarify" note
Section 4.1- Add footnote to clarify what CASE_STATUS means
Section 4.2- Add footnote to provide more detailed information about the data, including the webpage
Section 4.3- Clarify the dataset used
Section 4.3- Add footnote to explain the National Trafficking Hotline
Cameron- check with RJ about "N ~ 155"
Section 5.1- Explain models in detail, identifying predictors of concern and the outcome variables in the models
Section 5.3- Add footnote to describe "fuzzy matching algorithm" in detail
Section 5.5.1, second paragraph- DOL comment: "It is unclear for what purpose these three reasons/statements serve. Please clarify and consider only reporting the stringent definition of an investigation"
Section 5.5.1, third paragraph- Clarify statement and whether outcome variable is continuous or categorical
Section 5.6- Clarify what the "similar criteria" are
Section 5.7- Clarify/add footnote with source reference for GroupShuffler() from Scikit learn
Section 6.1- Don't use "very low" or other relative language, quantify instead
Section 6.1- Clarify what test set refers to
All of Section 6.1- Need consistent rounding rule for computing percentages
Section 6.2.2- clarify what job orders means in that context
Section 6.3- Indicate data sources used when pooling from all states across 2014-2021
Section 6.3.1- Try to introduce LASSO without its abbreviation non-awkwardly, and also provide citation references
Section 6.3.2, second paragraph- Report quantitative data in the text instead of describing just "higher" or "lower" risk so readers don't need to read the table. Apply throughout the report where applicable.
Same paragraph- avoid using "higher risk" and throughout the report do not report differences when they are not statistically significant
Section 6.4- Provide the data in the text (in the paragraphs labeled 1, 2, and 3)
Section 6.4, last sentence- Quantify and avoid using "higher concentrations"
Footnote 6- Add time and link to press release if available
Move table 7 and 8 to an appendix "specifically describing the text analysis process"
Section 7- Elaborate on distinct entities being "flagged" based on the results of the data analyses
Section 7- Summarize data to support the claim that "it was less useful to predict which entities are likely to face local, TRLA-based oversight of activities versus federal oversight"
Section 7.3- Provide citation reference for risk-based sampling strategy
Move Supplementary Materials section into the appendices

Figure/Table Edits

Table 1- Add source citation and year below table. Also do this for other tables/figures where data are cited
Table 2- 3 notes in Overleaf (columns not aligned, add note to explain terms so table can stand alone, and move tables 2/3 along with detailed descriptions of analysis process to an appendix)
Appendix Table 16 (and other instances of Appendix) mentioned in the text, but no appendices in the report
Mentioned just before Table 3 in Overleaf- Add table listing research questions and the corresponding data sources/analysis methods used
Table 3- Not strictly speaking a table, they recommend revising
Section 6.2.1- Add time period covered in "6,000 - 8,000" range
Figure 2- "1. Please use “WHD investigation” to be consistent with the legend. 2. Please add a note to clarify the differences between WHD investigations and violations. 3. Please use a note to spell out all the acronyms in the tables or figures as they need to be able to stand alone without the need to read the text unless these is a glossary page provided upfront in the report."
Figure 3- "1. Please do not use a question in the title and revise it to an informative title, including what and how classified. Please apply this comment to revise the figure titles throughout the report where applicable. 2. Please place values on the bars (could be rounded to whole numbers) for clarity as these two figures intend to present counts and percentages. Please apply this comment to the other figures which report counts or percentages. 3. The text in the legend seems to be cut off. Please adjust and apply this comment throughout this report where applicable."
Section 6.2.1- Provide the actual data for the four categories in the text (from Figure 4)
Section 6.2.1- Provide reasons/data to support conclusion that "TRLA intake records (which largely derive from direct outreach workers to workers)---capture different employers in their enforcement purview"
Figure 4- Please provide the sample sizes (denominators) for the percentage data in the note for clarity. Please apply this comment to revise all the related tables or figures."
Figure 5- "Please add a note to explain “job orders” and “Investigations.” Please also explain why the areas after 2020-03-01 are shaded and have no data in the note. Please clarify whether the vertical axis for “Investigations” is counts or percent. Please apply this comment to Figure 6 as well. The x-axis is hard to read. Since it is “rounded to month,” please consider using “Jan-2019,” “Mar-2019” ... instead. Please apply this comment to Figure 6 as well."
Section 6.2.3- CG check with RJ how to replace \sim
Figure 7- Clarify what the values on the bar represent and use consistent formats, use "WHD investigations" consistently throughout the report, adjust the data labels that are hard to read
Figure 9- Avoid using symbols in the labels, like "TRLA intakes?", revise "had confirmed violations" in the title to "confirmed WHD-found violations" to be consistent with the axis, consider removing the None category
Section 6.2.3- provide in-text data to support the following claim: "We see that although the two types of oversight focus on generally similar types of employers, the TRLA intake records reflect a slightly higher proportion of agricultural equipment operators than WHD investigations."
Figure 10- Provide unit of measurement on the horizontal axis, fix label on vertical axis that seems to be cut off (I think "Farmworkers and Laborers, Crop, Nursery, and" should have "Greenhouse" at the end)
Section 6.3.1- Elaborate on what "training/estimation set employers" means here
Figure 11- "1. We recommend moving this figure on the results of various tests to an appendix specifically on the modeling process to keep the main report focused on the findings. 2. Please add a note below the figure to spell out and explain all the acronyms in this figure."
Figure 12- 1. It is unclear what “label” means here in this comparison. Please clarify and also explain how to interpret these two figures in more detail in the text. 2. As the figures need to stand alone without the need to read the text, please add a note to explain the analytic model and the sample sizes used for the data shown in this figure and in Figure 13."
Table 4- "Please include the model applied to generate these probabilities in the title for clarity.
Please also provide a note to explain the analytic model and the sample size used to generate these predictions. Please also apply this comment to provide analytic models/methods and sample sizes for the other figures or tables where applicable."
Figure 17- "Please also add a legend for gray and darker gray in this map"- are they just referring to the areas where the county borders are clustered...?
Section 6.5- Explain in a note why the word "translat" from the text analysis is not "translate"
Section 6.5- "Please consider revising" term "leading indicator" because our design was observational and not experimental
Figure 18- Explain stemming
Figure 19- Use more specific title including what data sources are used and how classified

Example link

Here is a link to the code: https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/01_explorequarterly.ipynb

Paper to dos

Writing major revisions:- @lizard12995 @cassaydavis

Need to write an executive summary
2-column table with glossary -- one column is abbreviation; other column is what it means. Things to include: WHD, TRLA, DOL, others?

Writing minor revisions:

Finish all overleaf comments in Sections 1-3 and 7 (intro, gaps, present report, and discussion)- @lizard12995 @cassaydavis
Add underlines to links in Sections 1-3 and 7 due to 508 compliance (with exception of references and links to other sections, so just external urls- can be done using \underline{}) - @lizard12995 @cassaydavis
Finish all overleaf comments in Sections 4, 5, and 6 (data, methods, results) - @camguage

Figure edits that can be made in overleaf (captions/alt text): - @camguage

Finish all overleaf comments

Edits that require new code or figure revisions: @camguage and @rebeccajohnson88

Figure 3: Place values on the bars (could be rounded to whole number). Apply this comment to the other figures which report counts or percentages

508 Compliance Edits

The blue of citations and script links is too light- switch to #0064A8 blue
Underline links if possible
Links not working as is- fix here will depend on final submission format
Need alt text for each figure (tables too?)
Make sure all tables are tables- 1-1 cells, no merged, no split (could be an issue with conversion to Word)

ACS percentages code to borrow from

## melt to long format
df_acs_long = pd.melt(df_acs, id_vars= ['county', 'state', 'tract']).sort_values(by = 
                                    ['tract', 'county'])
## create prefix and suffix columns
df_acs_long['variable_prefix'], df_acs_long['variable_suffix'] = df_acs_long['variable'].str.split('_', 1).str

## manually indicate which prefix columns don't follow the pattern
prefixes_perc_notrelevant = ['B01002','B25031','B25107']
df_acs_long['perc_NA'] = np.where(df_acs_long.variable_prefix.isin(prefixes_perc_notrelevant), 
                                    1, 0)

prefixes_perc_extrahier = ['B25026', 'B25123']
df_acs_long['perc_extrahier'] = np.where(df_acs_long.variable_prefix.isin(prefixes_perc_extrahier),
                                        1, 0)

## group by county, tract, and variable prefix to generate 
## percentages (and later variable names)
df_acs_long_toiterate = df_acs_long[df_acs_long.perc_NA == 0].copy()
group_co_tract_varg = df_acs_long_toiterate.groupby(['county', 'tract', 'variable_prefix'])



################################## Generate percentages: auto-calc #################################



df_acs_long_percentage = []
#flag = 0
for group,data in group_co_tract_varg:
    #if flag > 100: # flag to test code
     #   break
#     print(data)
    tract = data.tract.iloc[0]
    county = data.county.iloc[0]
    prefix = data.variable_prefix.iloc[0]
    row_list_group = []
    for i in range(1,data.shape[0]):
        numerator = data.value.iloc[i]
        denominator = data.value.iloc[0].astype(float)
        percentage= numerator/denominator
        row = [county,tract,prefix]
        row = row + [data.variable_suffix.iloc[i],percentage]
        row_list_group.append(row)
#         print(row)
#         print(row_list_group)
#         print('___________________________')
        
#         break
    df_acs_long_percentage.append(pd.DataFrame(row_list_group))
    #flag = flag + 1
    
    
percentages_all_groups = pd.concat(df_acs_long_percentage)
percentages_all_groups.columns = ['county', 'tract', 'variable_prefix', 
                                  'variable_suffix', 'percentage']

percentages_all_groups['variable_prefix_suffix'] = percentages_all_groups.variable_prefix + "_" + percentages_all_groups.variable_suffix
percentages_all_groups.drop(columns = ['variable_prefix', 'variable_suffix'], inplace= True)


####################################### Reshape percentages to wide ##################################

## subset to a few counties/tract (so even row numbers for wide reshape) (remove after testing)

### commented out testing code
## test_forwide = percentages_all_groups[(percentages_all_groups.county == 5) & 
   ##                                   (percentages_all_groups.tract.isin([100, 200, 400]))].copy()

percentages_all_groups['county_tract'] = percentages_all_groups.county.astype(str) + "_" + percentages_all_groups.tract.astype(str)


percentages_all_groups.drop(columns = ['county', 'tract'], inplace= True)

## try the pivot -- before pivoting, creating county-tract 
percentages_wide_pivot = percentages_all_groups.pivot(index = 'county_tract',
                                 columns = 'variable_prefix_suffix',
                                 values = 'percentage')

percentages_wide_pivot_reset = percentages_wide_pivot.reset_index()

## other cols to drop
percentages_wide_pivot_reset.drop(columns = ['b01001_002e', 
                                      'b01001_026e'], inplace = True)

convert _MONTH_YEAR cols back to date time cols

CG questions for RJ

from lizzie:

Line 121 - can you validate that what I wrote is correct?
Line 164 - question about source citation
[ ]

Exploratory data analysis

Loading the wage and hour division enforcement data linked to in the readme--- giving some basic descriptives on how businesses are identified (business name, business address, ein or tax id), whether there's a field indicating whether the employer is part of the H-2 visa program, # of rows, span of years
Similarly, load the H-2 jobs data and start to explore --- @rebeccajohnson88 will check with @TRLegalAid on whether we can get a static copy that's behind the interface

Report empirical things

Must do:

Input TRLA not WHD and WHD not TRLA accuracy results
Find feature importances for those
add within-covid over time variation- could be maybe- gen graphs; add to report along with refreshed others
basic text comparisons- could be maybe

Maybe do:

Go back to rawer data sources and find link between attorney agent name and attorney firm (probably trla raw-er data; think this was excluded pretty early from h2a disclosures data because not collected consistently across years)
Predict WHD violations conditional on investigation
Heatmap stuff

Notes on intake data (migrated from slack)

Dispositions:

Closed = TRLA involvement with the case has concluded. A case with this disposition was never rejected. We took it on in some capacity (anywhere from just giving advice to taking on significant litigation) and then the case was closed.
Open = TRLA has accepted the case and it is in progress. The client and their case qualify for our services (re: income, immigration status, and area of law)
Pending = TRLA is evaluating the case before it is accepted. This stage could involve completing an intake application by asking/answering clarifying questions about the applicant and their case.
Rejected = TRLA declined to take the case. This often happens because of ineligibility. The client may be over-income, stop responding to messages from us after repeated tries, or outside of our geographic reach.

How to code adverse parties:

Case # is tied to an individual client, Lead Case # is one number to link those cases together (for a group case, which includes multiple individual cases).

Any case (or group case) could have multiple adverse parties tied to that case. A set of individual cases in a group could also all have the same adverse party.
Lead case AP would apply to all individual cases in the group.

If multiple are present you’d want to separate out.
So, if ind AP is missing, use Lead AP. You have to decide if a “count” is tied to an individual client or to a group case.
If Lead AP is missing, that is fine, use individual case AP.
If both are present, you could separate out into two cases for the first occurrence? Or for each depending on how you’ve decided to count.

Share case ids for illogical values like for some timestamps or states?

Script 7 notes

At end of script 3, will want to output these pre-deduped datasets (investigations_cleaned and approved_only) with ids

approved_only <- approved_only %>%
  mutate(id = match(name_city_state,id_vector_1))

investigations_cleaned <- investigations_cleaned %>%
  mutate(name_city_state = sprintf("%s, %s, %s", name, city, st_cd))

id_vector_2 <- unique(investigations_cleaned$name_city_state)

investigations_cleaned <- investigations_cleaned %>%
  mutate(id = match(name_city_state,id_vector_2))

approved_only <- approved_only %>%
  mutate(id = match(name_city_state,id_vector_1))

investigations_cleaned <- investigations_cleaned %>%
  mutate(name_city_state = sprintf("%s, %s, %s", name, city, st_cd))

id_vector_2 <- unique(investigations_cleaned$name_city_state)

investigations_cleaned <- investigations_cleaned %>%
  mutate(id = match(name_city_state,id_vector_2))

Discuss deduplication and whether to do it in script 03 or script 07 --- can use fastlink within the same dataset to deduplicate and create ids for distinct businesses --- here's some example code from different project

For this note, those were within the same dataset or across two datasets? seems like it could be a threshold issue in the earlier script or needing to do the within-dataset deduplication in that script

Also, issues like DEL'S GRASS FARMS, LTD, SAN ANTONIO, TX, one without apostrophe, one with period after LTD., one without the LTD showing up as unique entries

dedupe_fields = c("derived_businessname",
                    "derived_businessownername",
                   "derived_street") 
er_matches <- fastLink(dfA = er_todedupe,
                       dfB = er_todedupe,
                       varnames = dedupe_fields,
                       stringdist.match = dedupe_fields,
                       dedupe.matches = FALSE)
er_deduped = getMatches(dfA = er_todedupe, 
                       dfB = er_todedupe,
                       fl.out = er_matches) 
## add multiple flag
er_wnewid = er_deduped %>%
          group_by(dedupe.ids) %>%
          mutate(is_multiple_rows = ifelse(n() > 1, TRUE, FALSE)) %>%
          ungroup() %>%
          rename(derived_biz_id_distinct = dedupe.ids)

Once deduplication done and ready for outcomes, would construct categorical variable with a few different values: (1) no match to investigations, (2) match and findings_start_date >= job_start_date but findings_end_date > job_end_date, (3) match but findings_start_date < job_start_date, (4) match and findings_start_date >= job_start_date and findings_start_date <= job_end_date

Startup notes

Geolocation data:

Main challenge: getting tract identifiers onto the violations data if they don't already exist

Look at the address fields available in the violations data (exploratory notebook here you can copy over to your repo and build on: https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/00_explore_WHD_data.ipynb). Subset to ones with any H2A violations
If Lat and Long don't exist already in the data, either use zip codes as the geographic identifier or do geocoding using address and a geocoding API. Google maps geocoder, Geocod.io API, and Census geocoder are a couple options; you'll need to do in batches/probably in a .py given runtime
Once you have lat/long, transform those into spatial points data and intersect them with census tracts to find which census tract a given location is located in
Use the Census API or data explorer to pull American Community Survey (ACS) data for relevant years/variables

Ways to make scope more manageable: possibly focusing on states within TRLA catchment area ('TX', 'MS', 'LA', 'KY', 'AL', 'TN') to have fewer things to geocode and fewer sets of tracts to pull; could also increase geographic unit of analysis to zip code or county, but less interesting; if time, could add housing locations data (in this repo)

DOL staffing:

Main challenge: getting the OPM workforce data into a usable format/investigating granularity of staffing data

Visit site containing OPM workforce data across many years and read through documentation (https://www.opm.gov/data/index.aspx). Things to investigate: (1) am pretty sure already it gets down to the granularity of wage and hour division within dol, but you'll want to see what the granularity of job titles are (eg does it have a category for inspectors or just general dol staff?) and what the granularity of geographic identifiers are (eg does it contain specific office locations, county-level sites, or just state?)
Write code that can iterate over the different OPM years, subset the rows you need, and rowbind across years. You can decide on things like location and year range using violations data - https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/00_explore_WHD_data.ipynb
Aggregate h2a violations to whatever aggregation level makes sense given DOL data (e.g., state; broader regions)
Can then do both descriptives and if you want, ML or statistical modeling --- if doing latter, make sure that staffing is measured pre-investigation, so you're looking at how staffing in 2017, for instance, predicts investigations either later that year or in the following year (depending on time granularity of OPM data)

Text as data:

Main challenge: data is already acquired so data acquisition straightforward; main goal is using the problem set two text as data content to find interesting patterns/expanding beyond that if relevant and also merging of jobs/violations data

Explore and clean data containing text of job addendums and figure out data structure (e.g., there are repeated rows with the same job order but for different addendums)- https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/data/raw_data/FOIA_2021-F-05932_raw_data.xlsx
Merge with the jobs data - top of this script contains a link to the most recent quarter but you'll want to pull all overlapping quarters (https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/01_explorequarterly.ipynb)
Merge jobs with violations data (might want to coordinate with ML/stats group) to construct a violations indicator
Use text as data techniques to explore differences in the language of addendums b/t employers with violations and employers without

ML/stats:

Main challenge: since not explicitly covered in course content, ML code in sklearn or other packages

Use DOL quarterly jobs data as the universe of potential employers--- see note for above group about where to find/relevant script
Decide on time unit of analysis --- e.g., is it going to be all employers ever, employers repeated across months, employers repeated across quarters, etc
Use violations data to create different binary labels in employer dataset- "investigated that month," "violation that month", etc
Predict those labels using whatever features are interesting w/in jobs data

Next steps coding

Some cleanup of rowbinding code @rebeccajohnson88

Relative rather than absolute path names

## relative path name -- should work regardless of user if we have same directory structure (can sync at the meeting)
OUTPUT_DIR = "../../qss20_finalproj_rawdata/summerwork/raw/"

## absolute pathname has my username in it, eg
/Users/rebeccajohnson/Dropbox/qss20_s21_proj/code

Cam's code in issue fully subsumed in script 02?

Cleaning of job addresses/geocoding:- maybe @helenma0223234 ?

Inspect addresses in h2a_combined_2014-2021 and clean as needed
Copy over this script to the main code: https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/qss20_groupcode/geocoding_acs/01_geocode_mergewithACS_plot_studentversion.ipynb and move the API key to a .yaml file called creds.yaml (can put in dropbox; not github)
Geocode a sample of addresses (eg maybe 100) and check runtime
Geocode all addresses using a .py --- may want to simplify the job clearance data to unique addresses and create a crosswalk between those and case numbers (so if two different job orders have the same address, that address just gets geocoded once and then the lat/long are merged back to both job orders)

Check ACS codebook and pull additional ACS characteristics for all census tracts: - maybe @Grant-98 and @eunice30718

This pulls a few ACS numerators/denominators - https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/qss20_groupcode/geocoding_acs/01_geocode_mergewithACS_plot_studentversion.ipynb
@rebeccajohnson88 will share ACS codebook and identify what year to pull
Identifying a larger number of predictors, marking them as "yes" on a sheet, and pulling for all tracts / states

Fuzzy matching between job orders data and violations data- maybe @camguage

@rebeccajohnson88 will share fastlink script- https://cran.r-project.org/web/packages/fastLink/index.html
Group did fuzzy matching in python; loads WHD investigations data - https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/qss20_groupcode/predict_viol/A1_fuzzy_matching.py switching to R; would still fuzzy match on employer name and either address or city and maybe block on state; would do in .R script and start out with smaller sample given runtime

Notes on dates

2014 data:

2015 data:

Coding tasks

Pulling and rowbind DOL quarterly H-2A disclosures data from 2014 onwards (just csv or excel files and not the addendums)

General link: https://www.dol.gov/agencies/eta/foreign-labor/performance
2014-2020 are in the "H2-A program" section
2021 is up top

@helenma0223234 leading - ither scraping (looking at page structure and getting links; seems doable) or writing out the links, pull the files and write copies of the file from each year to local Dropbox qss20_finalproj_rawdata/summerwork/raw; have argument in the script to either pull from source or not

@eunice30718 and @camguage

#6
Rename cols to reconcile across years as needed/if renaming is clear
#7

Geocoding locations of rowbound data (after rowbinding above):

Notes: code here does geocoding using locations of places with violations: https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/qss20_groupcode/geocoding_acs/01_geocode_mergewithACS_plot_studentversion.ipynb. Update to clean/geocode addresses in data above (once the data is rowbound, we can discuss which address field to use since there may be multiple); also would recommend doing this in batches and via a .py file/moving creds to a yaml file

Fuzzy matching between violations and quarterly disclosure data (after rowbinding above):

Notes: code here uses python recordlinkage; switching to fastlink in R (I can provide some starter code)

Rowbind and write csv and pkl to `qss20_finalproj_rawdata/summerwork/intermediate` -- retain all rows for now

Running list of TRLA intake data questions

Script here: https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/11_cleanTRLA_intake.Rmd

Category: adverse party/opponent:

I did the hierarchical coding discussed on slack in this chunk of code, so "missing all" means that a given row had none of the following fields: adverse party organization, lead case ap organization, adverse party name, lead case ap name

About 47% are missing all four fields:

I assume that some of these we'll fill in when we consolidate rows across case ids/lead case ids, but for others, should we just exclude?

trla_orig = trla_orig %>%
      mutate(derived_opponent_consolidated = case_when(!is.na(adverse_party_organization) ~ adverse_party_organization,
                                               is.na(adverse_party_organization) & !is.na(lead_case_ap_organization) ~ lead_case_ap_organization,
                                               is.na(adverse_party_organization) & is.na(lead_case_ap_organization) &
                                              !is.na(adverse_party_name) ~ adverse_party_name,
                                              is.na(adverse_party_organization) & is.na(lead_case_ap_organization) &
                                              is.na(adverse_party_name) & !is.na(lead_case_ap_name) ~ lead_case_ap_name,
                                              TRUE ~ NA_character_),
             derived_opponent_source = case_when(!is.na(adverse_party_organization) ~ "AP org",
                                               is.na(adverse_party_organization) & !is.na(lead_case_ap_organization) ~ "Lead AP org",
                                               is.na(adverse_party_organization) & is.na(lead_case_ap_organization) &
                                              !is.na(adverse_party_name) ~ "AP name",
                                              is.na(adverse_party_organization) & is.na(lead_case_ap_organization) &
                                              is.na(adverse_party_name) & !is.na(lead_case_ap_name) ~ "Lead AP name",
                                              TRUE ~ "Missing all"),

There are some govt institutions as opponents (SSa, texas workforce, dol)-- assume we exclude? code currently constructs a flag for them but not sure if others i should add

derived_is_notemp = case_when(grepl("Social Security Administration|Department of Labor|Workforce", derived_opponent_consolidated) ~ TRUE,
                                      TRUE ~  FALSE)

Intake date: was the pull all cases or pre-2014? there are a non-negligible number in the 1990s and possibly one data entry error that should be recoded to maybe 1991?

Compare columns available in each year; define vector with intersecting cols but also use set.difference to see years where new cols are added

Notes on column diagnostics

Outcome variable next steps

Commit with changes here: 26b30b9

Main things I changed were:

renaming ids to more informative names. Now, there are four ids:

job_group_id: id for each distinct group
job_row_id: id for each distinct row in jobs
investigations_group_id: id for each distinct group
investigations_row_id: id for each distinct row in investigations

Changed fuzzy matching code so that jobs are the "base" data that we merge matches onto--- key thing is we want to retain all jobs and use the presence or NA for matches to code outcomes; previous code used investigations as the base data so lost the majority of jobs that didn't match to an investigation
also noticed that the names function was producing below bug so updated that and reran all the fuzzy matching/deduplication

I stopped around line 330; i think hopefully straightforward to finish setting all flags to FALSE so things aren't rerun; script 7 can then be restricted to the date-based outcome variable coding

Modeling notes

I think you updated but we don't want to assume that all booleans are outcomes to predict. Instead, I'd just start a list of outcome vars like:

outcomes = ['outcome_XX', 'outcome_YY']

df_y = pre_df.select_dtypes(bool)
print("Outcome variables to predict are:" + str(df_y.columns.values))
y1 = list(df_y.iloc[:, 0])
y2 = list(df_y.iloc[:, 1])
# remove the them from the preMatrix ... because that would be too easy!
pre_df = pre_df.select_dtypes(exclude=['bool'])

For predictors, I don't think the capital letter heuristic will work once we add in more data so I'd: (1) read in the csv file with the merged job disclosure data, (2) use the list of columns for that as the initial features to consider using

Predictive modeling next steps

separate out preprocessing script from modeling script
read in following new datasets below (csv versions) to preprocessing script: (1) "WHD data" (all states; no TRLA outcomes)- whd_violations.csv; (2) "TRLA data" (7 states; whd and trla outcomes)
get existing preprocessing code to work with new data, making sure that the ACS variables are newly added as features
Make sure the acs is included in the mean imputation (separate for train and test) and do min-max scaling separately in each dataset
Ping me after preprocessing is done for me to look over
Then work on the model estimation: (1) for WHD data, this should iterate over two labels: (1) investigations (outcome_is_investigation_overlapsd) and (2) violations (outcome_is_viol_overlapsd); (2) for data combining TRLA, should use the following variable outcome_compare_TRLA_WHD and use np.where to define four outcomes: investigated by either, investigated by TRLA only, investigated by WHD only, investigated by both
For model estimation script, write the following outputs for each model and label to the model_outputs folder in Github (so not Dropbox): (1) confusion matrix (can melt to long format where one col is each of the for categories, another col is the count), (2) coefficients/feature importances, (3) continuous predicted probabilities for each row (should have job_row_id attached)

Notes on paper deadline - September 1st

The Researcher(s) will produce a final paper and will submit it via email to the DOL Summer Data Challenge address ([email protected]) that will:

present the results of the analysis;
discuss the questions that the research was not able to answer;
identify other gaps in knowledge;
recommend new data collection efforts that should be considered to address these gaps;
propose what the DOL should fund in the future.

The researchers will deliver a draft paper by September 2021. They will deliver a briefing to DOL on the paper in October 2021 on a date yet to be determined.

Things to check for script 3

Creating of new unique ids if deduping results in jobs across multiple states or investigations across multiple states

Currently, the code seems to contain a step to create a modified id if an investigations_group_id spans multiple states. The main issue is:

This should be done before the match step and immediately after the deduping because we want the correct ids to be in the deduped dataset before the random sample
Change 1: right after the investigations_deduped_clean creation on line 255, do the new id creation to create new ids for same name entities that span multiple states- this needs to occur here because otherwise the sampling is with the wrong ids. Then, use that new id to do the sampling of 1 within an id
Change 2: similar for jobs, after creating approved_deduped_clean around line 208, do the new id creation and sampling within that id

Descriptives next steps

Questions when reading in fuzzy matching outputs to construct outcomes variables:

Correct to assume that jobs_group_id is the final identifier for a unique employer (and safe to drop jobs_group_id_part1 and jobs_group_id_part2 from the df if those were intermediate created for the cross-state stuff)?
For @eunice30718 when I read in the ACS merged file, it has the original variable names and not the ones renamed from the codebook. Is there an easy fix somewhere to get the renamed versions? If not we can maybe skip but the renamed ones are important/know that was a big focus of the earlier scripts. maybe just do the renaming in R by melting and reading in the codebook?

acs_pred = fread("intermediate/job_combined_acs_premerging.csv")

Next steps:

Ideally fix ACS var naming (can just do in R by reading in codebook if needed)
Line 144: work on deduplication of ACS-jobs merge
Write two outputs
Visualizations using ACS characteristics of three groups: (1) no investigation, (2) investigation but no violation, (3) investigation and violation (TRLA and WHD)
Visualizations of things like naics codes and attorney agent between the three groups

next steps on TRLA

review commit here and see if there's any other changes you'd want to make- one part i wasn't sure about was standardizing the case dispositions between the two in this code chunk, but we also aren't likely to use that other than descriptively

derived_casedispo = case_when(case_disposition_orig == "Closed" | 
                                            case_disposition_more == "CLO" ~ "Closed",
                                            case_disposition_orig == "Pending" |
                                            case_disposition_more == "PEN" ~ "Pending",
                                            case_disposition_orig == "Open" | 
                                            case_disposition_more == "ACC" ~ "Open/accepted",
                                            TRUE ~ "Unknown dispo"))

copy a version of this script and name it something like 13_fuzzymatching_TRLA.R- https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/03_fuzzy_matching.R
work on edits to that script where, in places where that script uses the investigations data, substitute in the TRLA data- we can either touch base to go over general script logic before you dig into it or once you've started making modifications; the basic logic is: (1) dedupe both datasets (for jobs, you can just read in the ones with the job_group_ids added; for TRLA, maybe skip this step), (2) fuzzy match between the two, (3) add back the rows removed during the deduplication

new step on predicative ml

gen_topk_dummies for selected_cat_large (the threshold should make the final unique values in the col around 100)
#25
run random forest classifier with outcome variables[1]
run naive bayes model
get confusion matrix and test accuracy (should be similiar to B_feature_matrix_prep.ipynb)

take a look at the fuzzy matching code

# use the find status function and put into a new column
h2a <- h2a %>%
  mutate(status = find_status(CASE_STATUS)) # RStudio is taking forever to view so hard for me to tell if this worked

Scraping DOL quarterly data

alt text notes

https://tex.stackexchange.com/questions/75102/need-alt-text-for-images-in-pdf-for-screen-readers

rebeccajohnson88 / qss20_s21_proj Goto Github PK

qss20_s21_proj's People

Contributors

Stargazers

Watchers

Forkers

qss20_s21_proj's Issues

Recommend Projects

Recommend Topics

Recommend Org