Coder Social home page Coder Social logo

h2a_oversight's People

Contributors

camguage avatar euniceyliu avatar rebeccajohnson88 avatar

Watchers

 avatar

h2a_oversight's Issues

Updating analysis of text addendums

Start of script here (may want to move out of notebook form for language translation part): https://github.com/rebeccajohnson88/h2a_oversight_paper/blob/main/code/20a_combine_clean_addendums.ipynb

General notes: based on your qss20 code with following changes: (1) keep unit of analysis at section level rather than concatenating to single job description- we can concatenate letter but we may want to keep separate; (2) rather than detect spanish based on keyword, shifts to language model-based detection

Next steps (feel free to skip over the section name/numbering cleaning updates for now):

  • Generalize this part to run with all addendums (may want to move to .py): make sure that function works if multiple languages detected in the same string (may want to create dummy example with that)
sample_add = addendum.sample(n = 200, random_state = 91988)

## example true positive in spanish:
## CASE_NUMBER: H-300-20063-372516

## test language detection code on a couple examples
examples = sample_add.loc[sample_add.CASE_NUMBER.isin(["H-300-20063-372516",
                                                     "H-300-19316-139384"])].copy()
examples

### for eunice, not sure if robust enough 
### to deal with multiple languages in same part
### of text so might want to generalize
def detect_onestr(one_str):
    
    ## return list
    res = detect_langs(one_str)
    
    ## transform into a string and split on :
    split_res = str(res[0]).split(":")

    ## return split
    return(split_res)


## add language and probabilities to dataframe
examples['lang'] = [detect_onestr(one_str)[0] for one_str in examples.SECTION_DETAILS]

examples['lang_prob'] = [detect_onestr(one_str)[1] for one_str in examples.SECTION_DETAILS]


  • Ping me when done w/ previous step and I can share some example code for language translation - it takes a bit of time to run so we should only run on the ones with some non-English languages

Updating fuzzy matching

Focal script: can keep as same script number and just make direct edits - https://github.com/rebeccajohnson88/h2a_oversight_paper/blob/main/code/03_fuzzy_matching.R

  • Use the here library, the R project, and relative paths to avoid hardcoding the pathname. Point the pathname for DATA_DIR to the new Dropbox folder (can access via here if you clone the repo within dropbox and use relative paths). So it'll look something like ../../h2a_all_data/something
  • Edit this chunk to keep all in regardless of certification status
approved_only <- h2a %>%
  filter(status_cleaned == "- CERTIFICATION" | status_cleaned == "- PARTIAL CERTIFICATION") %>%
  filter(EMPLOYER_NAME != "") %>%
  mutate(state_formatch = ifelse(EMPLOYER_STATE == "", 
                                 WORKSITE_STATE, EMPLOYER_STATE))
  • Edit this chunk to keep in all investigations regardless of registration act or naics code
investigations_filtered <- investigations %>%
  filter((`Registration Act` == "H2A" | `Registration Act` == "FLSA" | `Registration Act` == "MSPA") & 
           (str_detect(naic_cd, "^11") | naic_cd %in% h2a_NAICS$NAICS_CODE))
  • Remove view statements from final version
  • Make sure final merged data retains the following cols in addition to others: naic_cd, status_cleaned, and Registration Act (might want to rename cols in compliance action data to lowercase/no spaces)
  • For the save RDS and csv, make sure they're in the Dropbox folder and not the repo ---

Translation addendums next steps

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.