Coder Social home page Coder Social logo

qss20_slides_activities's People

Contributors

jacklauer31 avatar jlau31 avatar rebeccajohnson88 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

qss20_slides_activities's Issues

PS4 1.2 A, B, C

A. Using the original pharma press release (so the one before stripping punctuation/digits), use spaCy to extract all named entities from the press release.

B. Print the unique named entities with the tag: LAW

C. Use Google to summarize in one sentence what the RICO named entity means and why this might apply to a pharmaceutical case (and not just a mafia case...)

Output for unique named entities with tag LAW:

image

PS2 1.2 Part A

Keep the above ~58k row dataset that is subsetted to only to sentences involving incarceration. Then, further subset the rows to:

  • Rows where judgeid_derived = judge_21
  • simplified_offense_derived == "Narcotics"

Use shape to print the dimensions of the resulting dataframe

Concepts and resources: row subsetting using logical conditions; see above resources

Hint on output: we get 53 rows after this filtering step. Don't worry about matching this exactly but try to get in the ballpark range.

PS2 1.1 Part A

  • Read in the data
  • Filter to sentences that involve incarceration (same Illinois Department of Corrections logic as in problem set one:
    incarceration is indicated by COMMITMENT_TYPE == "Illinois Department of Corrections")
  • Using the COMMITMENT_UNIT column that represents sentence length, filter out the following non-numeric sentence lengths: Term, Pounds, Dollars, Ounces (year, months, natural life, days, hours, and weeks should remain)
  • Filter to Black or White defendants

Concepts tested and resources: this question tests filtering rows based on logical conditions. Here are some resources:

Hints on output: we get 58289 rows remaining after this filtering. Don't worry about matching this exactly but try to get in the ballpark range.

PS4 2.4

2.4 Add topics back to main data and explore correlation between manual labels and our estimated topics (10 points)

A. Extract the document-level topic probabilities. Within get_document_topics, use the argument minimum_probability = 0 to make sure all 3 topic probabilities are returned. Write an assert statement to make sure the length of the list is equal to the number of rows in the doj_subset_wscores dataframe

B. Add the topic probabilities to the doj_subset_wscores dataframe as columns and create a column, top_topic, that reflects each document to its highest-probability topic (eg topic 1, 2, or 3)

C. For each of the manual labels in topics_clean (Hate Crime, Civil Rights, Project Safe Childhood), print the breakdown of the % of documents with each top topic (so, for instance, Hate Crime has 246 documents-- if 123 of those documents are coded to topic_1, that would be 50%; and so on). Hint: pd.crosstab and normalize may be helpful: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.crosstab.html

D. Using a couple press releases as examples, write a 1-2 sentence interpretation of why some of the manual topics map on more cleanly to an estimated topic than other manual topic(s)

Resources:

Output: NOTE - do not worry if your numbers are slightly different and if your topics have different labels (though should range from 1...3 rather than 0..2)--- the labels are impacted by the random_state arg at the model estimation stage but it should broadly show hate crimes and proj safe childhood mapping cleanly to one topic each; civil rights more split

image

PS4 3 A and B

3. Extend the analysis from unigrams to bigrams (10 points)

In the previous question, you found top words via a unigram representation of the text. Now, we want to see how those top words change with bigrams (pairs of words)

A. Using the doj_subset_wscore data and the processed_text column (so the words after stemming/other preprocessing), create a column in the data called processed_text_bigrams that combines each consecutive pairs of word into a bigram separated by an underscore. Eg:

"depart reach settlem" would become "depart_reach reach_settlem"

Do this by writing a function create_bigram_onedoc that takes in a single processed_text string and returns a string with its bigrams structured similarly to above example

Hint: there are many ways to solve but zip may be helpful: https://stackoverflow.com/questions/21303224/iterate-over-all-pairs-of-consecutive-items-in-a-list

B. Print the id, processed_text, and processed_text_bigram columns for press release with id = 16-217

Output from example (fine if yours differs slightly):

image

PS3 1.3.1

1.3.1 Converting to upper (2 points)

A. Convert the names EMPLOYER_NAME and Name to uppercase using list comprehension rather than df.varname.str.upper() (it's fine to do a separate list comprehension line for each of the two columns)

B. Print a random sample of 15 values of each result

C. Assign the full vector of uppercase names back to the original data, writing over the original EMPLOYER_NAME and Name columns

Resources:
- Activity code with list comprehension: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/00_pandas_datacleaning_solutions.ipynb
- Sampling from a list without replacement using the random module: https://note.nkmk.me/en/python-random-choice-sample-choices/

Output: since everyone's random sample will be different, we don't have a uniform output; here's an example of the uppercase conversion for the American Crawfish employer in jobs: AMERICAN CRAWFISH LLC

image

PS5 2.3

2.3 Find the (1) number of rows in the database, (2) number of distinct states, (3) number of distinct years (3 points)

Interpret the results - how do you think the data is structured in terms of states and years (eg long format where each state repeated; wide format)?

Output:

image

PS5 2.1

2.1 Load database credentials if haven't already and establish a connection (1 point)

Note: you need to be on eduroam or VPN

Output of successful connection if you print connection object:

image

Problem set one questions and responses (S21)

Group portion

  • Question: for question 1.1, Print the mean and median number of charges per case/participant should we calculate the mean and median numbers per case and then per participant?

  • Response: ok with any reasonable answer but originally was one combined --- I'd basically think of it as grouping by both case and participant, find the number of charges for each case-participant group, and then the mean/median of that (sorry a bit confusing but just results from fact of participants nested in cases)

  • Question: for question 1.2.2, if the category does not contain aggravated, arson, homicide, vehicle/vehicular, or battery, what should appear in the simplified_offense_derived column?

  • Response: for other categories that aren't changed, the simplified_offense_derived column should be set to the original value for that category (eg if Retail Theft isn't changed by any of the above, simplified_offense_derived stays as Retail Theft)

  • Question: for question 1.3, if the race == NaN, should that be categorized as none of the above?

  • Response:: in short doesn't matter since we'll be subsetting to black and white; so can either code as true for other or leave as missing/false for all.

  • Question: just wanted to clarify that we should use sentenceymd_derived column for part 2.1?

  • Response: for all the time based questions going forward, you can use either sentenceym_derived or sentenceymd_derived (whichever makes more sense for the question)

  • Question: what does winsorize in this part of 1.3 mean- Winsorsize the top 0.01% of values to be equal to the 99.99th percentile value pre-winsorization. Call this age_derived

  • Response: word winsorize isn't important for answering (but some discussion here!: https://en.wikipedia.org/wiki/Winsorizing). In this question, it involves: (1) finding what the 99.99th percentile value for age at incident is in the raw data, (2) if a given row is above that 99.99th percentile value, recode it to the 99.99th percentile value (so recode it rather than drop it). Recommend this command: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html

  • Question: for 2.3.3, I understand the comparison by time frame but I’m wondering what type of disparities are we looking to compare? Raw number of retail theft sentences, sentence length for retail thefts or a different column?

  • Response: good question! It should be (1) just looking at retail thefts, (2) looking at what percentage of defendants are incarcerated (so repeating aggregation from 2.3.2 but doing the time variation)

  • Question: if working locally, why might I be having issues loading the sentencing_cleaned.pkl file and what can I do about it?

  • Response: pkl files are a filetype that preserve things like the datatypes (e.g., datetime) over csv. The downside is that there are what are called serialization protocols that differ between versions of python. The pkl I shared with you corresponds to python 3.8+ (version on jupyter hub).

If you want to work locally and you don't have python 3.8+ installed yet (can check versions by going to terminal and typing python --version) you can load the csv and use the following code to recast the two datetime columns back to the correct type (screenshot below is me checking that this works)

sentencing_csv['sentenceymd_derived'] = pd.to_datetime(sentencing_csv.sentenceymd_derived)
sentencing_csv['sentenceym_derived'] = pd.to_datetime(sentencing_csv.sentenceym_derived)

image

Question: for questions like 2.3.2 and 3.3 that say "print table", is printing a pandas dataframe fine?

Response: yep just a dataframe or whatever showing the results is fine!

Question: I am still a bit confused on what we are supposed to do for this question "Use mathematical notation to write out each of the proportions using summation notation," after reviewing the in-class Latex example. Are we supposed to export the table we created to latex and in latex make the summation notation?

Response: (1) use the menu at top to change the cell to markdown, (2) using math mode (e.g, open and closed $ $), write out the proportions using mathematical notation--- eg the numerator or count of Black defendants is $\sum_{i} Black_{i}$

Individual portion

Question: for question 3.2, are we supposed to print the median sentence length for all defendants/judges (ie plot grouped by judge and by race) or just by overall values (ie just one median value each for white and black and only grouped by race)

Response: yep to unpack this a bit For each judge with at least 20 Black and at least 20 white defendants, plot the (1) median sentence length for Black defendants and (2) median sentence length for white defendants (factor variable on x axis for each judge_id who qualifies; group by race)

  • Filter to narcotics offenses and judges who have 20+ Black and 20+ white defendants
  • Group by judge and race to calculate median sentence length --- so graph should have 2 bars or points for each judge- one showing median sentence length for that judge's Black defendants; other for his/her white defendants

Question: for question 3.3, who are our focal defendants, every defendant for a narcotics offense for judge 21? Does using the "same narcotics offenses" mean we're only looking at Black and White defendants like in 3.2, and does that also mean we shouldn't consider other races when doing the matched pairs? And lastly, if every defendant is a focal defendant, won't we get repeat pairs just in different orders?

Response:

  • Sample: every Black or White defendant with a narcotics offense sentenced by judge 21
  • Yep keep as Black or White only so continue excluding other races
  • Yep you iterate so that every defendant is a focal defendant; yep you'll have some repeat matches in output and some in the above analytic may not have any matches

PS5 1.4

1.4 Reproduce the queries from 1.1 and 1.2 using a user-defined function (4 points)

Create a function, construct_naep_query that takes in two arguments:

  • year: this should be a list with all years (so if one year, single element list; if multiple years, list with those years)
  • place: this should be a string with the name of the state or jurisdiction to pull

Have the function return the query and make sure it's identical to the queries you wrote for 1.1 and 1.2 (can use assert or other checker)

No output to display since output is just the query you're being asked to produce

PS3 2.3

2.3 Put the code from 2.2 into a fuzzy matching function that generalizes some steps (8 points)

You want to see how the matches change if you add the city and not only state as a field and also want to automate the process of matching a bit to try different distance thresholds.

A. Extract the City from the City, State column of debar

B. Convert that new city column to uppercase and convert the EMPLOYER_CITY column in jobs to uppercase

C. Write a function surrounding the code in recordlinkage that you wrote in problem 2.2 (so you don't need to recode the package from scratch) that (1) takes in each dataset, (2) blocks on two-digit state, and (3) fuzzy matches on employer name and employer city. See below notes on partial versus full credit for function structure.

D. Execute the function with a couple different string distance thresholds and print the resulting matches for each

6 out of 8 points: function takes arguments for input datasets, varname to block on, two varnames to fuzzy match on, string distance function, and string distance threshold

8 out of 8: above but function is also general enough that it takes a variable # of strings to match on--- so should work if you either execute just using employer name or also work if you execute using employer name and employer city

Concepts and resources: same as above. One hint is that for us, the easiest way to get to the full credit solution was to feed the function a dictionary where each key corresponds to one string variable to fuzzy match on; the values contain that variable's name in each dataset, the distance metric, and the distance threshold.

Example output: IMPORTANT- yours will likely differ because of string distance metrics etc. Here's our example though for a distance threshold of 0.9 for both business name and city name and then a threshold of 0.75 for each

image

Independent study set up tasks

  • Make sure you're able to login to DataCamp
  • Clone this qss20_slides_activities repo locally
  • Make sure you have python 3.8 installed (3.7 is okay but 3.8 works better for pkl stuff if not too big an issue to upgrade from 3.7) and have anaconda to work with jupyter notebooks

First parts of course will be (can discuss good deadlines monday)

PS2 1.2 Part C

Using the results from Part B, use pd.concat or another approach to create a dataframe that compares the (1) race and sentence length for the focal defendant to (2) the sentence length for other defendants. Using this dataframe, show this comparison for focal defendant: CASE_PARTICIPANT_ID == 808109112733

PS5 2.2

2.2 Run a query to select all columns and the first 5 rows of the math_gencompare database to explore structure (2 points)

Read the results in as a pandas dataframe and print the result

Output:

image

PS5 1.3

1.3 Create a line plot to show variation in the scores across years (2 points)

Using the results from 1.2, create a plot where the x axis has the year and the y axis is the math scores (value in dataframe), and there are separate lines/colors for male versus female students

Start the limits of the y axis at 270 and add informative labels

Our graph (doesn't need to match exactly in terms of aesthetics/labels:

image

PS4 1.1 A

1. NLP on one press release (10 points)

Focus on the following press release: id == "17-1204" about this pharmaceutical kickback prosecution: https://www.forbes.com/sites/michelatindera/2017/11/16/fentanyl-billionaire-john-kapoor-to-plead-not-guilty-in-opioid-kickback-case/?sh=21b8574d6c6c

The contents column is the one we're treating as a document. You may need to to convert it from a pandas series to a single string.

We'll call this press release pharma

1.1 part of speech tagging (3 points)

A. Preprocess the pharma press release to remove all punctuation / digits (so can use .isalpha() to subset)

Example output from first five tokens:

image

PS4 2.2

2.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)

A. Use the create_dtm function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the following columns: id, compound sentiment column you added, and the topics_clean column

B. Print the top 10 words for press releases with compound sentiment in the top 5% (so the most positive sentiment)

C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so the most negative sentiment)

Hint: for these, remember the pandas quantile function from pset one.

D. Print the top 10 words for press releases in each of the three topics_clean

For steps B - D, to receive full credit, write a function get_topwords that helps you avoid duplicated code when you find top words for the different subsets of the data. There are different ways to structure it but one way is to feed it subsetted data (so data subsetted to one topic etc.) and for it to get the top words for that subset.

Resources:

Output for top words based on compound scores:

image

Output for top words based on the topics_clean:

image

image

PS2 1.2 Part D

Using the results from Part C, group by the focal defendant's race and find the proportion of that defendant's matches who had a LONGER sentence than the focal defendant

Concepts and resources: can use groupby and agg

Hints on output: since we're grouping by race in the full pool, the resulting output should have two values: one for the rate at which Black defendants have matches with higher sentences; another for the rate at which White defendants have matches with higher sentences

PS5 2.5

2.5 group by year and find the percentage of states where male scores are higher than females (4 points)

A. Write a query that (1) groups by year and (2) finds the percentage of states that have higher scores for males than females in this year

B. Print the resulting dataframe and interpret the results

Hint: can either use subquery to construct the is_male_higher and use it or do it all in one query with a comparison; the avg command is useful

Output:

image

PS2 1.2 Part B

For each defendant sentenced by judge_21, you want to construct a "matched group" of defendants who:

  • Are the same exact age as the focal defendant (age_derived) and
  • Are the same gender as the focal defendant (is_male_derived) but
  • Differ in race from the focal defendant (is_black_derived/is_white_derived)

Write a user-defined function to find any/all matched defendants for each focal defendant of judge 21. You can structure the function in various ways but one way is to write a function similar to the class example where we iterate over different DC crimes (focal crimes) and find similar crimes. For this problem, we want to:

  • Iterate over unique defendants sentenced by judge 21 (use CASE_PARTICIPANT_ID to identify each unique defendant)
  • Find other defendants in the judge 21 pool who (1) have a different race from that focal defendant but (2) the same gender and age

Concepts and resources:

Hints on output:

  • Some focal defendants may not have any matches; they can be excluded from the results
  • In the way we wrote our function, each iteration of the function returns a single dataframe; because we execute via list comprehension, the output of this step is a list of dataframes (it's okay if you take a diff approach!)

PS3 1.2

1.2 Try exact merge on business name (2 points)

  • Use the EMPLOYER_NAME field of the jobs dataset
  • Use the Name field of the debar dataset

A. Use pd.merge with an inner join on those fields to see whether there are any exact matches.

B. If there are exact matches, subset to the following columns and print the rows with exact matches:

  • Employer_NAME and Name
  • Date range of debarment (Start date and End date in debar)
  • Location from each data (City, State in debar and EMPLOYER_CITY and EMPLOYER_STATE in jobs)

Example output:

image

PS4 3 C, D, and E

C. Use the create_dtm function and the processed_text_bigrams column to create a document-term matrix (dtm_bigram) with these bigrams. Keep the following three columns in the data: id, topics_clean, and compound

D. Print the (1) dimensions of the dtm matrix from question 2.2 and (2) the dimensions of the dtm_bigram matrix. Comment on why the bigram matrix has more dimensions than the unigram matrix

Output (don't need to match number of columns exactly):

image

E. Find and print the 10 most prevelant bigrams for each of the three topics_clean using the get_topwords function from 2.2

image

PS3 1.3.2 Part C

C. Use that pattern in conjunction with re.sub and list comprehension to clean the columns in each dataset. Save the new columns as name_clean in each. Then, use row subsetting to (1) subset to rows that changed names and for:

  • debar print the Name and name_clean columns
  • jobs print the EMPLOYER_NAME and name_clean columns

Make sure to use the uppercase versions of the variables

Concepts and resources: for the last part of row subsetting and printing two columns in each dataframe, use the .loc examples shown here: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/00_pandas_datacleaning_solutions.ipynb

Output:

Head of changed employer names in jobs (your head may be different depending on order of data)

image

Head of changed employer names in debar:

image

Pset 1 questions and responses (W22)

QUESTION ONE:

Question: for question 1.1, Print the mean and median number of charges per case/participant should we calculate the mean and median numbers per case and then per participant?
Response: ok with any reasonable answer but originally was one combined --- I’d basically think of it as grouping by both case and participant, find the number of charges for each case-participant group, and then the mean/median of that (sorry a bit confusing but just results from fact of participants nested in cases)

Question For 1.1, "Does the data seem to enable us to follow the same defendant across different cases they're charged in?" I am a little confused about how to get started on this one. I've tried grouping by case id and finding various ways to look for unique case participants, but I can't seem to find something that works
Response: i think this is close but a bit off. to see if the same defendant can be followed across cases, i'd instead group by the case_participant_id and see the number of unique case ids within the group; if there is only 1 case for each case_participant_id, we can't follow

Question: I am a little confused with 1.2.1. I don't have to reformat the data right? It's just saying for each row check if the original offense is the same as the changed offense right. EX: sentencing_data["OFFENSE_CATEGORY] == sentencing_data["CHANGED_OFFENSE_CATEGORY]. I guess the part that's confusing me is "case-participant-charge"
Response:

  • Do you have to reformat the data? You'll be adding a new column to the data is_changed_offense that takes one of two values: True if OFFENSE_CATEGORY is not equal to UPDATED_OFFENSE_CATEGORY; false otherwise
  • What do we mean by case-participant-charge observations (rows)? This is just to note that you can use something like np.where for this question and keep the unit of analysis at the row level. For later questions, we will be aggregating over different charges within the same case, so this hint is just to indicate that it's okay to keep everything at the row level for now/not yet group the data.

Question: for 1.2.1, what does it mean to "construct an indicator"
Response: see above answer - means to add a new column, is_changed_offense, to your dataframe

Question: for 1.2.1, how might I print one example of a changed offense?
Response: there are various ways to do this that avoid "hard coding" a specific row number. One way is to (1) filter to offenses that have changed using row subsetting, (2) run df.head(1) to print the head

Question: for question 1.2.2, if the category does not contain aggravated, arson, homicide, vehicle/vehicular, or battery, what should appear in the simplified_offense_derived column?
Response: for other categories that aren’t changed, the simplified_offense_derived column should be set to the original value for that category (eg if Retail Theft isn’t changed by any of the above, simplified_offense_derived stays as Retail Theft)

Question: for 1.2.2, do I need to use advanced regex?
Response: nope; functions like str.startswith or str.replace that operate directly on pandas dataframe columns work fine

Question: for 1.2.2, I know how to generally structure the np.select but get stuck at the part of coding of leaving the others that i'm not recoding at their UPDATED_OFFENSE_CATEGORY. How might I do this?
Response: see the code on slide 21- https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/slides/w22_slides/qss20_w22_unit2_pandas.pdf. You can use the same "default" syntax within np.select. But for the default, instead of coding it to a single string, set default to df['UPDATED_OFFENSE_CATEGORY']. This serves function of leaving everything that's not recoded at its original value.

Question: for 1.3, I'm getting this error: TypeError: bad operand type for unary ~: 'float'
Response: wherever you have str.contains, try including na=False inside the parenthesis

Question: for question 1.3, if the race == NaN, should that be categorized as none of the above?
Response: in short doesn’t matter since we’ll be subsetting to black and white; so can either code as true for other or leave as missing/false for all.

Question: for question 1.3, I'm stuck on this part of cleaning up the timestamps:
"For full credit, before converting, you notice that some of the years have been mistranscribed (e.g., 291X or 221X instead of 201X). Programatically fix those (eg 2914 -> 2014). Even after cleaning, there will still be some that are after the year 2021 that we'll filter out later. For partial credit, you can ignore the timestamps that cause errors and set errors = "coerce" within pd.to_datetime() to allow the conversion to proceed."
Response: since we're not covering regex until later, and people have different COSC backgrounds, here's some code you can adapt for that cleaning. It relies on the re.sub command (some examples here: https://lzone.de/examples/Python%20re.sub) and the list comprehension we covered in lecture 1

## replace erroneous dates
sentence['tmp_clnsdate'] = [re.sub(r'2[1-9]([0-9]+)', r"20\1", str(date)) 
                             if bool(re.search('\/2[1-9][0-9]+', str(date))) else 
                            str(date) 
                            for date in 
                            sentence.SENTENCE_DATE]

## cast as datetime
sentence['sentenceymd_derived'] = pd.to_datetime(sentence.tmp_clnsdate)

Question: what does winsorize in this part of 1.3 mean- Winsorsize the top 0.01% of values to be equal to the 99.99th percentile value pre-winsorization. Call this age_derived
Response: word winsorize isn’t important for answering (but some discussion here!: https://en.wikipedia.org/wiki/Winsorizing). In this question, it involves: (1) finding what the 99.99th percentile value for age at incident is in the raw data, (2) if a given row is above that 99.99th percentile value, recode it to the 99.99th percentile value (so recode it rather than drop it). Recommend this command: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html

Question: Also a question on 1.3, do the values have to start at "judge_1"? for some reason, mine are starting at "judge_-1"
Reponse: take a look at this documentation: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.ngroup.html and pay particular attention to the parameters in the ngroup documentation

Question: In general, when recoding & cleaning variables how concerned should we be with monitoring / handling missing values? For example, in Question 1.3, some of the variables we’re working with, such as SENTENCE_JUDGE have missing values, but the directions don’t seem to address this explicitly.
Response: yep in general missingness is really important to deal with for social science research but for this pset, feel free to largely ignore except in cases where we give you explicit filtering instructions like in 1.4 — "Filter out observations where judge is nan or nonsensical"

Question: for question 1.3, should we include "male name no gender given" in the is_male_derived indicator? Or would this be other?
Response: either is fine!

Question: For part 1.3, I'm not quite sure how to go about assigning the judges different id numbers. Any tips/hints?
Response: i'd think about grouping by judges and reading the documentation for this command that assigns one id for each group: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.ngroup.html
you can then paste the judge_ prefix in a separate step if that makes it easier!

Question: for 1.3, should we omit rows where the SENTENCE_JUDGE is NA?
Response: don't worry too much about na's here (fine w/ whatever) as we filter them out in 1.4

Question: for 1.3, any hints on rounding the sentenceymd_derived column to the month-year only version (sentenceym_derived)?
Response: see notes here in response to class questions. one of the responses covers how to round time in the pset, which involves month from datetime objects; can also extract years and then paste month-year together (making sure result is still datetime)- https://docs.google.com/document/d/1yzV0V8GeuW2fXGQuO9iD0X2g-9E4ORgADv1ECR3L5cQ/edit#heading=h.y94z198yrx86

Question: for 1.3, how do I check if my conversion of sentence_judge to a string is working?

check dytpes; string is represented as object

image

Question: For question 1.4 task 2, is it possible for a participant to receive 0 charge?
Response: for this part — "subset to a participant's primary charge and their current sentence (PRIMARY_CHARGE_FLAG is True and CURRENT_SENTENCE_FLAG is True). Double check that this worked by confirming there are no longer multiple charges for the same case-participant" —
it could be that both flags are false for a given participant in which case they'd be filtered out

Question: for question 1.4, is it okay if the number of rows in my filtered dataframe doesn't match the number of rows in the sentencing_cleaned.csv file I read in in part two?
Response: yep that's fine! no need to compare

Question: is 1.4 asking us to convert the data from long to wide format? where there's charge_1 charge_2 charge_3 as an individual column for each unique participant row?
Response: ah no it's simpler than that; just remove rows (CASE_IDs) where more than one participant is charged in that case; so conversely, subset to case ids .isin a list of ones that only have one participant

PS5 1.2

1.2 Write a query to pull 8th-grade mathematics scores in 2013, 2015, 2017, and 2019 from California by gender (1 point)

Same as 1.1 but pull the years 2013, 2015, 2017, and 2019 (search documentation for how to combine) in one query

Output subsetting to rel cols (can either print all cols or just relevant ones):

image

PS2 1.2 Part E

Write 1-2 lines commenting on the results from Part D. What other defendant or offense-level characteristics would you like to match on to investigate claims about racial disparities?

PS5 1.1

1.1 Write a query to pull 8th-grade mathematics scores in 2015 from California by gender (1 point)

  • Subject: mathematics
  • Subscale: MRPCM composite scale
  • Grade: 8
  • Year: 2015
  • grouping variable: GENDER

Print the output in dataframe format and briefly interpret; what do scores look like between the genders?

Output (subsetting to some relevant cols; feel free to either subset to print all:

image

PS5 2.4

2.4 Construct a new variable, is_male_higher that takes the value of 1 if the math scores of males exceed that of females in that state and year (each row) (2 points)

Read in the results, print the head, and find the mean across all rows

Output:

image

Problem set two questions and responses (S21)

Question: I'm working on jhub and having difficulty loading states crosswalk from web

Response: now put a states_pset2.csv file in public_data within qss20_slides_activities (on jhub in shared/qss20) that you should be able to load using pd.read_csv() and use (can comment out web code)

Question: for 1.2, do we need to do merge diagnostics as was done in the example code?

Response: nope! there should be very few matches so no need to do diagnostics of differences

Question: is it okay for 1.4.3 if adding the city makes the matches worse (e.g., more false positives)

Response: yes that's totally fine if happens; main goal of exercise is seeing false negatives and false positives from different approaches to fuzzy matching/setting parameters at different values

PS4 1.3

A. Use a SentimentIntensityAnalyzer and polarity_scores to score the entire press release for its sentiment (use the raw string of the pharma press release without punctuation/digits removed)

Output for A (don't worry about matching exactly):

image

B. First, remove all punctuations and digits from the string. Then, remove all named entities. With this string, score the sentiment of the press release (hint: for removing named entities, an or condition in re.sub may be useful). Did the neutral score go up or down relative to the raw pharma press release in part A? Why do you think this occurred?

Output for B (don't worry about matching exactly):

image

C. With the version of the string from part B that removes named entities, try to split the press release into discrete sentences (hint: re.split() may be useful since it allows or conditions in the pattern you're looking for, which can help deal with special characters). Print the first 5 sentences of the split press release (there will not be deductions if there remain some erroneous splits; just make sure it's generally splitting)

Output for C (don't worry about matching exactly):

image

D. Score each sentence in the split press release and print the top 5 sentences in the press release with the most negative sentiment (use the neg score- higher values = more negative). Hint: you can use pd.DataFrame to rowbind a list of dictionaries; you can then add the press release sentence for each row back as a column in that dataframe and use sort_values()

Output for D (don't worry about matching exactly):

image

EXTRA CREDIT PS3 1.3.3 Parts C and D

C. Iterate over the name_clean column in debar and execute the regex. Use it to create two new columns in debar:
- co_name: A column for company (full name_clean string if no match; pattern before COMPANY if one extracted)
- ind_name: A column for individual (full name_clean string if no match; pattern before INDIVIDUAL if one extracted)

D. Print three columns for the rows in debar containing the negative example and positive example described above (county fair farm and cisco produce):

  • name_clean
  • co_name
  • ind_name
  • Violation

Note: as shown in the outcome there may be duplicates of the same company reflecting different violations

Output for part D:

image

PS5 2.6

2.6 group by state and find the percentage of years where male scores higher than females

A. Write a query that (1) groups by state and (2) finds the percentage of years that have higher scores for males than females in that state

B. Plot the results ordering the states from males higher all 4 years to males higher none of the years

Output (dont need to match aesthetics but does need to be ordered):

image

PS4 1.2 D, E

D. You want to extract the possible sentence lengths the CEO is facing; pull out the named entities with (1) the label DATE and (2) that contain the word year or years (hint: you may want to use the re module for that second part). Print these named entities.

E. Pull and print the original parts of the press releases where those year lengths are mentioned (e.g., the sentences or rough region of the press release). Describe in your own words (1 sentence) what length of sentence (prison) and probation (supervised release) the CEO may be facing if convicted after this indictment (if there are multiple lengths mentioned describe the maximum). Hint: you may want to use re.search or re.findall

Output for D:

image

Output for E:

image

PS4 1.3

A. Subset the press releases to those labeled with one of three topics via topics_clean: Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call this doj_subset going forward and it should have 717 rows.

Dimensions after part A:

image

B. Write a function that takes one press release string as an input and:

  • Removes named entities from each press release string
  • Scores the sentiment of the entire press release
  • Returns the length-four (negative, positive, neutral, compound) sentiment dictionary

Apply that function to each of the press releases in doj_subset.

Hints:

  • You may want to use re.escape at some point to avoid errors relating to escape characters like ( in the press release
  • I used a function + list comprehension to execute and it takes about 30 seconds on my local machine and about 2 mins on jhub; if it's taking a very long time, you may want to check your code for inefficiencies. If you can't fix those, for partial credit on this part/full credit on remainder, you can take a small random sample of the 717

C. Add the four sentiment scores to the doj_subset dataframe. Sort from highest neg to lowest neg score and print the top id, contents, and neg columns of the two most neg press releases.

Output for top 2 most negative (truncated):

image

D. With the dataframe from part C, find the mean compound score for each of the three topics in topics_clean using group_by and agg.

Output for part D:

image

E. Add a 1 sentence interpretation of why we might see the variation in scores (remember that compound is a standardized summary where -1 is most negative; +1 is most positive)

PS3 1.1

Load the following datasets stored in pset3_inputdata

  • Historical H2A debarments (debar.csv); call this debar
  • Q1 2021 H2A job postings (jobs.csv); call this jobs

Example output:

Head of debar data:

image

Head of jobs data (columns truncated):

image

PS4 1.1 B and C

B. With the preprocessed press release from part A, use the part of speech tagger within nltk to tag all the words in that one press release with their part of speech.

C. Using the output from B, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print a dataframe with the 5 most frequent adjectives and their counts in the pharma release. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

Example output from part C:

image

PS3 2.1

2.1 Preprocessing location (2 points)

You want to block on state but notice that states in debar have a mix of two digit codes and full state names (e.g., GA versus Georgia) while states in jobs are all two-digit codes

A. Run the code below to load the states crosswalk (this matches state abbreviations with their full name).

B. Use that crosswalk to create a field in debar that has the two-digit state abbreviation for all locations (hint: you may need to first split the City, State string on the ", " or use str.replace to extract the state into a new column before converting it to all abbreviations)

Hint: the GitHub issue contains the value_counts() for the cleaned two-digit states after this step

C. Use an assert statement to check that all the states in debar are two-digits after the cleaning

Notes: you can filter out states that are NaN

Concepts and resources:

  • For part B, draw on regex or pandas str functions to create a State variable from the City, State column in debar
  • Then, one approach is to use pd.merge to merge the state crosswalk onto debar (example head() shown in the GitHub issue)
  • You may want to structure the pd.merge to only merge on rows without two-digit names to the crosswalk with those two-digit names and pd.concat to bind the dataframe back together
  • How to structure an assert statement: https://www.w3schools.com/python/ref_keyword_assert.asp

Output from .head() statement after merging on the two digit code-state name crosswalk:

image

Output from value_counts() after converting all states to two digits:

image

PS4 2.1

2.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)

A. Write a function that:

  • Takes in a single raw string in the contents column from that dataframe

  • Does the following preprocessing steps:

    • Converts the words to lowercase
    • Removes stopwords, adding the custom stopwords in your code cell below to the default stopwords list
    • Only retains alpha words (so removes digits and punctuation)
    • Only retains words 4 characters or longer
    • Uses the snowball stemmer from nltk to stem
  • Returns a joined preprocessed string

B. Use apply or list comprehension to execute that function and create a new column in the data called processed_text

C. Print the id, contents, and processed_text columns for the following press releases:

id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)

id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)

Resources:

Output for those two press releases (cutoff at end but you can get gist):

image

image

PS3 2.2

A. Write fuzzy matching code (don't yet put inside a user-defined function, you'll do that in 2.3) that:

  • Blocks on two-digit state code
  • Finds matches based on similarity between the employer name (name_clean) in debar (uppercase and cleaned) and name_clean in jobs (uppercase and cleaned). You can choose which distance metric and threshold to use (feel free to set a threshold low enough to get some matches even if that leads to some false positives).

For the steps after compute, just take any match with non-zero value rather than using a classifier (so skip the k-means or e-m step from the class example code code)

Concepts and resources:

Hint: you may need to deduplicate records in the datasets for the recordlinkage package to work. See drop_duplicates within pandas and subset command to only deduplicate using name_clean and the two-digit state code

B. Print the following columns in the resulting matched dataset and comment on examples of ones that seem like true positives and ones that seem like false matches:

  • name_clean for jobs and debar
  • state in debar
  • state in jobs

Output for part B: IMPORTANT NOTE - since different solutions will use different string distance functions and thresholds, don't try to replicate this exact output; instead, we show it as one example of how to structure the dataset in Part B to showcase matches

image

PS2 1.1 Part B

Then, follow the instructions in the codebook (combining COMMITMENT_TERM with COMMITMENT_UNIT) to create a standard sentence length in days column (senlength_derived) that measures the sentence in days. To simplify, you can assume that:

  • 1 hour = 1/24th of a day
  • 1 year = 365 days
  • 1 month = 30.5 days
  • 1 week = 7 days
  • Natural life = difference between the age of 100 and the defendant's age at incident (cleaned; if missing, code to age 20); note that this is a simplification since age at incident != age at sentencing

Print the following cols for an example of each type (eg an example of originally hours; an example of natural life): COMMITMENT_TERM, COMMITMENT_UNIT, age_derived and your new standardized sentence length column (senlength_derived)

Print the summary of that column (senlength_derived) using the .describe() command

Concepts tested and resources: there are many approaches but a couple ways are:

Hint on output: see GitHub issue for the summary stats we get from running .describe()

image

PS4 2.3

2.3 Estimate a topic model using those preprocessed words (5 points)

A. Going back to the preprocessed words from part 2.3.1, estimate a topic model with 3 topics, since you want to see if the unsupervised topic models recover different themes for each of the three manually-labeled areas (civil rights; hate crimes; project safe childhood). You have free rein over the other topic model parameters beyond the number of topics.

B. After estimating the topic model, print the top 15 words in each topic.

Hints and Resources:

  • Same topic modeling resources linked to above
  • Make sure to use the random_state argument within the model so that the numbering of topics does not move around between runs of your code

Output: NOTE- yours will likely be different bc it depends on (1) parameter values within ldamodel and (2) the random state you use; do not worry about this

image

EXTRA CREDIT PS3 1.3.3 Parts A and B

1.3.3 Regex to separate companies from individuals (4 points)

You notice some employers in debar have both the name of the company and the name of individual, e.g.:

COUNTY FAIR FARM (COMPANY) AND ANDREW WILLIAMSON (INDIVIDUAL)*

Use the uppercase/cleaned version of Name in debar

A. Write a regex pattern that does the following:
- Captures the pattern that occurs before COMPANY if (COMPANY) is in string-- so in above, extracts COUNTY FAIR FARM
- Captures the pattern that occurs before INDIVIDUAL if (INDIVIDUAL) is also in string -- so in above, extracts ANDREW WILLIAMSON (so omit the "and")

B. Test the pattern on pos_example and neg_example-- make sure former returns a list (if using find.all) or match object (if using re.search) with the company name and individual name separated out; make sure latter returns empty

Hints and resources: for step A, you can either use re.search, re.match, or re.findall; don't worry about matching B&R Harvesting and Paul Cruz (Individual)

  • Same regex resources as above

Output:

  • Example output from pos_example = "COUNTY FAIR FARM (COMPANY) AND ANDREW WILLIAMSON (INDIVIDUAL)*" (this corresponds to the re.findall approach; could also use other method)

image

  • Example output from neg_example = "CISCO PRODUCE INC"

Empty / returns no match

PS5 2.7

2.7 Use a subquery to create an indicator and group by that indicator (6 points)

The following states were the first 6 to expand the right to vote to women before the uniform federal expansion in 1920

  • Wyoming 1890
  • Colorado 1893
  • Utah 1896
  • Idaho 1896
  • Washington 1910
  • California 1911

A. Create an indicator is_early_vote for whether a state is in that list or not; do so without typing the state names inside the string and instead collapsing the list of states we provide and using something like format. Hint on how to combine the state names while preserving the quotes around each: https://stackoverflow.com/questions/12007686/join-a-list-of-strings-in-python-and-wrap-each-string-in-quotation-marks

B. Then, group by the is_early_vote indicator and year and find the percencentage of states in each group where males had higher scores than females

C. Print the resulting dataframe and interpret. Does early expansion of voting seem to be correlated with girls scoring better on the math tests a century later?

Hint: in order to group by the indicator in step b, you may need to use a subquery

Output (fine for yours to be formatted differently):

image

PS3 1.3.2 Part A and B

1.3.2 Cleaning up punctuation (4 points)

You notice that INC, CO, and LLC are sometimes followed by a period (.) but sometimes not

A. For each dataset, write a regex pattern using re.sub to remove the . but only if it's preceded by INC, LLC, or CO

Make sure LLC, INC, CO remain part of the string but just without the dot

B. Test the pattern on the positive and negative example we provide below and print the result. See the Github issue for examples of what to return

Hint: https://stackoverflow.com/questions/7191209/python-re-sub-replace-with-matched-content

Resources:
- Regex slides: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/slides/w22_slides/06_qss20_w22_unit6_regex.pdf
- Regex activity code: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/04_basicregex_solutions.ipynb

Output:

  • Output for pos_example_1 = "CISCO PRODUCE INC."

image

  • Output for pos_example_2 = "AVOYELLES HONEY CO., LLC"

image

  • Output for neg_example = "E.V. RANCH LLP"

image

PS5 1.5

1.5 Write and execute user-defined function that takes in a query and returns a pandas dataframe with the content of the response (4 points)

  • Write a user-defined function (process_naep_query) that takes in the NAEP query as a string, calls the API, and transforms the response into a pandas dataframe. Have the function return that pandas dataframe

  • Make sure the function is flexible enough to handle queries that return an error; for queries that return an error, have the function return the string "Data not found; check your query"

  • Execute the function on the query that pulls 2013-2019 data (either from handwriting the query or the result in 1.4)

  • Print the resulting dataframe

  • Then execute the function on a query that pulls a state that doesn't exist (call this state ZZ)

** Output from correct query**: same as output from 1.2

Output from incorrect query:

image

PS5 3.1

3.1 Obtain a Census API key, place it in credentials yaml, load the yaml file, and initialize connection to Census API using the census package (0 points)

What a successful api connection looks like:

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.