rebeccajohnson88 / qss20_slides_activities Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 29.0 121.49 MB

Repo for slides and activities for qss 20

Jupyter Notebook 11.80% HTML 88.20%

qss20_slides_activities's People

Contributors

Stargazers

Watchers

qss20_slides_activities's Issues

Update naming in problem sets folder

Relevant directory: https://github.com/rebeccajohnson88/qss20_slides_activities/tree/main/problemsets

See notes for renaming-- can either keep each pset in a separate directory or have them all in the outer directory
Rename and update README in that directory

PS4 1.2 A, B, C

A. Using the original pharma press release (so the one before stripping punctuation/digits), use spaCy to extract all named entities from the press release.

B. Print the unique named entities with the tag: LAW

C. Use Google to summarize in one sentence what the RICO named entity means and why this might apply to a pharmaceutical case (and not just a mafia case...)

Output for unique named entities with tag LAW:

PS2 1.2 Part A

Keep the above ~58k row dataset that is subsetted to only to sentences involving incarceration. Then, further subset the rows to:

Rows where judgeid_derived = judge_21
simplified_offense_derived == "Narcotics"

Use shape to print the dimensions of the resulting dataframe

Concepts and resources: row subsetting using logical conditions; see above resources

Hint on output: we get 53 rows after this filtering step. Don't worry about matching this exactly but try to get in the ballpark range.

PS2 1.1 Part A

Read in the data
Filter to sentences that involve incarceration (same Illinois Department of Corrections logic as in problem set one:
incarceration is indicated by COMMITMENT_TYPE == "Illinois Department of Corrections")
Using the COMMITMENT_UNIT column that represents sentence length, filter out the following non-numeric sentence lengths: Term, Pounds, Dollars, Ounces (year, months, natural life, days, hours, and weeks should remain)
Filter to Black or White defendants

Concepts tested and resources: this question tests filtering rows based on logical conditions. Here are some resources:

DataCamp on .loc: https://campus.datacamp.com/courses/data-manipulation-with-pandas/slicing-and-indexing-dataframes?ex=3
Lecture 2 slide 19: https://campus.datacamp.com/courses/data-manipulation-with-pandas/slicing-and-indexing-dataframes?ex=3
Row subsetting section in this activity: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/00_pandas_datacleaning_solutions.ipynb

Hints on output: we get 58289 rows remaining after this filtering. Don't worry about matching this exactly but try to get in the ballpark range.

2.4 Add topics back to main data and explore correlation between manual labels and our estimated topics (10 points)

A. Extract the document-level topic probabilities. Within get_document_topics, use the argument minimum_probability = 0 to make sure all 3 topic probabilities are returned. Write an assert statement to make sure the length of the list is equal to the number of rows in the doj_subset_wscores dataframe

B. Add the topic probabilities to the doj_subset_wscores dataframe as columns and create a column, top_topic, that reflects each document to its highest-probability topic (eg topic 1, 2, or 3)

C. For each of the manual labels in topics_clean (Hate Crime, Civil Rights, Project Safe Childhood), print the breakdown of the % of documents with each top topic (so, for instance, Hate Crime has 246 documents-- if 123 of those documents are coded to topic_1, that would be 50%; and so on). Hint: pd.crosstab and normalize may be helpful: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.crosstab.html

D. Using a couple press releases as examples, write a 1-2 sentence interpretation of why some of the manual topics map on more cleanly to an estimated topic than other manual topic(s)

Resources:

End of this code (Additional summaries of topics and documents) contains example of how to use get_document_topics and other steps to add topic probabilities back to data: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/06_textasdata_partII_topicmodeling_solution.ipynb
If you're getting errors, use shape, len, and other commands to check the dimensionality of things at different steps

Output: NOTE - do not worry if your numbers are slightly different and if your topics have different labels (though should range from 1...3 rather than 0..2)--- the labels are impacted by the random_state arg at the model estimation stage but it should broadly show hate crimes and proj safe childhood mapping cleanly to one topic each; civil rights more split

PS4 3 A and B

3. Extend the analysis from unigrams to bigrams (10 points)

In the previous question, you found top words via a unigram representation of the text. Now, we want to see how those top words change with bigrams (pairs of words)

A. Using the doj_subset_wscore data and the processed_text column (so the words after stemming/other preprocessing), create a column in the data called processed_text_bigrams that combines each consecutive pairs of word into a bigram separated by an underscore. Eg:

"depart reach settlem" would become "depart_reach reach_settlem"

Do this by writing a function create_bigram_onedoc that takes in a single processed_text string and returns a string with its bigrams structured similarly to above example

Hint: there are many ways to solve but zip may be helpful: https://stackoverflow.com/questions/21303224/iterate-over-all-pairs-of-consecutive-items-in-a-list

B. Print the id, processed_text, and processed_text_bigram columns for press release with id = 16-217

Output from example (fine if yours differs slightly):

PS3 1.3.1

1.3.1 Converting to upper (2 points)

A. Convert the names EMPLOYER_NAME and Name to uppercase using list comprehension rather than df.varname.str.upper() (it's fine to do a separate list comprehension line for each of the two columns)

B. Print a random sample of 15 values of each result

C. Assign the full vector of uppercase names back to the original data, writing over the original EMPLOYER_NAME and Name columns

Resources:
- Activity code with list comprehension: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/00_pandas_datacleaning_solutions.ipynb
- Sampling from a list without replacement using the random module: https://note.nkmk.me/en/python-random-choice-sample-choices/

Output: since everyone's random sample will be different, we don't have a uniform output; here's an example of the uppercase conversion for the American Crawfish employer in jobs: AMERICAN CRAWFISH LLC

PS5 2.3

2.3 Find the (1) number of rows in the database, (2) number of distinct states, (3) number of distinct years (3 points)

Interpret the results - how do you think the data is structured in terms of states and years (eg long format where each state repeated; wide format)?

Output:

PS5 2.1

2.1 Load database credentials if haven't already and establish a connection (1 point)

Note: you need to be on eduroam or VPN

Output of successful connection if you print connection object:

Problem set one questions and responses (S21)

Group portion

Question: for question 1.1, Print the mean and median number of charges per case/participant should we calculate the mean and median numbers per case and then per participant?
Response: ok with any reasonable answer but originally was one combined --- I'd basically think of it as grouping by both case and participant, find the number of charges for each case-participant group, and then the mean/median of that (sorry a bit confusing but just results from fact of participants nested in cases)
Question: for question 1.2.2, if the category does not contain aggravated, arson, homicide, vehicle/vehicular, or battery, what should appear in the simplified_offense_derived column?
Response: for other categories that aren't changed, the simplified_offense_derived column should be set to the original value for that category (eg if Retail Theft isn't changed by any of the above, simplified_offense_derived stays as Retail Theft)
Question: for question 1.3, if the race == NaN, should that be categorized as none of the above?
Response:: in short doesn't matter since we'll be subsetting to black and white; so can either code as true for other or leave as missing/false for all.
Question: just wanted to clarify that we should use sentenceymd_derived column for part 2.1?
Response: for all the time based questions going forward, you can use either sentenceym_derived or sentenceymd_derived (whichever makes more sense for the question)
Question: what does winsorize in this part of 1.3 mean- Winsorsize the top 0.01% of values to be equal to the 99.99th percentile value pre-winsorization. Call this age_derived
Response: word winsorize isn't important for answering (but some discussion here!: https://en.wikipedia.org/wiki/Winsorizing). In this question, it involves: (1) finding what the 99.99th percentile value for age at incident is in the raw data, (2) if a given row is above that 99.99th percentile value, recode it to the 99.99th percentile value (so recode it rather than drop it). Recommend this command: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html
Question: for 2.3.3, I understand the comparison by time frame but I’m wondering what type of disparities are we looking to compare? Raw number of retail theft sentences, sentence length for retail thefts or a different column?
Response: good question! It should be (1) just looking at retail thefts, (2) looking at what percentage of defendants are incarcerated (so repeating aggregation from 2.3.2 but doing the time variation)
Question: if working locally, why might I be having issues loading the sentencing_cleaned.pkl file and what can I do about it?
Response: pkl files are a filetype that preserve things like the datatypes (e.g., datetime) over csv. The downside is that there are what are called serialization protocols that differ between versions of python. The pkl I shared with you corresponds to python 3.8+ (version on jupyter hub).

If you want to work locally and you don't have python 3.8+ installed yet (can check versions by going to terminal and typing python --version) you can load the csv and use the following code to recast the two datetime columns back to the correct type (screenshot below is me checking that this works)

sentencing_csv['sentenceymd_derived'] = pd.to_datetime(sentencing_csv.sentenceymd_derived)
sentencing_csv['sentenceym_derived'] = pd.to_datetime(sentencing_csv.sentenceym_derived)

Question: for questions like 2.3.2 and 3.3 that say "print table", is printing a pandas dataframe fine?

Response: yep just a dataframe or whatever showing the results is fine!

Question: I am still a bit confused on what we are supposed to do for this question "Use mathematical notation to write out each of the proportions using summation notation," after reviewing the in-class Latex example. Are we supposed to export the table we created to latex and in latex make the summation notation?

Response: (1) use the menu at top to change the cell to markdown, (2) using math mode (e.g, open and closed $ $), write out the proportions using mathematical notation--- eg the numerator or count of Black defendants is $\sum_{i} Black_{i}$

Individual portion

Question: for question 3.2, are we supposed to print the median sentence length for all defendants/judges (ie plot grouped by judge and by race) or just by overall values (ie just one median value each for white and black and only grouped by race)

Response: yep to unpack this a bit For each judge with at least 20 Black and at least 20 white defendants, plot the (1) median sentence length for Black defendants and (2) median sentence length for white defendants (factor variable on x axis for each judge_id who qualifies; group by race)

Filter to narcotics offenses and judges who have 20+ Black and 20+ white defendants
Group by judge and race to calculate median sentence length --- so graph should have 2 bars or points for each judge- one showing median sentence length for that judge's Black defendants; other for his/her white defendants

Question: for question 3.3, who are our focal defendants, every defendant for a narcotics offense for judge 21? Does using the "same narcotics offenses" mean we're only looking at Black and White defendants like in 3.2, and does that also mean we shouldn't consider other races when doing the matched pairs? And lastly, if every defendant is a focal defendant, won't we get repeat pairs just in different orders?

Response:

Sample: every Black or White defendant with a narcotics offense sentenced by judge 21
Yep keep as Black or White only so continue excluding other races
Yep you iterate so that every defendant is a focal defendant; yep you'll have some repeat matches in output and some in the above analytic may not have any matches

PS5 1.4

1.4 Reproduce the queries from 1.1 and 1.2 using a user-defined function (4 points)

Create a function, construct_naep_query that takes in two arguments:

year: this should be a list with all years (so if one year, single element list; if multiple years, list with those years)
place: this should be a string with the name of the state or jurisdiction to pull

Have the function return the query and make sure it's identical to the queries you wrote for 1.1 and 1.2 (can use assert or other checker)

No output to display since output is just the query you're being asked to produce

PS3 2.3

2.3 Put the code from 2.2 into a fuzzy matching function that generalizes some steps (8 points)

You want to see how the matches change if you add the city and not only state as a field and also want to automate the process of matching a bit to try different distance thresholds.

A. Extract the City from the City, State column of debar

B. Convert that new city column to uppercase and convert the EMPLOYER_CITY column in jobs to uppercase

C. Write a function surrounding the code in recordlinkage that you wrote in problem 2.2 (so you don't need to recode the package from scratch) that (1) takes in each dataset, (2) blocks on two-digit state, and (3) fuzzy matches on employer name and employer city. See below notes on partial versus full credit for function structure.

D. Execute the function with a couple different string distance thresholds and print the resulting matches for each

6 out of 8 points: function takes arguments for input datasets, varname to block on, two varnames to fuzzy match on, string distance function, and string distance threshold

8 out of 8: above but function is also general enough that it takes a variable # of strings to match on--- so should work if you either execute just using employer name or also work if you execute using employer name and employer city

Concepts and resources: same as above. One hint is that for us, the easiest way to get to the full credit solution was to feed the function a dictionary where each key corresponds to one string variable to fuzzy match on; the values contain that variable's name in each dataset, the distance metric, and the distance threshold.

Example output: IMPORTANT- yours will likely differ because of string distance metrics etc. Here's our example though for a distance threshold of 0.9 for both business name and city name and then a threshold of 0.75 for each

Independent study set up tasks

Make sure you're able to login to DataCamp
Clone this qss20_slides_activities repo locally
Make sure you have python 3.8 installed (3.7 is okay but 3.8 works better for pkl stuff if not too big an issue to upgrade from 3.7) and have anaconda to work with jupyter notebooks

First parts of course will be (can discuss good deadlines monday)

Completing the DataCamp pandas data manipulation module
Completing problem set one: https://github.com/rebeccajohnson88/qss20_slides_activities/tree/main/problemsets/01_pset1
While completing problem set one, making question clarifications based on these questions that arose: #1

PS2 1.2 Part C

Using the results from Part B, use pd.concat or another approach to create a dataframe that compares the (1) race and sentence length for the focal defendant to (2) the sentence length for other defendants. Using this dataframe, show this comparison for focal defendant: CASE_PARTICIPANT_ID == 808109112733

PS5 2.2

2.2 Run a query to select all columns and the first 5 rows of the math_gencompare database to explore structure (2 points)

Read the results in as a pandas dataframe and print the result

Output:

PS5 1.3

1.3 Create a line plot to show variation in the scores across years (2 points)

Using the results from 1.2, create a plot where the x axis has the year and the y axis is the math scores (value in dataframe), and there are separate lines/colors for male versus female students

Start the limits of the y axis at 270 and add informative labels

Our graph (doesn't need to match exactly in terms of aesthetics/labels:

PS4 1.1 A

1. NLP on one press release (10 points)

Focus on the following press release: id == "17-1204" about this pharmaceutical kickback prosecution: https://www.forbes.com/sites/michelatindera/2017/11/16/fentanyl-billionaire-john-kapoor-to-plead-not-guilty-in-opioid-kickback-case/?sh=21b8574d6c6c

The contents column is the one we're treating as a document. You may need to to convert it from a pandas series to a single string.

We'll call this press release pharma

1.1 part of speech tagging (3 points)

A. Preprocess the pharma press release to remove all punctuation / digits (so can use .isalpha() to subset)

Example output from first five tokens:

PS4 2.2

2.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)

A. Use the create_dtm function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the following columns: id, compound sentiment column you added, and the topics_clean column

B. Print the top 10 words for press releases with compound sentiment in the top 5% (so the most positive sentiment)

C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so the most negative sentiment)

Hint: for these, remember the pandas quantile function from pset one.

D. Print the top 10 words for press releases in each of the three topics_clean

For steps B - D, to receive full credit, write a function get_topwords that helps you avoid duplicated code when you find top words for the different subsets of the data. There are different ways to structure it but one way is to feed it subsetted data (so data subsetted to one topic etc.) and for it to get the top words for that subset.

Resources:

Here contains an example of applying the create_dtm function: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/06_textasdata_partII_topicmodeling_solution.ipynb

Output for top words based on compound scores:

Output for top words based on the topics_clean:

PS2 1.2 Part D

Using the results from Part C, group by the focal defendant's race and find the proportion of that defendant's matches who had a LONGER sentence than the focal defendant

Concepts and resources: can use groupby and agg

Groupby and agg code: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/00_pandas_datacleaning_solutions.ipynb

Hints on output: since we're grouping by race in the full pool, the resulting output should have two values: one for the rate at which Black defendants have matches with higher sentences; another for the rate at which White defendants have matches with higher sentences

PS5 2.5

2.5 group by year and find the percentage of states where male scores are higher than females (4 points)

A. Write a query that (1) groups by year and (2) finds the percentage of states that have higher scores for males than females in this year

B. Print the resulting dataframe and interpret the results

Hint: can either use subquery to construct the is_male_higher and use it or do it all in one query with a comparison; the avg command is useful

Output:

PS2 1.2 Part B

For each defendant sentenced by judge_21, you want to construct a "matched group" of defendants who:

Are the same exact age as the focal defendant (age_derived) and
Are the same gender as the focal defendant (is_male_derived) but
Differ in race from the focal defendant (is_black_derived/is_white_derived)

Write a user-defined function to find any/all matched defendants for each focal defendant of judge 21. You can structure the function in various ways but one way is to write a function similar to the class example where we iterate over different DC crimes (focal crimes) and find similar crimes. For this problem, we want to:

Iterate over unique defendants sentenced by judge 21 (use CASE_PARTICIPANT_ID to identify each unique defendant)
Find other defendants in the judge 21 pool who (1) have a different race from that focal defendant but (2) the same gender and age

Concepts and resources:

Slides and activity code on user-defined functions and iterating using list comprehension: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/02_loopsfunctions_solutions.ipynb
You can either write code in the function to add columns with the attributes of the focal defendant (existing material) or using pd.merge to join these on after; we'll be covering pd.merge Monday 01.24 but here are last year's slides (slide 17-20) in meantime: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/slides/s21_slides/qss20_s21_class4.pdf

Hints on output:

Some focal defendants may not have any matches; they can be excluded from the results
In the way we wrote our function, each iteration of the function returns a single dataframe; because we execute via list comprehension, the output of this step is a list of dataframes (it's okay if you take a diff approach!)

PS3 1.2

1.2 Try exact merge on business name (2 points)

Use the EMPLOYER_NAME field of the jobs dataset
Use the Name field of the debar dataset

A. Use pd.merge with an inner join on those fields to see whether there are any exact matches.

B. If there are exact matches, subset to the following columns and print the rows with exact matches:

Employer_NAME and Name
Date range of debarment (Start date and End date in debar)
Location from each data (City, State in debar and EMPLOYER_CITY and EMPLOYER_STATE in jobs)

Example output:

PS4 3 C, D, and E

C. Use the create_dtm function and the processed_text_bigrams column to create a document-term matrix (dtm_bigram) with these bigrams. Keep the following three columns in the data: id, topics_clean, and compound

D. Print the (1) dimensions of the dtm matrix from question 2.2 and (2) the dimensions of the dtm_bigram matrix. Comment on why the bigram matrix has more dimensions than the unigram matrix

Output (don't need to match number of columns exactly):

E. Find and print the 10 most prevelant bigrams for each of the three topics_clean using the get_topwords function from 2.2

PS3 1.3.2 Part C

C. Use that pattern in conjunction with re.sub and list comprehension to clean the columns in each dataset. Save the new columns as name_clean in each. Then, use row subsetting to (1) subset to rows that changed names and for:

debar print the Name and name_clean columns
jobs print the EMPLOYER_NAME and name_clean columns

Make sure to use the uppercase versions of the variables

Concepts and resources: for the last part of row subsetting and printing two columns in each dataframe, use the .loc examples shown here: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/00_pandas_datacleaning_solutions.ipynb

Output:

Head of changed employer names in jobs (your head may be different depending on order of data)

Head of changed employer names in debar:

Pset 1 questions and responses (W22)

QUESTION ONE:

Question: for question 1.1, Print the mean and median number of charges per case/participant should we calculate the mean and median numbers per case and then per participant?
Response: ok with any reasonable answer but originally was one combined --- I’d basically think of it as grouping by both case and participant, find the number of charges for each case-participant group, and then the mean/median of that (sorry a bit confusing but just results from fact of participants nested in cases)

Question For 1.1, "Does the data seem to enable us to follow the same defendant across different cases they're charged in?" I am a little confused about how to get started on this one. I've tried grouping by case id and finding various ways to look for unique case participants, but I can't seem to find something that works
Response: i think this is close but a bit off. to see if the same defendant can be followed across cases, i'd instead group by the case_participant_id and see the number of unique case ids within the group; if there is only 1 case for each case_participant_id, we can't follow

Question: I am a little confused with 1.2.1. I don't have to reformat the data right? It's just saying for each row check if the original offense is the same as the changed offense right. EX: sentencing_data["OFFENSE_CATEGORY] == sentencing_data["CHANGED_OFFENSE_CATEGORY]. I guess the part that's confusing me is "case-participant-charge"
Response:

Do you have to reformat the data? You'll be adding a new column to the data is_changed_offense that takes one of two values: True if OFFENSE_CATEGORY is not equal to UPDATED_OFFENSE_CATEGORY; false otherwise
What do we mean by case-participant-charge observations (rows)? This is just to note that you can use something like np.where for this question and keep the unit of analysis at the row level. For later questions, we will be aggregating over different charges within the same case, so this hint is just to indicate that it's okay to keep everything at the row level for now/not yet group the data.

Question: for 1.2.1, what does it mean to "construct an indicator"
Response: see above answer - means to add a new column, is_changed_offense, to your dataframe

Question: for 1.2.1, how might I print one example of a changed offense?
Response: there are various ways to do this that avoid "hard coding" a specific row number. One way is to (1) filter to offenses that have changed using row subsetting, (2) run df.head(1) to print the head

Question: for question 1.2.2, if the category does not contain aggravated, arson, homicide, vehicle/vehicular, or battery, what should appear in the simplified_offense_derived column?
Response: for other categories that aren’t changed, the simplified_offense_derived column should be set to the original value for that category (eg if Retail Theft isn’t changed by any of the above, simplified_offense_derived stays as Retail Theft)

Question: for 1.2.2, do I need to use advanced regex?
Response: nope; functions like str.startswith or str.replace that operate directly on pandas dataframe columns work fine

Question: for 1.2.2, I know how to generally structure the np.select but get stuck at the part of coding of leaving the others that i'm not recoding at their UPDATED_OFFENSE_CATEGORY. How might I do this?
Response: see the code on slide 21- https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/slides/w22_slides/qss20_w22_unit2_pandas.pdf. You can use the same "default" syntax within np.select. But for the default, instead of coding it to a single string, set default to df['UPDATED_OFFENSE_CATEGORY']. This serves function of leaving everything that's not recoded at its original value.

Question: for 1.3, I'm getting this error: TypeError: bad operand type for unary ~: 'float'
Response: wherever you have str.contains, try including na=False inside the parenthesis

Question: for question 1.3, if the race == NaN, should that be categorized as none of the above?
Response: in short doesn’t matter since we’ll be subsetting to black and white; so can either code as true for other or leave as missing/false for all.

Question: for question 1.3, I'm stuck on this part of cleaning up the timestamps:
"For full credit, before converting, you notice that some of the years have been mistranscribed (e.g., 291X or 221X instead of 201X). Programatically fix those (eg 2914 -> 2014). Even after cleaning, there will still be some that are after the year 2021 that we'll filter out later. For partial credit, you can ignore the timestamps that cause errors and set errors = "coerce" within pd.to_datetime() to allow the conversion to proceed."
Response: since we're not covering regex until later, and people have different COSC backgrounds, here's some code you can adapt for that cleaning. It relies on the re.sub command (some examples here: https://lzone.de/examples/Python%20re.sub) and the list comprehension we covered in lecture 1

## replace erroneous dates
sentence['tmp_clnsdate'] = [re.sub(r'2[1-9]([0-9]+)', r"20\1", str(date)) 
                             if bool(re.search('\/2[1-9][0-9]+', str(date))) else 
                            str(date) 
                            for date in 
                            sentence.SENTENCE_DATE]

## cast as datetime
sentence['sentenceymd_derived'] = pd.to_datetime(sentence.tmp_clnsdate)

Question: what does winsorize in this part of 1.3 mean- Winsorsize the top 0.01% of values to be equal to the 99.99th percentile value pre-winsorization. Call this age_derived
Response: word winsorize isn’t important for answering (but some discussion here!: https://en.wikipedia.org/wiki/Winsorizing). In this question, it involves: (1) finding what the 99.99th percentile value for age at incident is in the raw data, (2) if a given row is above that 99.99th percentile value, recode it to the 99.99th percentile value (so recode it rather than drop it). Recommend this command: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html

Question: Also a question on 1.3, do the values have to start at "judge_1"? for some reason, mine are starting at "judge_-1"
Reponse: take a look at this documentation: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.ngroup.html and pay particular attention to the parameters in the ngroup documentation

Question: In general, when recoding & cleaning variables how concerned should we be with monitoring / handling missing values? For example, in Question 1.3, some of the variables we’re working with, such as SENTENCE_JUDGE have missing values, but the directions don’t seem to address this explicitly.
Response: yep in general missingness is really important to deal with for social science research but for this pset, feel free to largely ignore except in cases where we give you explicit filtering instructions like in 1.4 — "Filter out observations where judge is nan or nonsensical"

Question: for question 1.3, should we include "male name no gender given" in the is_male_derived indicator? Or would this be other?
Response: either is fine!

Question: For part 1.3, I'm not quite sure how to go about assigning the judges different id numbers. Any tips/hints?
Response: i'd think about grouping by judges and reading the documentation for this command that assigns one id for each group: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.ngroup.html
you can then paste the judge_ prefix in a separate step if that makes it easier!

Question: for 1.3, should we omit rows where the SENTENCE_JUDGE is NA?
Response: don't worry too much about na's here (fine w/ whatever) as we filter them out in 1.4

Question: for 1.3, any hints on rounding the sentenceymd_derived column to the month-year only version (sentenceym_derived)?
Response: see notes here in response to class questions. one of the responses covers how to round time in the pset, which involves month from datetime objects; can also extract years and then paste month-year together (making sure result is still datetime)- https://docs.google.com/document/d/1yzV0V8GeuW2fXGQuO9iD0X2g-9E4ORgADv1ECR3L5cQ/edit#heading=h.y94z198yrx86

Question: for 1.3, how do I check if my conversion of sentence_judge to a string is working?

check dytpes; string is represented as object

Question: For question 1.4 task 2, is it possible for a participant to receive 0 charge?
Response: for this part — "subset to a participant's primary charge and their current sentence (PRIMARY_CHARGE_FLAG is True and CURRENT_SENTENCE_FLAG is True). Double check that this worked by confirming there are no longer multiple charges for the same case-participant" —
it could be that both flags are false for a given participant in which case they'd be filtered out

Question: for question 1.4, is it okay if the number of rows in my filtered dataframe doesn't match the number of rows in the sentencing_cleaned.csv file I read in in part two?
Response: yep that's fine! no need to compare

Question: is 1.4 asking us to convert the data from long to wide format? where there's charge_1 charge_2 charge_3 as an individual column for each unique participant row?
Response: ah no it's simpler than that; just remove rows (CASE_IDs) where more than one participant is charged in that case; so conversely, subset to case ids .isin a list of ones that only have one participant

PS5 1.2

1.2 Write a query to pull 8th-grade mathematics scores in 2013, 2015, 2017, and 2019 from California by gender (1 point)

Same as 1.1 but pull the years 2013, 2015, 2017, and 2019 (search documentation for how to combine) in one query

Output subsetting to rel cols (can either print all cols or just relevant ones):

PS2 1.2 Part E

Write 1-2 lines commenting on the results from Part D. What other defendant or offense-level characteristics would you like to match on to investigate claims about racial disparities?

PS5 1.1

1.1 Write a query to pull 8th-grade mathematics scores in 2015 from California by gender (1 point)

Subject: mathematics
Subscale: MRPCM composite scale
Grade: 8
Year: 2015
grouping variable: GENDER

Print the output in dataframe format and briefly interpret; what do scores look like between the genders?

Output (subsetting to some relevant cols; feel free to either subset to print all:

PS5 2.4

2.4 Construct a new variable, `is_male_higher` that takes the value of 1 if the math scores of males exceed that of females in that state and year (each row) (2 points)

Read in the results, print the head, and find the mean across all rows

Output:

Problem set two questions and responses (S21)

Question: I'm working on jhub and having difficulty loading states crosswalk from web

Response: now put a states_pset2.csv file in public_data within qss20_slides_activities (on jhub in shared/qss20) that you should be able to load using pd.read_csv() and use (can comment out web code)

Question: for 1.2, do we need to do merge diagnostics as was done in the example code?

Response: nope! there should be very few matches so no need to do diagnostics of differences

Question: is it okay for 1.4.3 if adding the city makes the matches worse (e.g., more false positives)

Response: yes that's totally fine if happens; main goal of exercise is seeing false negatives and false positives from different approaches to fuzzy matching/setting parameters at different values

PS4 1.3

A. Use a SentimentIntensityAnalyzer and polarity_scores to score the entire press release for its sentiment (use the raw string of the pharma press release without punctuation/digits removed)

Output for A (don't worry about matching exactly):

B. First, remove all punctuations and digits from the string. Then, remove all named entities. With this string, score the sentiment of the press release (hint: for removing named entities, an or condition in re.sub may be useful). Did the neutral score go up or down relative to the raw pharma press release in part A? Why do you think this occurred?

Output for B (don't worry about matching exactly):

C. With the version of the string from part B that removes named entities, try to split the press release into discrete sentences (hint: re.split() may be useful since it allows or conditions in the pattern you're looking for, which can help deal with special characters). Print the first 5 sentences of the split press release (there will not be deductions if there remain some erroneous splits; just make sure it's generally splitting)

Output for C (don't worry about matching exactly):

D. Score each sentence in the split press release and print the top 5 sentences in the press release with the most negative sentiment (use the neg score- higher values = more negative). Hint: you can use pd.DataFrame to rowbind a list of dictionaries; you can then add the press release sentence for each row back as a column in that dataframe and use sort_values()

Output for D (don't worry about matching exactly):

EXTRA CREDIT PS3 1.3.3 Parts C and D

C. Iterate over the name_clean column in debar and execute the regex. Use it to create two new columns in debar:
- co_name: A column for company (full name_clean string if no match; pattern before COMPANY if one extracted)
- ind_name: A column for individual (full name_clean string if no match; pattern before INDIVIDUAL if one extracted)

D. Print three columns for the rows in debar containing the negative example and positive example described above (county fair farm and cisco produce):

name_clean
co_name
ind_name
Violation

Note: as shown in the outcome there may be duplicates of the same company reflecting different violations

Output for part D:

PS5 2.6

2.6 group by state and find the percentage of years where male scores higher than females

A. Write a query that (1) groups by state and (2) finds the percentage of years that have higher scores for males than females in that state

B. Plot the results ordering the states from males higher all 4 years to males higher none of the years

Output (dont need to match aesthetics but does need to be ordered):

PS4 1.2 D, E

D. You want to extract the possible sentence lengths the CEO is facing; pull out the named entities with (1) the label DATE and (2) that contain the word year or years (hint: you may want to use the re module for that second part). Print these named entities.

E. Pull and print the original parts of the press releases where those year lengths are mentioned (e.g., the sentences or rough region of the press release). Describe in your own words (1 sentence) what length of sentence (prison) and probation (supervised release) the CEO may be facing if convicted after this indictment (if there are multiple lengths mentioned describe the maximum). Hint: you may want to use re.search or re.findall

Output for D:

Output for E:

PS4 1.3

A. Subset the press releases to those labeled with one of three topics via topics_clean: Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call this doj_subset going forward and it should have 717 rows.

Dimensions after part A:

B. Write a function that takes one press release string as an input and:

Removes named entities from each press release string
Scores the sentiment of the entire press release
Returns the length-four (negative, positive, neutral, compound) sentiment dictionary

Apply that function to each of the press releases in doj_subset.

Hints:

You may want to use re.escape at some point to avoid errors relating to escape characters like ( in the press release
I used a function + list comprehension to execute and it takes about 30 seconds on my local machine and about 2 mins on jhub; if it's taking a very long time, you may want to check your code for inefficiencies. If you can't fix those, for partial credit on this part/full credit on remainder, you can take a small random sample of the 717

C. Add the four sentiment scores to the doj_subset dataframe. Sort from highest neg to lowest neg score and print the top id, contents, and neg columns of the two most neg press releases.

Output for top 2 most negative (truncated):

D. With the dataframe from part C, find the mean compound score for each of the three topics in topics_clean using group_by and agg.

Output for part D:

E. Add a 1 sentence interpretation of why we might see the variation in scores (remember that compound is a standardized summary where -1 is most negative; +1 is most positive)

Problem Set 2 Submission

My notebook for problem set 2 is uploaded to https://github.com/dscolby/qss20_w22_assignments/tree/main/code/Problem%20Set%202

PS3 1.1

Load the following datasets stored in pset3_inputdata

Historical H2A debarments (debar.csv); call this debar
Q1 2021 H2A job postings (jobs.csv); call this jobs

Example output:

Head of debar data:

Head of jobs data (columns truncated):

PS4 1.1 B and C

B. With the preprocessed press release from part A, use the part of speech tagger within nltk to tag all the words in that one press release with their part of speech.

C. Using the output from B, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print a dataframe with the 5 most frequent adjectives and their counts in the pharma release. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

Example output from part C:

PS3 2.1

2.1 Preprocessing location (2 points)

You want to block on state but notice that states in debar have a mix of two digit codes and full state names (e.g., GA versus Georgia) while states in jobs are all two-digit codes

A. Run the code below to load the states crosswalk (this matches state abbreviations with their full name).

B. Use that crosswalk to create a field in debar that has the two-digit state abbreviation for all locations (hint: you may need to first split the City, State string on the ", " or use str.replace to extract the state into a new column before converting it to all abbreviations)

Hint: the GitHub issue contains the value_counts() for the cleaned two-digit states after this step

C. Use an assert statement to check that all the states in debar are two-digits after the cleaning

Notes: you can filter out states that are NaN

Concepts and resources:

For part B, draw on regex or pandas str functions to create a State variable from the City, State column in debar
Then, one approach is to use pd.merge to merge the state crosswalk onto debar (example head() shown in the GitHub issue)
You may want to structure the pd.merge to only merge on rows without two-digit names to the crosswalk with those two-digit names and pd.concat to bind the dataframe back together
How to structure an assert statement: https://www.w3schools.com/python/ref_keyword_assert.asp

Output from .head() statement after merging on the two digit code-state name crosswalk:

Output from value_counts() after converting all states to two digits:

PS4 2.1

2.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)

A. Write a function that:

Takes in a single raw string in the contents column from that dataframe
Does the following preprocessing steps:
- Converts the words to lowercase
- Removes stopwords, adding the custom stopwords in your code cell below to the default stopwords list
- Only retains alpha words (so removes digits and punctuation)
- Only retains words 4 characters or longer
- Uses the snowball stemmer from nltk to stem
Returns a joined preprocessed string

B. Use apply or list comprehension to execute that function and create a new column in the data called processed_text

C. Print the id, contents, and processed_text columns for the following press releases:

id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)

id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)

Resources:

Here's code examples for the snowball stemmer: https://www.geeksforgeeks.org/snowball-stemmer-nlp/
Here's more condensed code with topic modeling steps: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/06_textasdata_partII_topicmodeling_solution.ipynb
Here's longer code with more broken-out topic modeling steps: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/06_textasdata_partII_topicmodeling.ipynb

Output for those two press releases (cutoff at end but you can get gist):

PS3 2.2

A. Write fuzzy matching code (don't yet put inside a user-defined function, you'll do that in 2.3) that:

Blocks on two-digit state code
Finds matches based on similarity between the employer name (name_clean) in debar (uppercase and cleaned) and name_clean in jobs (uppercase and cleaned). You can choose which distance metric and threshold to use (feel free to set a threshold low enough to get some matches even if that leads to some false positives).

For the steps after compute, just take any match with non-zero value rather than using a classifier (so skip the k-means or e-m step from the class example code code)

Concepts and resources:

Solutions code here (more consolidated): https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/05_merging_fuzzy_activity_solutions.ipynb
Example code here (more step by step): https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/05_merging_fuzzy_codeexample.ipynb

Hint: you may need to deduplicate records in the datasets for the recordlinkage package to work. See drop_duplicates within pandas and subset command to only deduplicate using name_clean and the two-digit state code

B. Print the following columns in the resulting matched dataset and comment on examples of ones that seem like true positives and ones that seem like false matches:

name_clean for jobs and debar
state in debar
state in jobs

Output for part B: IMPORTANT NOTE - since different solutions will use different string distance functions and thresholds, don't try to replicate this exact output; instead, we show it as one example of how to structure the dataset in Part B to showcase matches

PS2 1.1 Part B

Then, follow the instructions in the codebook (combining COMMITMENT_TERM with COMMITMENT_UNIT) to create a standard sentence length in days column (senlength_derived) that measures the sentence in days. To simplify, you can assume that:

1 hour = 1/24th of a day
1 year = 365 days
1 month = 30.5 days
1 week = 7 days
Natural life = difference between the age of 100 and the defendant's age at incident (cleaned; if missing, code to age 20); note that this is a simplification since age at incident != age at sentencing

Print the following cols for an example of each type (eg an example of originally hours; an example of natural life): COMMITMENT_TERM, COMMITMENT_UNIT, age_derived and your new standardized sentence length column (senlength_derived)

Print the summary of that column (senlength_derived) using the .describe() command

Concepts tested and resources: there are many approaches but a couple ways are:

np.select covered in the slides and this activity notebook: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/00_pandas_datacleaning_solutions.ipynb
writing a function that takes in one row as an argument and has a series of if, elif, else conditions where different commitment_units are translated into days. To execute this function, you can use the .apply function but apply it with axis = 1 (row-wise). Resources for that include: (1) the activity notebook on user-defined functions (https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/02_loopsfunctions_solutions.ipynb); (2) the activity notebook covering apply: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/00_pandas_datacleaning_solutions.ipynb

Hint on output: see GitHub issue for the summary stats we get from running .describe()

PS4 2.3

2.3 Estimate a topic model using those preprocessed words (5 points)

A. Going back to the preprocessed words from part 2.3.1, estimate a topic model with 3 topics, since you want to see if the unsupervised topic models recover different themes for each of the three manually-labeled areas (civil rights; hate crimes; project safe childhood). You have free rein over the other topic model parameters beyond the number of topics.

B. After estimating the topic model, print the top 15 words in each topic.

Hints and Resources:

Same topic modeling resources linked to above
Make sure to use the random_state argument within the model so that the numbering of topics does not move around between runs of your code

Output: NOTE- yours will likely be different bc it depends on (1) parameter values within ldamodel and (2) the random state you use; do not worry about this

EXTRA CREDIT PS3 1.3.3 Parts A and B

1.3.3 Regex to separate companies from individuals (4 points)

You notice some employers in debar have both the name of the company and the name of individual, e.g.:

COUNTY FAIR FARM (COMPANY) AND ANDREW WILLIAMSON (INDIVIDUAL)*

Use the uppercase/cleaned version of Name in debar

A. Write a regex pattern that does the following:
- Captures the pattern that occurs before COMPANY if (COMPANY) is in string-- so in above, extracts COUNTY FAIR FARM
- Captures the pattern that occurs before INDIVIDUAL if (INDIVIDUAL) is also in string -- so in above, extracts ANDREW WILLIAMSON (so omit the "and")

B. Test the pattern on pos_example and neg_example-- make sure former returns a list (if using find.all) or match object (if using re.search) with the company name and individual name separated out; make sure latter returns empty

Hints and resources: for step A, you can either use re.search, re.match, or re.findall; don't worry about matching B&R Harvesting and Paul Cruz (Individual)

Same regex resources as above

Output:

Example output from pos_example = "COUNTY FAIR FARM (COMPANY) AND ANDREW WILLIAMSON (INDIVIDUAL)*" (this corresponds to the re.findall approach; could also use other method)

Example output from neg_example = "CISCO PRODUCE INC"

Empty / returns no match

PS5 2.7

2.7 Use a subquery to create an indicator and group by that indicator (6 points)

The following states were the first 6 to expand the right to vote to women before the uniform federal expansion in 1920

Wyoming 1890
Colorado 1893
Utah 1896
Idaho 1896
Washington 1910
California 1911

A. Create an indicator is_early_vote for whether a state is in that list or not; do so without typing the state names inside the string and instead collapsing the list of states we provide and using something like format. Hint on how to combine the state names while preserving the quotes around each: https://stackoverflow.com/questions/12007686/join-a-list-of-strings-in-python-and-wrap-each-string-in-quotation-marks

B. Then, group by the is_early_vote indicator and year and find the percencentage of states in each group where males had higher scores than females

C. Print the resulting dataframe and interpret. Does early expansion of voting seem to be correlated with girls scoring better on the math tests a century later?

Hint: in order to group by the indicator in step b, you may need to use a subquery

Output (fine for yours to be formatted differently):

PS3 1.3.2 Part A and B

1.3.2 Cleaning up punctuation (4 points)

You notice that INC, CO, and LLC are sometimes followed by a period (.) but sometimes not

A. For each dataset, write a regex pattern using re.sub to remove the . but only if it's preceded by INC, LLC, or CO

Make sure LLC, INC, CO remain part of the string but just without the dot

B. Test the pattern on the positive and negative example we provide below and print the result. See the Github issue for examples of what to return

Hint: https://stackoverflow.com/questions/7191209/python-re-sub-replace-with-matched-content

Resources:
- Regex slides: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/slides/w22_slides/06_qss20_w22_unit6_regex.pdf
- Regex activity code: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/04_basicregex_solutions.ipynb

Output:

Output for pos_example_1 = "CISCO PRODUCE INC."

Output for pos_example_2 = "AVOYELLES HONEY CO., LLC"

Output for neg_example = "E.V. RANCH LLP"

PS5 1.5

1.5 Write and execute user-defined function that takes in a query and returns a pandas dataframe with the content of the response (4 points)

Write a user-defined function (process_naep_query) that takes in the NAEP query as a string, calls the API, and transforms the response into a pandas dataframe. Have the function return that pandas dataframe
Make sure the function is flexible enough to handle queries that return an error; for queries that return an error, have the function return the string "Data not found; check your query"
Execute the function on the query that pulls 2013-2019 data (either from handwriting the query or the result in 1.4)
Print the resulting dataframe
Then execute the function on a query that pulls a state that doesn't exist (call this state ZZ)

** Output from correct query**: same as output from 1.2

Output from incorrect query:

PS5 3.1

3.1 Obtain a Census API key, place it in credentials yaml, load the yaml file, and initialize connection to Census API using the `census` package (0 points)

Obtain a census API key from here: https://api.census.gov/data/key_signup.html
Place it in a credentials yaml file (feel free to name the key whatever; can be in same yaml file as the database credentials or different one)
Load the credentials file with the key
Documentation here for the census package on establishing an API connection: https://github.com/datamade/census

What a successful api connection looks like:

Twitter module

https://www.cambridge.org/core/journals/politics-and-the-life-sciences/article/twitter-as-research-data/6B31D18C5E2F9B8F9C0301BFB05F1C27

fake pset submission

Here's my pset: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/s21_activities/00_latex_output_examples.ipynb

rebeccajohnson88 / qss20_slides_activities Goto Github PK

qss20_slides_activities's People

Contributors

Stargazers

Watchers

Forkers

qss20_slides_activities's Issues

2.4 Add topics back to main data and explore correlation between manual labels and our estimated topics (10 points)

3. Extend the analysis from unigrams to bigrams (10 points)

1.3.1 Converting to upper (2 points)

2.3 Find the (1) number of rows in the database, (2) number of distinct states, (3) number of distinct years (3 points)

2.1 Load database credentials if haven't already and establish a connection (1 point)

Group portion

Individual portion

1.4 Reproduce the queries from 1.1 and 1.2 using a user-defined function (4 points)

2.3 Put the code from 2.2 into a fuzzy matching function that generalizes some steps (8 points)

2.2 Run a query to select all columns and the first 5 rows of the math_gencompare database to explore structure (2 points)

1.3 Create a line plot to show variation in the scores across years (2 points)

1. NLP on one press release (10 points)

1.1 part of speech tagging (3 points)

2.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)

2.5 group by year and find the percentage of states where male scores are higher than females (4 points)

1.2 Try exact merge on business name (2 points)

1.2 Write a query to pull 8th-grade mathematics scores in 2013, 2015, 2017, and 2019 from California by gender (1 point)

1.1 Write a query to pull 8th-grade mathematics scores in 2015 from California by gender (1 point)

2.4 Construct a new variable, is_male_higher that takes the value of 1 if the math scores of males exceed that of females in that state and year (each row) (2 points)

2.6 group by state and find the percentage of years where male scores higher than females

2.1 Preprocessing location (2 points)

2.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)

2.3 Estimate a topic model using those preprocessed words (5 points)

1.3.3 Regex to separate companies from individuals (4 points)

2.7 Use a subquery to create an indicator and group by that indicator (6 points)

1.3.2 Cleaning up punctuation (4 points)

1.5 Write and execute user-defined function that takes in a query and returns a pandas dataframe with the content of the response (4 points)

3.1 Obtain a Census API key, place it in credentials yaml, load the yaml file, and initialize connection to Census API using the census package (0 points)

Recommend Projects

Recommend Topics

Recommend Org

2.4 Construct a new variable, `is_male_higher` that takes the value of 1 if the math scores of males exceed that of females in that state and year (each row) (2 points)

3.1 Obtain a Census API key, place it in credentials yaml, load the yaml file, and initialize connection to Census API using the `census` package (0 points)