rebeccajohnson88 / qss20_slides_activities Goto Github PK
View Code? Open in Web Editor NEWRepo for slides and activities for qss 20
Repo for slides and activities for qss 20
Relevant directory: https://github.com/rebeccajohnson88/qss20_slides_activities/tree/main/problemsets
A. Using the original pharma
press release (so the one before stripping punctuation/digits), use spaCy to extract all named entities from the press release.
B. Print the unique named entities with the tag: LAW
C. Use Google to summarize in one sentence what the RICO
named entity means and why this might apply to a pharmaceutical case (and not just a mafia case...)
Output for unique named entities with tag LAW:
Keep the above ~58k row dataset that is subsetted to only to sentences involving incarceration. Then, further subset the rows to:
judgeid_derived
= judge_21
simplified_offense_derived
== "Narcotics"Use shape
to print the dimensions of the resulting dataframe
Concepts and resources: row subsetting using logical conditions; see above resources
Hint on output: we get 53 rows after this filtering step. Don't worry about matching this exactly but try to get in the ballpark range.
COMMITMENT_TYPE
== "Illinois Department of Corrections")COMMITMENT_UNIT
column that represents sentence length, filter out the following non-numeric sentence lengths: Term, Pounds, Dollars, Ounces (year, months, natural life, days, hours, and weeks should remain)Concepts tested and resources: this question tests filtering rows based on logical conditions. Here are some resources:
Hints on output: we get 58289 rows remaining after this filtering. Don't worry about matching this exactly but try to get in the ballpark range.
A. Extract the document-level topic probabilities. Within get_document_topics
, use the argument minimum_probability
= 0 to make sure all 3 topic probabilities are returned. Write an assert statement to make sure the length of the list is equal to the number of rows in the doj_subset_wscores
dataframe
B. Add the topic probabilities to the doj_subset_wscores
dataframe as columns and create a column, top_topic
, that reflects each document to its highest-probability topic (eg topic 1, 2, or 3)
C. For each of the manual labels in topics_clean
(Hate Crime, Civil Rights, Project Safe Childhood), print the breakdown of the % of documents with each top topic (so, for instance, Hate Crime has 246 documents-- if 123 of those documents are coded to topic_1, that would be 50%; and so on). Hint: pd.crosstab and normalize may be helpful: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.crosstab.html
D. Using a couple press releases as examples, write a 1-2 sentence interpretation of why some of the manual topics map on more cleanly to an estimated topic than other manual topic(s)
Resources:
Additional summaries of topics and documents
) contains example of how to use get_document_topics
and other steps to add topic probabilities back to data: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/06_textasdata_partII_topicmodeling_solution.ipynbOutput: NOTE - do not worry if your numbers are slightly different and if your topics have different labels (though should range from 1...3 rather than 0..2)--- the labels are impacted by the random_state arg at the model estimation stage but it should broadly show hate crimes and proj safe childhood mapping cleanly to one topic each; civil rights more split
In the previous question, you found top words via a unigram representation of the text. Now, we want to see how those top words change with bigrams (pairs of words)
A. Using the doj_subset_wscore
data and the processed_text
column (so the words after stemming/other preprocessing), create a column in the data called processed_text_bigrams
that combines each consecutive pairs of word into a bigram separated by an underscore. Eg:
"depart reach settlem" would become "depart_reach reach_settlem"
Do this by writing a function create_bigram_onedoc
that takes in a single processed_text
string and returns a string with its bigrams structured similarly to above example
Hint: there are many ways to solve but zip
may be helpful: https://stackoverflow.com/questions/21303224/iterate-over-all-pairs-of-consecutive-items-in-a-list
B. Print the id
, processed_text
, and processed_text_bigram
columns for press release with id = 16-217
Output from example (fine if yours differs slightly):
A. Convert the names EMPLOYER_NAME
and Name
to uppercase using list comprehension rather than df.varname.str.upper() (it's fine to do a separate list comprehension line for each of the two columns)
B. Print a random sample of 15 values of each result
C. Assign the full vector of uppercase names back to the original data, writing over the original EMPLOYER_NAME
and Name
columns
Resources:
- Activity code with list comprehension: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/00_pandas_datacleaning_solutions.ipynb
- Sampling from a list without replacement using the random
module: https://note.nkmk.me/en/python-random-choice-sample-choices/
Output: since everyone's random sample will be different, we don't have a uniform output; here's an example of the uppercase conversion for the American Crawfish employer in jobs
: AMERICAN CRAWFISH LLC
Question: for question 1.1, Print the mean and median number of charges per case/participant
should we calculate the mean and median numbers per case and then per participant?
Response: ok with any reasonable answer but originally was one combined --- I'd basically think of it as grouping by both case and participant, find the number of charges for each case-participant group, and then the mean/median of that (sorry a bit confusing but just results from fact of participants nested in cases)
Question: for question 1.2.2, if the category does not contain aggravated, arson, homicide, vehicle/vehicular, or battery, what should appear in the simplified_offense_derived column?
Response: for other categories that aren't changed, the simplified_offense_derived column should be set to the original value for that category (eg if Retail Theft isn't changed by any of the above, simplified_offense_derived stays as Retail Theft)
Question: for question 1.3, if the race == NaN, should that be categorized as none of the above?
Response:: in short doesn't matter since we'll be subsetting to black and white; so can either code as true for other or leave as missing/false for all.
Question: just wanted to clarify that we should use sentenceymd_derived
column for part 2.1?
Response: for all the time based questions going forward, you can use either sentenceym_derived or sentenceymd_derived (whichever makes more sense for the question)
Question: what does winsorize in this part of 1.3 mean- Winsorsize the top 0.01% of values to be equal to the 99.99th percentile value pre-winsorization. Call this age_derived
Response: word winsorize isn't important for answering (but some discussion here!: https://en.wikipedia.org/wiki/Winsorizing). In this question, it involves: (1) finding what the 99.99th percentile value for age at incident is in the raw data, (2) if a given row is above that 99.99th percentile value, recode it to the 99.99th percentile value (so recode it rather than drop it). Recommend this command: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html
Question: for 2.3.3, I understand the comparison by time frame but I’m wondering what type of disparities are we looking to compare? Raw number of retail theft sentences, sentence length for retail thefts or a different column?
Response: good question! It should be (1) just looking at retail thefts, (2) looking at what percentage of defendants are incarcerated (so repeating aggregation from 2.3.2 but doing the time variation)
Question: if working locally, why might I be having issues loading the sentencing_cleaned.pkl
file and what can I do about it?
Response: pkl files are a filetype that preserve things like the datatypes (e.g., datetime) over csv. The downside is that there are what are called serialization protocols that differ between versions of python. The pkl I shared with you corresponds to python 3.8+ (version on jupyter hub).
If you want to work locally and you don't have python 3.8+ installed yet (can check versions by going to terminal and typing python --version
) you can load the csv and use the following code to recast the two datetime columns back to the correct type (screenshot below is me checking that this works)
sentencing_csv['sentenceymd_derived'] = pd.to_datetime(sentencing_csv.sentenceymd_derived)
sentencing_csv['sentenceym_derived'] = pd.to_datetime(sentencing_csv.sentenceym_derived)
Question: for questions like 2.3.2 and 3.3 that say "print table", is printing a pandas dataframe fine?
Response: yep just a dataframe or whatever showing the results is fine!
Question: I am still a bit confused on what we are supposed to do for this question "Use mathematical notation to write out each of the proportions using summation notation," after reviewing the in-class Latex example. Are we supposed to export the table we created to latex and in latex make the summation notation?
Response: (1) use the menu at top to change the cell to markdown, (2) using math mode (e.g, open and closed $
Question: for question 3.2, are we supposed to print the median sentence length for all defendants/judges (ie plot grouped by judge and by race) or just by overall values (ie just one median value each for white and black and only grouped by race)
Response: yep to unpack this a bit For each judge with at least 20 Black and at least 20 white defendants, plot the (1) median sentence length for Black defendants and (2) median sentence length for white defendants (factor variable on x axis for each judge_id who qualifies; group by race)
Question: for question 3.3, who are our focal defendants, every defendant for a narcotics offense for judge 21? Does using the "same narcotics offenses" mean we're only looking at Black and White defendants like in 3.2, and does that also mean we shouldn't consider other races when doing the matched pairs? And lastly, if every defendant is a focal defendant, won't we get repeat pairs just in different orders?
Response:
Create a function, construct_naep_query
that takes in two arguments:
Have the function return the query and make sure it's identical to the queries you wrote for 1.1 and 1.2 (can use assert or other checker)
No output to display since output is just the query you're being asked to produce
You want to see how the matches change if you add the city and not only state as a field and also want to automate the process of matching a bit to try different distance thresholds.
A. Extract the City from the City, State
column of debar
B. Convert that new city
column to uppercase and convert the EMPLOYER_CITY
column in jobs
to uppercase
C. Write a function surrounding the code in recordlinkage
that you wrote in problem 2.2 (so you don't need to recode the package from scratch) that (1) takes in each dataset, (2) blocks on two-digit state, and (3) fuzzy matches on employer name and employer city. See below notes on partial versus full credit for function structure.
D. Execute the function with a couple different string distance thresholds and print the resulting matches for each
6 out of 8 points: function takes arguments for input datasets, varname to block on, two varnames to fuzzy match on, string distance function, and string distance threshold
8 out of 8: above but function is also general enough that it takes a variable # of strings to match on--- so should work if you either execute just using employer name or also work if you execute using employer name and employer city
Concepts and resources: same as above. One hint is that for us, the easiest way to get to the full credit solution was to feed the function a dictionary where each key corresponds to one string variable to fuzzy match on; the values contain that variable's name in each dataset, the distance metric, and the distance threshold.
Example output: IMPORTANT- yours will likely differ because of string distance metrics etc. Here's our example though for a distance threshold of 0.9 for both business name and city name and then a threshold of 0.75 for each
qss20_slides_activities
repo locallyFirst parts of course will be (can discuss good deadlines monday)
Using the results from Part B, use pd.concat
or another approach to create a dataframe that compares the (1) race and sentence length for the focal defendant to (2) the sentence length for other defendants. Using this dataframe, show this comparison for focal defendant: CASE_PARTICIPANT_ID
== 808109112733
Using the results from 1.2, create a plot where the x axis has the year and the y axis is the math scores (value
in dataframe), and there are separate lines/colors for male versus female students
Start the limits of the y axis at 270 and add informative labels
Our graph (doesn't need to match exactly in terms of aesthetics/labels:
Focus on the following press release: id
== "17-1204" about this pharmaceutical kickback prosecution: https://www.forbes.com/sites/michelatindera/2017/11/16/fentanyl-billionaire-john-kapoor-to-plead-not-guilty-in-opioid-kickback-case/?sh=21b8574d6c6c
The contents
column is the one we're treating as a document. You may need to to convert it from a pandas series to a single string.
We'll call this press release pharma
A. Preprocess the pharma
press release to remove all punctuation / digits (so can use .isalpha()
to subset)
Example output from first five tokens:
A. Use the create_dtm
function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the following columns: id
, compound
sentiment column you added, and the topics_clean
column
B. Print the top 10 words for press releases with compound sentiment in the top 5% (so the most positive sentiment)
C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so the most negative sentiment)
Hint: for these, remember the pandas quantile function from pset one.
D. Print the top 10 words for press releases in each of the three topics_clean
For steps B - D, to receive full credit, write a function get_topwords
that helps you avoid duplicated code when you find top words for the different subsets of the data. There are different ways to structure it but one way is to feed it subsetted data (so data subsetted to one topic etc.) and for it to get the top words for that subset.
Resources:
Output for top words based on compound scores:
Output for top words based on the topics_clean:
Using the results from Part C, group by the focal defendant's race and find the proportion of that defendant's matches who had a LONGER sentence than the focal defendant
Concepts and resources: can use groupby and agg
Hints on output: since we're grouping by race in the full pool, the resulting output should have two values: one for the rate at which Black defendants have matches with higher sentences; another for the rate at which White defendants have matches with higher sentences
A. Write a query that (1) groups by year and (2) finds the percentage of states that have higher scores for males than females in this year
B. Print the resulting dataframe and interpret the results
Hint: can either use subquery to construct the is_male_higher
and use it or do it all in one query with a comparison; the avg
command is useful
Output:
For each defendant sentenced by judge_21, you want to construct a "matched group" of defendants who:
age_derived
) andis_male_derived
) butis_black_derived
/is_white_derived
)Write a user-defined function to find any/all matched defendants for each focal defendant of judge 21. You can structure the function in various ways but one way is to write a function similar to the class example where we iterate over different DC crimes (focal crimes) and find similar crimes. For this problem, we want to:
CASE_PARTICIPANT_ID
to identify each unique defendant)Concepts and resources:
pd.merge
to join these on after; we'll be covering pd.merge
Monday 01.24 but here are last year's slides (slide 17-20) in meantime: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/slides/s21_slides/qss20_s21_class4.pdfHints on output:
EMPLOYER_NAME
field of the jobs
datasetName
field of the debar
datasetA. Use pd.merge with an inner join on those fields to see whether there are any exact matches.
B. If there are exact matches, subset to the following columns and print the rows with exact matches:
Employer_NAME
and Name
Start date
and End date
in debar
)City, State
in debar
and EMPLOYER_CITY
and EMPLOYER_STATE
in jobs
)Example output:
C. Use the create_dtm function and the processed_text_bigrams
column to create a document-term matrix (dtm_bigram
) with these bigrams. Keep the following three columns in the data: id
, topics_clean
, and compound
D. Print the (1) dimensions of the dtm
matrix from question 2.2 and (2) the dimensions of the dtm_bigram
matrix. Comment on why the bigram matrix has more dimensions than the unigram matrix
Output (don't need to match number of columns exactly):
E. Find and print the 10 most prevelant bigrams for each of the three topics_clean using the get_topwords
function from 2.2
C. Use that pattern in conjunction with re.sub
and list comprehension to clean the columns in each dataset. Save the new columns as name_clean
in each. Then, use row subsetting to (1) subset to rows that changed names and for:
debar
print the Name
and name_clean
columnsjobs
print the EMPLOYER_NAME
and name_clean
columnsMake sure to use the uppercase versions of the variables
Concepts and resources: for the last part of row subsetting and printing two columns in each dataframe, use the .loc
examples shown here: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/00_pandas_datacleaning_solutions.ipynb
Output:
Head of changed employer names in jobs (your head may be different depending on order of data)
Head of changed employer names in debar:
QUESTION ONE:
Question: for question 1.1, Print the mean and median number of charges per case/participant should we calculate the mean and median numbers per case and then per participant?
Response: ok with any reasonable answer but originally was one combined --- I’d basically think of it as grouping by both case and participant, find the number of charges for each case-participant group, and then the mean/median of that (sorry a bit confusing but just results from fact of participants nested in cases)
Question For 1.1, "Does the data seem to enable us to follow the same defendant across different cases they're charged in?" I am a little confused about how to get started on this one. I've tried grouping by case id and finding various ways to look for unique case participants, but I can't seem to find something that works
Response: i think this is close but a bit off. to see if the same defendant can be followed across cases, i'd instead group by the case_participant_id and see the number of unique case ids within the group; if there is only 1 case for each case_participant_id, we can't follow
Question: I am a little confused with 1.2.1. I don't have to reformat the data right? It's just saying for each row check if the original offense is the same as the changed offense right. EX: sentencing_data["OFFENSE_CATEGORY] == sentencing_data["CHANGED_OFFENSE_CATEGORY]. I guess the part that's confusing me is "case-participant-charge"
Response:
is_changed_offense
that takes one of two values: True if OFFENSE_CATEGORY
is not equal to UPDATED_OFFENSE_CATEGORY
; false otherwisecase-participant-charge observations (rows)
? This is just to note that you can use something like np.where
for this question and keep the unit of analysis at the row level. For later questions, we will be aggregating over different charges within the same case, so this hint is just to indicate that it's okay to keep everything at the row level for now/not yet group the data.Question: for 1.2.1, what does it mean to "construct an indicator"
Response: see above answer - means to add a new column, is_changed_offense
, to your dataframe
Question: for 1.2.1, how might I print one example of a changed offense?
Response: there are various ways to do this that avoid "hard coding" a specific row number. One way is to (1) filter to offenses that have changed using row subsetting, (2) run df.head(1)
to print the head
Question: for question 1.2.2, if the category does not contain aggravated, arson, homicide, vehicle/vehicular, or battery, what should appear in the simplified_offense_derived column?
Response: for other categories that aren’t changed, the simplified_offense_derived column should be set to the original value for that category (eg if Retail Theft isn’t changed by any of the above, simplified_offense_derived stays as Retail Theft)
Question: for 1.2.2, do I need to use advanced regex?
Response: nope; functions like str.startswith or str.replace that operate directly on pandas dataframe columns work fine
Question: for 1.2.2, I know how to generally structure the np.select
but get stuck at the part of coding of leaving the others that i'm not recoding at their UPDATED_OFFENSE_CATEGORY
. How might I do this?
Response: see the code on slide 21- https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/slides/w22_slides/qss20_w22_unit2_pandas.pdf. You can use the same "default" syntax within np.select. But for the default, instead of coding it to a single string, set default to df['UPDATED_OFFENSE_CATEGORY']
. This serves function of leaving everything that's not recoded at its original value.
Question: for 1.3, I'm getting this error: TypeError: bad operand type for unary ~: 'float'
Response: wherever you have str.contains, try including na=False inside the parenthesis
Question: for question 1.3, if the race == NaN, should that be categorized as none of the above?
Response: in short doesn’t matter since we’ll be subsetting to black and white; so can either code as true for other or leave as missing/false for all.
Question: for question 1.3, I'm stuck on this part of cleaning up the timestamps:
"For full credit, before converting, you notice that some of the years have been mistranscribed (e.g., 291X or 221X instead of 201X). Programatically fix those (eg 2914 -> 2014). Even after cleaning, there will still be some that are after the year 2021 that we'll filter out later. For partial credit, you can ignore the timestamps that cause errors and set errors = "coerce" within pd.to_datetime() to allow the conversion to proceed."
Response: since we're not covering regex until later, and people have different COSC backgrounds, here's some code you can adapt for that cleaning. It relies on the re.sub command (some examples here: https://lzone.de/examples/Python%20re.sub) and the list comprehension we covered in lecture 1
## replace erroneous dates
sentence['tmp_clnsdate'] = [re.sub(r'2[1-9]([0-9]+)', r"20\1", str(date))
if bool(re.search('\/2[1-9][0-9]+', str(date))) else
str(date)
for date in
sentence.SENTENCE_DATE]
## cast as datetime
sentence['sentenceymd_derived'] = pd.to_datetime(sentence.tmp_clnsdate)
Question: what does winsorize in this part of 1.3 mean- Winsorsize the top 0.01% of values to be equal to the 99.99th percentile value pre-winsorization. Call this age_derived
Response: word winsorize isn’t important for answering (but some discussion here!: https://en.wikipedia.org/wiki/Winsorizing). In this question, it involves: (1) finding what the 99.99th percentile value for age at incident is in the raw data, (2) if a given row is above that 99.99th percentile value, recode it to the 99.99th percentile value (so recode it rather than drop it). Recommend this command: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html
Question: Also a question on 1.3, do the values have to start at "judge_1"? for some reason, mine are starting at "judge_-1"
Reponse: take a look at this documentation: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.ngroup.html and pay particular attention to the parameters in the ngroup documentation
Question: In general, when recoding & cleaning variables how concerned should we be with monitoring / handling missing values? For example, in Question 1.3, some of the variables we’re working with, such as SENTENCE_JUDGE have missing values, but the directions don’t seem to address this explicitly.
Response: yep in general missingness is really important to deal with for social science research but for this pset, feel free to largely ignore except in cases where we give you explicit filtering instructions like in 1.4 — "Filter out observations where judge is nan or nonsensical"
Question: for question 1.3, should we include "male name no gender given" in the is_male_derived indicator? Or would this be other?
Response: either is fine!
Question: For part 1.3, I'm not quite sure how to go about assigning the judges different id numbers. Any tips/hints?
Response: i'd think about grouping by judges and reading the documentation for this command that assigns one id for each group: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.ngroup.html
you can then paste the judge_ prefix in a separate step if that makes it easier!
Question: for 1.3, should we omit rows where the SENTENCE_JUDGE is NA?
Response: don't worry too much about na's here (fine w/ whatever) as we filter them out in 1.4
Question: for 1.3, any hints on rounding the sentenceymd_derived
column to the month-year only version (sentenceym_derived
)?
Response: see notes here in response to class questions. one of the responses covers how to round time in the pset, which involves month from datetime objects; can also extract years and then paste month-year together (making sure result is still datetime)- https://docs.google.com/document/d/1yzV0V8GeuW2fXGQuO9iD0X2g-9E4ORgADv1ECR3L5cQ/edit#heading=h.y94z198yrx86
Question: for 1.3, how do I check if my conversion of sentence_judge to a string is working?
check dytpes; string is represented as object
Question: For question 1.4 task 2, is it possible for a participant to receive 0 charge?
Response: for this part — "subset to a participant's primary charge and their current sentence (PRIMARY_CHARGE_FLAG is True and CURRENT_SENTENCE_FLAG is True). Double check that this worked by confirming there are no longer multiple charges for the same case-participant" —
it could be that both flags are false for a given participant in which case they'd be filtered out
Question: for question 1.4, is it okay if the number of rows in my filtered dataframe doesn't match the number of rows in the sentencing_cleaned.csv
file I read in in part two?
Response: yep that's fine! no need to compare
Question: is 1.4 asking us to convert the data from long to wide format? where there's charge_1 charge_2 charge_3 as an individual column for each unique participant row?
Response: ah no it's simpler than that; just remove rows (CASE_IDs) where more than one participant is charged in that case; so conversely, subset to case ids .isin a list of ones that only have one participant
Same as 1.1 but pull the years 2013, 2015, 2017, and 2019 (search documentation for how to combine) in one query
Output subsetting to rel cols (can either print all cols or just relevant ones):
Write 1-2 lines commenting on the results from Part D. What other defendant or offense-level characteristics would you like to match on to investigate claims about racial disparities?
Print the output in dataframe format and briefly interpret; what do scores look like between the genders?
Output (subsetting to some relevant cols; feel free to either subset to print all:
Question: I'm working on jhub and having difficulty loading states crosswalk from web
Response: now put a states_pset2.csv file in public_data within qss20_slides_activities (on jhub in shared/qss20) that you should be able to load using pd.read_csv() and use (can comment out web code)
Question: for 1.2, do we need to do merge diagnostics as was done in the example code?
Response: nope! there should be very few matches so no need to do diagnostics of differences
Question: is it okay for 1.4.3 if adding the city makes the matches worse (e.g., more false positives)
Response: yes that's totally fine if happens; main goal of exercise is seeing false negatives and false positives from different approaches to fuzzy matching/setting parameters at different values
A. Use a SentimentIntensityAnalyzer
and polarity_scores
to score the entire press release for its sentiment (use the raw string of the pharma
press release without punctuation/digits removed)
Output for A (don't worry about matching exactly):
B. First, remove all punctuations and digits from the string. Then, remove all named entities. With this string, score the sentiment of the press release (hint: for removing named entities, an or
condition in re.sub may be useful). Did the neutral score go up or down relative to the raw pharma
press release in part A? Why do you think this occurred?
Output for B (don't worry about matching exactly):
C. With the version of the string from part B that removes named entities, try to split the press release into discrete sentences (hint: re.split() may be useful since it allows or
conditions in the pattern you're looking for, which can help deal with special characters). Print the first 5 sentences of the split press release (there will not be deductions if there remain some erroneous splits; just make sure it's generally splitting)
Output for C (don't worry about matching exactly):
D. Score each sentence in the split press release and print the top 5 sentences in the press release with the most negative sentiment (use the neg
score- higher values = more negative). Hint: you can use pd.DataFrame to rowbind a list of dictionaries; you can then add the press release sentence for each row back as a column in that dataframe and use sort_values()
Output for D (don't worry about matching exactly):
C. Iterate over the name_clean
column in debar and execute the regex. Use it to create two new columns in debar
:
- co_name
: A column for company (full name_clean
string if no match; pattern before COMPANY if one extracted)
- ind_name
: A column for individual (full name_clean
string if no match; pattern before INDIVIDUAL if one extracted)
D. Print three columns for the rows in debar
containing the negative example and positive example described above (county fair farm and cisco produce):
name_clean
co_name
ind_name
Violation
Note: as shown in the outcome there may be duplicates of the same company reflecting different violations
Output for part D:
A. Write a query that (1) groups by state and (2) finds the percentage of years that have higher scores for males than females in that state
B. Plot the results ordering the states from males higher all 4 years to males higher none of the years
Output (dont need to match aesthetics but does need to be ordered):
D. You want to extract the possible sentence lengths the CEO is facing; pull out the named entities with (1) the label DATE
and (2) that contain the word year or years (hint: you may want to use the re
module for that second part). Print these named entities.
E. Pull and print the original parts of the press releases where those year lengths are mentioned (e.g., the sentences or rough region of the press release). Describe in your own words (1 sentence) what length of sentence (prison) and probation (supervised release) the CEO may be facing if convicted after this indictment (if there are multiple lengths mentioned describe the maximum). Hint: you may want to use re.search or re.findall
Output for D:
Output for E:
A. Subset the press releases to those labeled with one of three topics via topics_clean
: Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call this doj_subset
going forward and it should have 717 rows.
Dimensions after part A:
B. Write a function that takes one press release string as an input and:
Apply that function to each of the press releases in doj_subset
.
Hints:
C. Add the four sentiment scores to the doj_subset
dataframe. Sort from highest neg to lowest neg score and print the top id
, contents
, and neg
columns of the two most neg press releases.
Output for top 2 most negative (truncated):
D. With the dataframe from part C, find the mean compound score for each of the three topics in topics_clean
using group_by and agg.
Output for part D:
E. Add a 1 sentence interpretation of why we might see the variation in scores (remember that compound is a standardized summary where -1 is most negative; +1 is most positive)
My notebook for problem set 2 is uploaded to https://github.com/dscolby/qss20_w22_assignments/tree/main/code/Problem%20Set%202
B. With the preprocessed press release from part A, use the part of speech tagger within nltk to tag all the words in that one press release with their part of speech.
C. Using the output from B, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print a dataframe with the 5 most frequent adjectives and their counts in the pharma
release. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/
Example output from part C:
You want to block on state but notice that states in debar
have a mix of two digit codes and full state names (e.g., GA versus Georgia) while states in jobs
are all two-digit codes
A. Run the code below to load the states
crosswalk (this matches state abbreviations with their full name).
B. Use that crosswalk to create a field in debar
that has the two-digit state abbreviation for all locations (hint: you may need to first split the City, State
string on the ", " or use str.replace to extract the state into a new column before converting it to all abbreviations)
Hint: the GitHub issue contains the value_counts() for the cleaned two-digit states after this step
C. Use an assert
statement to check that all
the states in debar
are two-digits after the cleaning
Notes: you can filter out states that are NaN
Concepts and resources:
State
variable from the City, State
column in debar
pd.merge
to merge the state crosswalk onto debar (example head() shown in the GitHub issue)pd.merge
to only merge on rows without two-digit names to the crosswalk with those two-digit names and pd.concat
to bind the dataframe back togetherOutput from .head() statement after merging on the two digit code-state name crosswalk:
Output from value_counts() after converting all states to two digits:
A. Write a function that:
Takes in a single raw string in the contents
column from that dataframe
Does the following preprocessing steps:
Returns a joined preprocessed string
B. Use apply
or list comprehension to execute that function and create a new column in the data called processed_text
C. Print the id
, contents
, and processed_text
columns for the following press releases:
id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)
id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)
Resources:
Output for those two press releases (cutoff at end but you can get gist):
A. Write fuzzy matching code (don't yet put inside a user-defined function, you'll do that in 2.3) that:
name_clean
) in debar
(uppercase and cleaned) and name_clean
in jobs
(uppercase and cleaned). You can choose which distance metric and threshold to use (feel free to set a threshold low enough to get some matches even if that leads to some false positives).For the steps after compute, just take any match with non-zero value rather than using a classifier (so skip the k-means or e-m step from the class example code code)
Concepts and resources:
Hint: you may need to deduplicate records in the datasets for the recordlinkage package to work. See drop_duplicates within pandas and subset command to only deduplicate using name_clean
and the two-digit state code
B. Print the following columns in the resulting matched dataset and comment on examples of ones that seem like true positives and ones that seem like false matches:
name_clean
for jobs and debarstate
in debarstate
in jobsOutput for part B: IMPORTANT NOTE - since different solutions will use different string distance functions and thresholds, don't try to replicate this exact output; instead, we show it as one example of how to structure the dataset in Part B to showcase matches
Then, follow the instructions in the codebook (combining COMMITMENT_TERM
with COMMITMENT_UNIT
) to create a standard sentence length in days column (senlength_derived
) that measures the sentence in days. To simplify, you can assume that:
Print the following cols for an example of each type (eg an example of originally hours; an example of natural life): COMMITMENT_TERM
, COMMITMENT_UNIT
, age_derived
and your new standardized sentence length column (senlength_derived
)
Print the summary of that column (senlength_derived
) using the .describe() command
Concepts tested and resources: there are many approaches but a couple ways are:
Hint on output: see GitHub issue for the summary stats we get from running .describe()
A. Going back to the preprocessed words from part 2.3.1, estimate a topic model with 3 topics, since you want to see if the unsupervised topic models recover different themes for each of the three manually-labeled areas (civil rights; hate crimes; project safe childhood). You have free rein over the other topic model parameters beyond the number of topics.
B. After estimating the topic model, print the top 15 words in each topic.
Hints and Resources:
random_state
argument within the model so that the numbering of topics does not move around between runs of your codeOutput: NOTE- yours will likely be different bc it depends on (1) parameter values within ldamodel and (2) the random state you use; do not worry about this
You notice some employers in debar
have both the name of the company and the name of individual, e.g.:
COUNTY FAIR FARM (COMPANY) AND ANDREW WILLIAMSON (INDIVIDUAL)*
Use the uppercase/cleaned version of Name
in debar
A. Write a regex pattern that does the following:
- Captures the pattern that occurs before COMPANY if (COMPANY) is in string-- so in above, extracts COUNTY FAIR FARM
- Captures the pattern that occurs before INDIVIDUAL if (INDIVIDUAL) is also in string -- so in above, extracts ANDREW WILLIAMSON (so omit the "and")
B. Test the pattern on pos_example
and neg_example
-- make sure former returns a list (if using find.all) or match object (if using re.search) with the company name and individual name separated out; make sure latter returns empty
Hints and resources: for step A, you can either use re.search, re.match, or re.findall; don't worry about matching B&R Harvesting and Paul Cruz (Individual)
Output:
pos_example = "COUNTY FAIR FARM (COMPANY) AND ANDREW WILLIAMSON (INDIVIDUAL)*"
(this corresponds to the re.findall approach; could also use other method)neg_example = "CISCO PRODUCE INC"
Empty / returns no match
The following states were the first 6 to expand the right to vote to women before the uniform federal expansion in 1920
A. Create an indicator is_early_vote
for whether a state is in that list or not; do so without typing the state names inside the string and instead collapsing the list of states we provide and using something like format
. Hint on how to combine the state names while preserving the quotes around each: https://stackoverflow.com/questions/12007686/join-a-list-of-strings-in-python-and-wrap-each-string-in-quotation-marks
B. Then, group by the is_early_vote
indicator and year
and find the percencentage of states in each group where males had higher scores than females
C. Print the resulting dataframe and interpret. Does early expansion of voting seem to be correlated with girls scoring better on the math tests a century later?
Hint: in order to group by the indicator in step b, you may need to use a subquery
Output (fine for yours to be formatted differently):
You notice that INC, CO, and LLC are sometimes followed by a period (.) but sometimes not
A. For each dataset, write a regex pattern using re.sub
to remove the . but only if it's preceded by INC, LLC, or CO
Make sure LLC, INC, CO remain part of the string but just without the dot
B. Test the pattern on the positive and negative example we provide below and print the result. See the Github issue for examples of what to return
Hint: https://stackoverflow.com/questions/7191209/python-re-sub-replace-with-matched-content
Resources:
- Regex slides: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/slides/w22_slides/06_qss20_w22_unit6_regex.pdf
- Regex activity code: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/04_basicregex_solutions.ipynb
Output:
pos_example_1 = "CISCO PRODUCE INC."
pos_example_2 = "AVOYELLES HONEY CO., LLC"
neg_example = "E.V. RANCH LLP"
Write a user-defined function (process_naep_query
) that takes in the NAEP query as a string, calls the API, and transforms the response into a pandas dataframe. Have the function return that pandas dataframe
Make sure the function is flexible enough to handle queries that return an error; for queries that return an error, have the function return the string "Data not found; check your query"
Execute the function on the query that pulls 2013-2019 data (either from handwriting the query or the result in 1.4)
Print the resulting dataframe
Then execute the function on a query that pulls a state that doesn't exist (call this state ZZ)
** Output from correct query**: same as output from 1.2
Output from incorrect query:
census
package (0 points)census
package on establishing an API connection: https://github.com/datamade/censusWhat a successful api connection looks like:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.