The biocong from soodoku

biocong's Introduction

Congressional Biographies

97th --- 104th Congress

We use text from the pdfs (downloaded from Google Books, from where these are freely available) and then parse the text.

Scripts

parse
clean

105th --- 115th Congress

We scrape congressional biographies for 105th to the 115th Congress from the Congressional Directory. We download the biographical files, e.g., https://www.govinfo.gov/content/pkg/CDIR-2018-10-29/html/CDIR-2018-10-29-STATISTICALINFORMATION-2.htm and parse them to extract information such as birthdate, number of children, education, etc.

Scripts

Scrapes the Congressional Directory produces biocong.csv, biocong-browsepath.csv, and html files (tar.gz)
Download Congressional Biographies Using the API provides the script for downloading the data using the API. (It produces incomplete data so we don't use this script.)
Parse iterates through biocong-browsepath.csv and parses the html files (tar.gz) and produces biocong-parsed.csv
Clean takes biocong-parsed.csv produces biocong-cleaned.csv

Data

The final dataset---biocong-cleaned.csv---has the following columns:

'level', 'docCount', 'browsePath', 'title', 'lastpage', 'granuleid', 'packageid', 'pdffile', 'pdf', 'text',
 'agencyLevel', 'nodeStatus', 'textfile', 'htmlfile', 'browseline1', 'processingcode', 'nodetype', 'index.1', 
 'publishdate', 'part', 'forGpo', 'hasChildren', 'hasParents', 'rootNode', 'documentResults', 'hasDocumentResults',
 'collectionCode', 'searchPath', 'isContentArea', 'pageSize', 'pageNumber', 'count', 'digitizedFR', 'section',
 'firstpage', 'congress', 'biography', 'name', 'party', 'location', 'born_in', 'birthdate', 'education', 'professional', 
 'married', 'children', 'committees', 'url', 'n_children'

Recommend Projects