We use text from the pdfs (downloaded from Google Books, from where these are freely available) and then parse the text.
We scrape congressional biographies for 105th to the 115th Congress from the Congressional Directory. We download the biographical files, e.g., https://www.govinfo.gov/content/pkg/CDIR-2018-10-29/html/CDIR-2018-10-29-STATISTICALINFORMATION-2.htm and parse them to extract information such as birthdate, number of children, education, etc.
- Scrapes the Congressional Directory produces biocong.csv, biocong-browsepath.csv, and html files (tar.gz)
- Download Congressional Biographies Using the API provides the script for downloading the data using the API. (It produces incomplete data so we don't use this script.)
- Parse iterates through biocong-browsepath.csv and parses the html files (tar.gz) and produces biocong-parsed.csv
- Clean takes biocong-parsed.csv produces biocong-cleaned.csv
The final dataset---biocong-cleaned.csv---has the following columns:
'level', 'docCount', 'browsePath', 'title', 'lastpage', 'granuleid', 'packageid', 'pdffile', 'pdf', 'text',
'agencyLevel', 'nodeStatus', 'textfile', 'htmlfile', 'browseline1', 'processingcode', 'nodetype', 'index.1',
'publishdate', 'part', 'forGpo', 'hasChildren', 'hasParents', 'rootNode', 'documentResults', 'hasDocumentResults',
'collectionCode', 'searchPath', 'isContentArea', 'pageSize', 'pageNumber', 'count', 'digitizedFR', 'section',
'firstpage', 'congress', 'biography', 'name', 'party', 'location', 'born_in', 'birthdate', 'education', 'professional',
'married', 'children', 'committees', 'url', 'n_children'