This repository extracts the bill similarity functions from the BillMap project here. Both are open source, under the Public Domain License.
A separate repository (github.com/aih/bills) contains tools in Go to process bill data.
This repository can be installed as a Python package directly from Github:
pip install -e git+https://github.com/aih/billsim.git#egg=billsim-aih
Then it can be imported as import billsim
The repository can also be installed with git clone https://github.com/aih/billsim.git
.
## Installation (quick) From a local repository
-
Clone the repository to a directory called
billsim
git clone https://github.com/aih/billsim.git
-
Create a virtualenvironment from Python >=3.8 For example, in
pyenv virtualenv
:pyenv virtualenv 3.9.12 billsim
-
Activate the virtual environment
pyenv activate billsim
-
Install the dependencies
-
Run
python setup.py install
To install BillSim, it is suggested to do the following steps.
If you are running this on a m1 mac, it is suggested to run the following instructions
to install brew
on arch
in the Users
folder.
Next run brew install pyenv-virtualenv
to install pyenv-virtualenv
to set up a virtual enviroment for python to ensure that architecture errors do not occur
when installing certain packages.
The following is a good set of instructions to set up a virtual enviroment with pyenv-virtualenv
quickly: https://towardsdatascience.com/python-how-to-create-a-clean-learning-environment-with-pyenv-pyenv-virtualenv-pipx-ed17fbd9b790
A more detailed guide can be found here: https://realpython.com/intro-to-pyenv/
You may get the following error when trying to set up the virtual enviroment:
Failed to activate virtualenv.
Perhaps pyenv-virtualenv has not been loaded into your shell properly.
Please restart current shell and try again.
To remedy this, run the following to ensure the Python environment can be activated properly:
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
Finally, to see if everyone works, run pytest
. However the issue with pytest
is that it runs
on the default global version, so it can be running tests on a version of python you are not using.
To solve this, install pytest in the virtual enviroment, and make sure via python pytest
that a version of python running in the virtual enviroment is running, versus the global enviroment.
Below is a stackoverflow explaining this issue: https://stackoverflow.com/questions/40718770/pytest-running-with-another-version-of-python
The default functions assume that the bill XML files are in a CONGRESS_DATA directory. The absolute path to CONGRESS_DATA must be defined as PATH_TO_CONGRESS_DATA
in environment variables, or set in the .env
file, inside the billsim
directory. It should be a path to a directory of the form [abspath]/data/congress/117
(for the 117th Congress). See .env-sample
for an example.
The 'PATHTYPE_DEFAULT' sets the expected hierarchy structure. The 'congressdotgov' structure is /116/bills/hr1818/BILLS-116hr1818ih.xml
, while the unitedstates
pathtype structure follows the hierarchy that is created by the scraper in github.com/unitedstates/congress
: 116/bills/hr/hr1818/text-versions/ih/BILLS-116hr1818ih.xml
.
Then you can run the following command to process all the bill XML files in the directory:
>>> from billsim.utils import getBillXmlPaths
>>> z=getBillXmlPaths()
>>> z[30]
BillPath(billnumber_version='116hr1818ih', path='[abspath]/billsim/data/congress/116/bills/hr1818/BILLS-116hr1818ih.xml', fileName='BILLS-116hr1818ih.xml')
Install Elasticsearch (>=7.0.0 and <7.10.2). This can be done from docker as follows:
$ docker pull docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2
Note
|
using podman instead of docker on MacOs works better for me. Note also that the docker container memory may be limited, slowing down elasticsearch processes. In production, the docker-compose should set enough memory to ensure performance (-m=4g sets the max memory to 4Gb).
|
Memory settings for elasticsearch (set in Kibana). These are necessary or ES returns errors during high-volume indexing and processing of documents:
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "100gb",
"cluster.routing.allocation.disk.watermark.high": "50gb",
"cluster.routing.allocation.disk.watermark.flood_stage": "10gb",
"cluster.info.update.interval": "1m"
}
}
$ docker run -m=4g -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2 &
Then, in a virtualenv with all requirements installed, run the following commands in Python, from the src
directory to create the index:
>>> from billsim.elastic_load import initializeBillSectionsIndex
>>> initializeBillSectionsIndex()
This will gather all of the bill paths in the directory specified in .env and create an Elasticsearch index with the name specified in .env (or the default in constants). Creating the index will take approximately 5 minutes per Congress directory on a reasonably fast server (e.g. 16GB ram, 3 GHz), without any concurrent processing or other optimizations.
Note
|
This will not delete the index if it already exists. To do so, and start over, pass delete=True to billsim.elastic_load.createIndex or delete_index=True to billsim.elastic_load initializeBillSectionsIndex.
|
Note
|
The Elasticsearch versions after 7.10.2 are forked between the full 'OSS' version and a more restrictive license (as a challenge to Cloud services like AWS). The python client library must match the version of the Elasticsearch server. |
-
Install
elasticdump
command-line application with npm
npm install -g elasticdump
-
Store
billsim
andbill_full
indices to .gz files
elasticdump --input=http://localhost:9200/billsim --output=$ | gzip > ./elasticdump.billsim.json.gz
elasticdump --input=http://localhost:9200/bill_full --output=$ | gzip > ./elasticdump.bill_full.json.gz
-
Import data from
.json
-
Unzip the
.json.gz
-
gzip -d elasticdump.billsim.json.gz
gzip -d elasticdump.bill_full.json.gz
-
Restore data to Elasticsearch
elasticdump \
--input "elasticdump.billsim.json" \
--output=http://localhost:9200/billsim --limit=50 \
--ignore-errors true
elasticdump \
--input "elasticdump.bill_full.json" \
--output=http://localhost:9200/bill_full --limit=50 \
--ignore-errors true
To run the Elasticsearch similarity algorithm, followed by a function to calculate similarity scores between pairs of bills, run the following command from the command line:
$ python compare.py $MAX_BILLS_TO_COMPARE
Where $MAX_BILLS_TO_COMPARE is the maximum number of bills to compare (chosen randomly from the bill XML files). If no value is passed, the default is all bills.
Note
|
Running 995 bills this way took ~700 minutes on my machine (16GB ram, 2.9 GHz) (average 41.7 seconds per bill). |
The bill_similarity.py
script includes functions to find similar bills by billnumber and version. The default functions assume that the bill XML files are in a directory three levels up from the bill_similarity.py
file, of the form congress/data/
. The default data
directory can also be set in a .env
file.
Then you can run the following command to find and save similar bills (the bill itself should be found as the first result):
>>> from billsim.compare import processSimilarBills
>>> processSimilarBills('116hr1818ih')
OR for many bills:
>>> from billsim.compare import processSimilarBills`
>>> billnumber_versions=['116hr133enr', '115hr4275ih', '117s235is', '117hr4459ih', '117hr4350ih', '117s2766is', '117hr5466ih', '116hr8939ih', '116s160is', '117s2685is', '117hr4041ih', '116hr2812ih', '116hr2709ih', '117s2812is', '116sres178is', '116hres391ih']
>>> for billnumber_version in billnumber_versions:
>>> processSimilarBills(billnumber_version)
# Additional bills
# ['117hres158ih', '117hr1768ih', '117hres318ih','117sres356is', '117s2563is', '117s1816is', '117s1588is', '117hr1992ih', '117s2685is', '116sres178is']
or for all bills in one Congress:
>>> from billsim.utils import getBillXmlPaths
>>> billnumberversions117 = [billPath.billnumber_version for billPath in getBillXmlPaths(congresses=[117])]
>>> for billnumber_version in billnumber_versions:
>>> processSimilarBills(billnumber_version)
The processSimilarBills function is the equivalent of the following:
>>> from billsim.bill_similarity import getSimilarBillSections, getBillToBill
>>> from billsim.utils_db import save_bill_to_bill, save_bill_to_bill_sections
>>> s = getSimilarBillSections('116hr200ih')
>>> b2b = getBillToBill(s)
>>> b2b
{'116hr200ih': BillToBillModel(id=None, billnumber_version='116hr200ih', length=7313, length_to=None, score_es=190.614846, score=None, score_to=None, reasons=None, billnumber_version_to='116hr200ih', identified_by=None, title=None, title_to=None, sections=[Section(billnumber_version='116hr200ih', section_id='HE90F34DBB44149C6B9BBD6747EB6F645', label='2.', header='Border wall trust fund', length=None, similar_sections=[SimilarSection(billnumber_version='116hr200ih', section_id='HE90F34DBB44149C6B9BBD6747EB6F645', label='2.', header='Border wall trust fund', length=1264, score_es=97.936806, score=None, score_to=None)]), Section(bill...
>>> for bill in b2b:
>>> save_bill_to_bill(b2b[bill])
>>> save_bill_to_bill_sections(b2b[bill]) # This should save the individual sections and the sections to section mapping
# Get similarity scores for bill-to-bill
>>> similar_bills=b2b.keys()
// Calls comparematrix from bills (Golang);
// The compiled executable is in the `bin` directory.
>>> from billsim.compare import getCompareMatrix
>>> c = getCompareMatrix(similar_bills)
>>> c[0][0]
{'Score': 1, 'ScoreOther': 1, 'Explanation': 'bills-identical', 'ComparedDocs': '116hr222ih-116hr222ih'}
>>> c[0][1] {'Score': 0.86, 'ScoreOther': 0.86, 'Explanation': 'bills-nearly_identical', 'ComparedDocs': '116hr222ih-115hr198ih'}
>>> from billsim.pymodels import BillToBillModel
>>> for row in c:
>>> for column in row:
>>> bill, bill_to = column['ComparedDocs'].split('-')
>>> if bill and bill_to:
>>> b2bModel = BillToBillModel(billnumber_version=bill, billnumber_version_to=bill_to, score=column['Score'], score_to=column['ScoreOther'], reasons=[column['Explanation']])
>>> save_bill_to_bill(b2bModel)
To find similar bills from ES, without reference to the file system, use the getSimilarBillSections_es
function.
Tests, built with pytest
are found in the tests
directory. To run the tests, run make
(requires cmake and pytest installed) or run pytest -rs tests
directly.
Uses the pytest-order
plugin. See https://pytest-dev.github.io/pytest-order/dev/
While you can run this script locally, as a alternative a dockerfile is provided. To run it do the following:
docker build -t billsim .
docker run billsim -m INSERT OTHER ARGS HERE
$ mkdir -p $HOME/docker/volumes/postgres
$ docker run --rm --name pg-docker -e POSTGRES_PASSWORD=$POSTGRES_PW -d -p 5432:5432 -v $HOME/docker/volumes/postgres:/var/lib/postgresql/data postgres:alpine
Create a local postgres user:app-name:
createuser -s postgres
Install the tables:
$python pymodels.py
2021-12-11 15:48:29,657 INFO sqlalchemy.engine.Engine select pg_catalog.version()
...
CREATE TABLE bill (
id SERIAL,
length INTEGER,
billnumber VARCHAR NOT NULL,
version VARCHAR NOT NULL,
PRIMARY KEY (id),
CONSTRAINT billnumber_version UNIQUE (billnumber, version)
)
...
To access the database from the command line:
psql postgresql://postgres:$POSTGRES_PW@localhost:5432/postgres
To run pgadmin4 from docker:
docker run -p 5050:80 -e "PGADMIN_DEFAULT_EMAIL=[email protected]" -e "PGADMIN_DEFAULT_PASSWORD=a12345678" -d dpage/pgadmin4
The admin panel is available at http://localhost:5050/
The database is backed up with:
pg_dump billsim > billsim-bk.sql
Or, without user/pw, and gzipped:
pg_dump billsim -O -x | gzip -9 > billsim-bk.sql.gz
Or, from the url:
pg_dump postgresql://postgres:postgres@localhost -O -x | gzip -9 > billsim-bk.sql.gz