BillSim: Utilities to process similarity of bills

Table of Contents

Install
- As a python package
Installation (m1 mac)
Processing bill XML files
Creating Elasticsearch index of bill sections
- Elasticsearch backup
Find similar bills
- Bill similarity functions with Elasticsearch
Build and test
Run with Docker
Run with Postgres (docker)
Postgres back-up

This repository extracts the bill similarity functions from the BillMap project here. Both are open source, under the Public Domain License.

A separate repository (github.com/aih/bills) contains tools in Go to process bill data.

Install

As a python package

This repository can be installed as a Python package directly from Github: pip install -e git+https://github.com/aih/billsim.git#egg=billsim-aih

Then it can be imported as import billsim

The repository can also be installed with git clone https://github.com/aih/billsim.git. ## Installation (quick) From a local repository

Clone the repository to a directory called billsim git clone https://github.com/aih/billsim.git
Create a virtualenvironment from Python >=3.8 For example, in pyenv virtualenv: pyenv virtualenv 3.9.12 billsim
Activate the virtual environment pyenv activate billsim
Install the dependencies
Run python setup.py install

Installation (m1 mac)

To install BillSim, it is suggested to do the following steps.

If you are running this on a m1 mac, it is suggested to run the following instructions to install brew on arch in the Users folder.

https://stackoverflow.com/questions/64963370/error-cannot-install-in-homebrew-on-arm-processor-in-intel-default-prefix-usr

Next run brew install pyenv-virtualenv to install pyenv-virtualenv to set up a virtual enviroment for python to ensure that architecture errors do not occur when installing certain packages.

The following is a good set of instructions to set up a virtual enviroment with pyenv-virtualenv quickly: https://towardsdatascience.com/python-how-to-create-a-clean-learning-environment-with-pyenv-pyenv-virtualenv-pipx-ed17fbd9b790

A more detailed guide can be found here: https://realpython.com/intro-to-pyenv/

You may get the following error when trying to set up the virtual enviroment:

Failed to activate virtualenv.

Perhaps pyenv-virtualenv has not been loaded into your shell properly.
Please restart current shell and try again.

To remedy this, run the following to ensure the Python environment can be activated properly:

eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

Finally, to see if everyone works, run pytest. However the issue with pytest is that it runs on the default global version, so it can be running tests on a version of python you are not using.

To solve this, install pytest in the virtual enviroment, and make sure via python pytest that a version of python running in the virtual enviroment is running, versus the global enviroment.

Below is a stackoverflow explaining this issue: https://stackoverflow.com/questions/40718770/pytest-running-with-another-version-of-python

Processing bill XML files

The default functions assume that the bill XML files are in a CONGRESS_DATA directory. The absolute path to CONGRESS_DATA must be defined as PATH_TO_CONGRESS_DATA in environment variables, or set in the .env file, inside the billsim directory. It should be a path to a directory of the form [abspath]/data/congress/117 (for the 117th Congress). See .env-sample for an example.

The 'PATHTYPE_DEFAULT' sets the expected hierarchy structure. The 'congressdotgov' structure is /116/bills/hr1818/BILLS-116hr1818ih.xml, while the unitedstates pathtype structure follows the hierarchy that is created by the scraper in github.com/unitedstates/congress: 116/bills/hr/hr1818/text-versions/ih/BILLS-116hr1818ih.xml.

Then you can run the following command to process all the bill XML files in the directory:

>>> from billsim.utils import getBillXmlPaths
>>> z=getBillXmlPaths()
>>> z[30]
BillPath(billnumber_version='116hr1818ih', path='[abspath]/billsim/data/congress/116/bills/hr1818/BILLS-116hr1818ih.xml', fileName='BILLS-116hr1818ih.xml')

Creating Elasticsearch index of bill sections

Install Elasticsearch (>=7.0.0 and <7.10.2). This can be done from docker as follows:

$ docker pull docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2

Note	using `podman` instead of docker on MacOs works better for me. Note also that the docker container memory may be limited, slowing down elasticsearch processes. In production, the docker-compose should set enough memory to ensure performance (-m=4g sets the max memory to 4Gb).

Memory settings for elasticsearch (set in Kibana). These are necessary or ES returns errors during high-volume indexing and processing of documents:

PUT _cluster/settings
{
   "transient": {
       "cluster.routing.allocation.disk.watermark.low": "100gb",
       "cluster.routing.allocation.disk.watermark.high": "50gb",
       "cluster.routing.allocation.disk.watermark.flood_stage": "10gb",
       "cluster.info.update.interval": "1m"
   }
}

$ docker run -m=4g -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2 &

Then, in a virtualenv with all requirements installed, run the following commands in Python, from the src directory to create the index:

>>> from billsim.elastic_load import initializeBillSectionsIndex
>>> initializeBillSectionsIndex()

This will gather all of the bill paths in the directory specified in .env and create an Elasticsearch index with the name specified in .env (or the default in constants). Creating the index will take approximately 5 minutes per Congress directory on a reasonably fast server (e.g. 16GB ram, 3 GHz), without any concurrent processing or other optimizations.

Note	This will not delete the index if it already exists. To do so, and start over, pass `delete=True` to `billsim.elastic_load.createIndex` or `delete_index=True` to billsim.elastic_load initializeBillSectionsIndex.

Note	The Elasticsearch versions after 7.10.2 are forked between the full 'OSS' version and a more restrictive license (as a challenge to Cloud services like AWS). The python client library must match the version of the Elasticsearch server.

Elasticsearch backup

See https://github.com/elasticsearch-dump/elasticsearch-dump

Install elasticdump command-line application with npm

npm install -g elasticdump

Store billsim and bill_full indices to .gz files

elasticdump --input=http://localhost:9200/billsim --output=$ | gzip > ./elasticdump.billsim.json.gz

elasticdump --input=http://localhost:9200/bill_full --output=$ | gzip > ./elasticdump.bill_full.json.gz

Import data from .json
- Unzip the .json.gz

gzip -d elasticdump.billsim.json.gz gzip -d elasticdump.bill_full.json.gz

Restore data to Elasticsearch

elasticdump \
  --input "elasticdump.billsim.json" \
  --output=http://localhost:9200/billsim --limit=50 \
  --ignore-errors true

elasticdump \
  --input "elasticdump.bill_full.json" \
  --output=http://localhost:9200/bill_full --limit=50 \
  --ignore-errors true

Find similar bills

To run the Elasticsearch similarity algorithm, followed by a function to calculate similarity scores between pairs of bills, run the following command from the command line:

$ python compare.py $MAX_BILLS_TO_COMPARE

Where $MAX_BILLS_TO_COMPARE is the maximum number of bills to compare (chosen randomly from the bill XML files). If no value is passed, the default is all bills.

Note	Running 995 bills this way took ~700 minutes on my machine (16GB ram, 2.9 GHz) (average 41.7 seconds per bill).

Bill similarity functions with Elasticsearch

The bill_similarity.py script includes functions to find similar bills by billnumber and version. The default functions assume that the bill XML files are in a directory three levels up from the bill_similarity.py file, of the form congress/data/. The default data directory can also be set in a .env file.

Then you can run the following command to find and save similar bills (the bill itself should be found as the first result):

>>> from billsim.compare import processSimilarBills
>>> processSimilarBills('116hr1818ih')

OR for many bills:

>>> from billsim.compare import processSimilarBills`
>>> billnumber_versions=['116hr133enr', '115hr4275ih', '117s235is', '117hr4459ih', '117hr4350ih', '117s2766is', '117hr5466ih', '116hr8939ih', '116s160is', '117s2685is', '117hr4041ih', '116hr2812ih', '116hr2709ih', '117s2812is', '116sres178is', '116hres391ih']
>>> for billnumber_version in billnumber_versions:
>>>     processSimilarBills(billnumber_version)

# Additional bills
# ['117hres158ih', '117hr1768ih', '117hres318ih','117sres356is', '117s2563is', '117s1816is', '117s1588is', '117hr1992ih', '117s2685is', '116sres178is']

or for all bills in one Congress:

>>> from billsim.utils import getBillXmlPaths
>>> billnumberversions117 = [billPath.billnumber_version for billPath in getBillXmlPaths(congresses=[117])]
>>> for billnumber_version in billnumber_versions:
>>>     processSimilarBills(billnumber_version)

The processSimilarBills function is the equivalent of the following:

>>> from billsim.bill_similarity import getSimilarBillSections, getBillToBill
>>> from billsim.utils_db import save_bill_to_bill, save_bill_to_bill_sections
>>> s = getSimilarBillSections('116hr200ih')
>>> b2b = getBillToBill(s)
>>> b2b
{'116hr200ih': BillToBillModel(id=None, billnumber_version='116hr200ih', length=7313, length_to=None, score_es=190.614846, score=None, score_to=None, reasons=None, billnumber_version_to='116hr200ih', identified_by=None, title=None, title_to=None, sections=[Section(billnumber_version='116hr200ih', section_id='HE90F34DBB44149C6B9BBD6747EB6F645', label='2.', header='Border wall trust fund', length=None, similar_sections=[SimilarSection(billnumber_version='116hr200ih', section_id='HE90F34DBB44149C6B9BBD6747EB6F645', label='2.', header='Border wall trust fund', length=1264, score_es=97.936806, score=None, score_to=None)]), Section(bill...
>>> for bill in b2b:
>>>    save_bill_to_bill(b2b[bill])
>>>    save_bill_to_bill_sections(b2b[bill]) # This should save the individual sections and the sections to section mapping

# Get similarity scores for bill-to-bill
>>> similar_bills=b2b.keys()
// Calls comparematrix from bills (Golang);
// The compiled executable is in the `bin` directory.
>>> from billsim.compare import getCompareMatrix
>>> c = getCompareMatrix(similar_bills)
>>> c[0][0]
{'Score': 1, 'ScoreOther': 1, 'Explanation': 'bills-identical', 'ComparedDocs': '116hr222ih-116hr222ih'}
>>> c[0][1] {'Score': 0.86, 'ScoreOther': 0.86, 'Explanation': 'bills-nearly_identical', 'ComparedDocs': '116hr222ih-115hr198ih'}

>>> from billsim.pymodels import BillToBillModel
>>> for row in c:
>>>   for column in row:
>>>     bill, bill_to = column['ComparedDocs'].split('-')
>>>     if bill and bill_to:
>>>         b2bModel = BillToBillModel(billnumber_version=bill, billnumber_version_to=bill_to, score=column['Score'], score_to=column['ScoreOther'], reasons=[column['Explanation']])
>>>         save_bill_to_bill(b2bModel)

To find similar bills from ES, without reference to the file system, use the getSimilarBillSections_es function.

Build and test

Tests, built with pytest are found in the tests directory. To run the tests, run make (requires cmake and pytest installed) or run pytest -rs tests directly.

Uses the pytest-order plugin. See https://pytest-dev.github.io/pytest-order/dev/

Run with Docker

While you can run this script locally, as a alternative a dockerfile is provided. To run it do the following:

docker build -t billsim .
docker run billsim -m INSERT OTHER ARGS HERE

Run with Postgres (docker)

$ mkdir -p $HOME/docker/volumes/postgres
$ docker run --rm   --name pg-docker -e POSTGRES_PASSWORD=$POSTGRES_PW -d -p 5432:5432 -v $HOME/docker/volumes/postgres:/var/lib/postgresql/data  postgres:alpine

Create a local postgres user:app-name: createuser -s postgres

Install the tables:

$python pymodels.py
2021-12-11 15:48:29,657 INFO sqlalchemy.engine.Engine select pg_catalog.version()
...
CREATE TABLE bill (
        id SERIAL,
        length INTEGER,
        billnumber VARCHAR NOT NULL,
        version VARCHAR NOT NULL,
        PRIMARY KEY (id),
        CONSTRAINT billnumber_version UNIQUE (billnumber, version)
)
...

To access the database from the command line: psql postgresql://postgres:$POSTGRES_PW@localhost:5432/postgres

To run pgadmin4 from docker: docker run -p 5050:80 -e "PGADMIN_DEFAULT_EMAIL=[email protected]" -e "PGADMIN_DEFAULT_PASSWORD=a12345678" -d dpage/pgadmin4

The admin panel is available at http://localhost:5050/

Postgres back-up

The database is backed up with: pg_dump billsim > billsim-bk.sql

Or, without user/pw, and gzipped: pg_dump billsim -O -x | gzip -9 > billsim-bk.sql.gz

Or, from the url: pg_dump postgresql://postgres:postgres@localhost -O -x | gzip -9 > billsim-bk.sql.gz

(See https://jer-k.github.io/docker-postgres-image-with-seeded-data)

aih / billsim Goto Github PK

billsim's Introduction

BillSim: Utilities to process similarity of bills

Install

As a python package

Installation (m1 mac)

Processing bill XML files

Creating Elasticsearch index of bill sections

Elasticsearch backup

Find similar bills

Bill similarity functions with Elasticsearch

Build and test

Run with Docker

Run with Postgres (docker)

Postgres back-up

billsim's People

Contributors

Stargazers

Watchers

Forkers

billsim's Issues

Recommend Projects

Recommend Topics

Recommend Org