Coder Social home page Coder Social logo

billsim's Introduction

BillSim: Utilities to process similarity of bills

This repository extracts the bill similarity functions from the BillMap project here. Both are open source, under the Public Domain License.

A separate repository (github.com/aih/bills) contains tools in Go to process bill data.

Install

As a python package

This repository can be installed as a Python package directly from Github: pip install -e git+https://github.com/aih/billsim.git#egg=billsim-aih

Then it can be imported as import billsim

The repository can also be installed with git clone https://github.com/aih/billsim.git. ## Installation (quick) From a local repository

  • Clone the repository to a directory called billsim git clone https://github.com/aih/billsim.git

  • Create a virtualenvironment from Python >=3.8 For example, in pyenv virtualenv: pyenv virtualenv 3.9.12 billsim

  • Activate the virtual environment pyenv activate billsim

  • Install the dependencies

  • Run python setup.py install

Installation (m1 mac)

To install BillSim, it is suggested to do the following steps.

If you are running this on a m1 mac, it is suggested to run the following instructions to install brew on arch in the Users folder.

Next run brew install pyenv-virtualenv to install pyenv-virtualenv to set up a virtual enviroment for python to ensure that architecture errors do not occur when installing certain packages.

The following is a good set of instructions to set up a virtual enviroment with pyenv-virtualenv quickly: https://towardsdatascience.com/python-how-to-create-a-clean-learning-environment-with-pyenv-pyenv-virtualenv-pipx-ed17fbd9b790

A more detailed guide can be found here: https://realpython.com/intro-to-pyenv/

You may get the following error when trying to set up the virtual enviroment:

Failed to activate virtualenv.

Perhaps pyenv-virtualenv has not been loaded into your shell properly.
Please restart current shell and try again.

To remedy this, run the following to ensure the Python environment can be activated properly:

eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

Finally, to see if everyone works, run pytest. However the issue with pytest is that it runs on the default global version, so it can be running tests on a version of python you are not using.

To solve this, install pytest in the virtual enviroment, and make sure via python pytest that a version of python running in the virtual enviroment is running, versus the global enviroment.

Processing bill XML files

The default functions assume that the bill XML files are in a CONGRESS_DATA directory. The absolute path to CONGRESS_DATA must be defined as PATH_TO_CONGRESS_DATA in environment variables, or set in the .env file, inside the billsim directory. It should be a path to a directory of the form [abspath]/data/congress/117 (for the 117th Congress). See .env-sample for an example.

The 'PATHTYPE_DEFAULT' sets the expected hierarchy structure. The 'congressdotgov' structure is /116/bills/hr1818/BILLS-116hr1818ih.xml, while the unitedstates pathtype structure follows the hierarchy that is created by the scraper in github.com/unitedstates/congress: 116/bills/hr/hr1818/text-versions/ih/BILLS-116hr1818ih.xml.

Then you can run the following command to process all the bill XML files in the directory:

>>> from billsim.utils import getBillXmlPaths
>>> z=getBillXmlPaths()
>>> z[30]
BillPath(billnumber_version='116hr1818ih', path='[abspath]/billsim/data/congress/116/bills/hr1818/BILLS-116hr1818ih.xml', fileName='BILLS-116hr1818ih.xml')

Creating Elasticsearch index of bill sections

Install Elasticsearch (>=7.0.0 and <7.10.2). This can be done from docker as follows:

$ docker pull docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2
Note
using podman instead of docker on MacOs works better for me. Note also that the docker container memory may be limited, slowing down elasticsearch processes. In production, the docker-compose should set enough memory to ensure performance (-m=4g sets the max memory to 4Gb).

Memory settings for elasticsearch (set in Kibana). These are necessary or ES returns errors during high-volume indexing and processing of documents:

PUT _cluster/settings
{
   "transient": {
       "cluster.routing.allocation.disk.watermark.low": "100gb",
       "cluster.routing.allocation.disk.watermark.high": "50gb",
       "cluster.routing.allocation.disk.watermark.flood_stage": "10gb",
       "cluster.info.update.interval": "1m"
   }
}
$ docker run -m=4g -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2 &

Then, in a virtualenv with all requirements installed, run the following commands in Python, from the src directory to create the index:

>>> from billsim.elastic_load import initializeBillSectionsIndex
>>> initializeBillSectionsIndex()

This will gather all of the bill paths in the directory specified in .env and create an Elasticsearch index with the name specified in .env (or the default in constants). Creating the index will take approximately 5 minutes per Congress directory on a reasonably fast server (e.g. 16GB ram, 3 GHz), without any concurrent processing or other optimizations.

Note
This will not delete the index if it already exists. To do so, and start over, pass delete=True to billsim.elastic_load.createIndex or delete_index=True to billsim.elastic_load initializeBillSectionsIndex.
Note
The Elasticsearch versions after 7.10.2 are forked between the full 'OSS' version and a more restrictive license (as a challenge to Cloud services like AWS). The python client library must match the version of the Elasticsearch server.

Elasticsearch backup

  • Install elasticdump command-line application with npm

npm install -g elasticdump

  • Store billsim and bill_full indices to .gz files

elasticdump --input=http://localhost:9200/billsim --output=$ | gzip > ./elasticdump.billsim.json.gz

elasticdump --input=http://localhost:9200/bill_full --output=$ | gzip > ./elasticdump.bill_full.json.gz

  • Import data from .json

    • Unzip the .json.gz

gzip -d elasticdump.billsim.json.gz gzip -d elasticdump.bill_full.json.gz

  • Restore data to Elasticsearch

elasticdump \
  --input "elasticdump.billsim.json" \
  --output=http://localhost:9200/billsim --limit=50 \
  --ignore-errors true

elasticdump \
  --input "elasticdump.bill_full.json" \
  --output=http://localhost:9200/bill_full --limit=50 \
  --ignore-errors true

Find similar bills

To run the Elasticsearch similarity algorithm, followed by a function to calculate similarity scores between pairs of bills, run the following command from the command line:

$ python compare.py $MAX_BILLS_TO_COMPARE

Where $MAX_BILLS_TO_COMPARE is the maximum number of bills to compare (chosen randomly from the bill XML files). If no value is passed, the default is all bills.

Note
Running 995 bills this way took ~700 minutes on my machine (16GB ram, 2.9 GHz) (average 41.7 seconds per bill).

Bill similarity functions with Elasticsearch

The bill_similarity.py script includes functions to find similar bills by billnumber and version. The default functions assume that the bill XML files are in a directory three levels up from the bill_similarity.py file, of the form congress/data/. The default data directory can also be set in a .env file.

Then you can run the following command to find and save similar bills (the bill itself should be found as the first result):

>>> from billsim.compare import processSimilarBills
>>> processSimilarBills('116hr1818ih')

OR for many bills:

>>> from billsim.compare import processSimilarBills`
>>> billnumber_versions=['116hr133enr', '115hr4275ih', '117s235is', '117hr4459ih', '117hr4350ih', '117s2766is', '117hr5466ih', '116hr8939ih', '116s160is', '117s2685is', '117hr4041ih', '116hr2812ih', '116hr2709ih', '117s2812is', '116sres178is', '116hres391ih']
>>> for billnumber_version in billnumber_versions:
>>>     processSimilarBills(billnumber_version)

# Additional bills
# ['117hres158ih', '117hr1768ih', '117hres318ih','117sres356is', '117s2563is', '117s1816is', '117s1588is', '117hr1992ih', '117s2685is', '116sres178is']

or for all bills in one Congress:

>>> from billsim.utils import getBillXmlPaths
>>> billnumberversions117 = [billPath.billnumber_version for billPath in getBillXmlPaths(congresses=[117])]
>>> for billnumber_version in billnumber_versions:
>>>     processSimilarBills(billnumber_version)

The processSimilarBills function is the equivalent of the following:

>>> from billsim.bill_similarity import getSimilarBillSections, getBillToBill
>>> from billsim.utils_db import save_bill_to_bill, save_bill_to_bill_sections
>>> s = getSimilarBillSections('116hr200ih')
>>> b2b = getBillToBill(s)
>>> b2b
{'116hr200ih': BillToBillModel(id=None, billnumber_version='116hr200ih', length=7313, length_to=None, score_es=190.614846, score=None, score_to=None, reasons=None, billnumber_version_to='116hr200ih', identified_by=None, title=None, title_to=None, sections=[Section(billnumber_version='116hr200ih', section_id='HE90F34DBB44149C6B9BBD6747EB6F645', label='2.', header='Border wall trust fund', length=None, similar_sections=[SimilarSection(billnumber_version='116hr200ih', section_id='HE90F34DBB44149C6B9BBD6747EB6F645', label='2.', header='Border wall trust fund', length=1264, score_es=97.936806, score=None, score_to=None)]), Section(bill...
>>> for bill in b2b:
>>>    save_bill_to_bill(b2b[bill])
>>>    save_bill_to_bill_sections(b2b[bill]) # This should save the individual sections and the sections to section mapping

# Get similarity scores for bill-to-bill
>>> similar_bills=b2b.keys()
// Calls comparematrix from bills (Golang);
// The compiled executable is in the `bin` directory.
>>> from billsim.compare import getCompareMatrix
>>> c = getCompareMatrix(similar_bills)
>>> c[0][0]
{'Score': 1, 'ScoreOther': 1, 'Explanation': 'bills-identical', 'ComparedDocs': '116hr222ih-116hr222ih'}
>>> c[0][1] {'Score': 0.86, 'ScoreOther': 0.86, 'Explanation': 'bills-nearly_identical', 'ComparedDocs': '116hr222ih-115hr198ih'}

>>> from billsim.pymodels import BillToBillModel
>>> for row in c:
>>>   for column in row:
>>>     bill, bill_to = column['ComparedDocs'].split('-')
>>>     if bill and bill_to:
>>>         b2bModel = BillToBillModel(billnumber_version=bill, billnumber_version_to=bill_to, score=column['Score'], score_to=column['ScoreOther'], reasons=[column['Explanation']])
>>>         save_bill_to_bill(b2bModel)

To find similar bills from ES, without reference to the file system, use the getSimilarBillSections_es function.

Build and test

Tests, built with pytest are found in the tests directory. To run the tests, run make (requires cmake and pytest installed) or run pytest -rs tests directly.

Uses the pytest-order plugin. See https://pytest-dev.github.io/pytest-order/dev/

Run with Docker

While you can run this script locally, as a alternative a dockerfile is provided. To run it do the following:

docker build -t billsim .
docker run billsim -m INSERT OTHER ARGS HERE

Run with Postgres (docker)

$ mkdir -p $HOME/docker/volumes/postgres
$ docker run --rm   --name pg-docker -e POSTGRES_PASSWORD=$POSTGRES_PW -d -p 5432:5432 -v $HOME/docker/volumes/postgres:/var/lib/postgresql/data  postgres:alpine

Create a local postgres user:app-name: createuser -s postgres

Install the tables:

$python pymodels.py
2021-12-11 15:48:29,657 INFO sqlalchemy.engine.Engine select pg_catalog.version()
...
CREATE TABLE bill (
        id SERIAL,
        length INTEGER,
        billnumber VARCHAR NOT NULL,
        version VARCHAR NOT NULL,
        PRIMARY KEY (id),
        CONSTRAINT billnumber_version UNIQUE (billnumber, version)
)
...

To access the database from the command line: psql postgresql://postgres:$POSTGRES_PW@localhost:5432/postgres

To run pgadmin4 from docker: docker run -p 5050:80 -e "PGADMIN_DEFAULT_EMAIL=[email protected]" -e "PGADMIN_DEFAULT_PASSWORD=a12345678" -d dpage/pgadmin4

The admin panel is available at http://localhost:5050/

Postgres back-up

The database is backed up with: pg_dump billsim > billsim-bk.sql

Or, without user/pw, and gzipped: pg_dump billsim -O -x | gzip -9 > billsim-bk.sql.gz

Or, from the url: pg_dump postgresql://postgres:postgres@localhost -O -x | gzip -9 > billsim-bk.sql.gz

billsim's People

Contributors

aih avatar amprokop avatar awsninja avatar leschonander avatar locjay avatar selama1 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

amprokop locjay

billsim's Issues

Run tests on PR

This issue is to create a github action to run tests on PR

Issues with running billsim on Python3.8 and Python3.9

I tried running billsim with python3.8 and python3.9, and got two separate issues.

With 3.8

___________ ERROR collecting tests/constants_test.py ___________
tests/constants_test.py:3: in <module>
    from billsim.pymodels import BillPath
src/billsim/pymodels.py:36: in <module>
    class Section(SectionMeta):
src/billsim/pymodels.py:37: in Section
    similar_sections: list[SimilarSection]
E   TypeError: 'type' object is not subscriptable
_____________ ERROR collecting tests/utils_test.py _____________
tests/utils_test.py:9: in <module>
    from billsim.pymodels import BillPath
src/billsim/pymodels.py:36: in <module>
    class Section(SectionMeta):
src/billsim/pymodels.py:37: in Section
    similar_sections: list[SimilarSection]
E   TypeError: 'type' object is not subscriptable

With 3.9

from lxml import etree
ImportError: dlopen(/opt/homebrew/lib/python3.9/site-packages/lxml/etree.cpython-39-darwin.so, 2): no suitable image found.  Did find:
        /opt/homebrew/lib/python3.9/site-packages/lxml/etree.cpython-39-darwin.so: mach-o, but wrong architecture
        /opt/homebrew/lib/python3.9/site-packages/lxml/etree.cpython-39-darwin.so: mach-o, but wrong architecture

This latter one seems to be a issue with m1 macs from considering the error complains about architecture.

Need to pass in environment variables to BillSim

For a Library of Congress project, we need to connect to Postgres on port 5433, not port 5432.

The billsim code doesn't seem to use the values defined in the .env file at the top-level. I thought it did use those .env values, but potentially I hardcoded port 5433 into the billsim package and forgotten it, then reinstalled billsim and wiped that change away.

How can we best pass the postgres port and host into the billsim package? Potentially we should pass it at runtime rather than as an env variable? Or is there some way to have the billsim package use the .env file at the top level of the repo? This would appear to be an issue for any user of the billsim package, as generally you'd want to pass in username/password/etc and not use defaults.

Create generic array similarity functions

This function builds on the functions in this repository and the Go functions in aih/bills.

Assumptions:

  • Each 'document' consists of an array of strings. The document has a unique id and each item in the array is also uniquely identified (either by an id or its ordinal position in the array).
  • The length of each document array may vary

The generic similarity functions would:

  1. Calculate a vocabulary of n-grams from the total corpus of documents (an array of documents).
  2. Vectorize the documents so that they each document can be stored as a (sparse) array of the length of the vocabulary
  3. Store the vectorized matrix of all documents in a pickle file (or eventually in Postgresql) (MOD- matrix of all documents)
  4. Calculate the similarity between each item of each array and all other items in the MOD
  5. Apply an item threshold to find similar items for each item in a document
  6. Apply a document threshold to find similar documents
  7. Return 5 and 6 in a model form that can be stored to a database (item-to-item and document-to-document similarity)

Improve batch saving

See https://github.com/aih/billsim/blob/main/src/billsim/utils_db.py#L382

We currently use sqlalchemy for this batch save operation. However, in some cases, it causes errors:

Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/3.9.1/envs/py391/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1799, in _execute_context
    self.dialect.do_execute(
  File "/home/ubuntu/.pyenv/versions/3.9.1/envs/py391/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 717, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.StatementTooComplex: stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate.

We may try to increase max_stack_depth, but the statements also appear to be unnecessarily large and recursive.

Can we take advantage of psycopg3 improvements and use it directly for saves?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.