This repository contains a Flask app designed to improve the process of classifying surveys received in the GOV.UK intents survey. The application is hosted on GOV.UK PaaS.
A blog about the GOV.UK intent survey is available on gov.uk, whilst the code is available as a python package and supporting scripts.
The underlying framework of the app is based heavily on the micro blogging site by Miguel Grinberg which features in the O'Reilly book Flask Web Development.
git clone [email protected]:alphagov/classifyintentsapp.git
cd classifyintentsapp
Configure the app by creating a .env
file in the classifyintentsapp directory:
DATABASE_URL=postgres://USER@localhost:5432/classifyintentsapp
DEV_DATABASE_URL=postgres://USER@localhost:5432/classifyintentsapp-dev
TEST_DATABASE_URL=postgres://USER@localhost:5432/classifyintentsapp-test
[email protected]
SECRET_KEY=key-to-prevent_csrf
FLASK_CONFIG=development
NOTIFY_API_KEY=govuk-notify-api-key
and edit:
- USER to your local machine's login name
- FLASKY_ADMIN to be your email address (for exception emails and signifying the admin account)
- NOTIFY_API_KEY to be your notify API key
mkvirtualenv classifyintentsapp
pip install -r requirements.txt
Install PostgreSQL. On OSX it may be convenient to use the Postgres.app. Now create the databases:
createdb classifyintentsapp-dev
createdb classifyintentsapp-test
createdb classifyintentsapp
Setup the classifyintentsapp-dev database and add dummy data:
python manage.py deploy_local
Create an admin account:
python manage.py shell
Then:
from app.models import User, Role
admin_id = Role.query.filter(Role.name=='Administrator').with_entities(Role.id).scalar()
u = User(username='admin', email='[email protected]', password='pass', role=Role.query.get(admin_id), confirmed=True)
db.session.add(u)
db.session.commit()
Start the app:
python manage.py runserver
Open in a browser http://127.0.0.1:5000/
Login as: [email protected]
password: pass
To deploy manually on PaaS, navigate to the root of the project and run. Note that the .cfignore file mirrors the .gitignore file, so any files you wish to exclude from being pushed onto the PaaS instance should be added to the .gitignore.
cf push
Environment variables for the instance should be set in manifest.yml. This file should be in the following format:
---
applications:
- name: classifyapp
env:
DATABASE_URL: postgres://username:password@host:port/database
DEV_DATABASE_URL: postgres://username:password@host:port/database
TEST_DATABASE_URL: postgres://username:password@host:port/database
FLASKY_ADMIN: [email protected]
SECRET_KEY: key-to-prevent_csrf
FLASK_CONFIG: production
NOTIFY_API_KEY: govuk-notify-api-key
---
SECRET_KEY should be a random string - create using:
python -c 'import random, string; print("".join([random.SystemRandom().choice("{}{}".format(string.ascii_letters, string.digits)) for i in range(50)]))'
When deploying the application for the first time you must log into the instance running the application and deploy the application manually. To access the server run:
cf ssh classifyapp
You will then need to activate the local environment:
export DEPS_DIR=/home/vcap/deps
for f in /home/vcap/app/.profile.d/*.sh; do source $f; done
then:
cd app/ # navigate to the project root
python manage.py db upgrade
If you wish to populate the database with dummy data, you can also run:
python manage.py deploy
See below for more details on generating dummy data. A local server can then be deployed with:
python manage.py runserver
and accessed at https://127.0.0.1:5000
.
Ensure that you have specified your email address in the FLASKY_ADMIN
environment variable, and then register with the application using the registration page. You will automatically be granted administrator rights to the web application.
If you are running the server without access to Notify, you will need to create a user manually. Open a shell:
python manage.py shell
Then:
from app.models import User, Role
admin_id = Role.query.filter(Role.name=='Administrator').with_entities(Role.id).scalar()
u = User(username='admin', email='[email protected]', password='pass', role=Role.query.get(admin_id), confirmed=True)
db.session.add(u)
db.session.commit()
Dummy data is generated as part of the python manage.py deploy_local
command, but these methods can be run independent of python manage.py deploy_local
by running python manage.py populate
, or by opening an app specific shell with python manage.py shell
, and executing the commands:
Role.insert_roles()
Raw.generate_fake()
Codes.generate_fake()
ProjectCodes.generate_fake()
User.generate_fake()
Classified.generate_fake()
Each method accepts as its first argument the number of records to create. Classified.generate_fake()
also accepts a second method which specifies the number or random users over which the specified number of Classified records will be spread.
Note that it is possible to 'run out' of eligible surveys to classify using this method, in which case more fake surveys should be generated with Raw.generate_fake()
.
When hosting databases on GOV.UK PaaS, it is not possible to make a direct connection between your local machine and the remote server. This must be handled using an SSH tunnel. More information is available in the GOV.UK PaaS documentation.
To see the details of the postgres database run:
cf env APP_NAME
which will return a json containing the server configuration.
To create an SSH tunnel via the instance running the web application run:
cf ssh classifyapp -L 6666:HOST:PORT
In a new terminal window then run:
psql postgres://USERNAME:PASSWORD@localhost:6666/DATABASE_NAME
substituting the database details.
How surveys should be prioritised to users is controlled by the prioritisation view. Any view could be created in its place with a new set of criteria as required, but at present the prioritisation works on the basis that at least half of all the people coding a survey need to agree before a code can be set. The prioritisation rules are set below:
Priority | Conditions |
---|---|
1 | Any survey for which there is not yet a majority (>=50%) of users assigning a single code, and where <= 5 users have coded the survey |
2 | New surveys that have not been coded |
3 | A survey for which a majority (>=50%) has been found, but <5 people have coded the survey. |
6 | The survey has been automatically been classified by algorithm. |
7 | Survey is recalcitrant: when >5 people have coded the survey and there still is not majority. |
8 | Survey contains Personally Identifiable Information (has been tagged as such once or more times) |
9 | There is a majority, and more than 5 people have coded the survey. |
Surveys with priority >=6 are removed from circulation (in the case of 7 and 8: pending further action).
Within the priority codes, surveys are ordered by descending date order, so that the most recent survey will always come up first.
Note that respondent_id is not unique in the priority view. Under circumstances where there is no discernible majority, i.e. there are two or more votes with a majority < 0.5, both these entries will appear in the priority view.
You bet. Tests are in the tests/ folder. Either run python manage.py test
to execute all, (required for database setup and teardown), or you can run individual tests with python -m unittest tests/test_lookup.py
(for example).
Tests must be run on a postgres data base, so the TEST_DATABASE_URL
environmental variable must be set in .env
.
To complete tests using selenium, you will need to download the chromedriver and load it into your path, otherwise these tests will pass without failing.
Note that the views: priority, leaders, daily_leaders, and weekly_leaders are not created in the migration script, but instead by running the queries contained in sql/views/priority.sql and sql/views/leaders.sql. This is automatically handled in the python manage.py deploy
and python manage.py deploy_local
commands.
This may cause confusion if you create tables using db.create_all()
from the shell instead of using python manage.py deploy
(which is the best approach).
From the python manage.py shell
these queries can be executed with:
from app.queryloader import *
query = query_loader('sql/views/priority.sql')
db.session.execute(query)
db.session.commit()
Note that Raw.generate_fake()
will use real GOV.UK urls from the govukurls.txt.
These entries are created by the Raw.get_urls()
method which queries the gov.uk/random page.
Results are stored in the govukurls.txt file and can be appended to by running the Raw.get_urls()
method taking the number of new pages to add as the first argument.
Note that this process can be quite slow as a 5 second gap is required between each query, in order to return a unique URL.
The following error AttributeError: 'NoneType' object has no attribute 'drivername'
indicates that the DEV_DATABASE_URL
environmental variable has not been set.