Coder Social home page Coder Social logo

classifyintentsapp's Introduction

Build Status codecov

Classify intents survey web app

This repository contains a Flask app designed to improve the process of classifying surveys received in the GOV.UK intents survey. The application is hosted on GOV.UK PaaS.

A blog about the GOV.UK intent survey is available on gov.uk, whilst the code is available as a python package and supporting scripts.

The underlying framework of the app is based heavily on the micro blogging site by Miguel Grinberg which features in the O'Reilly book Flask Web Development.

Getting started

Deploying locally

git clone [email protected]:alphagov/classifyintentsapp.git
cd classifyintentsapp

Configure the app by creating a .env file in the classifyintentsapp directory:

DATABASE_URL=postgres://USER@localhost:5432/classifyintentsapp
DEV_DATABASE_URL=postgres://USER@localhost:5432/classifyintentsapp-dev
TEST_DATABASE_URL=postgres://USER@localhost:5432/classifyintentsapp-test
[email protected]
SECRET_KEY=key-to-prevent_csrf
FLASK_CONFIG=development
NOTIFY_API_KEY=govuk-notify-api-key

and edit:

  • USER to your local machine's login name
  • FLASKY_ADMIN to be your email address (for exception emails and signifying the admin account)
  • NOTIFY_API_KEY to be your notify API key
mkvirtualenv classifyintentsapp
pip install -r requirements.txt

Install PostgreSQL. On OSX it may be convenient to use the Postgres.app. Now create the databases:

createdb classifyintentsapp-dev
createdb classifyintentsapp-test
createdb classifyintentsapp

Setup the classifyintentsapp-dev database and add dummy data:

python manage.py deploy_local

Create an admin account:

python manage.py shell

Then:

from app.models import User, Role
admin_id = Role.query.filter(Role.name=='Administrator').with_entities(Role.id).scalar()
u = User(username='admin', email='[email protected]', password='pass', role=Role.query.get(admin_id), confirmed=True)
db.session.add(u)
db.session.commit()

Start the app:

python manage.py runserver

Open in a browser http://127.0.0.1:5000/ Login as: [email protected] password: pass

Deploying the app onto PaaS

To deploy manually on PaaS, navigate to the root of the project and run. Note that the .cfignore file mirrors the .gitignore file, so any files you wish to exclude from being pushed onto the PaaS instance should be added to the .gitignore.

cf push

Environment variables for the instance should be set in manifest.yml. This file should be in the following format:

---
applications:
- name: classifyapp
  env:
    DATABASE_URL: postgres://username:password@host:port/database
    DEV_DATABASE_URL: postgres://username:password@host:port/database
    TEST_DATABASE_URL: postgres://username:password@host:port/database
    FLASKY_ADMIN: [email protected]
    SECRET_KEY: key-to-prevent_csrf
    FLASK_CONFIG: production
    NOTIFY_API_KEY: govuk-notify-api-key
---

SECRET_KEY should be a random string - create using:

python -c 'import random, string; print("".join([random.SystemRandom().choice("{}{}".format(string.ascii_letters, string.digits)) for i in range(50)]))'

Setting up the database

When deploying the application for the first time you must log into the instance running the application and deploy the application manually. To access the server run:

cf ssh classifyapp

You will then need to activate the local environment:

export DEPS_DIR=/home/vcap/deps
for f in /home/vcap/app/.profile.d/*.sh; do source $f; done

then:

cd app/ # navigate to the project root
python manage.py db upgrade

If you wish to populate the database with dummy data, you can also run:

python manage.py deploy

See below for more details on generating dummy data. A local server can then be deployed with:

python manage.py runserver

and accessed at https://127.0.0.1:5000.

Getting admin access to the application

Ensure that you have specified your email address in the FLASKY_ADMIN environment variable, and then register with the application using the registration page. You will automatically be granted administrator rights to the web application.

If you are running the server without access to Notify, you will need to create a user manually. Open a shell:

python manage.py shell

Then:

from app.models import User, Role
admin_id = Role.query.filter(Role.name=='Administrator').with_entities(Role.id).scalar()
u = User(username='admin', email='[email protected]', password='pass', role=Role.query.get(admin_id), confirmed=True)
db.session.add(u)
db.session.commit()

Generating dummy data

Dummy data is generated as part of the python manage.py deploy_local command, but these methods can be run independent of python manage.py deploy_local by running python manage.py populate, or by opening an app specific shell with python manage.py shell, and executing the commands:

Role.insert_roles()
Raw.generate_fake()
Codes.generate_fake()
ProjectCodes.generate_fake()
User.generate_fake()
Classified.generate_fake()

Each method accepts as its first argument the number of records to create. Classified.generate_fake() also accepts a second method which specifies the number or random users over which the specified number of Classified records will be spread. Note that it is possible to 'run out' of eligible surveys to classify using this method, in which case more fake surveys should be generated with Raw.generate_fake().

Connecting to the database on GOV.UK PaaS

When hosting databases on GOV.UK PaaS, it is not possible to make a direct connection between your local machine and the remote server. This must be handled using an SSH tunnel. More information is available in the GOV.UK PaaS documentation.

To see the details of the postgres database run:

cf env APP_NAME

which will return a json containing the server configuration.

To create an SSH tunnel via the instance running the web application run:

cf ssh classifyapp -L 6666:HOST:PORT

In a new terminal window then run:

psql postgres://USERNAME:PASSWORD@localhost:6666/DATABASE_NAME

substituting the database details.

How new surveys are selected (the priority view)

How surveys should be prioritised to users is controlled by the prioritisation view. Any view could be created in its place with a new set of criteria as required, but at present the prioritisation works on the basis that at least half of all the people coding a survey need to agree before a code can be set. The prioritisation rules are set below:

Priority Conditions
1 Any survey for which there is not yet a majority (>=50%) of users assigning a single code, and where <= 5 users have coded the survey
2 New surveys that have not been coded
3 A survey for which a majority (>=50%) has been found, but <5 people have coded the survey.
6 The survey has been automatically been classified by algorithm.
7 Survey is recalcitrant: when >5 people have coded the survey and there still is not majority.
8 Survey contains Personally Identifiable Information (has been tagged as such once or more times)
9 There is a majority, and more than 5 people have coded the survey.

Surveys with priority >=6 are removed from circulation (in the case of 7 and 8: pending further action).

Within the priority codes, surveys are ordered by descending date order, so that the most recent survey will always come up first.

Note that respondent_id is not unique in the priority view. Under circumstances where there is no discernible majority, i.e. there are two or more votes with a majority < 0.5, both these entries will appear in the priority view.

Is it tested?

You bet. Tests are in the tests/ folder. Either run python manage.py test to execute all, (required for database setup and teardown), or you can run individual tests with python -m unittest tests/test_lookup.py (for example).

Tests must be run on a postgres data base, so the TEST_DATABASE_URL environmental variable must be set in .env.

To complete tests using selenium, you will need to download the chromedriver and load it into your path, otherwise these tests will pass without failing.

Gotchas

Manually creating views

Note that the views: priority, leaders, daily_leaders, and weekly_leaders are not created in the migration script, but instead by running the queries contained in sql/views/priority.sql and sql/views/leaders.sql. This is automatically handled in the python manage.py deploy and python manage.py deploy_local commands. This may cause confusion if you create tables using db.create_all() from the shell instead of using python manage.py deploy (which is the best approach). From the python manage.py shell these queries can be executed with:

from app.queryloader import *
query = query_loader('sql/views/priority.sql')
db.session.execute(query)
db.session.commit()

Note that Raw.generate_fake() will use real GOV.UK urls from the govukurls.txt. These entries are created by the Raw.get_urls() method which queries the gov.uk/random page. Results are stored in the govukurls.txt file and can be appended to by running the Raw.get_urls() method taking the number of new pages to add as the first argument. Note that this process can be quite slow as a 5 second gap is required between each query, in order to return a unique URL.

Missing DEV_DATABASE_URL environment variable

The following error AttributeError: 'NoneType' object has no attribute 'drivername' indicates that the DEV_DATABASE_URL environmental variable has not been set.

classifyintentsapp's People

Contributors

andreagrandi avatar ivyleavedtoadflax avatar miguelgrinberg avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

classifyintentsapp's Issues

Move prioritisation view in Python

Running this query in sql is slow. It would probably be faster in python.

This is a start:

def extract_code_id(x):
    y = [i.code_id for i in x]
    return y

#def max_vote(x):
#    if len(x):
#        return max(set(x), key=x.count)
#    else:
#        return None



a = Raw.query.all()
b = [extract_code_id(i.classified) for i in a]
c = [ratio(i) for i in b]

Add random element to survey selection

Surveys coded as 4 cause a particular problem, and are coded many times:

      id|start_date|max codes|max/total|total codes|coder ids|pii?|priority
      ----------------------
      56779845 | 2017-04-30 23:37:00 |  12 | 0.666667 |    18 | {2,6,10,16,19,25,28,29,31,35,37,38,40,41,46,47,48,49} |   0 |        4
      56775873 | 2017-04-30 20:45:00 |  16 | 0.888889 |    18 | {2,6,10,16,19,25,28,29,31,35,37,38,40,41,46,47,48,49} |   0 |        4
      56763526 | 2017-04-30 13:38:00 |   9 | 0.529412 |    17 | {2,6,10,16,19,25,28,29,31,35,37,38,41,46,47,48,49}    |   0 |        4
      56757743 | 2017-04-30 10:04:00 |  10 | 0.588235 |    17 | {2,6,10,16,19,25,28,29,31,35,37,38,41,46,47,48,49}    |   0 |        4
      56733177 | 2017-04-29 14:59:00 |   7 | 0.411765 |    17 | {2,6,10,16,19,25,28,29,31,35,37,38,41,46,47,48,49}    |   0 |        4
      56662562 | 2017-04-28 16:07:00 |   9 | 0.529412 |    17 | {2,6,10,16,19,25,28,29,31,35,37,38,41,46,47,48,49}    |   0 |        4
      56654589 | 2017-04-28 15:01:00 |   9 |   0.5625 |    16 | {2,6,10,16,19,25,28,29,31,35,38,41,46,47,48,49}       |   0 |        4
      56649658 | 2017-04-28 13:59:00 |   8 | 0.533333 |    15 | {2,6,10,16,19,25,28,31,35,38,41,46,47,48,49}          |   0 |        4
      56637715 | 2017-04-28 11:31:00 |   7 | 0.466667 |    15 | {2,6,10,16,19,25,28,31,35,38,41,46,47,48,49}          |   0 |        4
      56627471 | 2017-04-28 09:14:00 |   8 | 0.533333 |    15 | {2,6,10,16,19,25,28,31,35,38,41,46,47,48,49}          |   0 |        4
      56599601 | 2017-04-27 20:53:00 |   7 | 0.466667 |    15 | {2,6,10,16,19,25,28,31,35,38,41,46,47,48,49}          |   0 |        4
      56599066 | 2017-04-27 20:49:00 |  11 | 0.733333 |    15 | {2,6,10,16,19,25,28,31,35,38,41,46,47,48,49}          |   0 |        4
      56573967 | 2017-04-27 14:47:00 |   8 | 0.533333 |    15 | {2,6,10,16,19,25,28,31,35,38,41,46,47,48,49}          |   0 |        4
      56573057 | 2017-04-27 14:36:00 |   7 | 0.466667 |    15 | {2,6,10,16,19,25,28,31,35,38,41,46,47,48,49}          |   0 |        4
      56570616 | 2017-04-27 14:06:00 |  12 |      0.8 |    15 | {2,6,10,16,19,25,28,31,35,38,41,46,47,48,49}          |   0 |        4
      56544963 | 2017-04-27 07:47:00 |   8 | 0.533333 |    15 | {2,6,10,16,19,25,28,31,35,38,41,46,47,48,49}          |   0 |        4
      56541907 | 2017-04-27 05:27:00 |  12 | 0.857143 |    14 | {2,10,16,19,25,28,31,35,38,41,46,47,48,49}            |   0 |        4
      56513016 | 2017-04-26 15:42:00 |   8 | 0.571429 |    14 | {2,10,16,19,25,28,31,35,38,41,46,47,48,49}            |   0 |        4
      56511412 | 2017-04-26 15:19:00 |   8 | 0.571429 |    14 | {2,10,16,19,25,28,31,35,38,41,46,47,48,49}            |   0 |        4
      56495293 | 2017-04-26 11:27:00 |   9 | 0.642857 |    14 | {2,10,16,19,25,28,31,35,38,41,46,47,48,49}            |   0 |        4
      56488253 | 2017-04-26 09:53:00 |   9 | 0.642857 |    14 | {2,10,16,19,25,28,31,35,38,41,46,47,48,49}            |   0 |        4
      56485502 | 2017-04-26 09:12:00 |   8 | 0.571429 |    14 | {2,10,16,19,25,28,31,35,38,41,46,47,48,49}            |   0 |        4
      56478805 | 2017-04-26 06:46:00 |   9 |     0.75 |    12 | {2,10,16,19,25,28,31,38,41,46,48,49}                  |   0 |        4

Set up user permissions

Currently all authenticated users have access to classification, set this up to allow two stages: moderators and users.

Some surveys are not getting classified

On occasion a survey pops up more than once (though perhaps not more than twice).

On searching through the raw table there is no duplicate, and no record of the first supposed classification in the classification table.

Assumption is that the first classification was simply not recorded.

Group radio buttons

Currently users would need to zoom out to get all the buttons on one page.
Group them in several columns?

Is there some what that we can cluster them together in a meaningful way?

Create a prioritisation view for surveys

This might look something like:

  • User has not classified before
  • A survey that has been classified n <= 3 times in descending order by number of classifications.
  • A survey that has no classifications at all.
  • A survey that has been classified 3 < n < 10 times.
  • n% of ok classifications from the machine learning model

Add existing data to the database

Add user names and emails from existing data. Passwords can be any random string, and then set up by resetting the password.

Gravatar hashes can be generated with:

import hashlib
hashlib.md5('[email protected]'.encode('utf-8')).hexdigest()

None codes are not being accepted with real data

<ul id="code">
<li>
<input id="code-0" name="code" type="radio" value="0"> 
<label for="code-0">none</label></li>
<li><input id="code-1" name="code" type="radio" value="1"> 
<label for="code-1">govuk-specific</label>
</li>
<li><input id="code-2" name="code" type="radio" value="2"> 
<label for="code-2">contact-government</label>
</li>
...
</ul>

Update for Flask WTFO 3.0 deprecations

classifyintentsapp/app/main/forms.py:9: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  name = StringField('What is your name?', validators=[Required()])
classifyintentsapp/app/main/forms.py:21: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  email = StringField('Email', validators=[Required(), Length(1, 64),
classifyintentsapp/app/main/forms.py:24: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  Required(), Length(1, 64), Regexp('^[A-Za-z][A-Za-z0-9_.]*$', 0,
classifyintentsapp/app/main/forms.py:51: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  code = RadioField('code_radio', coerce=int, validators=[Required()])
classifyintentsapp/app/main/forms.py:57: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  project_code = RadioField('project_code_radio', coerce=int, default='1', validators=[Required()])
classifyintentsapp/app/auth/forms.py:9: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  email = StringField('Email', validators=[Required(), Length(1, 64),
/classifyintentsapp/app/auth/forms.py:11: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  password = PasswordField('Password', validators=[Required()])
classifyintentsapp/app/auth/forms.py:21: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  Required(),
classifyintentsapp/app/auth/forms.py:29: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  Required(), Length(1, 64), Regexp('^[A-Za-z][A-Za-z0-9_.\ ]*$', 0,
classifyintentsapp/app/auth/forms.py:33: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  Required(), EqualTo('password2', message='Passwords must match.')])
classifyintentsapp/app/auth/forms.py:34: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  password2 = PasswordField('Confirm password', validators=[Required()])
classifyintentsapp/app/auth/forms.py:47: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  old_password = PasswordField('Old password', validators=[Required()])
classifyintentsapp/app/auth/forms.py:49: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  Required(), EqualTo('password2', message='Passwords must match')])
classifyintentsapp/app/auth/forms.py:50: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  password2 = PasswordField('Confirm new password', validators=[Required()])
classifyintentsapp/app/auth/forms.py:55: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  email = StringField('Email', validators=[Required(), Length(1, 64),
classifyintentsapp/app/auth/forms.py:61: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  email = StringField('Email', validators=[Required(), Length(1, 64),
classifyintentsapp/app/auth/forms.py:64: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  Required(), EqualTo('password2', message='Passwords must match')])
classifyintentsapp/app/auth/forms.py:65: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  password2 = PasswordField('Confirm password', validators=[Required()])
classifyintentsapp/app/auth/forms.py:74: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  email = StringField('New Email', validators=[Required(), Length(1, 64),
classifyintentsapp/app/auth/forms.py:76: DeprecationWarning: Required is going away in WTForms 3.0, use DataRequired
  password = PasswordField('Password', validators=[Required()])

URL lookup on gov.uk content api

Process should look like:

  • Raw.full_url is cleaned with the clean_url() function based on a set of rules.
  • Cleaned urls are looked up using the content API
  • Duplicates (based on full_url, sections, orgs, are dropped)

A date of lookup is included to be used for linking back to the full_url with time relevance.

Stop collecting `none` project_codes

This really is a absence whereas none for a code is an assertion that no other code can be applied.

When none applied, it would be better to simply insert null. The ability to select null is still important however, as without it a user would not able to fix mistakes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.