The bga-payroll from datamade

Exact duplicates should represent different people

Display wages and salaries on different charts

Some salary data is actually an hourly or per-appearance wage. The difference in scale makes it awkward to display on one chart:

Chart salaries and wages separately. Give the user the option to toggle between them.

put employee salary distribution chart on top-level employer pages

find out if bga has asked for amended source files before

we have some questions about what to do with them in the import.

make histogram bars clickable

the links could lead to search page w/ employees within the selected range

when and how should we filter payroll models for the most current year?

we will soon be managing multiple years of data. sometimes, we will want to be able to filter that data by year. the first year data appears, is accessible through vintage__standardized_files__reporting_year on each object. a record is created for every incoming Salary, every year, such that we can intuit when related Job and Person come and go, based on objects related to the Salary. that means filtering should be done at the Salary level. and that means our orm and sql queries need to be refactored to do this filtering by default, in a way that is also configurable via user input.

let's talk a bit about this irl.

filter

aggregate statistics
employer pages (employee and department lists)

don't filter

employee pages (we'll want to see all jobs / salaries a person has been paid)
search (or, make this an option?)

suggest similar entities on the review pages

i don't have any further thoughts on this, at this time.

write an error queue

the flush / match_or_create routines provide us an opportunity to intercept records that error out for some reason when we try to insert them in to the database. let's make a queue for those things so the user can review what went wrong and act accordingly.

related to #55.

make histograms more histogram-y

continuous x axis
bars should touch

show parent employer context on department distribution charts

Log of data import transformations

Convert empty strings to nulls
Remove exact duplicate rows
Remove weird characters from dates and date-ify them
Remove excess whitespace around strings

start thinking about "over time" data elements

total expenditure, other hi-level stats, would make employer pages richer

Charts should have labels

The existing charts rely too heavily on tooltips. Add some labels that don't require interacting with the chart.

integrate with bga's automated foia rig

once the app talks to us, wire up delayed task to download & store files from google drive. more on that rig here.

investigate oop with tasks in celery

we've defined a base task class with dynamic shared context for all of our delayed work in e52602f. this context is contigent on the standardized file we're operating on, e.g., each task expects a standardized file id.

however, the need for dynamic context introduces a challenge, because "the __init__ constructor (of the Task class) will only be called once per process."

this means we cannot use the __init__ method to establish context, given a standardized file id. instead, we define a setup method that accepts this id and sets class attributes accordingly.

we run this method each time a task is issued via celery's task_prerun signal. this signal provides access to the pending task (sender), as well as its args and kwargs, with which we can run setup prior to executing any task code.

this has the effect of giving us access to those contextual attributes in the task, without having to call setup at the top of each one.

however, it feels a little hacky.

when a task method is bound to a base task class, the code in the bound task is injected into that class as the run method. however, ~~because we are not in a class context in our task method, it's not possible to define a common run method the base class and extend it via super() in the method~~ because the task code is injected as the run method of the base class, running super(BaseClass, self).run() calls the run method of celery's Task class (the base class's parent), which raises a NotImplementedError.

source:

 def run(self, *args, **kwargs):
        """The body of the task executed by workers."""
        raise NotImplementedError('Tasks must define the run method.')

exception:

tp = <class 'celery.backends.base.NotImplementedError'>
value = NotImplementedError('Tasks must define the run method.',), tb = None

    def reraise(tp, value, tb=None):
        """Reraise exception."""
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)
>       raise value
E       celery.backends.base.NotImplementedError: Tasks must define the run method.

is there something else we should hook into, to run common code prior to task execution? or are the conventions here just unconventional?

Parse meeting notes out into meaningful issues

Area/bar charts -> histogram

Should there be a "Job" model

Right now, we link a person to a position through a Salary model.

I think we might want to have "job" model that links those two. There are three reasons.

start_date is not attached to the salary, but start_date is not a property of salary conceptually. start_date is a property of how long someone has held a position, i.e a "job".
Similarly the (person, position) tuple is replicated for every salary that we have for a "job" these data could be normalized out if we had a job model
I found it pretty surprising that the way that you link people and positions is through a table called salary. This is pretty subjective, but I think it will be clearer to future developers (or versions of ourselves) if we we had a job model.

add a confirm modal for flushing queues

don't show start date in person table

point source links at the right place

our payroll models are now related to the standardized file they came from via their vintage. let's update the source urls in the front end to lead to those files!

write delayed task for subsequent standardized data imports

after the first import (#49), there will be a canonical universe of data we need to squash new data into. wire this up, collapsing records only when we can be absolutely sure they belong together.

to-do:

responding agency

queue
view
review (match / add) endpoints

parent employer

queue
view
review (match / add) endpoints

child employer

queue
view
review (match / add) endpoints

salary

queue
view
review (match/add) endpoints

show data year / source link

source link for each data element

Wire up unified search endpoint

It should accept parameters for employer / person / year...

How should we handle one person having multiple "salaries" in a single year?

Related to, but separate from, #4.

It looks like there are multiple records for one employee when they get a raise. For example, there are 3 unique records for Justin R Thiede in the 2017 data, each with a different "salary": $8.75, $9.25 and $9.50.

2 | 613 | WEST CHICAGO PARK DISTRICT | THIEDE | JUSTIN R |   | AQUATICS | 8.75 | 3/19/13 | 2017
2 | 613 | WEST CHICAGO PARK DISTRICT | THIEDE | JUSTIN R |   | AQUATICS | 9.25 | 3/19/13 | 2017
2 | 613 | WEST CHICAGO PARK DISTRICT | THIEDE | JUSTIN R |   | AQUATICS | 9.5 | 3/19/13 | 2017

How should this be represented?

put top-expenditure departments above avg salary distribution chart

Make top-level stats cuter

c/o

Add bread crumbs to unit, department, and person pages

They'll look a little something like:

City of Chicago > Chicago Public Library > Brian A Bannon

[meta] improvements to search

implement solr (related: decide whether to use haystack)
- advanced query parsing
- fast
- faceting (for search results)
- enables later improvements, like geosearch (employers near joliet, i.e.)
build a better interface for searching: 1. from the landing page, 2. from search results, 3. from entity pages (?)

prompt user to try again in a few minutes when there is nothing to checkout, but there are still unreviewed records in the queue

when multiple users are reviewing, or when one user is reviewing, but navigates away from the page, the queue can be exhausted without all records being reviewed. checked out records expire after five minutes. tell the user of the situation, and ask them to come back in a few minutes. at such time, either review will be completed, or checked out records will be available again.

Auto-advance through review steps, if there is nothing to review

the app isn't yet smart enough to advance to the next step without observing that the queue is empty. it doesn't make that observation, until someone tries to access review. in the transition code, check whether there is anything to review in the queue; if there isn't, trigger the next step.

discuss alternative queueing mechanism

our choice for distributed review queues, saferedisqueue, has bugs. 🐛🐞🐜🕷

perhaps most pressingly, the re-queueing mechanism "may work okay in a single-consumer setup", but doesn't otherwise. this is a big strike against distributed work!

let's:

talk about our choice and alternatives (the srq dev mentioned rabbit mq, which plays nice with celery...),
decide on a path forward, and
implement it.

write delayed task for first-time standardized data import

there's a decent blueprint over in the stopgap import_data command. start there, but do better. 🙃

Wire up delayed work

celery & redis!

Organize department information on top-level employer pages

Right now I just spit all of them out. Show them in a table, with information on size and budget.

Prefetching slows down gov unit / department views a lot

Query for summary statistics separately, and only get the people we need, rather than fetching all 100k (or whatever) of them at once, which takes too long.

How should we handle null departments?

There are 240 of them.

brainstorm how bga might hang links to their related work

i.e., links to bga stories about the chicago public library, on the chicago public library page

Number formatting

Do it good.

Baseball stats -> 2.4 billion.
Axe cents.

add date boundaries to the salary object

historically, bga has gathered prospective pay for a standard period (the calendar year). someday, they may gather actual pay. on that occasion, because employees start or leave jobs at all times of year – not just jan 1 or dec 31 – it would be nice to keep track of bounding dates for that pay. (see #56 / #57.)

Interfaces: make them

Matching: reporting agency, employer, department
Review: salaries

investigate person dupes (esp high paid employees)

chicago parks / chicago are a good place to start

Add meta model/s to keep track of system updates (uploads), and tie payroll data to them

More tk.

figure out cost-effective way to determine whether a file is processing & use it to dynamically indicate status on the uploads page

celery has an api for inspecting what's being done / enqueued: http://docs.celeryproject.org/en/latest/reference/celery.app.control.html

figure out how to use that as an indicator of whether work is underway for a file (e.g., must be a quick check – this approach is very heavy.)

dump thoughts on validation (conditions, workflow, how to handle it on the backend) here

some questions:

what is the minimum threshold we must meet to link people across years?
at what point should basic data validation (e.g., salaries are positive, dates are dates) happen, and how should we handle if this validation fails? invalidate the entire upload? return the invalid records for review? something else?

add editable display name to models with names

stop & start celery worker in deploy scripts

this is mostly a to-do for me, in the morning

add review model to data_import

we should keep track of review decisions. let's do this a review model. here's an example from large-lots.

Wire up account creation / login

Once this is done, protect admin views in data_import.

Retain aliases for responding agencies and employers

Users are asked to review unknown responding agencies and employers during the data import process. They can choose to link them to an existing entity, or add them as a new entity. If they choose to link to an existing entity, we should retain both names for the entity as aliases. This will allow us to link the employer using either representation in future years of data.

datamade / bga-payroll Goto Github PK

bga-payroll's People

Contributors

Stargazers

Watchers

Forkers

bga-payroll's Issues