zooniverse / caesar Goto Github PK

View Code? Open in Web Editor NEW

13.0 17.0 13.0 2.73 MB

Backend automation and orchestration

Home Page: https://zooniverse.github.io/caesar

License: Apache License 2.0

Ruby 88.24% Shell 0.10% Python 0.12% JavaScript 0.17% HTML 10.48% Dockerfile 0.17% SCSS 0.71%

panoptes-platform automation aggregation caesar

caesar's Introduction

README

Caesar is an evolution of the Nero codebase, which is made more generic. In essence, Caesar receives classifications from the event stream (a Lambda script sends them to Caesars HTTP API).

Development

Prepare the Docker containers:

docker-compose build
docker-compose run --rm app bin/rails db:setup
docker-compose run --rm -e RAILS_ENV=test app bin/rails db:create

Run tests with:

docker-compose run --rm -e RAILS_ENV=test app bin/rspec

Or interactively / manually in a docker shell

docker-compose run --rm -e RAILS_ENV=test app bash
# from the bash prompt
bin/rspec

Start a local server with:

docker-compose up

To have it listen to the stream:

AWS_REGION=us-east-1 kinesis-tail zooniverse-staging | bin/stream_to_server

Or to override the configuration for a given workflow, create a local file in tmp/ (or anywhere else, but that directory is ignored by git) and run:

AWS_REGION=us-east-1 kinesis-tail zooniverse-staging | bin/override_workflow_configuration workflow_id tmp/path_to_nero_config.json | bin/stream_to_server

Kinesis / Lambda

Panoptes posts classifications into Kinesis. Caesar has a Lambda script that reads in from Kinesis and then POSTs those into Caesar's API. Docs on how to change that lambda script are in the kinesis-to-http directory.

Mutation tests

RAILS_ENV=test bundle exec mutant -r ./config/environment --use rspec Reducers::ExternalReducer

caesar's People

Contributors

Stargazers

Watchers

Forkers

amy-langley hughdickinson zwolf joelchan simensta camallen lcjohnso marten jnunyez mdheller racposner isabella232 ramanakumars

caesar's Issues

UniqueViolation can occur when storing reductions

PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_reductions_on_workflow_id_and_subject_id_and_reducer_id"
DETAIL: Key (workflow_id, subject_id, reducer_id)=(3559, 7316469, s) already exists.

File "/app/app/models/classification_pipeline.rb" line 39 in block in reduce
File "/app/app/models/classification_pipeline.rb" line 34 in each
File "/app/app/models/classification_pipeline.rb" line 34 in reduce
File "/app/app/workers/reduce_worker.rb" line 10 in perform

Ignore response body for a "HTTP 204 No Content"

export extracts and reductions as csv

Add rake tasks (see https://github.com/zooniverse/Panoptes/blob/master/lib/tasks/export.rake ) to export the contents of the extracts table and the reductions table as CSV files

Panoptes error when subject was already in set

The add_subject_to_set action will trigger an error in panoptes if the subject was already a member. This might be best resolved in Panoptes?

Panoptes::Client::ServerError: {"errors"=>[{"message"=>"PG::UniqueViolation: ERROR: duplicate key value violates unique constraint \"index_set_member_subjects_on_subject_id_and_subject_set_id\"\nDETAI

Add "webhooks" concept to nero_config

Marten Veldthuis
[11:28 AM]
so they would have {extractors: {s: {type: 'blank'}}, reducers: {}, rules: [], webhooks: {"asdf": {url: "https://example.org"}}}
[11:28]
as their nero config
Marten Veldthuis [11:30 AM]
but there could be a filter inside that {url: 'zxy', events: ['extract_changed']} if we really need to
Marten Veldthuis [11:32 AM]
i guess we could debounce it quite easily, and then do one call with multiple events (for all the changes)
[11:34]
so let’s make sure the API call we’re making out defines it as us sending them an array of events

Replaces #60

Add HTTPExtractor

extractor that applies jsonpath expression to classification

Given that we have a very composable design for our extract/reduce pipeline, I can imagine lots of times that we'd want to just pluck a single field out of the classification (for example, pulling out classroom groupings). JsonPath is a standard that specifies a query language for JSON blobs similar to what XPath provides for XML. It would be neat and relatively quick to write a JsonPath/pluck extractor to get the value of a single field, and ruby gems exist that apply JsonPath expressions to JSON blobs or ruby hashes.

https://github.com/joshbuddy/jsonpath
https://stackoverflow.com/questions/13422607/filter-hash-with-jsonpath

Reducer filter "repeated_classifications"

Right now we take all repeated classifications into account.

The default should probably to only use the first classification made on a subject by a user.

Other options might be "use the last classification" or "use all" (what we have currently).

Expose CSV export functionality via REST endpoint

Because the exports might take a little while to run, we are going to need a system like Panoptes uses.

For extracts, you should be able to specify one or more of: user id, workflow id, subject id
For reductions, you should be able to specify one or more of: (someday) user id, workflow id, subject id, group value

Foreign key issue can occur if a kinesis post contains a lot of data

PG::ForeignKeyViolation: ERROR: insert or update on table "actions" violates foreign key constraint "fk_rails_697fdb9010"
DETAIL: Key (subject_id)=(6440741) is not present in table "subjects".

File "/app/app/models/effects/effect.rb" line 10 in prepare
File "/app/app/models/rules/rule.rb" line 13 in block in process
File "/app/app/models/rules/rule.rb" line 12 in each
File "/app/app/models/rules/rule.rb" line 12 in process
File "/app/app/models/rules/engine.rb" line 11 in block in process
File "/app/app/models/rules/engine.rb" line 11 in each
File "/app/app/models/rules/engine.rb" line 11 in process
File "/app/app/models/classification_pipeline.rb" line 46 in check_rules
File "/app/app/workers/check_rules_worker.rb" line 7 in perform

Caesar should be aware of individual users and their attributes

This is sort of a big-picture issue that we won't have a quick fix for.

Allow reducer to filter extracts by extractor_id?

If we're using a reducer like the stats reducer and there are several different extractors defined, the current behavior is that it merges all of the extracts and then calculates statistics, which is a perfectly sane default behavior. But depending on how people define extractors, it may sometimes be useful to be able to restrict which extractors' extracts are processed by that reducer, in a similar way to how we can restrict ourselves to looking at the first N extracts for a given subject

Support including subject metadata in CSV export

This is requirement for Intro2Astro for some of the exports they want from Caesar. They would like the subject metadata included with the CSV export. The format they would like is one column per metadata field.

Add capability to ignore classifications before certain version

Might want to default to ignore all but latest workflow version? But there's plenty of cases where things didn't significantly change (example: Wildcam Darien has 45k classifications on an older version, with the changes just being a few added cross-links between species, and the "How many" removed for "Human")

let workflow config specify how to handle duplicate classifications from user

"use first" versus "use last"

Write documentation on rules format

Extractors and reducers should automatically use their configured "id" in the keys they emit

Allow extractors/reducers to be re-run for a workflow

For aggregation if a user wants to change anything (say a clustering parameter) there should be a nice way to re-run reducers for a workflow.

Also if caesar is set up after some images are retired (or the entire project is finished) it would be nice to be able to force re-run the extractors.

Allow unauthenticated access to GraphQL API endpoint

How should the extractor/reducer handle people marking e.g. 3 lions in a subject?

Extractors can also cause unique violations the way reducers did

PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_extracts_on_classification_id_and_extractor_id"
DETAIL: Key (classification_id, extractor_id)=(56443012, choice) already exists.

File "/app/app/models/classification_pipeline.rb" line 29 in block in extract
File "/app/app/models/classification_pipeline.rb" line 17 in each
File "/app/app/models/classification_pipeline.rb" line 17 in extract
File "/app/app/workers/extract_worker.rb" line 11 in perform

ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_extracts_on_classification_id_and_extractor_id"
DETAIL: Key (classification_id, extractor_id)=(56443012, choice) already exists.
: INSERT INTO "extracts" ("classification_id", "classification_at", "extractor_id", "project_id", "workflow_id", "user_id", "subject_id", "data", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) RETURNING "id"`

File "/app/app/models/classification_pipeline.rb" line 29 in block in extract
File "/app/app/models/classification_pipeline.rb" line 17 in each45
File "/app/app/models/classification_pipeline.rb" line 17 in extract
File "/app/app/workers/extract_worker.rb" line 11 in perform

17 Effects should result in actions getting recorded in the database

Create action that allows extracts or reductions to be posted to external URL

This would help support Coleman's aggregation effort

Asynchronous extract/reduction handling and the pipeline

If an extraction is done asynchronously with the ExternalExtractor, then either an empty hash or a special value should be returned to indicate that no further processing should be performed and the ExtractWorker should not queue up a ReduceWorker. Reduction will instead occur after an extract is posted to the ExtractsController by a remote service.

If an extraction is done asynchronously with the ExternalReducer, then either an empty hash or a special value should be returned to indicate that no further processing should be performed and the ReduceWorker should not queue up a CheckRulesWorker. Reduction will instead occur after an extract is posted to the ReductionsController by a remote service.

REST API isn't versioned

This hasn't been a big deal so far but once we get more people using ExternalExtractors and ExternalReducers we're going to want the ability to occasionally make breaking changes without making consumers of the API recode their apps.

Create extractors for every task/tool type

Here is a list of every task type on PFE that need extractors:

Public

Experimental

Generalize the SurveyReducer

Rather than having configurable subranges, the reducer could just be responsible for a single specific range (with support for 0..-1) and a workflow would specify multiple instances of the reducer if it wants multiple ranges.

It also doesn't need to be tied specifically to the Survey, it could also reduce "flags" for instance, or anything else.

Better automatic reprocessing

Configs sometimes change. We want to make sure to recalculate things automatically so that we don't have to keep babysitting and deleting data from the database:

Maybe we can detect when a new extractor is added, and only add extracts for that?
Maybe we can detect when a new reducer is added, and only add reductions for that?
Can we detect if a specific extractor's config changed, and recalculate that extractor?
Can we detect if a specific reducer's config changed, and recalculate that reducer?
What do we do with external extractors/reducers?

Log to STDOUT in single-line style, and stop writing to production.log

This fills up disk space and would need rotating.

Write logs to STDOUT
Have supervisor forward those to Graylog
Make sure Rails doesn't still log to /app/log/production.log

API calls to Panoptes don't specify `admin: true` in the params

Without that, Panoptes will forbid unless the project specifically added the zooniverse user as a collaborator.

Extracts controller finding extract to update

Currently it uses find_or_initialize_by with workflow_id, subject_id, and extractor_id, but this won't be unique for extracts, only for reductions. instead we want to use classification_id and extractor_id probably.

Copy over the Lambda script from EducationAPI

Pull classifications from Panoptes when receiving one for a subject we have no data for

If we receive a classification for a (workflow, subject) we haven't processed before, we should make an API call to Panoptes to check if there are any classifications we don't have yet. This way, if Caesar is enabled after people are already classifying, we'll automatically backfill classifications.

Caesar should be able to group extracts before reduction

There is some desire for caesar to be able to do reductions on a subject for less than an entire workflow's worth of classifications for a subject. It should be possible to introduce a grouping clause for reducers similar to the subrange clause they already have. This would result in multiple reductions being created in a single reduction pass, so we'd need to add a subgroup name column to the table and to the unique key.

Enable caesar for a specific workflow when the first configuration message is sent.

Currently it seems like caesar must be enabled "by hand" for every workflow wishing to use the functionality. Would it be possible to enable a workflow (maybe with a default config for any unspecified setup) if a configuration/registration request pertaining to a new workflow ID is received?

Add controllers to allow third-party extractors/reducers to update state

This would allow external extractors or reducers to push state into Caesar whenever they have new information.

parameters to ReductionsController#index

when you ask for the reductions for a given workflow, you’re required to specify workflow_id and reducer_id in the route, but the controller doesn’t filter by reducer id. also, you’re required to provide a subject_id as a param, even though it’s not part of the route, because the controller filters on that

it might make more sense to do /workflows/1/subjects/123/reductions with an optional reducer_id param

Store caesar configuration in caesar instead of on panoptes API

Now that the authentication feature has been complete, we can add this feature.

should we always require lookups to provide a default value?

Switch stream processing to Sidekiq

There are some workers prepared, but right not the Kinesis controller just processes inline. We should switch this over to use the workers.

Add effect that adds subject to a given subject_set

Move workflow configuration to Caesar

Putting Panoptes in charge of Caesar's config was, in retrospect, a bad decision. Caesar should be the principal owner of all data related to Caesar. That mean doing less work on events in the stream, since we don't have to keep checking if we have to update the workflow config.

The main issue with this was that the UI for Caesar needs some way of setting the configuration, and the existing HTTP Basic authentication as used by Kinesis wasn't going to cut it. But with #84 we now have working OAuth authentication in Caesar.

To resolve this issue we will need to check (by calling Panoptes) that the current_user has write access to the Panoptes project (i.e. is an owner or collaborator on that project). This will also help future tickets that can provide proper security on the rest of the API.