Coder Social home page Coder Social logo

zooniverse / caesar Goto Github PK

View Code? Open in Web Editor NEW
13.0 17.0 13.0 2.73 MB

Backend automation and orchestration

Home Page: https://zooniverse.github.io/caesar

License: Apache License 2.0

Ruby 88.24% Shell 0.10% Python 0.12% JavaScript 0.17% HTML 10.48% Dockerfile 0.17% SCSS 0.71%
panoptes-platform automation aggregation caesar

caesar's Introduction

README

Caesar is an evolution of the Nero codebase, which is made more generic. In essence, Caesar receives classifications from the event stream (a Lambda script sends them to Caesars HTTP API).

Development

Prepare the Docker containers:

docker-compose build
docker-compose run --rm app bin/rails db:setup
docker-compose run --rm -e RAILS_ENV=test app bin/rails db:create

Run tests with:

docker-compose run --rm -e RAILS_ENV=test app bin/rspec

Or interactively / manually in a docker shell

docker-compose run --rm -e RAILS_ENV=test app bash
# from the bash prompt
bin/rspec

Start a local server with:

docker-compose up

To have it listen to the stream:

AWS_REGION=us-east-1 kinesis-tail zooniverse-staging | bin/stream_to_server

Or to override the configuration for a given workflow, create a local file in tmp/ (or anywhere else, but that directory is ignored by git) and run:

AWS_REGION=us-east-1 kinesis-tail zooniverse-staging | bin/override_workflow_configuration workflow_id tmp/path_to_nero_config.json | bin/stream_to_server

Kinesis / Lambda

Panoptes posts classifications into Kinesis. Caesar has a Lambda script that reads in from Kinesis and then POSTs those into Caesar's API. Docs on how to change that lambda script are in the kinesis-to-http directory.

Mutation tests

RAILS_ENV=test bundle exec mutant -r ./config/environment --use rspec Reducers::ExternalReducer

caesar's People

Contributors

adammcmaster avatar amy-langley avatar amyrebecca avatar camallen avatar ckrawczyk avatar dependabot-preview[bot] avatar dependabot-support avatar dependabot[bot] avatar lcjohnso avatar mariamsaeedi avatar marten avatar nciemniak avatar ramanakumars avatar simensta avatar wgranger avatar yuenmichelle1 avatar zwolf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

caesar's Issues

Allow reducer to filter extracts by extractor_id?

If we're using a reducer like the stats reducer and there are several different extractors defined, the current behavior is that it merges all of the extracts and then calculates statistics, which is a perfectly sane default behavior. But depending on how people define extractors, it may sometimes be useful to be able to restrict which extractors' extracts are processed by that reducer, in a similar way to how we can restrict ourselves to looking at the first N extracts for a given subject

Switch stream processing to Sidekiq

There are some workers prepared, but right not the Kinesis controller just processes inline. We should switch this over to use the workers.

Support including subject metadata in CSV export

This is requirement for Intro2Astro for some of the exports they want from Caesar. They would like the subject metadata included with the CSV export. The format they would like is one column per metadata field.

extractor that applies jsonpath expression to classification

Given that we have a very composable design for our extract/reduce pipeline, I can imagine lots of times that we'd want to just pluck a single field out of the classification (for example, pulling out classroom groupings). JsonPath is a standard that specifies a query language for JSON blobs similar to what XPath provides for XML. It would be neat and relatively quick to write a JsonPath/pluck extractor to get the value of a single field, and ruby gems exist that apply JsonPath expressions to JSON blobs or ruby hashes.

https://github.com/joshbuddy/jsonpath
https://stackoverflow.com/questions/13422607/filter-hash-with-jsonpath

Caesar should be able to group extracts before reduction

There is some desire for caesar to be able to do reductions on a subject for less than an entire workflow's worth of classifications for a subject. It should be possible to introduce a grouping clause for reducers similar to the subrange clause they already have. This would result in multiple reductions being created in a single reduction pass, so we'd need to add a subgroup name column to the table and to the unique key.

Create extractors for every task/tool type

Here is a list of every task type on PFE that need extractors:

Public

  • Question (single and multiple)
  • text
  • Survey
    • Top level choice
    • Secondary questions
  • Drawing
    • Bezier
    • Circle
    • Column
    • Ellipse
    • Full Width Line
    • Full Height Line
    • Line
    • Point
    • Polygon
    • Rectangle
    • Triangle
    • Sub-tasks

Experimental

  • Combo
  • Crop
  • Drop down
  • Highlighter tool
  • Slider
  • Shortcut
  • Drawing
    • Grid
    • Freehand

Expose CSV export functionality via REST endpoint

Because the exports might take a little while to run, we are going to need a system like Panoptes uses.

For extracts, you should be able to specify one or more of: user id, workflow id, subject id
For reductions, you should be able to specify one or more of: (someday) user id, workflow id, subject id, group value

Reducer filter "repeated_classifications"

Right now we take all repeated classifications into account.

The default should probably to only use the first classification made on a subject by a user.

Other options might be "use the last classification" or "use all" (what we have currently).

Foreign key issue can occur if a kinesis post contains a lot of data

PG::ForeignKeyViolation: ERROR: insert or update on table "actions" violates foreign key constraint "fk_rails_697fdb9010"
DETAIL: Key (subject_id)=(6440741) is not present in table "subjects".

File "/app/app/models/effects/effect.rb" line 10 in prepare
File "/app/app/models/rules/rule.rb" line 13 in block in process
File "/app/app/models/rules/rule.rb" line 12 in each
File "/app/app/models/rules/rule.rb" line 12 in process
File "/app/app/models/rules/engine.rb" line 11 in block in process
File "/app/app/models/rules/engine.rb" line 11 in each
File "/app/app/models/rules/engine.rb" line 11 in process
File "/app/app/models/classification_pipeline.rb" line 46 in check_rules
File "/app/app/workers/check_rules_worker.rb" line 7 in perform

Better automatic reprocessing

Configs sometimes change. We want to make sure to recalculate things automatically so that we don't have to keep babysitting and deleting data from the database:

  • Maybe we can detect when a new extractor is added, and only add extracts for that?
  • Maybe we can detect when a new reducer is added, and only add reductions for that?
  • Can we detect if a specific extractor's config changed, and recalculate that extractor?
  • Can we detect if a specific reducer's config changed, and recalculate that reducer?
  • What do we do with external extractors/reducers?

Extractors can also cause unique violations the way reducers did

PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_extracts_on_classification_id_and_extractor_id"
DETAIL: Key (classification_id, extractor_id)=(56443012, choice) already exists.
File "/app/app/models/classification_pipeline.rb" line 29 in block in extract
File "/app/app/models/classification_pipeline.rb" line 17 in each
File "/app/app/models/classification_pipeline.rb" line 17 in extract
File "/app/app/workers/extract_worker.rb" line 11 in perform
ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_extracts_on_classification_id_and_extractor_id"
DETAIL: Key (classification_id, extractor_id)=(56443012, choice) already exists.
: INSERT INTO "extracts" ("classification_id", "classification_at", "extractor_id", "project_id", "workflow_id", "user_id", "subject_id", "data", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) RETURNING "id"`
File "/app/app/models/classification_pipeline.rb" line 29 in block in extract
File "/app/app/models/classification_pipeline.rb" line 17 in each45
File "/app/app/models/classification_pipeline.rb" line 17 in extract
File "/app/app/workers/extract_worker.rb" line 11 in perform

Panoptes error when subject was already in set

The add_subject_to_set action will trigger an error in panoptes if the subject was already a member. This might be best resolved in Panoptes?

Panoptes::Client::ServerError: {"errors"=>[{"message"=>"PG::UniqueViolation: ERROR: duplicate key value violates unique constraint \"index_set_member_subjects_on_subject_id_and_subject_set_id\"\nDETAI

Generalize the SurveyReducer

Rather than having configurable subranges, the reducer could just be responsible for a single specific range (with support for 0..-1) and a workflow would specify multiple instances of the reducer if it wants multiple ranges.

It also doesn't need to be tied specifically to the Survey, it could also reduce "flags" for instance, or anything else.

Pause workflows for a little while if they fail repeatedly

If a workflow fails enough times in a row or enough times within a small window, there's probably something wrong with the config and we should stop.

Marten adds:
Yeah good idea. Might be gems for a pattern called a circuit breaker

Add capability to ignore classifications before certain version

Might want to default to ignore all but latest workflow version? But there's plenty of cases where things didn't significantly change (example: Wildcam Darien has 45k classifications on an older version, with the changes just being a few added cross-links between species, and the "How many" removed for "Human")

REST API isn't versioned

This hasn't been a big deal so far but once we get more people using ExternalExtractors and ExternalReducers we're going to want the ability to occasionally make breaking changes without making consumers of the API recode their apps.

parameters to ReductionsController#index

when you ask for the reductions for a given workflow, you’re required to specify workflow_id and reducer_id in the route, but the controller doesn’t filter by reducer id. also, you’re required to provide a subject_id as a param, even though it’s not part of the route, because the controller filters on that

it might make more sense to do /workflows/1/subjects/123/reductions with an optional reducer_id param

Extracts controller finding extract to update

Currently it uses find_or_initialize_by with workflow_id, subject_id, and extractor_id, but this won't be unique for extracts, only for reductions. instead we want to use classification_id and extractor_id probably.

Add "webhooks" concept to nero_config

Marten Veldthuis
[11:28 AM]
so they would have {extractors: {s: {type: 'blank'}}, reducers: {}, rules: [], webhooks: {"asdf": {url: "https://example.org"}}}
[11:28]
as their nero config
Marten Veldthuis [11:30 AM]
but there could be a filter inside that {url: 'zxy', events: ['extract_changed']} if we really need to
Marten Veldthuis [11:32 AM]
i guess we could debounce it quite easily, and then do one call with multiple events (for all the changes)
[11:34]
so let’s make sure the API call we’re making out defines it as us sending them an array of events

Replaces #60

Move workflow configuration to Caesar

Putting Panoptes in charge of Caesar's config was, in retrospect, a bad decision. Caesar should be the principal owner of all data related to Caesar. That mean doing less work on events in the stream, since we don't have to keep checking if we have to update the workflow config.

The main issue with this was that the UI for Caesar needs some way of setting the configuration, and the existing HTTP Basic authentication as used by Kinesis wasn't going to cut it. But with #84 we now have working OAuth authentication in Caesar.

To resolve this issue we will need to check (by calling Panoptes) that the current_user has write access to the Panoptes project (i.e. is an owner or collaborator on that project). This will also help future tickets that can provide proper security on the rest of the API.

UniqueViolation can occur when storing reductions

PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_reductions_on_workflow_id_and_subject_id_and_reducer_id"
DETAIL: Key (workflow_id, subject_id, reducer_id)=(3559, 7316469, s) already exists.

File "/app/app/models/classification_pipeline.rb" line 39 in block in reduce
File "/app/app/models/classification_pipeline.rb" line 34 in each
File "/app/app/models/classification_pipeline.rb" line 34 in reduce
File "/app/app/workers/reduce_worker.rb" line 10 in perform

Allow extractors/reducers to be re-run for a workflow

For aggregation if a user wants to change anything (say a clustering parameter) there should be a nice way to re-run reducers for a workflow.

Also if caesar is set up after some images are retired (or the entire project is finished) it would be nice to be able to force re-run the extractors.

Asynchronous extract/reduction handling and the pipeline

If an extraction is done asynchronously with the ExternalExtractor, then either an empty hash or a special value should be returned to indicate that no further processing should be performed and the ExtractWorker should not queue up a ReduceWorker. Reduction will instead occur after an extract is posted to the ExtractsController by a remote service.

If an extraction is done asynchronously with the ExternalReducer, then either an empty hash or a special value should be returned to indicate that no further processing should be performed and the ReduceWorker should not queue up a CheckRulesWorker. Reduction will instead occur after an extract is posted to the ReductionsController by a remote service.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.