ngageoint / scale Goto Github PK

Processing framework for containerized algorithms

Home Page: http://ngageoint.github.io/scale/

License: Apache License 2.0

Shell 0.49% Python 97.85% Makefile 0.11% Batchfile 0.13% HTML 0.90% CSS 0.23% JavaScript 0.01% HCL 0.01% Dockerfile 0.27%

mesos dcos scale docker

scale's Introduction

Scale

Scale is a system that provides management of automated processing on a cluster of machines. It allows users to define jobs, which can be any type of script or algorithm. These jobs run on ingested source data and produce product files. The produced products can be disseminated to appropriate users and/or used to evaluate the producing algorithm in terms of performance and accuracy.

Mesos and Nodes

Scale runs across a cluster of networked machines (called nodes) that process the jobs. Scale utilizes Apache Mesos, a free and open source project, for managing the available resources on the nodes. Mesos informs Scale of available computing resources and Scale schedules jobs to run on those resources.

Ingest

Scale ingests source files using a Scale component called Strike. Strike is a process that monitors an ingest directory into which source data files are being copied. After a new source data file has been ingested, Scale produces and places jobs on the queue depending on the type of the ingested file. Many Strike processes can be run simultaneously, allowing Scale to monitor many different ingest directories.

Jobs

Scale creates jobs based on its known job types. A job type defines key characteristics about an algorithm that Scale needs to know in order to run it (what command to run, the algorithm.s inputs and outputs, etc.) Job types are labeled with versions, allowing Scale to run multiple versions of the same algorithm. Jobs may be created automatically due to an event, such as the ingest of a particular type of source data file, or they may be created manually by a user. Jobs that need to be executed are placed onto and prioritized within a queue before being scheduled onto an available node. When multiple jobs need to be run in a serial or parallel sequence, a recipe can be created that defines the job workflow.

Products

Jobs can produce products as a result of their successful execution. Products may be disseminated to users or used to analyze and improve the algorithms that produced them. Scale allows the creation of different workspaces. A workspace defines a separate location for storing source or product files. When a job is created, it is given a workspace to use for storing its results, allowing a user to control whether the job.s results are available to a wider audience or are restricted to a private workspace for the user's own use.

Scale Dependencies

Scale requires several external components to run as intended. PostgreSQL is used to store all internal system state and must be accessible to both the scheduler and web server processes. Fluentd along with Elasticsearch are used to collect and store all algorithm logs. A message broker is required for in-flight storage of internal Scale messages and must be accessible to all system components. The following versions of these services are required to support Scale:

Elasticsearch 6.6.2
Fluentd 1.4
PostgreSQL 9.4+
PostGIS 2.0+
Message Broker (RabbitMQ 3.6+ or Amazon SQS)

Note: We strongly recommend using managed services for PostgreSQL (AWS RDS), Messaging (AWS SQS) and Elasticsearch (AWS Elasticsearch Service), if available to you. Use of these services in Docker containers should be avoided in all but development environments. Reference the Architecture documentation for additional details on configuring supporting services.

Quick Start

While Scale can be entirely run on a pure Apache Mesos cluster, we strongly recommend using Data Center Operating System (DC/OS). DC/OS provides service discovery, load-balancing and fail-over for Scale, as well as deployment scripts for nearly all imaginable target infrastructures. This stack allows Scale users to focus on use of the framework while minimizing effort spent on deployment and configuration. A complete quick start guide can be found at:

https://ngageoint.github.io/scale/quickstart.html

Algorithm Development

Scale is designed to allow development of recipes and jobs for your domain without having to concern yourself with the complexities of cluster scheduling or data flow management. As long as your processing can be accomplished with discrete inputs on a Linux command line, it can be run in Scale. Simple examples of a complete processing chain can be found within the above quick start or you can refer to our in-depth documentation for step-by-step Scale integration:

https://ngageoint.github.io/scale/docs/algorithm_integration/index.html

Scale Development

If you want to contribute to the actual Scale open source project, we welcome your contributions. There are 2 primary components of Scale:

Scale User Interface: https://github.com/ngageoint/scale/tree/master/scale-ui
Scheduler / Service APIs: https://github.com/ngageoint/scale/tree/master/scale

The links provide specific development environment setup instructions for each individual component.

Build

Scale is tested and built using a combination of Travis CI and Docker Hub. All unit test execution and documentation generation are done using Travis CI. We require that any pull request fully pass unit test checks prior to being merged. Docker Hub builds are saved to x.x.x-snapshot image tags between releases and on release tags are matched to release version.

A new release can be cut using the generate-release.sh shell script from a cloned Scale repository (where numbers refer to MAJOR MINOR PATCH versions respectively):

./generate-release.sh 4 0 0

There is no direct connection between the Travis CI and Docker Hub builds, but both are launched via push to the GitHub repository.

Contributing

Scale was developed at the National Geospatial-Intelligence Agency (NGA). The government has "unlimited rights" and is releasing this software to increase the impact of government investments by providing developers with the opportunity to take things in new directions. The software use, modification, and distribution rights are stipulated within the Apache 2.0 license.

All pull request contributions to this project will be released under the Apache 2.0 or compatible license. Software source code previously released under an open source license and then modified by NGA staff is considered a "joint work" (see 17 USC § 101); it is partially copyrighted, partially public domain, and as a whole is protected by the copyrights of the non-government authors and must be released according to the terms of the original open source license.

scale's People

Contributors

Stargazers

Watchers

Forkers

carl4 thomasneal giserh dev-geo iocarto cuulee gitter-badger dancollar michaeljohns2 ais-open dmitry-r jefferey dwcaraway wong-j sguzwf droessne kaydoh antman1p danlopez00 steph2a1 elviento gisjedi mikenholt mnjstwins bowmanmc vovoma fatmas1982 grseb9s joohoo340 fizz11 steveais chedda7 ctc-oss hhy5277 marketboy sau29 kfconsultant gardengeek99 gisdevelope loganr-w alexeykolychev matttalda cshamis

scale's Issues

Add job_type.max_scheduled to REST API

A new field was added to job_type called max_scheduled which optionally limits the number of jobs of each type that can be scheduled to run at the same time. Add this new field to applicable serializers, REST API calls, and documentation. This should be an editable field for the purposes of import/export. When this is complete, create an issue for the UI to display this value on the job type detail page.

Serialize metrics state so it can be shared

Serialize all the options selected for a metrics chart so it can be shared and bookmarked via a short URL.

Product files lose path information

output products in subdirectories lose directory information.

for example: /scale_data/job_exe__/outputs/output_data/a/b/c.png
will end up in: /products/foo/1.0.0//year/month/day/job_exe*/c.png

Directory information should be kept.

Job load chart height

The job load chart's initial height is not correct when loaded (on the standalone job load page). The chart switches to the proper height when the "older/newer" navigation buttons are used.

Job type scheduling limit

Add a limit to how many jobs of a particular type can be scheduled at one time.
Allow this to be set to unlimited (negative value?).

Exit codes from jobs seem to be getting lost

Exit codes from pre and post are ok, but execution status of the job itself doesn't seem to be making it out of the docker container. Are we not sending it correctly or are we losing it? Or is this just a display bug?

Add Job to Recipe

Develop a different UI for adding a job to a recipe. The current modal method doesn't work well when there are lots of jobs present in the system.

Add version information to scale

Add version/git information to the Scale UI.

"Vagrant up" scale UI is not accessible

There are three URLs I've noticed in the README.md file with regards to running this from a vagrant startup sequence.

http://master
http://master:8081
http://master:5050

The last one (http://master:5050) appears to connect fine through the host browser (MacOS 10.11). However, the other two URLs fail to connect. It says "Firefox failed to connect to server".

Move automatic node pausing to NodeManager

Right now the automatic node pausing is triggered during the ScaleScheduler.statusUpdate() callback, adding unnecessary execution time. Also, the logic for when automatic pausing is triggered should be looked at. The current query is highly inefficient and expensive.

Move the automated node pausing logic into the NodeManager. The NodeManager can collect information about job execution failures during scheduler callbacks. Then it can perform automatic pausing when needed in its sync_with_database() thread.
We need to have a group discussion to figure out the best approach for automatically triggering node pausing.

Refactor ingest system

Refactor the ingest system to remove NFS requirement.
Implement new monitor system to allow Strike to have different ways of receiving ingest file events

Sync is_online back to node models

Right now the REST API to return status information back for nodes relies upon an HTTP request to Mesos to determine whether a node is currently online. After the major scheduler refactor, the scheduler now tracks is_online status for each node using the scheduler.sync.node_manager.NodeManager class. The scheduler should be considered the source of truth for this value and should sync this value back to the database.

Add non-nullable is_online boolean field to node model (default true)
Update NodeManager.sync_with_database() to set the is_online field for each node to the correct value as it changes in the scheduler
Resolve the 2 related TODOs in scheduler.offer.node.py
Update all applicable REST APIs to no longer query Mesos to determine online status; use the new field. Since the new field "costs" nothing, it can be added to all REST APIs that return node models.

Change metrics UI plot-data call parameter from 'choice-id' to 'choice_id'

Metrics API changed parameter 'choice-id' to 'choice_id'.

Add workspace limits to scheduler

Add limits that the scheduler enforces for workspaces.
Probably need both a global and per node limit for each workspace.
Consider read vs write limits?

Job type icons on nodes

Icons representing currently running job types don't show up initially on a node - they only appear when subsequent calls are made to the running jobs API endpoint.

Add failover support to scale scheduler

Allow multiple master/backup copies of the scheduler to run. Automatically failover to a backup scheduler if the master becomes unavailable.

Allow workspaces to deny file moves

Add the ability to workspaces so that they can be configured to not move files even if the parse results of a job indicates that a file in the workspace should move. This will be accomplished with a new boolean field added to the workspace model called "is_move_enabled" with a default value of True.

http://master:8081 requires authentication from "vagrant up"

http://master:5050 and http://master work correctly. However, going to http://master:8081 causes a browser dialog window requesting username and password.

Add status icon to job nodes in recipe editor

Adding an icon in addition to the color would make the status of the recipe more obvious.

Recipe Superseding

Requires issue #51 to be done first.

Update the recipe model to handle the concept of recipes superseding one another.
Add is_superseded, root_superseded_recipe_id, superseded_recipe_id, and superseded fields to recipe model
Add is_original to recipe_job model (think about the field name)

Update recipe handling to correctly handle jobs being in multiple recipes

update recipe handling methods to use named tuples (from previous pull request comment)
create class to handle recipe definitions

Update RecipeManager.get_recipe_for_job() to return the original recipe for the job
Update file ancestry links to only grab the root recipe for a job

Update RecipeManager.create_recipe() to accept a recipe being superseded

the two recipes may have different job_names for the jobs
all of the 'not is_original' recipe jobs (the "copied" jobs) should only depend upon one another

Job Superseding

Update the job model to handle the concept of jobs superseding one another.
Add root_id, is_superseded, superseded_id, superseded_by_id, and delete_superseded as fields to the job model.
Update JobManager.create_job() to accept a job that is being superseded by the new job.

Allow passing additional parameters to the docker container

In order to support changing network layouts, I need to be able to pass the "add_host" parameter to a docker depending on certain job types. I have checked in a possible solution to my fork. Commit

Nodes need better self-cleanup after jobs.

Many post steps are leaving trash behind on the node. They need to be better at cleaning up after themselves, or we need a "maid" task to come through every so often to clean house.

Vagrant up fails on "master"

I performed a git clone of "master" today and ran the vagrant related commands to run against VirtualBox on Mac OS 10.11.

ansible 2.0.1.0
config file =
configured module search path = Default w/o overrides
Vagrant 1.8.1
VirtualBox 5.0.10

However, it's failing.

The first failure was a typo at the bottom of this file:
scale/ansible/group_vars/vagrant - line 55:

These are for the example database

nfs_server: "{{ mesos_master_ip }}""

The second double quote. I fixed that and re-ran vagrant up.

TASK [scale-configs : Install local_settings.py] *******************************
task path: /Users/marshall/DEV/GEOINT/scale/ansible/roles/scale-configs/tasks/main.yml:8
fatal: [slave2]: FAILED! => {"changed": false, "failed": true, "msg": "AnsibleFilterError: |password_hash requires the passlib python module to generate password hashes on Mac OS X/Darwin"}

Which might be something I can fix, so I will dig into that error message.

Product Superseding

Requires issue #51 to be done first

Update the product model to support products superseding one another
Add is_superseded, superseded_id, superseded_by_id fields to product model
Update product saving to optionally take superseded product:

being is_superseded makes is_published false
grab UUID from superseded product
do not save products if their job is already superseded (locking might be complicated here)

Upgrade to latest Mesos

Update Scale to use the latest stable version of Mesos.

This should remove our need to do an explicit Docker pull in the pre-task. Test this!

Update the node.slave_id column to agent_id to reflect the new name in Mesos.

Check for updated Python bindings/stubs and see if we can remove the protobuf dependency.

Delete superseded products

Requires issue #53 to be done first

In the Scale Cleanup Job, delete superseded products if the job_exe completed.
The post task must not mark cleanup as done if the job_exe's job has delete_superseded.

Products being deleted without a specific superseding job should be marked as is_superseded as well as being deleted.

Cancel multiple tasks

Update the backend to support canceling multiple jobs at once (similar to re-queuing multiple jobs).
Also take this as an opportunity to consider switched the model locking order from job_exe, job to job, job_exe, which would be beneficial for canceling jobs.

Get Scale into Docker

Get all of the Scale pieces into Docker and update the build.xml to do Docker builds.
We will need to update the system job types to use Docker.

This requires solving the NFS mounting issue across containers.

This is desperately needed to ease deployment.

Create a tool to fake a strike ingest of existing data

Create a tool (command line is fine) to generate database entries for existing ingested data. This should create the file entries like strike does so that existing data from another scale instance can be ingested without having to copy it into the ingest directory again.

Jobs page performance

When a number of jobs have built up the response time of the Jobs list page seriously degrades. I'm guessing this is a result of using a sort on the update time of the job combined with paging. This is going to cause the entire jobs table to be processed by the database so that the appropriate page of data can be returned.

A simple solution would be to add a days/time back filter to ensure nothing older than a week or so would be processed. This could be exposed as a calendar control and passed to the API. This would reduce the total dataset that would need to be processed when viewing the job list.

Data feed history

Add the ability to view data feed history using the same UI as the job load chart.

Node hostnames and job history in EC2 environment

Where (virtual?) nodes may automatically spinup/join/shutdown/destroy on a regular basis the concept of active/inactive offline/online may not make as much sense. Should we abandon hostnames for a GUI name or just "pass the buck" through to mesos?

Refactor task handling

Refactor the task handler system to separate Scale logic from the Mesos interface.
Speed up the status updates in the Scheduler by reducing the number of queries needed.
Add the automatic retry decorator to the database queries made for status updates.

This is an important refactor for improving the performance and reliability of the scheduler.

Add support for constants for input values in Recipe Types

In order to improve reusability of Job Types, Recipes should support specifying constants for values of Job Type inputs.

Example: Job Type "Send E-mail" that has inputs "sender" and "recipient". Scale currently requires creating separate Job Types and Docker images for each unique configuration. The Recipe Type should support user defined values for these inputs.

Responsive design issues

There are a few places in the UI that don't respond well to different browser window/screen sizes. For example, the scheduler pause button on the Overview screen and the timeline visualization on the Job Details page.

MPI

Trev:

Can we use the MPI for SciPy project to help build a better log capturing mechanism? Might give us the interactive-real-time delivery we want.

Improve source file ancestor query

The method FileAncestryLinkManager.get_source_ancestors() is an expensive query. In particular, the DISTINCT ON portion of the query. Improve it so the post-task takes less time holding a database connection.

Job Load Graph fails to update

If the overview page is left open for long periods of time (> 8 hours) the Job Load table stops updating. The other panels on the page appear to stay in sync.

Integrate triggers with superseding

Integrate the ingest and parse triggers with the superseding system when duplicate files are ingested/parsed.

Update Django and django-rest-framework

Update the versions of Django and django-rest-framework used by Scale. This will require some investigation into the Django changes and possibly making appropriate updates to Scale.

Modify the main.yml file to fix Docker error

Docker does not load properly using YUM unless lvm2 is updated first. Here is the fix.

file:
scale/ansible/roles/common/tasks/main.yml
...

name: Install python-pip
yum: name=python-pip
become: true
name: Update lvm2
yum: name=lvm2 state=latest
become: true
name: Install docker-py
pip: name=docker-py
become: true
name: Install docker
yum: name=docker
become: true
...

Re-queue multiple jobs

Improve the REST API to allow re-queuing multiple jobs in one REST call.
Include an option to reset the jobs to a new priority.

Fix the hard coding for the machine.vm.synced_folder in the Vagrant file

Fix the hard coding for the machine.vm.synced_folder in the Vagrant file. Using the "" will prevent the user from needing to create the hard coded path specified in the file. Modify the vagrant quick start documentation to reflect that the synced folder may need to be modified if the project is not installed at the user's home directory.
file:
scale/vagrant
....
machine.vm.synced_folder "/scale/", "/scale/"
...

Exception in new scheduler

The following exception is occurring often with the latest changes to the scheduler.

ERROR [scheduler.scale_scheduler(293)] Error handling status update for job execution: #
Traceback (most recent call last):
   File "/opt/scale/scheduler/scale_scheduler.py", line 265 in statusUpdate
      task_dir = get_slave_task_directory(hostname, port, task_id)
   File "/opt/scale/mesos_api/api.py", line 241 in get_slave_task_directory
      raise MesosError('Task not found: %s' % task_id)

Refactor log system

Refactor the log system:
Remove all of the Django database logging (LogEntry model in Error app)
No longer include task logs in the database
Figure out a new method of log storage so that logs don't need to be moved once a task finishes
Store task logs independently (pre/job/post tasks)
See if stdout and stderr can be both combined and separated
Remove stdout/stderr columns in the database and figure out new columns for handling URLs that can be used for log viewing.
Update the JobExecutionManager.task_started() and JobExecutionManager.task_ended() methods and update the tasks to remove stdout/stderr and the URLs
Should this use the workspace system?

This is an important issue for improving the reliability and performance of the scheduler since logs will no longer have to be dealt with during status updates.

Batch

Create the foundation for adding historical batch processing

New Batch model with definition, status (just setup status?), setup_job_id, event_id, title, description, created, and last_modified fields
New BatchJob model with batch_id, job_id (nullable), superseded_job_id (nullable), and created fields
New BatchRecipe model with batch_id, recipe_id (nullable), superseded_recipe_id (nullable), and created fields
Create new system job 'Scale Batch Setup' with a blank implementation
Create a create_batch() method that validates definition schema (currently empty), creates a Batch model, and queues a Scale Batch Setup job

Example database workspace urls not always initialized properly.

When the build host is "localhost" in ansible, the workspace url base will be http://localhost:8080 which is not correct. This should be defined by a separate variable so it can be set to the proper value. This should also be set in the vagrant group_vars.

Add links to inputs and outputs in recipe editor

Add links to input and output files for each job node in the recipe editor.

AWS S3 broker

Add a new broker for AWS S3 storage