warelab / sciapps Goto Github PK

View Code? Open in Web Editor NEW

2.0 18.0 1.0 9.44 MB

SciApps: a cloud-based platform for reproducible bioinformatics workflows

Home Page: https://www.sciapps.org

License: Apache License 2.0

Perl 20.25% HTML 0.87% CSS 1.02% JavaScript 40.94% Shell 0.01% Raku 0.39% Less 36.53%

cyverse workflow agave

sciapps's Introduction

SciApps: a cloud-based platform for reproducible bioinformatics workflows

Introduction

SciApps is a bioinformatics workflow package developed to leverage local clusters or TACC/XSEDE resources for computing and CyVerse Data Store for storage. SciApps is built on top of the Agave API that can also virtualize commercial resources, e.g., Amazon EC2/S3 for computing and storage. Both GUI and RESTful API are available for interactive or batch processing of NGS data.

Installation of SciApps

git clone https://github.com/warelab/sciapps.git
cd sciapps/agavedancer
sudo npm install -g grunt-cli
npm install
grunt package
sudo /usr/sbin/apachectl graceful

Providing CyVerse credentials

Update defaultUser to "XXX" in agavedancer/environments/production.yml (or development.yml), and create the following file.

cd sciapps/agavedancer
touch .agave
  .agave content:
      {"username":"XXX","password":"YYY"}

Setting up iRODS (for accessing CyVerse Data Store)

wget ftp://ftp.renci.org/pub/irods/releases/4.1.10/centos7/irods-icommands-4.1.10-centos7-x86_64.rpm
sudo yum install fuse fuse-libs
sudo rpm -i irods-icommands-4.1.10-centos7-x86_64.rpm 
cd /usr/share/httpd
sudo touch irodsEnv
sudo chmod 664 irodsEnv
sudo chown apache:apache irodsEnv
  irodsEnv content:
    {
      "irods_host": "data.iplantcollaborative.org",
      "irods_user_name": "XXX",
      "irods_port": 1247,
      "irods_zone_name": "iplant",
      "irods_authentication_file": "/usr/share/httpd/irodsA"
    }
sudo touch irodsA
sudo chmod 664 irodsA
sudo chown apache:apache irodsA
sudo -u apache /bin/bash
export IRODS_ENVIRONMENT_FILE=/usr/share/httpd/irodsEnv
iinit

Integrating new Apps/Tools

Follow this instruction for developing new Agave apps. And put the app json file in the following assets folder (e.g., Bismark-0.14.4.json).

cd agavedancer/public/assets
touch agaveAppsList.json
agaveAppsList.json content
   {
      "tags": ["Methylation"],
      "id": "Bismark-0.14.4",
      "label": "Bismark",
      "name": "Bismark",
      "version": "0.14.4"
   },
   {
     ...
   }

Configuring web server

SciApps.org can be configured with an Apache server using the following demo configuration file. The sciapps.conf file should be placed under /etc/httpd/conf.d/ (Centos 7) or /usr/local/apache2/conf/ (Centos 6). Note that SSL certificate is needed to be able to authenticate to the cloud systems.

<VirtualHost 143.48.220.100:443>
    SSLEngine on
    SSLCertificateFile /etc/letsencrypt/live/www.sciapps.org/fullchain.pem
    SSLCertificateKeyFile /etc/letsencrypt/live/www.sciapps.org/privkey.pem
    SSLCertificateChainFile /etc/letsencrypt/live/www.sciapps.org/chain.pem
    ServerName       www.sciapps.org
    ServerAlias      sciapps.org
    DocumentRoot     /home/YOURUSERNAME/sciapps/agavedancer/public
    RewriteEngine on
    RewriteRule ^/app_id/(.*)      https://www.sciapps.org/?app_id=$1 [L]
    RewriteRule ^/page/(.*)     https://www.sciapps.org/?page_id=$1 [L]
    RewriteRule ^/data/(.*)     https://www.sciapps.org/?page_id=dataWorkflows&data_item=$1 [L]

    SetEnv DANCER_ENVIRONMENT "production"
    <Directory "/home/YOURUSERNAME/sciapps/agavedancer/public">
        AllowOverride none
        Require all granted
        DirectoryIndex index.html index.php
    </Directory>
    <Location />
        SetHandler perl-script
        PerlResponseHandler Plack::Handler::Apache2
        PerlSetVar psgi_app /home/YOURUSERNAME/sciapps/agavedancer/bin/app.pl
    </Location>
</VirtualHost>

Citation

Wang, L., Lu, Z., Van Buren, P., & Ware, D. (2018). SciApps: a cloud-based platform for reproducible bioinformatics workflows. Bioinformatics, 34(22), 3917-3920. Link

sciapps's People

Contributors

Stargazers

Watchers

Forkers

liyawang

sciapps's Issues

Visualizing scientific workflow

Visualization of the workflow should provide following functions:

Provide a graphic representation of the workflow that can be downloaded for publication (e.g. Pegasus)
A graphical representation that can be edited before saving (e.g. reorganization of nodes and links)
A graphical representation in which nodes are clickable for inputs/outputs, app and parameter infos
A graphical representation that can show status of jobs
A graphical representation that can be re-submitted with new parameters/inputs

Some reviews and packages for visualization of workflow are below. We need to be able to generate a graphic representation of the workflow. It will be nice to be able to update job status on the graph (like what bioExtract did, seems that they adopted PinBall below). But its fine to just create a static image of the workflow now (like Pegasus did, seems that they also adopted PinBall).

Here is an old review of different visualizations developed by Galaxy and Taverna.
www.molgenis.org/raw-attachment/blog/2013/01/31/BIOINFORMATICS_2013_24_CR%20(2).pdf

andrew
I always liked the look of seven bridges genomics’ workflows. They have an open source project that might have some useful viz tools https://www.rabix.org/ (edited)

kapeel
http://stackoverflow.com/questions/14292636/dagdirected-acyclic-graph-dynamic-job-scheduler
stackoverflow.com
DAG(directed acyclic graph) dynamic job scheduler
I need to manage a large workflow of ETL tasks, which execution depends on time, data availability or an external event. Some jobs may fail during execution of the workflow and the system should h...

http://link.springer.com/article/10.1007%2Fs11227-009-0284-7#/page-1
link.springer.com
DAGMap: efficient and dependable scheduling of DAG workflow job in Gri
DAG has been extensively used in Grid workflow modeling. Since Grid resources tend to be heterogeneous and dynamic, efficient and dependable workflow job scheduling becomes essential. It poses great c

Filename needs to be wrapped if too long

In the history column

Agave sent out multiple failed/finished notifications

Investigating workflow ideas

Workflow ideas (Change output link to): http://data.sciapps.org/results/test/readme.txt?jobid=n

jobid start from 1, 2, … (on the right column)
appid,inutid1, inputid2, that generated the output can be retrieved from the jobid
outputs are unique with jobid and output filename
assuming output names are always the same no matter how inputs are changed for any app
a workflow page will be built for constructing an automatic workflow
all steps have been running at least once
can delete nodes (failed analysis repeated later, no emergent)
can save
can run (archiving to brie so not need to modify agave job submission)

- can bring up used inputs and parameters for modification (parameter sweep)

Error when inputs have no metadata associated

For the workflow diagram, when there is no metadata associated with inputs, clicking on the inputs will make the entire diagram non-responsive. The metadata associated with the app will not be displayed either.

Need to allow left and right column to retract back for visualizing genome browser

Setup Drupal

Install Drupal modules on Brie for developing the MaizeCode web portal

Enable visualization with JBrowse

Configure JBrowse with data.sciapps.org.
Configure Agave archived results folder with JBrowse.
Visualizing maker outputs (GFF3), bam, VCF, etc.
Support compressed format for server.
Create compressed maker outputs.
Compress large folders as tar ball before archiving, then automatically uncompress.

Display 'Link to File' to all files on diagram

Even if no metadata is avaialble

Speed optimization

Start caching datastore automatically once web page loads up / refresh
Remember the last path user has selected (not starting from the root all the time)
Or create a static json for maizecode data set in the long term

Add direct link to each app

This is functional by Zhenyuan and url rewrite is used to improve the direct link. However, it will be better to keep the before-rewritten url in the address bar. It will also be nice to have the link displayed in the address bar when clicked on an app.

Need to label inputs as required and check all required inputs

Workflow and archiving

Archiving should be avoided when subsequent step/steps are running on the same server.

However, for cases like a-->b, a-->c, a & b runs on cloud A, c, runs on cloud B, shall we archive results_a on cloud_B or leave it on cloud A? An ideal solution will be leaving it on cloud A. If archived to cloud B, then it will need to be copied back to cloud A again for running b.

Here we are considering cloud not individual server. All servers in CSHL will be considered as one cloud, similar with TACC cloud, UA cloud (main data store).

The workflow management system needs to query each app for their operating cloud.

Expand SLURM to DNALC

Add DNALC's 24-core cluster to CSHL system and offer it to EOT.

Workflow re-launch points

For example, given a five-steps workflow, to re-run from any steps:

Step 1, re-run entire workflow with different parameters/files
Step 2, choose step 2, 3, 4, 5 in the right panel, build a new workflow, customize then run
Step 3, choose step 3, 4, 5 in the right panel, build a new workflow, customize then run
Step 4, choose step 4, 5 in the right panel, build a new workflow, customize then run
Step 5, re-launch this step (last) directly from right panel

The assumption is we will always re-run entire workflow. But this will allow user to re-run from any steps of the entire workflow.

Support login

For jobs:
Use Agave to support login and use user's credential for running a job. Save tokens for checking job status and debugging.

For inputs:
In the same time, capture user's username to browse login user's data in irods based datastore (ils /iplant/home/some-user/sci_data).

For outputs:
Save results back to /iplant/home/some-user/sci_data/analysis/

For security:
Token will expire after 4 hours. User needs to be logged out and login again to renew token.

Test Case: Metagenomics pipeline

Rough ideas:

Host copy of databases locally (update?).
Build Agave pipeline or Docker based pipeline.
Archive results back to remote database (or download).
Metadata management.

Diagram zooming and enhancement

like making button vertical, optimizing layout

Uncompress output files after archiving to improve Agave efficiency

This is noticed: https://bitbucket.org/taccaci/agave/issues/96.

For example, in http://data.sciapps.org/results/maker-0-0-1-bHN9Uywwm8/, we can compress the maker.jbrowse folder before archiving. It will become maker.jbrowse.open.tar.gz. After archiving, uncompress it to the archived folder.

Backend uses webhook to get notification and uncompress the tar ball.

Change first job from 0 to 1

Need to change the index of jobs to start from 1, not from 0.

Relative path problem

When browsing 'Data Store', need a variable (storagePath) to define relative path.

For CSHL storage system, $storagePath='data.sciapps.org/example_data'

For CyVerse Data Store, $storagePath='/iplant/home/UserName/sci_data', also need to add a button to allow accessing community data: /iplant/home/shared

To do this, we need to update Agave system to point root folder one level up than example_data. When browsing files from App form, user needs to be directed to $storagePath and won't have right to access data above $storagePath.

All workflows need to be rebuilt since path are changed.

'example_data' need to be removed from filesInfo.js after these changes.

Support input as multiple files

Optimizing building workflow

Add a checkbox before the number of each step or in the right end
Click on 'Build a Workflow' will open a popup window, a workflow diagram will be drawn from the checked jobs in history
if less than 2 jobs are checked, the pop up window will display "Please select/check at least two jobs from the right column'
User can close the popup diagram, de-select one or more jobs and redo 'Build Workflow'
To save, user need to add Workflow name and brief description (Fields are needed in the popup window)

Combine results/genome browser(BioDalliance, Jbrowse) page as the welcome page

Resolve Mod_perl conflicts among production and developments (on same server)

Use cases:

Sciapps.org, prod, dev1, dev2 are mixed together because of mod_perl. Zhenyuan has two separated local gits for prod and dev1. He commits new developments to dev1 then do git pull from the same github repo (Warelab/MaizeCode) for updating prod. Liya commits new developments to dev2 then to prod. He does git pull from Warelab/MaizeCode to merge Zhenyuan's new developments to dev2. The problem occurs at the mod_perl level since our code adopted Cornel's perl SDK for Agave. We also used perl dancer but it is not related to the problem.
Dnasubway.org, used plain old HTML::Mason on top of mod_perl. Cornel's dev is on a VM on his laptop so he can work anywhere w/o an internet connection (unless work on Agave stuff - this need a connection). He also have a VM running on balaur using libvirtd on the live site. There is only the live site, no dev. Pan has her own VM

Solutions:

Run several apache at different ports. ( We didn't adopt this)
Run your VirtualHost with PerlOptions +Parent. This gives each VirtualHost its own Perl interpreter (We adopted this one).
More details see: http://www.gossamer-threads.com/lists/modperl/modperl/98162

Don't display Agave files in right column for results

Add header Apps and History to both left and right columns

Updating Maker

Maker updates (update app json, then push to Sciapps dev2, then Sciapps prod):

Mark genome sequence as required (true for both annotation and re-annotation), to do
Change 'otherEvidences' label to 'Maker derived evidence' (so that example workflow still work), to do
Add more parameters to control options for re-annotation? Low priority

Delete defaults when load workflow

Example workflow is loaded when trying to load workflow. This happens when example workflow is loaded before loading workflow.

Re-reoute URL input

Currently there is a bug in Agave that modifies files uploaded by URL, one solution to it will be replacing the input URL by Agave path. This will solve the problem for data hosting on a Agave system (e.g. brie)

Wrap Text in Nodes of a Diagram for long names

One idea is to replace a longlonglongname with longlon... before passing to mermaid

Another idea (?) is to replace with
longlon-
glongna-
me

The second idea might be better if just wrapped once
longlong-
longname

The first idea is simple and also ok if we can display the full name when clicked (in the meta data section)

Adding a dropdown menu for workflows

Change the workflow link in the header to a dropdown menu.

Build a workflow: this will bring up the workflow building page. User needs to have already run each step of the workflow at least once. There are several tricks below
a. Run some simple existing workflows, all history of each step will be popped up in the right column. Then user can build complicated workflows from them. Its similar to chain several sub-workflows together.
b.For batch processing, for example, run bwa on 10 samples, user can run it on 1 sample (or a fraction), then build a 10 step workflow to batch process all 10 samples.
Load a workflow: this will bring up the workflow loading page, which can load a workflow in json format
Example workflow: this will bring up the example workflow page, which will have links to download json for example workflows and links to load each of the example workflow page.

Reduce dependency on Agave interaction

Need to talk to Dave and Cornel to see whether we can simplify Agave services and build our own servers for the federated system. This will also help to resolve the issue that Agave fail to recognize the local iRods based storage system (aka resource server). In addition, a standalone fork might allow us to build more efficient workflows.

Support folder as input

Change code to take folder
or test url
folder in results page (.htaccess in root folder)

Need a search box for app

The search is supposed to filter the list of apps then display results by category in the left column (with all categories expanded/opened)
The search box should appear under the 'Apps' header
The searched word can be cleared out to re-list all categories
It will be nice to have 'auto-complete' for searched words

Example workflow for developing the workflow script

Here is a four steps workflow: GLM --> AdjustP --> XYPlot1
....................................................|--> XYPlot2 (not dependent on AdjustP)

Output folders:
GLM: http://data.sciapps.org/results/glm-tassel-5-1-23-0fJGkCBl8B/
AdjustP: http://data.sciapps.org/results/adjustpvalue-0-0-1-Tx0ryhviOK/
XYPlot1: http://data.sciapps.org/results/xyplot-0-0-2-wyGcfBpaoL/
XYPlot2: http://data.sciapps.org/results/xyplot-0-0-2-O9ULw4qog4/

One thing we ignored before is that Agave might fail for whatever reasons. So the workflow engine will need to check outputs (possible to do?) and re-submit the job if failed.

Enable browsing data

Before we have a GUI to select file from Data Store, we need to have way to take the file path from browsing, and send it to Agave app for analysis.

Support inputs as Text box

Widely needed for customized configuration file, e.g., maker, structure, etc or sequence search

User analysis management

Use CyVerse authentication (might need to pay for https)
https://auth.iplantcollaborative.org/cas4/login?service=http://sciapps.org/

We will capture user name to display user data (we will add username to our database (1st name), cookie will be saved in browser)
/iplant/home/username/sci_data (can go up just one level)
/iplant/home/shared (can't go up)

On SciApps side, we will add user account managements
Job history: save to database, load from database (last 20, then next 20?), delete, search
Workflows: List of workflows for reloading, downloading, saving, deleting, search
Data: Not for now but will be useful for organizing into experiments with metadata

On SciApps side, job executation
Use maizecode: User can not run their private apps unless shared with maizecode
Might be slow since maizecode user is running all jobs
Use user account: Capture User Secret when authenication? Doable? Ask Nirav/Dennis/Tony/Edwin
More intuitive since user can check their job via Agave commandline SDK
No longer need write permission to sci_data folder, results still need to go into user's sci_data folder that is readable for building workflow (make an analysis folder under sci_data?)

TRAM API: Display results in right column (working for non-public but shared readable data?)

Add example job id to workflow

When a workflow is generated, example job ids need to be added to the json file. When a workflow is loaded, the example output should be loaded in the History column.

In cases where output folders are removed or example job id field is empty, no or only partial history (some if not all steps) will be loaded.

Add direct link to example workflows

The link to example workflows are all the same. We will need direct link to example workflow as we did for each app.

Test case: Maker for annotation

A complete pipeline with two apps: Maker and SNAP.
Step 0. Run STAR on RNA-seq to build EST
Step 0. Blast UniProt/SwissProt protein database or the NCBI NR protein database for protein.
Step 1. Given EST and Protein, run maker to get GFF1.
Step 2. Run SNAP with GFF1 to estimate HMM1.
Step 3. Run Maker again with HMM1 to get GFF2 (No EST/Protein).
Step 4. Run SNAP with GFF2 to re-estimate HMM2.
Step 5. Run Maker again with HMM2 to get GFF3 (Final for JBrowse).

Questions:

Above pipeline ignores repeats, and other gene predictors.
Is it more efficient to keep them as one app?
Check three option files to optimize data transfer
Do Maker take compressed files?

Fix Agave timezone bug

Agave time zone is messed up:
Example 1:
Submitted on: 8/18/2016, 10:56:20 AM
Started on: 8/18/2016, 9:56:20 AM
Finished on: 8/17/2016, 8:11:38 PM
Real job finished time (NY): Aug 17 2016 7:11 PM

Example 2:
Submitted on: 8/18/2016, 2:58:46 AM
Started on: 8/18/2016, 1:58:47 AM
Finished on: 8/17/2016, 2:04:27 PM
Real job finished time (NY): Aug 17 2016 1:04 PM

Contacting Rion for possible solution. Otherwise, fix manually on our side?

Adding parameter/file details as help to app form

Just leave app description and source below the app form

Index should be one instead of zero across the website

For building and displaying workflows.

Existing workflows need to be modified by adding 1 to all step numbers.

Database need to be emptied out / re-load.

Re-launch a job

Add a "Relaunch" button in the right column between Status and Results to bring up the app form with set parameters.

In the short term, just keep the set parameters (not reset to defaults) after submission

Handling boolean variables in web form with segmented control buttons

Resolved also raised the question with Joe, Jim, and Andrew

Optimizing displaying outputs in the right panel

Right now, we are using Agave files-list via job-id encoded with app name.

on brie, dev2.sciapps.org/results/job-folder
on halcott, data.sciapps.org/results/job-folder

Since we have already loaded job json into the browser, we know the results folder name, can we bypass agave? Assumer we can list contents under a web folder.

On de.sciapps.org

We will use ils so its much simpler and we can definitely bypass agave.

Can't load workflow on Safari

File API is not supported on this browser

Output path not universal across dev1 and prod

dev2.sciapps.org or data.sciapps.org should be used for links to outputs independent on whether the workflow is loaded from dev1 or prod.

warelab / sciapps Goto Github PK

sciapps's Introduction

SciApps: a cloud-based platform for reproducible bioinformatics workflows

Introduction

Installation of SciApps

Providing CyVerse credentials

Setting up iRODS (for accessing CyVerse Data Store)

Integrating new Apps/Tools

Configuring web server

Citation

sciapps's People

Contributors

Stargazers

Watchers

Forkers

sciapps's Issues

- can bring up used inputs and parameters for modification (parameter sweep)

Recommend Projects

Recommend Topics

Recommend Org