Coder Social home page Coder Social logo

warelab / sciapps Goto Github PK

View Code? Open in Web Editor NEW
2.0 18.0 1.0 9.44 MB

SciApps: a cloud-based platform for reproducible bioinformatics workflows

Home Page: https://www.sciapps.org

License: Apache License 2.0

Perl 20.25% HTML 0.87% CSS 1.02% JavaScript 40.94% Shell 0.01% Raku 0.39% Less 36.53%
cyverse workflow agave

sciapps's Introduction

SciApps: a cloud-based platform for reproducible bioinformatics workflows

Introduction

SciApps is a bioinformatics workflow package developed to leverage local clusters or TACC/XSEDE resources for computing and CyVerse Data Store for storage. SciApps is built on top of the Agave API that can also virtualize commercial resources, e.g., Amazon EC2/S3 for computing and storage. Both GUI and RESTful API are available for interactive or batch processing of NGS data.

Installation of SciApps

git clone https://github.com/warelab/sciapps.git
cd sciapps/agavedancer
sudo npm install -g grunt-cli
npm install
grunt package
sudo /usr/sbin/apachectl graceful  

Providing CyVerse credentials

Update defaultUser to "XXX" in agavedancer/environments/production.yml (or development.yml), and create the following file.

cd sciapps/agavedancer
touch .agave
  .agave content:
      {"username":"XXX","password":"YYY"}

Setting up iRODS (for accessing CyVerse Data Store)

wget ftp://ftp.renci.org/pub/irods/releases/4.1.10/centos7/irods-icommands-4.1.10-centos7-x86_64.rpm
sudo yum install fuse fuse-libs
sudo rpm -i irods-icommands-4.1.10-centos7-x86_64.rpm 
cd /usr/share/httpd
sudo touch irodsEnv
sudo chmod 664 irodsEnv
sudo chown apache:apache irodsEnv
  irodsEnv content:
    {
      "irods_host": "data.iplantcollaborative.org",
      "irods_user_name": "XXX",
      "irods_port": 1247,
      "irods_zone_name": "iplant",
      "irods_authentication_file": "/usr/share/httpd/irodsA"
    }
sudo touch irodsA
sudo chmod 664 irodsA
sudo chown apache:apache irodsA
sudo -u apache /bin/bash
export IRODS_ENVIRONMENT_FILE=/usr/share/httpd/irodsEnv
iinit

Integrating new Apps/Tools

Follow this instruction for developing new Agave apps. And put the app json file in the following assets folder (e.g., Bismark-0.14.4.json).

cd agavedancer/public/assets
touch agaveAppsList.json
agaveAppsList.json content
   {
      "tags": ["Methylation"],
      "id": "Bismark-0.14.4",
      "label": "Bismark",
      "name": "Bismark",
      "version": "0.14.4"
   },
   {
     ...
   }

Configuring web server

SciApps.org can be configured with an Apache server using the following demo configuration file. The sciapps.conf file should be placed under /etc/httpd/conf.d/ (Centos 7) or /usr/local/apache2/conf/ (Centos 6). Note that SSL certificate is needed to be able to authenticate to the cloud systems.

<VirtualHost 143.48.220.100:443>
    SSLEngine on
    SSLCertificateFile /etc/letsencrypt/live/www.sciapps.org/fullchain.pem
    SSLCertificateKeyFile /etc/letsencrypt/live/www.sciapps.org/privkey.pem
    SSLCertificateChainFile /etc/letsencrypt/live/www.sciapps.org/chain.pem
    ServerName       www.sciapps.org
    ServerAlias      sciapps.org
    DocumentRoot     /home/YOURUSERNAME/sciapps/agavedancer/public
    RewriteEngine on
    RewriteRule ^/app_id/(.*)      https://www.sciapps.org/?app_id=$1 [L]
    RewriteRule ^/page/(.*)     https://www.sciapps.org/?page_id=$1 [L]
    RewriteRule ^/data/(.*)     https://www.sciapps.org/?page_id=dataWorkflows&data_item=$1 [L]

    SetEnv DANCER_ENVIRONMENT "production"
    <Directory "/home/YOURUSERNAME/sciapps/agavedancer/public">
        AllowOverride none
        Require all granted
        DirectoryIndex index.html index.php
    </Directory>
    <Location />
        SetHandler perl-script
        PerlResponseHandler Plack::Handler::Apache2
        PerlSetVar psgi_app /home/YOURUSERNAME/sciapps/agavedancer/bin/app.pl
    </Location>
</VirtualHost>

Citation

Wang, L., Lu, Z., Van Buren, P., & Ware, D. (2018). SciApps: a cloud-based platform for reproducible bioinformatics workflows. Bioinformatics, 34(22), 3917-3920. Link

sciapps's People

Contributors

ajo2995 avatar kapeel avatar zhlu9890 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

liyawang

sciapps's Issues

Visualizing scientific workflow

Visualization of the workflow should provide following functions:

  1. Provide a graphic representation of the workflow that can be downloaded for publication (e.g. Pegasus)
  2. A graphical representation that can be edited before saving (e.g. reorganization of nodes and links)
  3. A graphical representation in which nodes are clickable for inputs/outputs, app and parameter infos
  4. A graphical representation that can show status of jobs
  5. A graphical representation that can be re-submitted with new parameters/inputs

Some reviews and packages for visualization of workflow are below. We need to be able to generate a graphic representation of the workflow. It will be nice to be able to update job status on the graph (like what bioExtract did, seems that they adopted PinBall below). But its fine to just create a static image of the workflow now (like Pegasus did, seems that they also adopted PinBall).

Here is an old review of different visualizations developed by Galaxy and Taverna.
www.molgenis.org/raw-attachment/blog/2013/01/31/BIOINFORMATICS_2013_24_CR%20(2).pdf

andrew
I always liked the look of seven bridges genomics’ workflows. They have an open source project that might have some useful viz tools https://www.rabix.org/ (edited)

kapeel
http://stackoverflow.com/questions/14292636/dagdirected-acyclic-graph-dynamic-job-scheduler
stackoverflow.com
DAG(directed acyclic graph) dynamic job scheduler
I need to manage a large workflow of ETL tasks, which execution depends on time, data availability or an external event. Some jobs may fail during execution of the workflow and the system should h...

http://link.springer.com/article/10.1007%2Fs11227-009-0284-7#/page-1
link.springer.com
DAGMap: efficient and dependable scheduling of DAG workflow job in Gri
DAG has been extensively used in Grid workflow modeling. Since Grid resources tend to be heterogeneous and dynamic, efficient and dependable workflow job scheduling becomes essential. It poses great c

Investigating workflow ideas

Workflow ideas (Change output link to): http://data.sciapps.org/results/test/readme.txt?jobid=n

  • jobid start from 1, 2, … (on the right column)
  • appid,inutid1, inputid2, that generated the output can be retrieved from the jobid
  • outputs are unique with jobid and output filename
  • assuming output names are always the same no matter how inputs are changed for any app
  • a workflow page will be built for constructing an automatic workflow
  • all steps have been running at least once
  • can delete nodes (failed analysis repeated later, no emergent)
  • can save
  • can run (archiving to brie so not need to modify agave job submission)

- can bring up used inputs and parameters for modification (parameter sweep)

Error when inputs have no metadata associated

For the workflow diagram, when there is no metadata associated with inputs, clicking on the inputs will make the entire diagram non-responsive. The metadata associated with the app will not be displayed either.

Setup Drupal

Install Drupal modules on Brie for developing the MaizeCode web portal

Enable visualization with JBrowse

Configure JBrowse with data.sciapps.org.
Configure Agave archived results folder with JBrowse.
Visualizing maker outputs (GFF3), bam, VCF, etc.
Support compressed format for server.
Create compressed maker outputs.
Compress large folders as tar ball before archiving, then automatically uncompress.

Speed optimization

  1. Start caching datastore automatically once web page loads up / refresh
  2. Remember the last path user has selected (not starting from the root all the time)
  3. Or create a static json for maizecode data set in the long term

Add direct link to each app

This is functional by Zhenyuan and url rewrite is used to improve the direct link. However, it will be better to keep the before-rewritten url in the address bar. It will also be nice to have the link displayed in the address bar when clicked on an app.

Workflow and archiving

Archiving should be avoided when subsequent step/steps are running on the same server.

However, for cases like a-->b, a-->c, a & b runs on cloud A, c, runs on cloud B, shall we archive results_a on cloud_B or leave it on cloud A? An ideal solution will be leaving it on cloud A. If archived to cloud B, then it will need to be copied back to cloud A again for running b.

Here we are considering cloud not individual server. All servers in CSHL will be considered as one cloud, similar with TACC cloud, UA cloud (main data store).

The workflow management system needs to query each app for their operating cloud.

Workflow re-launch points

For example, given a five-steps workflow, to re-run from any steps:

  • Step 1, re-run entire workflow with different parameters/files
  • Step 2, choose step 2, 3, 4, 5 in the right panel, build a new workflow, customize then run
  • Step 3, choose step 3, 4, 5 in the right panel, build a new workflow, customize then run
  • Step 4, choose step 4, 5 in the right panel, build a new workflow, customize then run
  • Step 5, re-launch this step (last) directly from right panel

The assumption is we will always re-run entire workflow. But this will allow user to re-run from any steps of the entire workflow.

Support login

For jobs:
Use Agave to support login and use user's credential for running a job. Save tokens for checking job status and debugging.

For inputs:
In the same time, capture user's username to browse login user's data in irods based datastore (ils /iplant/home/some-user/sci_data).

For outputs:
Save results back to /iplant/home/some-user/sci_data/analysis/

For security:
Token will expire after 4 hours. User needs to be logged out and login again to renew token.

Test Case: Metagenomics pipeline

Rough ideas:

  1. Host copy of databases locally (update?).
  2. Build Agave pipeline or Docker based pipeline.
  3. Archive results back to remote database (or download).
  4. Metadata management.

Relative path problem

When browsing 'Data Store', need a variable (storagePath) to define relative path.

For CSHL storage system, $storagePath='data.sciapps.org/example_data'

For CyVerse Data Store, $storagePath='/iplant/home/UserName/sci_data', also need to add a button to allow accessing community data: /iplant/home/shared

To do this, we need to update Agave system to point root folder one level up than example_data. When browsing files from App form, user needs to be directed to $storagePath and won't have right to access data above $storagePath.

All workflows need to be rebuilt since path are changed.

'example_data' need to be removed from filesInfo.js after these changes.

Optimizing building workflow

  1. Add a checkbox before the number of each step or in the right end
  2. Click on 'Build a Workflow' will open a popup window, a workflow diagram will be drawn from the checked jobs in history
  3. if less than 2 jobs are checked, the pop up window will display "Please select/check at least two jobs from the right column'
  4. User can close the popup diagram, de-select one or more jobs and redo 'Build Workflow'
  5. To save, user need to add Workflow name and brief description (Fields are needed in the popup window)

Resolve Mod_perl conflicts among production and developments (on same server)

Use cases:

  1. Sciapps.org, prod, dev1, dev2 are mixed together because of mod_perl. Zhenyuan has two separated local gits for prod and dev1. He commits new developments to dev1 then do git pull from the same github repo (Warelab/MaizeCode) for updating prod. Liya commits new developments to dev2 then to prod. He does git pull from Warelab/MaizeCode to merge Zhenyuan's new developments to dev2. The problem occurs at the mod_perl level since our code adopted Cornel's perl SDK for Agave. We also used perl dancer but it is not related to the problem.
  2. Dnasubway.org, used plain old HTML::Mason on top of mod_perl. Cornel's dev is on a VM on his laptop so he can work anywhere w/o an internet connection (unless work on Agave stuff - this need a connection). He also have a VM running on balaur using libvirtd on the live site. There is only the live site, no dev. Pan has her own VM

Solutions:

  1. Run several apache at different ports. ( We didn't adopt this)
  2. Run your VirtualHost with PerlOptions +Parent. This gives each VirtualHost its own Perl interpreter (We adopted this one).
    More details see: http://www.gossamer-threads.com/lists/modperl/modperl/98162

Updating Maker

Maker updates (update app json, then push to Sciapps dev2, then Sciapps prod):

  1. Mark genome sequence as required (true for both annotation and re-annotation), to do
  2. Change 'otherEvidences' label to 'Maker derived evidence' (so that example workflow still work), to do
  3. Add more parameters to control options for re-annotation? Low priority

Re-reoute URL input

Currently there is a bug in Agave that modifies files uploaded by URL, one solution to it will be replacing the input URL by Agave path. This will solve the problem for data hosting on a Agave system (e.g. brie)

Wrap Text in Nodes of a Diagram for long names

One idea is to replace a longlonglongname with longlon... before passing to mermaid

Another idea (?) is to replace with
longlon-
glongna-
me

The second idea might be better if just wrapped once
longlong-
longname

The first idea is simple and also ok if we can display the full name when clicked (in the meta data section)

Adding a dropdown menu for workflows

Change the workflow link in the header to a dropdown menu.

  1. Build a workflow: this will bring up the workflow building page. User needs to have already run each step of the workflow at least once. There are several tricks below
    a. Run some simple existing workflows, all history of each step will be popped up in the right column. Then user can build complicated workflows from them. Its similar to chain several sub-workflows together.
    b.For batch processing, for example, run bwa on 10 samples, user can run it on 1 sample (or a fraction), then build a 10 step workflow to batch process all 10 samples.
  2. Load a workflow: this will bring up the workflow loading page, which can load a workflow in json format
  3. Example workflow: this will bring up the example workflow page, which will have links to download json for example workflows and links to load each of the example workflow page.

Reduce dependency on Agave interaction

Need to talk to Dave and Cornel to see whether we can simplify Agave services and build our own servers for the federated system. This will also help to resolve the issue that Agave fail to recognize the local iRods based storage system (aka resource server). In addition, a standalone fork might allow us to build more efficient workflows.

Need a search box for app

  1. The search is supposed to filter the list of apps then display results by category in the left column (with all categories expanded/opened)
  2. The search box should appear under the 'Apps' header
  3. The searched word can be cleared out to re-list all categories
  4. It will be nice to have 'auto-complete' for searched words

Example workflow for developing the workflow script

Here is a four steps workflow: GLM --> AdjustP --> XYPlot1
....................................................|--> XYPlot2 (not dependent on AdjustP)

Output folders:
GLM: http://data.sciapps.org/results/glm-tassel-5-1-23-0fJGkCBl8B/
AdjustP: http://data.sciapps.org/results/adjustpvalue-0-0-1-Tx0ryhviOK/
XYPlot1: http://data.sciapps.org/results/xyplot-0-0-2-wyGcfBpaoL/
XYPlot2: http://data.sciapps.org/results/xyplot-0-0-2-O9ULw4qog4/

One thing we ignored before is that Agave might fail for whatever reasons. So the workflow engine will need to check outputs (possible to do?) and re-submit the job if failed.

Enable browsing data

Before we have a GUI to select file from Data Store, we need to have way to take the file path from browsing, and send it to Agave app for analysis.

User analysis management

Use CyVerse authentication (might need to pay for https)
https://auth.iplantcollaborative.org/cas4/login?service=http://sciapps.org/

We will capture user name to display user data (we will add username to our database (1st name), cookie will be saved in browser)
/iplant/home/username/sci_data (can go up just one level)
/iplant/home/shared (can't go up)

On SciApps side, we will add user account managements
Job history: save to database, load from database (last 20, then next 20?), delete, search
Workflows: List of workflows for reloading, downloading, saving, deleting, search
Data: Not for now but will be useful for organizing into experiments with metadata

On SciApps side, job executation
Use maizecode: User can not run their private apps unless shared with maizecode
Might be slow since maizecode user is running all jobs
Use user account: Capture User Secret when authenication? Doable? Ask Nirav/Dennis/Tony/Edwin
More intuitive since user can check their job via Agave commandline SDK
No longer need write permission to sci_data folder, results still need to go into user's sci_data folder that is readable for building workflow (make an analysis folder under sci_data?)

TRAM API: Display results in right column (working for non-public but shared readable data?)

Add example job id to workflow

When a workflow is generated, example job ids need to be added to the json file. When a workflow is loaded, the example output should be loaded in the History column.

In cases where output folders are removed or example job id field is empty, no or only partial history (some if not all steps) will be loaded.

Test case: Maker for annotation

A complete pipeline with two apps: Maker and SNAP.
Step 0. Run STAR on RNA-seq to build EST
Step 0. Blast UniProt/SwissProt protein database or the NCBI NR protein database for protein.
Step 1. Given EST and Protein, run maker to get GFF1.
Step 2. Run SNAP with GFF1 to estimate HMM1.
Step 3. Run Maker again with HMM1 to get GFF2 (No EST/Protein).
Step 4. Run SNAP with GFF2 to re-estimate HMM2.
Step 5. Run Maker again with HMM2 to get GFF3 (Final for JBrowse).

Questions:

  1. Above pipeline ignores repeats, and other gene predictors.
  2. Is it more efficient to keep them as one app?
  3. Check three option files to optimize data transfer
  4. Do Maker take compressed files?

Fix Agave timezone bug

Agave time zone is messed up:
Example 1:
Submitted on: 8/18/2016, 10:56:20 AM
Started on: 8/18/2016, 9:56:20 AM
Finished on: 8/17/2016, 8:11:38 PM
Real job finished time (NY): Aug 17 2016 7:11 PM

Example 2:
Submitted on: 8/18/2016, 2:58:46 AM
Started on: 8/18/2016, 1:58:47 AM
Finished on: 8/17/2016, 2:04:27 PM
Real job finished time (NY): Aug 17 2016 1:04 PM

Contacting Rion for possible solution. Otherwise, fix manually on our side?

Re-launch a job

Add a "Relaunch" button in the right column between Status and Results to bring up the app form with set parameters.

In the short term, just keep the set parameters (not reset to defaults) after submission

Optimizing displaying outputs in the right panel

Right now, we are using Agave files-list via job-id encoded with app name.

on brie, dev2.sciapps.org/results/job-folder
on halcott, data.sciapps.org/results/job-folder

Since we have already loaded job json into the browser, we know the results folder name, can we bypass agave? Assumer we can list contents under a web folder.

On de.sciapps.org

We will use ils so its much simpler and we can definitely bypass agave.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.