iusca / bioloop Goto Github PK

View Code? Open in Web Editor NEW

4.0 5.0 2.0 9.6 MB

Scientific data management portal and pipeline application template

License: Other

Shell 0.63% Dockerfile 0.47% JavaScript 37.58% HTML 0.08% Vue 46.83% CSS 0.56% Python 13.85%

data-delivery data-management data-management-platform pipeline research science workflow workflow-automation

bioloop's Introduction

Bioloop

Data Management Portal and Pipeline Application for Research Teams

Overview

Bioloop is a web-based portal to simplify the management of large-scale datasets shared among research teams in scientific domains. This platform optimizes data handling by effectively utilizing both cold and hot storage solutions, like tape and disk storage, to reduce overall storage costs.

Key Features:

Project-Based Organization: Data is assigned to projects, allowing collaborators to work within specific project environments, ensuring data isolation and efficient collaboration.
Data Ingestion: Bioloop simplifies data ingestion by offering automated triggers for instrument-based data ingestion and supports manual uploads for datasets.
Data Provenance Tracking: Bioloop tracks data lineage, recording the origin of raw datasets and their subsequent derived data products, promoting data transparency and accountability.
Custom Pipelines: Bioloop allows custom data processing pipelines, leveraging Python's Celery task queue system to efficiently scale processing workers as needed.
Secure Downloads: Bioloop ensures data security with token-based access mechanisms for downloading data, restricting access to authorized users.
Microservice Architecture: Bioloop utilizes Docker containers for effortless deployment, allowing flexibility between local infrastructure and public cloud environments."

Getting started

Dependencies

Bioloop leverages a few other projects to get up and running.

Architecture

bioloop's People

Contributors

Stargazers

Watchers

Forkers

iusca iusca

bioloop's Issues

Add authorization (access controls) to dataset api endpoints

Explore how to configure role based access control on joins of multiple resources like dataset, dataset_file, dataset_hierarchy, dataset_audit, dataset_state, and other joins.

CSS class conflicts

CSS conflict for .mt-x between tailwind and Vuestic.

Vuestic has margin-top: calc(var(--va-grid-spacing-base) * 3)!important; rule somewhere.

Implement secure session management

https://supertokens.com/blog/all-you-need-to-know-about-user-session-security

UI performance and accessibility

Lighthouse Report cpa.sca.iu.edu_2023-06-30.html.txt

Enable text compression - Nginx config
Serve static assets with an efficient cache policy - Nginx config
Reduce unused JavaScript - see rollup depednecy stats.html (need to completely remove moment.js and moment-timezone)
Eliminate render-blocking resources - ??

Accessibility:

Image elements do not have [alt] attributes
Background and foreground colors do not have a sufficient contrast ratio.

Develop a workflow view

compact workflow view / task view?
- Name / id / description
- progress + status + times
  - if active, show progress bar, elapsed & ETA
  - else, show status, start date, run duration (problematic with resumed workflows)
  - last updated?
detailed workflow view (expanded) / task view?
- info tab showing args , hostname,
- results tab
- logs tab (stdout + stderr)

where to show task retries?

how to show previous task runs?

audit for workflow object

change is status (events in state machine)
action events
- create
- pause
- resume

Stats / Tracking / Analytics section

CMG has a Stats / Tracking page for operator use to see things like:

Number of downloads / day
- Download mechanism used (Direct download vs Slate path access)
Total number of users
Total amount of data being managed by the system (exists on Dashboard already)

Some features not currently available in CMG that would be helpful to include are:

"Most staged" files (list of datasets sorted by total number of times a stage action has completed)
Total number of stage requests
Most downloaded files

Nice to have could be

bandwidth utilized per user (# downloads per user * download file size) - could potentially show top 10/20 users in terms of bandwidth
total space / project (integrate with project directly?)

In CMG, we were not able to track actual downloads, so we only track button clicks for those links. Now that we have the secure download server (#31), we should have better visibility into actual download actions.

It may be possible to combine this with the Dashboard page, and this may be related to #39

Operators - Users page

As a operator I should be able to

view users and their data
create new users with user role (cannot be able to create users with operator and admin roles). Role field is pre-filled and disabled in the user form
edit users with user roles (cannot edit users with operator and admin roles)

Operators should not be able to

see sudo user button

API changes required

API docker build - pnpm not installing new packages

After installing a new package ex: pnpm i multer, the package.json and pnpm-lock.yaml are updated. However, pnpm fetch is not installing these new packages and API fails to start.

deleting the docker volume and doing a fresh pnpm fetch is the workaround.

Worker - Stage - make it lazy + idempotent

check the current state of the system before executing any changes, and only performs those changes that are necessary to bring the system to the desired state.

if there are already files on disk, do not download and extract.

UI - Refactor Modals

Add a modal design pattern document to wiki

Build a system for storing and accessing workflow logs

capture task logs and logs of external processes run by tasks

show logs on UI in real-time in the workflow section against tasks

Make content grid responsive

In #19, we discovered that the content grid is not responsive enough on extra small screen sizes.

As an example, the content on the Dashboard (/dashboard) page, on extra small screen widths, starts overlapping against each other, instead of falling into a neat single-column grid.

One way of avoiding this overlap would be to have the open sidebar not shift the main content to the right on smaller screens. This is how the Vuestic website does it. To see this in action, open the Vuestic website in mobile view (which should collapse the left sidebar), then open the left sidebar. You will note that the main content does not shift to the right when the sidebar is opened.

API Security

CSRF (with axios)
CORS
ratelimiting

API Security Checklist

Checklist of the most important security countermeasures when designing, testing, and releasing your API.

Authentication

Don't use Basic Auth. Use standard authentication instead (e.g., JWT).
Don't reinvent the wheel in Authentication, token generation, password storage. Use the standards.
Use Max Retry and jail features in Login.
Use encryption on all sensitive data.

JWT (JSON Web Token)

Use a random complicated key (JWT Secret) to make brute forcing the token very hard.
Don't extract the algorithm from the header. Force the algorithm in the backend (HS256 or RS256).
Make token expiration (TTL, RTTL) as short as possible.
Don't store sensitive data in the JWT payload, it can be decoded easily.
Avoid storing too much data. JWT is usually shared in headers and they have a size limit.
Problems with JWT

Access

Limit requests (Throttling) to avoid DDoS / brute-force attacks.
Use HTTPS on server side with TLS 1.2+ and secure ciphers to avoid MITM (Man in the Middle Attack).
[?] Use HSTS header with SSL to avoid SSL Strip attacks.
Turn off directory listings.
[?] For private APIs, allow access only from safelisted IPs/hosts.

Authorization

OAuth

Always validate redirect_uri server-side to allow only safelisted URLs.
Always try to exchange for code and not tokens (don't allow response_type=token).
Use state parameter with a random hash to prevent CSRF on the OAuth authorization process.
Define the default scope, and validate scope parameters for each application.

Input

Use the proper HTTP method according to the operation: GET (read), POST (create), PUT/PATCH (replace/update), and DELETE (to delete a record), and respond with 405 Method Not Allowed if the requested method isn't appropriate for the requested resource.
Validate content-type on request Accept header (Content Negotiation) to allow only your supported format (e.g., application/xml, application/json, etc.) and respond with 406 Not Acceptable response if not matched.
Validate content-type of posted data as you accept (e.g., application/x-www-form-urlencoded, multipart/form-data, application/json, etc.).
Validate user input to avoid common vulnerabilities (e.g., XSS, SQL-Injection, Remote Code Execution, etc.).
Don't use any sensitive data (credentials, Passwords, security tokens, or API keys) in the URL, but use standard Authorization header.
Use only server-side encryption.
Use an API Gateway service to enable caching, Rate Limit policies (e.g., Quota, Spike Arrest, or Concurrent Rate Limit) and deploy APIs resources dynamically.

Processing

Check if all the endpoints are protected behind authentication to avoid broken authentication process.
User own resource ID should be avoided. Use /me/orders instead of /user/654321/orders. - both are okay
Don't auto-increment IDs. Use UUID instead.
If you are parsing XML data, make sure entity parsing is not enabled to avoid XXE (XML external entity attack).
If you are parsing XML, YAML or any other language with anchors and refs, make sure entity expansion is not enabled to avoid Billion Laughs/XML bomb via exponential entity expansion attack.
Use a CDN for file uploads.
If you are dealing with huge amount of data, use Workers and Queues to process as much as possible in background and return response fast to avoid HTTP Blocking.
Do not forget to turn the DEBUG mode OFF.
Use non-executable stacks when available.

Output

Send X-Content-Type-Options: nosniff header.
Send X-Frame-Options: deny header.
Send Content-Security-Policy: default-src 'none' header.
Remove fingerprinting headers - X-Powered-By, Server, X-AspNet-Version, etc.
Force content-type for your response. If you return application/json, then your content-type response is application/json.
Don't return sensitive data like credentials, passwords, or security tokens.
Return the proper status code according to the operation completed. (e.g., 200 OK, 400 Bad Request, 401 Unauthorized, 405 Method Not Allowed, etc.).

CI & CD

Audit your design and implementation with unit/integration tests coverage.
Use a code review process and disregard self-approval.
Ensure that all components of your services are statically scanned by AV software before pushing to production, including vendor libraries and other dependencies.
Continuously run security tests (static/dynamic analysis) on your code.
Check your dependencies (both software and OS) for known vulnerabilities.
Design a rollback solution for deployments.

Monitoring

Use centralized logins for all services and components.
Use agents to monitor all traffic, errors, requests, and responses.
Use alerts for SMS, Slack, Email, Telegram, Kibana, Cloudwatch, etc.
Ensure that you aren't logging any sensitive data like credit cards, passwords, PINs, etc.
Use an IDS and/or IPS system to monitor your API requests and instances.

Implement delete user

Generally it is preferable to disable user accounts via the Modify User modal. Operators should have this ability.

In some cases, it may be necessary to delete a user record all together. Only Admin users should have this ability. The action should be triggered via the red trashcan icon (existing).

Dark mode doesn't work on some components

After switching the theme to Dark Mode, the Footer still shows up with a white background.

The "There are no active workflows." alert also shows up with a white background when Dark Mode is toggled on. This alert is visible on the homepage when running the app locally without a connection to the workflow service.

Build secure download server

Secure Download Server
New UI based on Google Drive

Permission data - postgres app database - app.sca.iu.edu
Actual files - scratch - colo node

Option 1: hosted on colo23
Authenticate users
Resolve permissions by talking to API for a particular app
App Api should have a consistent route for resolving permissions - 200 OK / 403 Forbidden
Resolve the project_uuid to a path on scratch - 2 options
hardcoded + symlink
api call which will respond with actual path on scratch
NGINX x-accel header / sends file / UI javascript initiates browser download

Pros:
Common UI for all apps
Cons:
It needs to know which app to talk to via url / app hostnames / scratch configs
Extra auth for service-to-service

Option 2: Bundle this file tree UI with apps (import component)
Construct a file tree from postgres dataset_files table
There is no extra api to resolve permissions and file paths
User clicks on a file to download - API will create a token + URL - UI will make a CORS request to NGINX API (nginx + express api) hosted on colo23
File: /abc/123.fastq.gz
Token
will have the URL - asd12412sad/abc/123.fastq.gz
Token expiry time - 1 minute
Sign with its private key
Public key id
The nginx api on colo23 validates the token (with public key) - serves that file
Nginx: Resource server / resource owner

Oauth2.0 Authz server - ory Hydra
Flow - client credentials oauth2.0

Pros:
App API driven - no overhead on permissions and path resolutions
Stats on who downloaded what
Constraint:
File tree is static
Allows searching for files without staging
Allows staging individual files from a given archive

Pros (for both):
User level permissions on project downloads
Revoke permission - applies immediately
Multi modal authentication - IU, google, apple
Cons (for both):
No wget

FileBrowser: Add state(view) transitions to the browser history

FileBrowser: https://github.com/IUSCA/bioloop/blob/main/ui/src/components/filebrowser/README.md

add file browser navigation & search state to browser history by manipulating browser url query parameters.

User should be able to use the browser back button to go back to previous directory / previous search result after navigation / search action.

User should be able to use the url to load the page again in the same state (same pwd / same search filters)

Speed up Inspect and Validate steps

Use multiprocessing to leverage multiple CPUs in colo nodes and parallelize the computation of checksum of files.

Distribute the files among many processors to the run the same function

processes must consume the files from a queue. This is because the time to compute checksum depends on the size and file sizes in a dataset are not uniform. Giving each process a certain number of files is not ideal. Files can be partitioned based on size (as well as number of available processors) and each process can be given a partition but this is complicated.
Make processes "nice" (linux) so that they do not hog resources and slow other running processes

Tracking progress will be difficult as multiple processors are completing work.

Workers - Extractor and Archiver progress

In CMG, Extractor and Archiver progress are per file ex: Progress: 109 of 636 files.
This gives a better progress update

Setup Metrics and Monitoring

Setup centralized metrics reporting and monitoring dashboard

Monitoring dashboard - Grafana

Metrics - Telegraf + InfluxDB

Explore alternatives if possible.

ExpressJS API Metrics - https://www.npmjs.com/package/express-node-metrics

Create a workflows list view to filter failed workflows

API - 404 response requires authentication

Since the auth middleware is mounted globally using router.use in routes/index.js, all the routes after it requires auth including the 404 and error handlers.

Explore storing checksum of bundles in the dataset object to reduce computation

bundle means either tar, tar + zip, tar + zip + enc of the dataset directory.

In archive step, after making the bundle to upload to SDA, save the checksum to the database

This will be useful to avoid computing the checksum of local bundle file, provided that only the application has write permissions on the location of these bundles.

Try storing the file metadata (mtime, size) for a quick check?

Workers - Log when API retries

2023-05-29 17:51:09,674 - ERROR - base.py:98 - Thread-1 - Failed to contact API for work, (Attempt 1/5)
2023-05-29 17:51:09,674 - ERROR - base.py:247 - MainThread -
2023-05-29 17:51:09,674 - ERROR - base.py:101 - Thread-1 - Retrying in 10 seconds
2023-05-29 17:53:20,745 - ERROR - https.py:20 - MainThread - HTTPSConnectionPool(host='cmg.sca.iu.edu', port=443): Max retries exceeded with url: /api/workers/5fe21439ecbe3fc1bcb6aaf2 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f25a3a324c0>: Failed to establish a new connection: [Errno 110] Connection timed out'))

When the workers try to reach the express API and fail - log that error as well as when is the next retry

Responsive Layout

collapsible sidenav
header

@ryanlong89 has done some work on this in his repo.

UI - Copy Button responsive design

When the path text is longer copy button wraps to the next line

Fix paths in READMEs

The README.md files in the project reference paths to files that appear to be outdated. These paths should be updated to reflect the correct paths of these files.

Explore sending progress updates of a workflow/step using websockets / EventSource

leverage celery events
explore celery clients for javascript / directly connect to rabbitmq for the events
or open a socket to the browser and poll the mongo from the express API

Build prototype of Flask OAuth 2.0 Server

Use Authlib

Should support

client credentials flow
device code flow

Registration - Show the wait for no recent activity

When the register worker finds a new dataset it usually waits for sometime to make sure nothing has changed in the folder. After this it gets registered and only then it shows up on ui.

This doesn't give immediate feedback and often confuses the operators.

Register the dataset as soon as it is found but only trigger the workflow after the no recent activity wait time. Also show a message on the UI that the system is waiting for x more minutes before it starts processing.

UI Security

Write action items from https://vuejs.org/guide/best-practices/security.html

Lighthouse report:

Ensure CSP is effective against XSS attacks

A strong Content Security Policy (CSP) significantly reduces the risk of cross-site scripting (XSS) attacks. Learn how to use a CSP to prevent XSS

FileBrowser: Implement filtering by directory sizes

FileBrowser: https://github.com/IUSCA/bioloop/blob/main/ui/src/components/filebrowser/README.md

API:
As the file metadata is immutable, it is feasible to recursively calculate the directory sizes using the graph datastructure used to insert the data into the DB.

UI
In browser view:

should be able to sort directories by size
group directories and files separately while sorting by size - similar to sorting by name

In Search View:

should be able to filter directories by min and max size

Filebrowser: Allow download of full tar file

Now that we have the ability to download individual files in a dataset (#31) (!!) we have received a request to also be able to download the complete dataset. We can use the same file that is archived on the SDA for this purpose. Once we stage the tar file, we can keep it around for direct downloading. Downloading the tar file should be protected by the same secure download features as individual files.

This could be added as a third option on the "Data Access Options" modal:

Dataset view - show audit and state tables

Router based / meta title based auto breadcrumbs in page headers

https://medium.com/geekculture/breadcrumbs-with-vue-router-params-vuetify-22d71ebc104a
https://github.com/open-ish/vue2-template/tree/feat/add-breadcrumb

UI Base features

#17
- https://medium.com/geekculture/breadcrumbs-with-vue-router-params-vuetify-22d71ebc104a
- https://github.com/open-ish/vue2-template/tree/feat/add-breadcrumb
#18
https://raoulkramer.de/vue-js-axios-interceptors-and-toast-notifications/
#19
- collapsible sidenav
- header
- set width on main component? to force scroll on tables and overflowing components
#20
- persist to local storage
- create a dark mode switcher component and add it to header
#21
- https://keefdrive.medium.com/tips-and-tricks-to-create-smooth-animation-in-vue-3-71b8b20e6400
#68
#69

Merge latest changes to master and rename DGL to bioloop

TODO

Refactor api/prisma/schema.js - split data and code
Sync bioloop/workers and cpa/workers

Prisma - Use Referential actions

https://www.prisma.io/docs/concepts/components/prisma-schema/relations/referential-actions

To delete a dataset, all the associated relations have to be deleted first, this can be specified in schema.prisma config.

async function hard_delete(id) {
  const deleteFiles = prisma.dataset_file.deleteMany({
    where: {
      dataset_id: id,
    },
  });
  const deleteWorkflows = prisma.workflow.deleteMany({
    where: {
      dataset_id: id,
    },
  });
  const deleteAudit = prisma.dataset_audit.deleteMany({
    where: {
      dataset_id: id,
    },
  });
  const deleteStates = prisma.dataset_state.deleteMany({
    where: {
      dataset_id: id,
    },
  });
  const deleteAssociations = prisma.dataset_hierarchy.deleteMany({
    where: {
      OR: [
        {
          source_id: id,
        },
        {
          derived_id: id,
        },
      ],
    },
  });
  const deleteDataset = prisma.dataset.delete({
    where: {
      id,
    },
  });

  await prisma.$transaction([
    deleteFiles,
    deleteWorkflows,
    deleteAudit,
    deleteStates,
    deleteAssociations,
    deleteDataset,
  ]);
}

Upgrade prisma to version 5

https://www.prisma.io/docs/guides/upgrade-guides/upgrading-versions/upgrading-to-prisma-5

FileBrowser: Implement sorting in search view

FileBrowser: https://github.com/IUSCA/bioloop/blob/main/ui/src/components/filebrowser/README.md

User should be able to sort the search results by name, size, and filetype. Follow the multisort strategy used in browser view to group files and directories separately.

User should be able to see the number of search results.

Optional:
As of now, number of search results is limited to 1000 to prevent sending too much data to the browser. Table is virtual - debate the use of lazy load vs pagination to show next batch of results beyond the initial 1000 (if such a feature is required.)

Send notifications based on workflow events

Notification channels:

email
slack

Events:

Workflow start
Workflow done
Workflow failed, revoked

Leverage celery events

Upgrade celery to version 3.6

May contain breaking changes: https://github.com/celery/celery/releases

Look at new features and refactor existing code

Develop a Workflow Manager Page and Workflow Summary component for Dashboard

In addition to showing active workflows and current resource utilization, the dashboard should show the status of workers, regardless of if they are active or not. This is useful when troubleshooting issues to know, e.g. when the last time a worker was seen by the API and which workers are configured for the system. This feature would only show workers configured for a specific instance of bioloop, not all workers across all instances (as may be the case in #39?)

In CMG, this functionality looks like this:

Dark mode

persist to local storage
create a dark mode switcher component and add it to header

Worker - TODO

Workers:

Logging

use celery logger
log file management, rotation for each task
Include filename and line number - celery logger formatter
use python logger instead of print in register.py and app.py

Config

hierarchical config - for dev and prod envs
refactor celeryconfig to use data from main config

Lighthouse - CLS - https://web.dev/optimize-cls/?utm_source=lighthouse&utm_medium=devtools

Workers - Inspection and Validation Steps - Use queue based multiprocessing to speed up

Use multiprocessing to leverage multiple CPUs in colo nodes and parallelize the computation of checksum of files.

Distribute the files among many processors to the run the same function

processes must consume the files from a queue. This is because the time to compute checksum depends on the size and file sizes in a dataset are not uniform. Giving each process a certain number of files is not ideal. Files can be partitioned based on size (as well as number of available processors) and each process can be given a partition but this is complicated.
Make processes "nice" (linux) so that they do not hog resources and slow other running processes

Tracking progress will be difficult as multiple processors are completing work.

iusca / bioloop Goto Github PK

bioloop's Introduction

Bioloop

Overview

Getting started

Dependencies

Architecture

bioloop's People

Contributors

Stargazers

Watchers

Forkers

bioloop's Issues

Authentication

JWT (JSON Web Token)

Access

Authorization

OAuth

Input

Processing

Output

CI & CD

Monitoring

Recommend Projects

Recommend Topics

Recommend Org