Coder Social home page Coder Social logo

iusca / bioloop Goto Github PK

View Code? Open in Web Editor NEW
4.0 5.0 2.0 9.6 MB

Scientific data management portal and pipeline application template

License: Other

Shell 0.63% Dockerfile 0.47% JavaScript 37.58% HTML 0.08% Vue 46.83% CSS 0.56% Python 13.85%
data-delivery data-management data-management-platform pipeline research science workflow workflow-automation

bioloop's Introduction

Bioloop

Data Management Portal and Pipeline Application for Research Teams

Overview

Bioloop is a web-based portal to simplify the management of large-scale datasets shared among research teams in scientific domains. This platform optimizes data handling by effectively utilizing both cold and hot storage solutions, like tape and disk storage, to reduce overall storage costs.

Key Features:

  1. Project-Based Organization: Data is assigned to projects, allowing collaborators to work within specific project environments, ensuring data isolation and efficient collaboration.

  2. Data Ingestion: Bioloop simplifies data ingestion by offering automated triggers for instrument-based data ingestion and supports manual uploads for datasets.

  3. Data Provenance Tracking: Bioloop tracks data lineage, recording the origin of raw datasets and their subsequent derived data products, promoting data transparency and accountability.

  4. Custom Pipelines: Bioloop allows custom data processing pipelines, leveraging Python's Celery task queue system to efficiently scale processing workers as needed.

  5. Secure Downloads: Bioloop ensures data security with token-based access mechanisms for downloading data, restricting access to authorized users.

  6. Microservice Architecture: Bioloop utilizes Docker containers for effortless deployment, allowing flexibility between local infrastructure and public cloud environments."

Getting started

Dependencies

Bioloop leverages a few other projects to get up and running.

Architecture


bioloop's People

Contributors

charlesbrandt avatar deepakduggirala avatar dependabot[bot] avatar karthiek390 avatar ri-pandey avatar ryanlong89 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

iusca iusca

bioloop's Issues

CSS class conflicts

CSS conflict for .mt-x between tailwind and Vuestic.

Vuestic has margin-top: calc(var(--va-grid-spacing-base) * 3)!important; rule somewhere.

image

UI performance and accessibility

Lighthouse Report cpa.sca.iu.edu_2023-06-30.html.txt

  • Enable text compression - Nginx config
  • Serve static assets with an efficient cache policy - Nginx config
  • Reduce unused JavaScript - see rollup depednecy stats.html (need to completely remove moment.js and moment-timezone)
  • Eliminate render-blocking resources - ??

Accessibility:

  • Image elements do not have [alt] attributes
  • Background and foreground colors do not have a sufficient contrast ratio.

Develop a workflow view

  • compact workflow view / task view?

    • Name / id / description
    • progress + status + times
      • if active, show progress bar, elapsed & ETA
      • else, show status, start date, run duration (problematic with resumed workflows)
      • last updated?
  • detailed workflow view (expanded) / task view?

    • info tab showing args , hostname,
    • results tab
    • logs tab (stdout + stderr)

where to show task retries?

how to show previous task runs?

audit for workflow object

  • change is status (events in state machine)
  • action events
    • create
    • pause
    • resume

Stats / Tracking / Analytics section

CMG has a Stats / Tracking page for operator use to see things like:

  • Number of downloads / day
    • Download mechanism used (Direct download vs Slate path access)
  • Total number of users
  • Total amount of data being managed by the system (exists on Dashboard already)

264145841-87d7e410-791c-4dda-ac8c-0a1cc7ccce59

Some features not currently available in CMG that would be helpful to include are:

  • "Most staged" files (list of datasets sorted by total number of times a stage action has completed)
  • Total number of stage requests
  • Most downloaded files

Nice to have could be

  • bandwidth utilized per user (# downloads per user * download file size) - could potentially show top 10/20 users in terms of bandwidth
  • total space / project (integrate with project directly?)

In CMG, we were not able to track actual downloads, so we only track button clicks for those links. Now that we have the secure download server (#31), we should have better visibility into actual download actions.

It may be possible to combine this with the Dashboard page, and this may be related to #39

Operators - Users page

As a operator I should be able to

view users and their data
create new users with user role (cannot be able to create users with operator and admin roles). Role field is pre-filled and disabled in the user form
edit users with user roles (cannot edit users with operator and admin roles)

Operators should not be able to

see sudo user button

API changes required

API docker build - pnpm not installing new packages

After installing a new package ex: pnpm i multer, the package.json and pnpm-lock.yaml are updated. However, pnpm fetch is not installing these new packages and API fails to start.

deleting the docker volume and doing a fresh pnpm fetch is the workaround.

Worker - Stage - make it lazy + idempotent

check the current state of the system before executing any changes, and only performs those changes that are necessary to bring the system to the desired state.

if there are already files on disk, do not download and extract.

Make content grid responsive

In #19, we discovered that the content grid is not responsive enough on extra small screen sizes.

As an example, the content on the Dashboard (/dashboard) page, on extra small screen widths, starts overlapping against each other, instead of falling into a neat single-column grid.

One way of avoiding this overlap would be to have the open sidebar not shift the main content to the right on smaller screens. This is how the Vuestic website does it. To see this in action, open the Vuestic website in mobile view (which should collapse the left sidebar), then open the left sidebar. You will note that the main content does not shift to the right when the sidebar is opened.

API Security

CSRF (with axios)
CORS
ratelimiting

API Security Checklist

Checklist of the most important security countermeasures when designing, testing, and releasing your API.


Authentication

  • Don't use Basic Auth. Use standard authentication instead (e.g., JWT).
  • Don't reinvent the wheel in Authentication, token generation, password storage. Use the standards.
  • Use Max Retry and jail features in Login.
  • Use encryption on all sensitive data.

JWT (JSON Web Token)

  • Use a random complicated key (JWT Secret) to make brute forcing the token very hard.
  • Don't extract the algorithm from the header. Force the algorithm in the backend (HS256 or RS256).
  • Make token expiration (TTL, RTTL) as short as possible.
  • Don't store sensitive data in the JWT payload, it can be decoded easily.
  • Avoid storing too much data. JWT is usually shared in headers and they have a size limit.
  • Problems with JWT

Access

  • Limit requests (Throttling) to avoid DDoS / brute-force attacks.
  • Use HTTPS on server side with TLS 1.2+ and secure ciphers to avoid MITM (Man in the Middle Attack).
  • [?] Use HSTS header with SSL to avoid SSL Strip attacks.
  • Turn off directory listings.
  • [?] For private APIs, allow access only from safelisted IPs/hosts.

Authorization

OAuth

  • Always validate redirect_uri server-side to allow only safelisted URLs.
  • Always try to exchange for code and not tokens (don't allow response_type=token).
  • Use state parameter with a random hash to prevent CSRF on the OAuth authorization process.
  • Define the default scope, and validate scope parameters for each application.

Input

  • Use the proper HTTP method according to the operation: GET (read), POST (create), PUT/PATCH (replace/update), and DELETE (to delete a record), and respond with 405 Method Not Allowed if the requested method isn't appropriate for the requested resource.
  • Validate content-type on request Accept header (Content Negotiation) to allow only your supported format (e.g., application/xml, application/json, etc.) and respond with 406 Not Acceptable response if not matched.
  • Validate content-type of posted data as you accept (e.g., application/x-www-form-urlencoded, multipart/form-data, application/json, etc.).
  • Validate user input to avoid common vulnerabilities (e.g., XSS, SQL-Injection, Remote Code Execution, etc.).
  • Don't use any sensitive data (credentials, Passwords, security tokens, or API keys) in the URL, but use standard Authorization header.
  • Use only server-side encryption.
  • Use an API Gateway service to enable caching, Rate Limit policies (e.g., Quota, Spike Arrest, or Concurrent Rate Limit) and deploy APIs resources dynamically.

Processing

  • Check if all the endpoints are protected behind authentication to avoid broken authentication process.
  • User own resource ID should be avoided. Use /me/orders instead of /user/654321/orders. - both are okay
  • Don't auto-increment IDs. Use UUID instead.
  • If you are parsing XML data, make sure entity parsing is not enabled to avoid XXE (XML external entity attack).
  • If you are parsing XML, YAML or any other language with anchors and refs, make sure entity expansion is not enabled to avoid Billion Laughs/XML bomb via exponential entity expansion attack.
  • Use a CDN for file uploads.
  • If you are dealing with huge amount of data, use Workers and Queues to process as much as possible in background and return response fast to avoid HTTP Blocking.
  • Do not forget to turn the DEBUG mode OFF.
  • Use non-executable stacks when available.

Output

  • Send X-Content-Type-Options: nosniff header.
  • Send X-Frame-Options: deny header.
  • Send Content-Security-Policy: default-src 'none' header.
  • Remove fingerprinting headers - X-Powered-By, Server, X-AspNet-Version, etc.
  • Force content-type for your response. If you return application/json, then your content-type response is application/json.
  • Don't return sensitive data like credentials, passwords, or security tokens.
  • Return the proper status code according to the operation completed. (e.g., 200 OK, 400 Bad Request, 401 Unauthorized, 405 Method Not Allowed, etc.).

CI & CD

  • Audit your design and implementation with unit/integration tests coverage.
  • Use a code review process and disregard self-approval.
  • Ensure that all components of your services are statically scanned by AV software before pushing to production, including vendor libraries and other dependencies.
  • Continuously run security tests (static/dynamic analysis) on your code.
  • Check your dependencies (both software and OS) for known vulnerabilities.
  • Design a rollback solution for deployments.

Monitoring

  • Use centralized logins for all services and components.
  • Use agents to monitor all traffic, errors, requests, and responses.
  • Use alerts for SMS, Slack, Email, Telegram, Kibana, Cloudwatch, etc.
  • Ensure that you aren't logging any sensitive data like credit cards, passwords, PINs, etc.
  • Use an IDS and/or IPS system to monitor your API requests and instances.

Implement delete user

Generally it is preferable to disable user accounts via the Modify User modal. Operators should have this ability.
Screenshot from 2023-07-27 17-15-04

In some cases, it may be necessary to delete a user record all together. Only Admin users should have this ability. The action should be triggered via the red trashcan icon (existing).

Dark mode doesn't work on some components

After switching the theme to Dark Mode, the Footer still shows up with a white background.

The "There are no active workflows." alert also shows up with a white background when Dark Mode is toggled on. This alert is visible on the homepage when running the app locally without a connection to the workflow service.

Build secure download server

Secure Download Server
New UI based on Google Drive

Permission data - postgres app database - app.sca.iu.edu
Actual files - scratch - colo node

Option 1: hosted on colo23
Authenticate users
Resolve permissions by talking to API for a particular app
App Api should have a consistent route for resolving permissions - 200 OK / 403 Forbidden
Resolve the project_uuid to a path on scratch - 2 options
hardcoded + symlink
api call which will respond with actual path on scratch
NGINX x-accel header / sends file / UI javascript initiates browser download

Pros:
Common UI for all apps
Cons:
It needs to know which app to talk to via url / app hostnames / scratch configs
Extra auth for service-to-service

Option 2: Bundle this file tree UI with apps (import component)
Construct a file tree from postgres dataset_files table
There is no extra api to resolve permissions and file paths
User clicks on a file to download - API will create a token + URL - UI will make a CORS request to NGINX API (nginx + express api) hosted on colo23
File: /abc/123.fastq.gz
Token
will have the URL - asd12412sad/abc/123.fastq.gz
Token expiry time - 1 minute
Sign with its private key
Public key id
The nginx api on colo23 validates the token (with public key) - serves that file
Nginx: Resource server / resource owner

Oauth2.0 Authz server - ory Hydra
Flow - client credentials oauth2.0

Pros:
App API driven - no overhead on permissions and path resolutions
Stats on who downloaded what
Constraint:
File tree is static
Allows searching for files without staging
Allows staging individual files from a given archive

Pros (for both):
User level permissions on project downloads
Revoke permission - applies immediately
Multi modal authentication - IU, google, apple
Cons (for both):
No wget

FileBrowser: Add state(view) transitions to the browser history

FileBrowser: https://github.com/IUSCA/bioloop/blob/main/ui/src/components/filebrowser/README.md

add file browser navigation & search state to browser history by manipulating browser url query parameters.

User should be able to use the browser back button to go back to previous directory / previous search result after navigation / search action.

User should be able to use the url to load the page again in the same state (same pwd / same search filters)

Speed up Inspect and Validate steps

Use multiprocessing to leverage multiple CPUs in colo nodes and parallelize the computation of checksum of files.

Distribute the files among many processors to the run the same function

  • processes must consume the files from a queue. This is because the time to compute checksum depends on the size and file sizes in a dataset are not uniform. Giving each process a certain number of files is not ideal. Files can be partitioned based on size (as well as number of available processors) and each process can be given a partition but this is complicated.
  • Make processes "nice" (linux) so that they do not hog resources and slow other running processes

Tracking progress will be difficult as multiple processors are completing work.

Explore storing checksum of bundles in the dataset object to reduce computation

bundle means either tar, tar + zip, tar + zip + enc of the dataset directory.

In archive step, after making the bundle to upload to SDA, save the checksum to the database

This will be useful to avoid computing the checksum of local bundle file, provided that only the application has write permissions on the location of these bundles.

Try storing the file metadata (mtime, size) for a quick check?

Workers - Log when API retries

2023-05-29 17:51:09,674 - ERROR - base.py:98 - Thread-1 - Failed to contact API for work, (Attempt 1/5)
2023-05-29 17:51:09,674 - ERROR - base.py:247 - MainThread -
2023-05-29 17:51:09,674 - ERROR - base.py:101 - Thread-1 - Retrying in 10 seconds
2023-05-29 17:53:20,745 - ERROR - https.py:20 - MainThread - HTTPSConnectionPool(host='cmg.sca.iu.edu', port=443): Max retries exceeded with url: /api/workers/5fe21439ecbe3fc1bcb6aaf2 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f25a3a324c0>: Failed to establish a new connection: [Errno 110] Connection timed out'))

When the workers try to reach the express API and fail - log that error as well as when is the next retry

Fix paths in READMEs

The README.md files in the project reference paths to files that appear to be outdated. These paths should be updated to reflect the correct paths of these files.

Registration - Show the wait for no recent activity

When the register worker finds a new dataset it usually waits for sometime to make sure nothing has changed in the folder. After this it gets registered and only then it shows up on ui.

This doesn't give immediate feedback and often confuses the operators.

Register the dataset as soon as it is found but only trigger the workflow after the no recent activity wait time. Also show a message on the UI that the system is waiting for x more minutes before it starts processing.

FileBrowser: Implement filtering by directory sizes

FileBrowser: https://github.com/IUSCA/bioloop/blob/main/ui/src/components/filebrowser/README.md

API:
As the file metadata is immutable, it is feasible to recursively calculate the directory sizes using the graph datastructure used to insert the data into the DB.

UI
In browser view:

  • should be able to sort directories by size
  • group directories and files separately while sorting by size - similar to sorting by name

In Search View:

  • should be able to filter directories by min and max size

Filebrowser: Allow download of full tar file

Now that we have the ability to download individual files in a dataset (#31) (!!) we have received a request to also be able to download the complete dataset. We can use the same file that is archived on the SDA for this purpose. Once we stage the tar file, we can keep it around for direct downloading. Downloading the tar file should be protected by the same secure download features as individual files.

This could be added as a third option on the "Data Access Options" modal:
image

UI Base features

TODO

  • Refactor api/prisma/schema.js - split data and code
  • Sync bioloop/workers and cpa/workers

Prisma - Use Referential actions

https://www.prisma.io/docs/concepts/components/prisma-schema/relations/referential-actions

To delete a dataset, all the associated relations have to be deleted first, this can be specified in schema.prisma config.

async function hard_delete(id) {
  const deleteFiles = prisma.dataset_file.deleteMany({
    where: {
      dataset_id: id,
    },
  });
  const deleteWorkflows = prisma.workflow.deleteMany({
    where: {
      dataset_id: id,
    },
  });
  const deleteAudit = prisma.dataset_audit.deleteMany({
    where: {
      dataset_id: id,
    },
  });
  const deleteStates = prisma.dataset_state.deleteMany({
    where: {
      dataset_id: id,
    },
  });
  const deleteAssociations = prisma.dataset_hierarchy.deleteMany({
    where: {
      OR: [
        {
          source_id: id,
        },
        {
          derived_id: id,
        },
      ],
    },
  });
  const deleteDataset = prisma.dataset.delete({
    where: {
      id,
    },
  });

  await prisma.$transaction([
    deleteFiles,
    deleteWorkflows,
    deleteAudit,
    deleteStates,
    deleteAssociations,
    deleteDataset,
  ]);
}

FileBrowser: Implement sorting in search view

FileBrowser: https://github.com/IUSCA/bioloop/blob/main/ui/src/components/filebrowser/README.md

User should be able to sort the search results by name, size, and filetype. Follow the multisort strategy used in browser view to group files and directories separately.

User should be able to see the number of search results.

Optional:
As of now, number of search results is limited to 1000 to prevent sending too much data to the browser. Table is virtual - debate the use of lazy load vs pagination to show next batch of results beyond the initial 1000 (if such a feature is required.)

Develop a Workflow Manager Page and Workflow Summary component for Dashboard

In addition to showing active workflows and current resource utilization, the dashboard should show the status of workers, regardless of if they are active or not. This is useful when troubleshooting issues to know, e.g. when the last time a worker was seen by the API and which workers are configured for the system. This feature would only show workers configured for a specific instance of bioloop, not all workers across all instances (as may be the case in #39?)

In CMG, this functionality looks like this:
image

Dark mode

persist to local storage
create a dark mode switcher component and add it to header

Worker - TODO

Workers:

  • embellish workflow, step, task objects with explicit statuses, start and end times, progress, etc which are implicitly stored in order of tasks, task attributes.
  • docker for workflow server
  • pm2 config for daemon processes - auto start, background
  • add /health
  • If the API call fails at the end of the task, do not retry the entire task. This could happen when the API is momentarily down.
  • isolate workflow steps into a config (todo in api too)
  • use a hook to write task status to result backend as soon as it starts, otherwise a task object is not created in the result backend until either the task succeeds or fails or an status update from task code.
  • In workflow code: handle gracefully if a task is not found
  • Enable pm2 to start services automatically if the server restarts. - not possible in colo23?
  • Change in get workflow api, instead of last_task_run and prev_task_runs, return task_runs as an array sorted chronologically. change the query to num_task_runs, with -1 representing all tasks.
  • workflow resume: if the pending step is not the first step, and it has never run before, then get the args from the previous step.
  • celery: read more about inbuilt workers in celery that clean up tasks periodically
  • celery config for long running tasks, a worker cannot claim / preprocess, acks late, etc
    • documentation on resiliency - task level, worker level
  • Design WorkflowTask to not retry on Custom Exception (nonRetryableException) - https://github.com/celery/celery/blob/b4d23f290713ebea25ab517d9f980ae542885577/celery/app/autoretry.py#L14
  • Make a list of all the config that can be configured on the task, and how to overwrite the config on the WorkflowTask.
  • Update task code without restarting the celery worker
  • Application specific queues
  • Standardize status and status sets names across layers
  • error which caused the task to be retried is lost when the task sends progress updates / on success.
  • Add options to reject task and mark as to be skipped. When the workflow resumes, it should execute next step. (This is helpful when the task is manually completed, and workflow has to complete remaining steps). When the task is truly lazy, it will see that the desired state is already reached, it does nothing and completes. But for non-lazy tasks, this is required.
  • await stability being a task is not an efficient design. When there are multiple concurrent data ingestion, most of the celery tasks will be idle waiting for stability of some datasets while inspect, archive tasks of other ready datasets will be waiting in the queue. Move await stability functionality to watch where in one loop all the datasets can be checked at once.

Logging

  • use celery logger
  • log file management, rotation for each task
  • Include filename and line number - celery logger formatter
  • use python logger instead of print in register.py and app.py

Config

  • hierarchical config - for dev and prod envs
  • refactor celeryconfig to use data from main config

Workers - Inspection and Validation Steps - Use queue based multiprocessing to speed up

Use multiprocessing to leverage multiple CPUs in colo nodes and parallelize the computation of checksum of files.

Distribute the files among many processors to the run the same function

processes must consume the files from a queue. This is because the time to compute checksum depends on the size and file sizes in a dataset are not uniform. Giving each process a certain number of files is not ideal. Files can be partitioned based on size (as well as number of available processors) and each process can be given a partition but this is complicated.
Make processes "nice" (linux) so that they do not hog resources and slow other running processes

Tracking progress will be difficult as multiple processors are completing work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.