Coder Social home page Coder Social logo

umccr / htsget-rs Goto Github PK

View Code? Open in Web Editor NEW
36.0 4.0 9.0 35.96 MB

A server implementation of the htsget protocol for bioinformatics in Rust

Home Page: https://samtools.github.io/hts-specs/htsget.html

License: MIT License

Rust 97.10% JavaScript 0.16% TypeScript 2.60% Dockerfile 0.13%
bioinformatics rust noodles bioinfo aws-lambda htsget serverless

htsget-rs's Introduction

htsget-rs

MIT licensed Build Status

A server implementation of the htsget protocol for bioinformatics in Rust. It is:

  • Fully-featured: supports BAM and CRAM for reads, and VCF and BCF for variants, as well as other aspects of the protocol such as TLS, and CORS.
  • Serverless: supports local server instances using Actix Web, and serverless instances using AWS Lambda Rust Runtime.
  • Storage interchangeable: supports local filesystem storage as well as objects via Minio and AWS S3.
  • Thoroughly tested and benchmarked: tested using a purpose-built test suite and benchmarked using criterion-rs.

To get started, see Usage.

Note: htsget-rs is still experimental, and subject to change.

Overview

Htsget-rs implements the htsget protocol, which is an HTTP-based protocol for querying bioinformatics files. The htsget protocol outlines how a htsget server should behave, and it is an effective way to fetch regions of large bioinformatics files.

A htsget server responds to queries which ask for regions of bioinformatics files. It does this by returning an array of URL tickets, that the client must fetch and concatenate. This process is outlined in the diagram below:

htsget-diagram

htsget-rs implements this process as closely as possible, and aims to return byte ranges that are as small as possible. htsget-rs is written asynchronously using the Tokio runtime. It aims to be as efficient and safe as possible, having a thorough set of tests and benchmarks.

htsget-rs implements the following components of the protocol:

  • GET requests.
  • POST requests.
  • BAM and CRAM for the reads endpoint.
  • VCF and BCF for the variants endpoint.
  • service-info endpoint.
  • TLS on the data block server.
  • CORS support on the ticket and data block servers.

Usage

Htsget-rs is configured using environment variables, for details on how to set them, see htsget-config.

Local

To run a local instance htsget-rs, run htsget-actix by executing the following:

cargo run -p htsget-actix

Using the default configuration, this will start a ticket server on 127.0.0.1:8080 and a data block server on 127.0.0.1:8081 with data accessible from the data directory. See htsget-actix for more information.

Cloud

Cloud based htsget-rs uses htsget-lambda. For more information and an example deployment of this crate see deploy.

Tests

Tests can be run tests by executing:

cargo test --all-features

To run benchmarks, see the benchmark sections of htsget-actix and htsget-search.

Project Layout

This repository consists of a workspace composed of the following crates:

Other directories contain further applications or data:

  • data: Contains example data files which can be used by htsget-rs, in folders denoting the file type. This directory also contains example events used by a cloud instance of htsget-rs in the events subdirectory.
  • deploy: An example deployment of htsget-lambda.

In htsget-rs the ticket server handled by htsget-actix or htsget-lambda, and the data block server is handled by the storage backend, either locally, or using AWS S3. This project layout is structured to allow for extensibility and modularity. For example, a new ticket server and data server could be implemented using Cloudflare Workers in a htsget-http-workers crate and Cloudflare R2 in htsget-search.

See the htsget-search overview for more information on the storage backend.

Contributing

Thanks for your interest in contributing, we would love to have you! See the contributing guide for more information.

License

This project is licensed under the MIT license.

htsget-rs's People

Contributors

brainstorm avatar castillodel avatar chris-zen avatar dependabot[bot] avatar github-actions[bot] avatar github-merge-queue[bot] avatar mmalenic avatar norling avatar victorskl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

htsget-rs's Issues

VCF search interface implementation (/variants endpoint)

As per the htsget spec implement the variants endpoint:

GET /variants/<id>

This entails understanding the current BAM (/reads) structure and following suit with how the code is currently structured. In particular, your implementation would have to be living in:

 htsget-mvp/htsget-search/src/htsget/bcf.rs

For information on how a bcf file looks inside (most importantly, its uncompressed counterpart, vcf), please refer to the appropriate PDF under hts-specs. That being said, Noodles has a fairly clean and straightforward module that describes that format idiomatically in Rust.

EDIT: Since the bcf crate for Noodles is currently in the works on the bcf branch, it'd be reasonable to implement vcf.gz business logic instead, plus this format is more commonly found in the wild than bcf, actually.

Document usage of GZI

A feature of htsget-rs is using GZI, which allows fine-grained byte range responses, without reading the underlying file. Its use should be documented, and it should be recommended as a supplementary index to the regular index files.

Implement Crypt4GH

Relevant high level read about it:

https://www.ga4gh.org/news/crypt4gh-a-secure-method-for-sharing-human-genetic-data/

And the actual spec for this enhancement:

https://samtools.github.io/hts-specs/crypt4gh.pdf

Worth paying close attention to aspect A.3 Other Considerations since it directly affects how htsget should behave with this encryption scheme.

Some useful repos:

https://github.com/elixir-oslo/crypt4gh
https://github.com/EGA-archive/crypt4gh
https://github.com/neicnordic/LocalEGA-cryptor

/cc @viklund

Implement proper AsyncSeek+AsyncReader

We should really fix this issue in order to not download excessive bytes for the get_content() method. Potentially have those changes more upstream so that all storage backends can benefit from the improvement:

async fn get_content<K: AsRef<str> + Send>(&self, key: K, options: GetOptions) -> Result<Bytes> {
// It would be nice to use a ready-made type with a ByteStream that implements AsyncRead + AsyncSeek
// in order to avoid reading the whole byte buffer into memory. A custom type could be made similar to
// https://users.rust-lang.org/t/what-to-pin-when-implementing-asyncread/63019/2 which could be based off
// StreamReader.

Otherwise the side effects are unnecessary egress charges for cloud and prohibitively slow requests/responses.

README.md: cargo run -p htsget-http-actix does not listen to 8080 out of the box

% cargo run -p htsget-http-actix
   Compiling htsget-search v0.1.0 (/Users/rvalls/dev/umccr/htsget-rs/htsget-search)
   Compiling htsget-http-core v0.1.0 (/Users/rvalls/dev/umccr/htsget-rs/htsget-http-core)
   Compiling htsget-http-actix v0.1.0 (/Users/rvalls/dev/umccr/htsget-rs/htsget-http-actix)
    Finished dev [unoptimized + debuginfo] target(s) in 10.27s
     Running `target/debug/htsget-http-actix`
2022-06-06T10:44:53.933646Z  INFO htsget_config::config: Config created from environment variables. config=Ok(Config { addr: 127.0.0.1:8080, resolver: RegexResolver { regex: .*, substitution_string: "$0" }, path: ".", service_info: ConfigServiceInfo { id: None, name: None, version: None, organization_name: None, organization_url: None, contact_url: None, documentation_url: None, created_at: None, updated_at: None, environment: None }, storage_type: LocalStorage, ticket_server_addr: 127.0.0.1:8081, ticket_server_key: "key.pem", ticket_server_cert: "cert.pem", s3_bucket: "" })
2022-06-06T10:44:53.943689Z  INFO actix_server::builder: Starting 8 workers
2022-06-06T10:44:53.945003Z  INFO actix_server::server: Actix runtime found; starting in Actix runtime
Error: Custom { kind: NotFound, error: IoError("Failed to open key file", Os { code: 2, kind: NotFound, message: "No such file or directory" }) }

Improve error messages and handling

Some parts of the code have errors that are vague, or lose information when wrapping the original error. For example, this error mapping, doesn't record why a Header could not be parsed. This should be improved so that information on why errors occur is properly returned to the user.

Blocking/async versions (dead code) warnings

After merging #56 , the errors below appeared. The main function only uses specific methods/objects relevant to it being async or blocking, so cargo build without flags yields the following:

warning: unused imports: `get`, `post`, `reads_service_info`, `variants_service_info`
 --> htsget-http-actix/src/main.rs:4:33
  |
4 | use crate::handlers::blocking::{get, post, reads_service_info, variants_service_info};
  |                                 ^^^  ^^^^  ^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^^^^^^
  |
  = note: `#[warn(unused_imports)]` on by default

warning: function is never used: `reads`
  --> htsget-http-actix/src/handlers/blocking/get.rs:16:14
   |
16 | pub async fn reads<H: HtsGet>(
   |              ^^^^^
   |
   = note: `#[warn(dead_code)]` on by default

warning: function is never used: `variants`
  --> htsget-http-actix/src/handlers/blocking/get.rs:31:14
   |
31 | pub async fn variants<H: HtsGet>(
   |              ^^^^^^^^

warning: function is never used: `reads`
  --> htsget-http-actix/src/handlers/blocking/post.rs:14:14
   |
14 | pub async fn reads<H: HtsGet>(
   |              ^^^^^

warning: function is never used: `variants`
  --> htsget-http-actix/src/handlers/blocking/post.rs:28:14
   |
28 | pub async fn variants<H: HtsGet>(
   |              ^^^^^^^^

warning: function is never used: `get_service_info_json`
  --> htsget-http-actix/src/handlers/blocking/mod.rs:15:4
   |
15 | fn get_service_info_json<H: HtsGet>(app_state: &AppState<H>, endpoint: Endpoint) -> impl Responder {
   |    ^^^^^^^^^^^^^^^^^^^^^

warning: function is never used: `reads_service_info`
  --> htsget-http-actix/src/handlers/blocking/mod.rs:23:14
   |
23 | pub async fn reads_service_info<H: HtsGet>(app_state: Data<AppState<H>>) -> impl Responder {
   |              ^^^^^^^^^^^^^^^^^^

warning: function is never used: `variants_service_info`
  --> htsget-http-actix/src/handlers/blocking/mod.rs:28:14
   |
28 | pub async fn variants_service_info<H: HtsGet>(app_state: Data<AppState<H>>) -> impl Responder {
   |              ^^^^^^^^^^^^^^^^^^^^^

warning: type alias is never used: `HtsGetStorage`
  --> htsget-http-actix/src/main.rs:45:1
   |
45 | type HtsGetStorage = HtsGetFromStorage<LocalStorage>;
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

warning: field is never read: `htsget`
  --> htsget-http-actix/src/main.rs:54:3
   |
54 |   htsget: H,
   |   ^^^^^^^^^

warning: field is never read: `config`
  --> htsget-http-actix/src/main.rs:55:3
   |
55 |   config: Config,
   |   ^^^^^^^^^^^^^^

warning: 11 warnings emitted

CORS support for standalone running instance

Context:

  • A low prior change request: To add direct CORS support for built-in servers for standalone run

Use Case:

  • Local dev run with cargo run -p htsget-http-actix to use htsget-rs as backend in Pipeline development and/or Portal integration with IGV.js

Workaround:

  • Use Nginx/Haproxy to handle CORS then reverse proxy to htsget-rs servers

Need Changes:

I guess, this should be quick -- 2 parts in there:

  • actix_cors crate for actix-web for Ticket server
  • activate tower_http::cors feature in tower/axum for built-in storage streaming server

Implement htsget id resolver interface

When a query is sent to htsget via, for instance, GET /reads/<id>, the spec says that id can be any of those things:

Screen Shot 2021-04-09 at 9 37 02 pm

The spec is intentionally loose in this regard since different organizations might have differing requirements regarding how this endpoints and internal (organizational) structure they have in place.

A "resolver" interface should allow for those organizations to easily "plug and play" htsget into their systems more easily and/or filter what's a legal ID or not, perhaps via regexps like the reference implementation does.

Change `Option<Class>` to `Class`

Right now there are two ways of representing a Query in which the client didn't specify the headers class. It can be done using class: None or class: Class::Body. This duplicity could be avoided by having only Class in the Query struct, unless there is a specific reason not to.

Also note that the htsget specification only accepts the headers value in the class parameter.

Implementation of storage class for S3

This task consists on implementing the specifics for the generic "storage" module, so this task will begin by understanding the contents on:

htsget-mvp/htsget-search/src/storage/local.rs

And then implementing the corresponding AWS S3 module:

htsget-mvp/htsget-search/src/storage/aws_s3.rs 

Please note that for now we are focusing on AWS's implementation of S3, but as you might be aware of, there are other S3 implementations such as Minio which we will not focus on for this task.

For instance, this implementation will begin by defining the bucket and key concepts in a structure that are relevant to S3:

/// Implementation for the [Storage] trait using AWS S3
#[derive(Debug)]
pub struct S3Storage {
  bucket: String,
  key: String,
}

For information on how S3 works, please refer to the official AWS S3 documentation.

Not to be confused with AWS S3 "internal" storage classes, namely: DEEP ARCHIVE, ARCHIVE, STANDARD, IA, etc...

More accurate byte ranges for BAM search results

VirtualPosition's encode the start of a BGZF block in it's high bits (known also as the compressed part or the coffset). When getting results from a BAI query with noodles, the ending VirtualPosition's compressed value is pointing to the beginning of a block that might still include reads that need to be returned. If we want to translate that ending VirtualPosition into a byte offset in the file, we need to know where does that block finishes, but that's not something that the BAI index provides out of the box.

Our current solution is dummy and it just adds the maximum block size to the compressed part for the ending intervals to make sure that we don't loose any reads (at the expense of possibly adding many reads that don't belong to the sequence range that was queried)

A better alternative would be to construct a HashMap<u64, u64> from the BAI that would map the block starting offsets to the beginning of the closer contiguous block that we can infer from the BAI index. We could use that map to know the corresponding end for every compressed offset that we get for an end interval. In case that we don't find it in the map, we could fall back to the current strategy of adding the maximum block size.

To get that information from the BAI we can do something like:

  let index = bai::read(path.as_ref().with_extension("bam.bai"))?;
  
  let mut ref_seq_intervals: Vec<HashMap<u64, u64>> = Vec::with_capacity(index.reference_sequences().len());

  for (idx, idx_ref_seq) in index.reference_sequences().iter().enumerate() {
    if let Some(metadata) = idx_ref_seq.metadata() {
      let blocks: HashSet<u64> = idx_ref_seq
        .bins()
        .iter()
        .flat_map(|bin| bin.chunks().iter())
        .flat_map(|chunk| vec![chunk.start(), chunk.end()])
        .map(|vpos| vpos.compressed())
        .collect();
      let mut blocks: Vec<u64> = blocks.into_iter().collect();
      blocks.sort_unstable();

      let intervals: HashMap<u64, u64> = blocks
        .iter()
        .take(blocks.len() - 1)
        .zip(blocks.iter().skip(1))
        .map(|(start, end)| (*start, *end))
        .collect();

      ref_seq_intervals.push(intervals);
    }
  }

Two local test failures related to cargo lambda

Those minor couple of defects are related to cargo-lambda/cargo-lambda#153.

$ cargo test
(...)
running 6 tests
test test_parameterized_get ... FAILED
test test_get ... FAILED
test test_parameterized_post_class_header ... FAILED
test test_service_info ... FAILED
test test_post ... FAILED
test test_parameterized_post ... FAILED

failures:

---- test_parameterized_get stdout ----
thread 'test_parameterized_get' panicked at 'Failed to invoke request.', htsget-http-lambda/tests/integration_tests.rs:55:3
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- test_get stdout ----
thread 'test_get' panicked at 'Failed to invoke request.', htsget-http-lambda/tests/integration_tests.rs:55:3

---- test_parameterized_post_class_header stdout ----
thread 'test_parameterized_post_class_header' panicked at 'Failed to invoke request.', htsget-http-lambda/tests/integration_tests.rs:55:3

---- test_service_info stdout ----
thread 'test_service_info' panicked at 'Failed to invoke request.', htsget-http-lambda/tests/integration_tests.rs:55:3

---- test_post stdout ----
thread 'test_post' panicked at 'Failed to invoke request.', htsget-http-lambda/tests/integration_tests.rs:55:3

---- test_parameterized_post stdout ----
thread 'test_parameterized_post' panicked at 'Failed to invoke request.', htsget-http-lambda/tests/integration_tests.rs:55:3


failures:
    test_get
    test_parameterized_get
    test_parameterized_post
    test_parameterized_post_class_header
    test_post
    test_service_info

test result: FAILED. 0 passed; 6 failed; 0 ignored; 0 measured; 0 filtered out; finished in 21.99s

error: test failed, to rerun pass '-p htsget-http-lambda --test integration_tests'
% cargo test --no-default-features
   Compiling htsget-http-lambda v0.1.0 (/Users/rvalls/dev/umccr/htsget-rs/htsget-http-lambda)
   Compiling htsget-http-actix v0.1.0 (/Users/rvalls/dev/umccr/htsget-rs/htsget-http-actix)
error[E0004]: non-exhaustive patterns: `AwsS3Storage` not covered
  --> htsget-http-actix/src/main.rs:20:9
   |
20 |   match config.storage_type {
   |         ^^^^^^^^^^^^^^^^^^^ pattern `AwsS3Storage` not covered
   |
  ::: /Users/rvalls/dev/umccr/htsget-rs/htsget-config/src/config.rs:68:3
   |
68 |   AwsS3Storage,
   |   ------------ not covered
   |
   = help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
   = note: the matched value is of type `StorageType`

For more information about this error, try `rustc --explain E0004`.
error: could not compile `htsget-http-actix` due to previous error
warning: build failed, waiting for other jobs to finish...
error[E0004]: non-exhaustive patterns: `AwsS3Storage` not covered
  --> htsget-http-lambda/src/main.rs:16:9
   |
16 |   match config.storage_type {
   |         ^^^^^^^^^^^^^^^^^^^ pattern `AwsS3Storage` not covered
   |
  ::: /Users/rvalls/dev/umccr/htsget-rs/htsget-config/src/config.rs:68:3
   |
68 |   AwsS3Storage,
   |   ------------ not covered
   |
   = help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
   = note: the matched value is of type `StorageType`

error: build failed

Benchmarking

As mentioned on the original GSoC proposals, this project is not complete until we validate that is well performant :)

For this, I've been cooking this repo: https://github.com/umccr/aws-benchmarks

As an example, using AWS S3 client libraries as subjects to test both CPU and I/O throughput on... but the idea is to apply those benches over here.

This task depends on #45 being finished.

Convert Storage and HtsGet traits to use async/await

For the current MVP, having blocking operations is fine. But for real life we should convert those interfaces to use async/await.

I think that this is something we can leave for later when we have the basic parts of the search logic well figured out, and possibly more relevant before we start working on the HTTP layer.

#[async_trait(?Send)]
pub trait Storage {
  async fn get<K: AsRef<str>>(&self, key: K, options: GetOptions) -> Result<PathBuf>;
  async fn url<K: AsRef<str>>(&self, key: K, options: UrlOptions) -> Result<Url>;
}
#[async_trait(?Send)]
pub trait HtsGet {
  async fn search(&self, query: Query) -> Result<Response>;
}

README contains disparaging FUD about other projects

From README.md:

This implementation gets rid of the unsafe interfacing with the C-based htslib, which has had many vulnerabilities along with other problematic third party dependencies such as OpenSSL.

There are four real CVEs listed for HTSlib. The last non-trivial one of those was fixed two years and seven releases ago.

Your “many vulnerabilities” link is to a list of HTSlib PRs derived from OSS-Fuzz reports, many of which will likely have been short-lived problems in code on the develop branch that were never present in an HTSlib release. Moreover, would you prefer to use a C library that is not being constantly fuzz-tested and does not have such issues identified and fixed?

As the author of this README text has previously been informed, HTSlib itself does not directly use OpenSSL. Only its S3 support (in hfile_s3*.c) — which can be built separately as a plugin if desired, thus isolating the dependency — will use OpenSSL as one option for providing the cryptographic routines it needs. If you prefer an HTSlib built without that dependency at all, it is simple to configure your HTSlib build accordingly.

So this text serves no purpose for your users, is largely incorrect or misleading, and casts unwarranted aspersions at HTSlib and its authors — several of whom, incidentally, are involved with htsget.

This text does not reflect well on your project.

Consider Fields and Tags.

Currently the search functions don't seem to do anything with the Query tags and fields. We should implement tag and field filtering as it's an optional part of the htsget protocol.

Append EOF for BAM and CRAM

When returning the sliced data, those two formats are not hitting the client with the right EOF termination:

BAM: 1f 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 1b 00 03 00 00 00 00 00 00 00 00 00
CRAM: 0f 00 00 00 ff ff ff ff 0f e0 45 4f 46 00 00 00 00 01 00 05 bd d9 4f 00 01 00 06 06 01 00 01 00 01 00 ee 63 01 4b

Which leads to samtools (and most probably other clients) to error out accordingly:

[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read_block] Invalid BGZF header at offset 4668
[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
samtools view: error reading file "bam_slice.bam": No such process
samtools view: error closing "bam_slice.bam": -1

There are noodles methods that return this const EOF per format, use them.

Caveats:

  1. Do not append EOF if request is header, only body.
  2. Do not append EOF if request is whole file/object.

Fix (benchmarking-related?) CI issues

https://github.com/umccr/htsget-rs/runs/6725133792?check_suite_focus=true#step:11:294

It seems that criterion is trying to fish out some prior result to compare against?: https://github.com/umccr/htsget-rs/runs/6725133792?check_suite_focus=true#step:13:20

Could also be related with my recent await-ification of many of the tests. I'll investigate.

Future proofing warning worth fixing too: https://github.com/umccr/htsget-rs/runs/6725133792?check_suite_focus=true#step:11:139

Also need to investigate on whether to run the benchmarks with --release mode on as it's the production baseline we want to compare against.

Also when running the benchmarks locally, they don't seem to finish?:

cargo bench --bench search-benchmarks --bench request-benchmarks -- LIGHT --output-format bencher | tee output.txt

Tests failing

Several tests are failing right now. It seems some new noodles updates broke them, I'm investigating which commit it was.

Consider multiple buckets support

Context:

export HTSGET_S3_BUCKET=my-primary-data

Desirable:

  • It would be good; if we can configure it to multiple buckets as data may sparsely located across multiple locations/buckets

Related:

  • This may also related to Htsget REST API endpoint resource ID mapping; currently is with regex pattern .* matching all keys under a bucket
  • Hence, in-turn, this request indirectly refer to consider include bucket as part of its REST endpoint resource ID mapping
  • i.e.
s3://my-bucket1/subject/key/to/data1.bam
s3://my-bucket2/another/subject/key/to/data1.bam

Then, Htsget REST endpoint resource ID will become

https://htsget-rs.my.org/my-bucket1/subject/key/to/data1.bam
https://htsget-rs.my.org/my-bucket2/another/subject/key/to/data1.bam

(or without extension suffix is fine)
https://htsget-rs.my.org/my-bucket1/subject/key/to/data1
https://htsget-rs.my.org/my-bucket2/another/subject/key/to/data1

Or, with htsget URI scheme (on-going discussion #581 umccr/data-portal-client#88)

htsget://data.my.org/my-bucket1/subject/key/to/data1
htsget://data.my.org/my-bucket2/another/subject/key/to/data1

Providing the file size through the Storage abstraction

For the BAM format, we need to know the size of the file to provide a close byte range for all the unplaced unmapped reads. The current alternative is to leave the end part open.

Possible names:

  • head: which the HTTP verb you would use to get the size information (as well as other details like whether it exists or not) from S3
  • stats: which is the metadata information you get for a file in unix
  • size: more limited scope and focused on just getting the size.

We have two options for the result (where T will depend on the final semantics that we decide for this new method):

  • Result<Option<T>>: Result encodes either a success or a failure, and the Option whether the file exists or not.
  • Result<T>: Result encodes both success/failure and whether the file exists or not as one of the possible Error variants.

One possible example:

pub trait Storage {
  fn get<K: AsRef<str>>(&self, key: K, options: GetOptions) -> Result<PathBuf>;
  fn url<K: AsRef<str>>(&self, key: K, options: UrlOptions) -> Result<Url>;

  fn size<K: AsRef<str>>(&self, key: K) -> Result<usize>;
}

Responses always contain the same class.

JSON responses always contain the same class string, either only "body" or only "header". Requests that contain a body response label the header as "body" as well. For example, this test returns a "body" class for the first byte range, even though it represents a "header":

Ok(
    Response {
        format: Bam,
        urls: [
            Url {
                url: "http://127.0.0.1:8081/data/htsnexus_test_NA12878.bam",
                headers: Some(
                    Headers(
                        {
                            "Range": "bytes=0-4667",
                        },
                    ),
                ),
                class: Body,
            },
            Url {
                url: "http://127.0.0.1:8081/data/htsnexus_test_NA12878.bam",
                headers: Some(
                    Headers(
                        {
                            "Range": "bytes=256721-647345",
                        },
                    ),
                ),
                class: Body,
            },
            Url {
                url: "http://127.0.0.1:8081/data/htsnexus_test_NA12878.bam",
                headers: Some(
                    Headers(
                        {
                            "Range": "bytes=824361-842100",
                        },
                    ),
                ),
                class: Body,
            },
            Url {
                url: "http://127.0.0.1:8081/data/htsnexus_test_NA12878.bam",
                headers: Some(
                    Headers(
                        {
                            "Range": "bytes=977196-996014",
                        },
                    ),
                ),
                class: Body,
            },
            Url {
                url: "data:;base64,H4sIBAAAAAAA/wYAQkMCABsAAwAAAAAAAAAAAA==",
                headers: None,
                class: Body,
            },
        ],
    },
)

This does not line up with the example in the htsget protocol, which labels headers with the "header" class:

{
   "htsget" : {
      "format" : "BAM",
      "urls" : [
         {
            "url" : "data:application/vnd.ga4gh.bam;base64,QkFNAQ==",
            "class" : "header"
         },
         {
            "url" : "https://htsget.blocksrv.example/sample1234/header",
            "class" : "header"
         },
         {
            "url" : "https://htsget.blocksrv.example/sample1234/run1.bam",
            "headers" : {
               "Authorization" : "Bearer xxxx",
               "Range" : "bytes=65536-1003750"
             },
            "class" : "body"
         },
         {
            "url" : "https://htsget.blocksrv.example/sample1234/run1.bam",
            "headers" : {
               "Authorization" : "Bearer xxxx",
               "Range" : "bytes=2744831-9375732"
            },
            "class" : "body"
         }
      ]
   }
}

I'm not sure what the best class for ranges that contain both header and body bytes is. Maybe in this case it would be best to drop the class string, as it is optional anyway.

Allow built-in storage streaming server to serve over HTTP scheme

Context:

  • To allow Storage streaming server tower/Axum server to have option to configure either HTTP or HTTPS

Use Case:

  • HTTP is still desirable for local dev because XHR JavaScript client (like the one embedded in igv.js) will fail and block with local self-singed (local-trusted-ca) certs in browser

Prefer:

  • Would be great; if it comes out-of-the-box with HTTP scheme. Then, configurable to HTTPS as required.

Only one curl example works from htsget-http-actix README.md

This is most probably a regression introduced in the latest changes perhaps? I do recall those working, in any case, it's a good opportunity to put those tests under CI and not just have them on the README.md... I'll determine the root cause and fix this for good ;)

% curl '127.0.0.1:8080/variants/data/vcf/sample1-bcbio-cancer'
curl: (52) Empty reply from server
% curl --header "Content-Type: application/json" -d '{}' '127.0.0.1:8080/variants/data/vcf/sample1-bcbio-cancer'
curl: (52) Empty reply from server
% curl '127.0.0.1:8080/variants/data/vcf/sample1-bcbio-cancer?format=VCF&class=header'
curl: (52) Empty reply from server
% curl --header "Content-Type: application/json" -d '{"format": "VCF", "regions": [{"referenceName": "chrM"}]}' '127.0.0.1:8080/variants/data/vcf/sample1-bcbio-cancer'
curl: (52) Empty reply from server
% curl 127.0.0.1:8080/variants/service-info
{
  "id": "",
  "name": "HtsGet service",
  "version": "",
  "organization": {
    "name": "Snake oil",
    "url": "https://en.wikipedia.org/wiki/Snake_oil"
  },
  "type": {
    "group": "org.ga4gh",
    "artifact": "htsget",
    "version": "1.3.0"
  },
  "htsget": {
    "datatype": "variants",
    "formats": [
      "VCF",
      "BCF"
    ],
    "fieldsParametersEffective": false,
    "TagsParametersEffective": false
  },
  "contactUrl": "",
  "documentationUrl": "https://github.com/umccr/htsget-rs/tree/main/htsget-http-actix",
  "createdAt": "",
  "UpdatedAt": "",
  "environment": "testing"
}

Meanwhile on the other "server" console:

% cargo run -p htsget-http-actix
    Finished dev [unoptimized + debuginfo] target(s) in 0.24s
     Running `target/debug/htsget-http-actix`
thread 'actix-rt:worker:0' panicked at 'there is no reactor running, must be called from the context of a Tokio 1.x runtime', /Users/rvalls/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/blocking/pool.rs:84:33
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'actix-rt:worker:1' panicked at 'there is no reactor running, must be called from the context of a Tokio 1.x runtime', /Users/rvalls/dev/umccr/htsget-rs/htsget-http-core/src/async_http_core.rs:45:18
thread 'actix-rt:worker:2' panicked at 'there is no reactor running, must be called from the context of a Tokio 1.x runtime', /Users/rvalls/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.12.0/src/runtime/blocking/pool.rs:84:33
thread 'actix-rt:worker:3' panicked at 'there is no reactor running, must be called from the context of a Tokio 1.x runtime', /Users/rvalls/dev/umccr/htsget-rs/htsget-http-core/src/async_http_core.rs:45:18

/cc @CastilloDel

Merge overlapping byte ranges

In some cases we could get overlapping byte ranges like:

    Response {
        format: Bam,
        urls: [
            Url {
                url: "file:///Users/chris/dev/personal/htsget/data/htsnexus_test_NA12878.bam",
                headers: Some(
                    Headers(
                        {
                            "Range": "bytes=4668-1042732",
                        },
                    ),
                ),
                class: None,
            },
            Url {
                url: "file:///Users/chris/dev/personal/htsget/data/htsnexus_test_NA12878.bam",
                headers: Some(
                    Headers(
                        {
                            "Range": "bytes=977196-2177677",
                        },
                    ),
                ),
                class: None,
            },
            Url {
                url: "file:///Users/chris/dev/personal/htsget/data/htsnexus_test_NA12878.bam",
                headers: Some(
                    Headers(
                        {
                            "Range": "bytes=2060795-",
                        },
                    ),
                ),
                class: None,
            },
        ],
    }

Remove `file://`

To implement the benchmarks, we need to remove the file:// scheme (which also isn't allowed by the htsget spec). See here. Last time I just changed it for http://localhost, but this time I was thinking maybe we should take a more permanent route? If I understand correctly, LocalStorage shouldn't be constrained to localhost.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.