refgenie / refgenieserver Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 3.0 643 KB

Serves a web interface and RESTful API for reference genome assets.

Home Page: http://refgenie.databio.org

License: BSD 2-Clause "Simplified" License

Dockerfile 0.29% Python 66.75% HTML 25.74% CSS 1.71% Shell 5.51%

api-server docker-container genome-assembly reference-genome

refgenieserver's Introduction

A standardized reference genome resource manager. See the documentation.

refgenieserver's People

Contributors

Stargazers

Watchers

Forkers

vreuter alexander-manley ashdavid12

refgenieserver's Issues

do not update the refgenie config file

new YAML file should be created for the server use after refgenieserver archive rather than the original one edited.

location: where the genome_archive key points

use logging in refgenieserver archive

use logging (possibly logmuse) instead of warnings in refgenieserver archive

Error with archive

Separate from #32; if I give the config then I get this error:

refgenieserver -c $REFGENIE archive
email-validator not installed, email fields will be treated as str.
To install, run: pip install email-validator
Traceback (most recent call last):
  File "/home/nsheff/miniconda3/lib/python3.7/site-packages/attmap/pathex_attmap.py", line 31, in __getattr__
    v = super(PathExAttMap, self).__getattribute__(item)
AttributeError: 'YacAttMap' object has no attribute 'set_default'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nsheff/miniconda3/lib/python3.7/site-packages/attmap/ordattmap.py", line 45, in __getitem__
    return super(OrdAttMap, self).__getitem__(item)
KeyError: 'set_default'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nsheff/miniconda3/lib/python3.7/site-packages/attmap/pathex_attmap.py", line 34, in __getattr__
    return self.__getitem__(item, expand)
  File "/home/nsheff/miniconda3/lib/python3.7/site-packages/attmap/pathex_attmap.py", line 51, in __getitem__
    v = super(PathExAttMap, self).__getitem__(item)
  File "/home/nsheff/miniconda3/lib/python3.7/site-packages/attmap/ordattmap.py", line 47, in __getitem__
    return AttMap.__getitem__(self, item)
  File "/home/nsheff/miniconda3/lib/python3.7/site-packages/attmap/attmap.py", line 32, in __getitem__
    return self.__dict__[item]
KeyError: 'set_default'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nsheff/.local/bin/refgenieserver", line 10, in <module>
    sys.exit(main())
  File "/home/nsheff/.local/lib/python3.7/site-packages/refgenieserver/main.py", line 124, in main
    archive(rgc, args)
  File "/home/nsheff/.local/lib/python3.7/site-packages/refgenieserver/server_builder.py", line 52, in archive
    asset_desc = rgc[CFG_GENOMES_KEY][genome][CFG_ASSETS_KEY][asset_name].set_default(CFG_ASSET_DESC_KEY, "NA")
  File "/home/nsheff/miniconda3/lib/python3.7/site-packages/attmap/pathex_attmap.py", line 38, in __getattr__
    raise AttributeError(item)
AttributeError: set_default

naming

I originally used refgenie_server as the name of this repository, and then used refgenies as the command to invoke to start the server.

I'm not sure if refgenies is good because it is just so close to refgenie that it may cause some confusion.

I wanted to see what you thought about this and if you had any other suggestion for CLI name.

some ideas:

refgenserve
rgserver
rgcserver
refgenieserver

I'm leaning toward rgserver for the command name....

TypeError: call() missing 2 required positional arguments: 'receive' and 'send'

docker build -t fastapi .
docker run --rm -p 80:80 --name fastapi -v $(pwd):/app -v $(pwd)/files:/genomes fastapi refgenieserver -c refgenie.yaml serve

INFO: Started server process [1]
INFO: Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)
ERROR: Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/uvicorn/protocols/http/httptools_impl.py", line 371, in run_asgi
    asgi = app(self.scope)
TypeError: __call__() missing 2 required positional arguments: 'receive' and 'send'
INFO: ('172.17.0.1', 35496) - "GET / HTTP/1.1" 500
ERROR: Exception in ASGI application

endpoint to identify genome by hash?

Will we want to be able to return server metadata for a genome that is requested by hash value?

version API

Some reading:

https://stackoverflow.com/questions/389169/best-practices-for-api-versioning

https://www.troyhunt.com/your-api-versioning-is-wrong-which-is/

idea:

change main.py to:

from .version1 import *
from .version2 import *

and version1.py is:

VERSION="v1"
@app.get("/")
@app.get(os.path.join(VERSION, "/index"))
async def index(request: Request):

Then refgenconf says VERSION="v1" and can be updated as needed.

refgenies archive command

we need a script that will take a refgenie.yaml config file and build the individual tar archives the server users.

this is a separate command-line tool that comes with the server package.

Archive function also requires a `genomes_desc` argument

According to readme, the archive function only requires a config file:
"refgenieserver archive -c CONFIG
It just requires a -c argument or $REFGENIE environment variable."

However, it also appears to require genomes_desc:

refgenieserver archive -c test_multiple_servers.yaml 
Traceback (most recent call last):
  File "/home/user/.local/bin/refgenieserver", line 10, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.7/site-packages/refgenieserver/main.py", line 32, in main
    archive(rgc, arp, args.force, args.remove, selected_cfg)
TypeError: archive() missing 1 required positional argument: 'genomes_desc'

change command order

right now we say refgenie pull -c CONFIG but it's refgenieserver -c CONFIG serve. We should make the config come after the command as it does with refgenie. I find that more intuitive.

archiver should use YAML in the genome_archive if available

When using --asset and/or --genome options the archiver should use the config file from genome_archive to prevent overwriting the assets metadata stored in the existing one.

New Salmon indices with Genome

Hi guys,

Thanks for maintaining such an awesome resource.
I am from the salmon developer's group and we recently released a new version of salmon that supports highly accurate quantification (check this preprint for more details). Some of our users are interested in pre built salmon indices COMBINE-lab/salmon#444 and we are exploring the resources to upload new Genome based salmon indices which are downloadable and queryable, may be based on species or the reference sequence. I found the refgenie very useful and am wondering are you guys open to updating your database with the latest salmon indices ? For starters support for just the human and mouse indices would be great too.

refgenie pull consistently fails to retrieve complete asset (large asset only?)

Not sure if this is related to the size of the asset, but refgenie pull consistently is unable to retrieve the complete file.

$ refgenie pull -g hg38 -a bowtie2
Starting pull for 'hg38/bowtie2'
'hg38/bowtie2' archive size: 3.5GB
Downloading URL: http://refgenomes.databio.org/asset/hg38/bowtie2/archive
hg38/bowtie2:  16%|████████████████████████▉                                                                                                                                   | 600M/3.75G [00:19<01:44, 30.2MB/s]
<urlopen error retrieval incomplete: got only 600489984 out of 3749238990 bytes>
'hg38/bowtie2' download incomplete

$ refgenie pull -g hg38 -a bowtie2
Starting pull for 'hg38/bowtie2'
'hg38/bowtie2' archive size: 3.5GB
Downloading URL: http://refgenomes.databio.org/asset/hg38/bowtie2/archive
hg38/bowtie2:   5%|███████▊                                                                                                                                                    | 186M/3.75G [00:10<03:22, 17.6MB/s]
<urlopen error retrieval incomplete: got only 186474496 out of 3749238990 bytes>
'hg38/bowtie2' download incomplete

$ refgenie pull -g hg38 -a bowtie2
Starting pull for 'hg38/bowtie2'
'hg38/bowtie2' archive size: 3.5GB
Downloading URL: http://refgenomes.databio.org/asset/hg38/bowtie2/archive
hg38/bowtie2:  58%|██████████████████████████████████████████████████████████████████████████████████████████▏                                                                | 2.18G/3.75G [01:06<00:47, 32.8MB/s]
<urlopen error retrieval incomplete: got only 2181976064 out of 3749238990 bytes>
'hg38/bowtie2' download incomplete

$ refgenie pull -g hg38 -a bowtie2
Starting pull for 'hg38/bowtie2'
'hg38/bowtie2' archive size: 3.5GB
Downloading URL: http://refgenomes.databio.org/asset/hg38/bowtie2/archive
hg38/bowtie2:  14%|█████████████████████                                                                                                                                       | 507M/3.75G [00:27<02:57, 18.2MB/s]
<urlopen error retrieval incomplete: got only 506863616 out of 3749238990 bytes>
'hg38/bowtie2' download incomplete

$ refgenie pull -g hg38 -a bowtie2
Starting pull for 'hg38/bowtie2'
'hg38/bowtie2' archive size: 3.5GB
Downloading URL: http://refgenomes.databio.org/asset/hg38/bowtie2/archive
hg38/bowtie2:  14%|█████████████████████▌                                                                                                                                      | 517M/3.75G [00:28<02:58, 18.2MB/s]
<urlopen error retrieval incomplete: got only 517259264 out of 3749238990 bytes>
'hg38/bowtie2' download incomplete

$ refgenie pull -g hg38 -a bowtie2
Starting pull for 'hg38/bowtie2'
'hg38/bowtie2' archive size: 3.5GB
Downloading URL: http://refgenomes.databio.org/asset/hg38/bowtie2/archive
hg38/bowtie2:  26%|█████████████████████████████████████████▏                                                                                                                  | 990M/3.75G [00:31<01:28, 31.0MB/s]
<urlopen error retrieval incomplete: got only 989806592 out of 3749238990 bytes>
'hg38/bowtie2' download incomplete

add custom operationIds

in support of: refgenie/refgenconf#64

see failed tests: https://travis-ci.org/databio/refgenconf/jobs/595340325

staging config

@nmagee can we make the staging server operate on a different genome config file from the main server?

more frequent server config updates in archiver

new entries (asset with its attributes) should be added to the config every time an asset is processed, rather than after the archive run

wait for refgenie/refgenie#26

passing config file to the server

we'll need a way for the server to get the configuration filepath it should use. perhaps we should use what we did for caravel to do this... but I can't reconcile that with how to easily run it in a container right now...

add genome level tarball attributes in archiver

originally proposed by @nsheff in #17

the 'all asset' link also needs a checksum and filesize. we need a way to build this information into the 'archive' command

are mounting app docs incorrect?

in our directions we use -v to mount the /app dir.

I know you used to need to do this, but with the switch to importing uvicorn to run on the command line, is this still necessary? can we updated the docs if not?

add hook to list genomes by asset

once refgenieconf has list genomes by asset function, refgenie/refgenconf#7

there should be an endpoint to access that.

Refgenomes.databio.org master server is giving a 404 error

http://refgenomes.databio.org/ is down.

It was actually down (giving a 404 error) before the release... but anyway, now I just released it and it's still down...

server interface ideas

add to assets list:

size of the files
links to complete archives (add a separate API endpoint for these)
links to individual asset tarballs (already exists)

add to index landing page:

link to /docs swagger auto-docs
links to some common api points (/genomes, /assets, or whatever)
switch to bootstrap
~~some text introducing what this is~~ Added, needs text (not in a new tab, but on the landing page, under the list)
link to refgenie.databio.org; github for refgenie_server?
refgenie_server version
logo

Multiple sources of truth

There's a config file in this project's root that differs in structure from the one in, e.g., refgenconf's root. The one here should move to refgenconf, and then the one here should point to that one.

PyPi installation error

requirements are not included in the package, because MANIFEST.in is missing.

> pip3.6 install --user refgenieserver
Collecting refgenieserver
  Using cached https://files.pythonhosted.org/packages/c3/3e/c013539317e35d22a5201129ddb21df0394e45cf079fdf5a017108fe73c0/refgenieserver-0.3.2.tar.gz
    ERROR: Command errored out with exit status 1:
     command: /usr/local/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/pip-install-qq9uihwz/refgenieserver/setup.py'"'"'; __file__='"'"'/private/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/pip-install-qq9uihwz/refgenieserver/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info
         cwd: /private/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/pip-install-qq9uihwz/refgenieserver/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/pip-install-qq9uihwz/refgenieserver/setup.py", line 13, in <module>
        with open("requirements/requirements-all.txt", 'r') as reqs_file:
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements/requirements-all.txt'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.```

packaging up the refgenie_server

should we make refgenie server a python package?

Server GUI changes

add links to all genomes on the server up at the top
display genomes in an alphabetical order

add examples to api docs

the swagger autodocs should include examples. right now it says 'none available'.

we just need to add examples in the code in the appropriate way to give users example queries they can try out.

archive tars the files with the directory strcture

see -C or --strip-components option

introduce tagging concept

add the following logic to the asset pull request:

if no tag: return default (if it exists)
if tag: return that tag (if it exists)

docs for archive

I am unable to figure out how the refgenie archive command works. Do I give it a config file?

home page style updates

a few updates:

reduce font size on navbar to match our other pages (eg http://refgenie.databio.org)
make the genome->asset connection more clear by moving the genome name into a column of the asset table. one row could be 'all assets', which would be the link currently anchored by the genome name.
add version numbers at footer, as we have done with other packages.
we should add asset descriptions... I suppose this will have to be provided manually for now but it seems important to describe exactly what the asset is on this page.
~~the 'all asset' link also needs a checksum and filesize. we need a way to build this information into the 'archive' command ?~~ moved do #18
there is a 'list_assets.html' page, but at what endpoint is this rendered?
we should add links to other endpoints from the front page as well, for each asset/genome
rename template files so that one isn't called template.html, which is confusing
column sizes in tables

table representation of the resources

now the data that can be served/downloaded is represented as a list with badges providing additional info

it should be represented in a form of a set of tables, one for each genome, where assets are rows and columns are the assets' attributes

dev branch

Per request of @nsheff added a dev branch to the project, based on 045ae81. Deployments using .travis.yml have been updated as well for automation. Build/deploys are taking ~60 seconds. Follow Travis-CI here

DEV/Development Branch

http://dev.refgenomes.uvasomrc.io/
http://dev.refgenomes.databio.org/ [DNS required]
http://dev.refgenomes.uvadcos.io/ [DCOS preview - in testing]

MASTER/Production Branch

http://refgenomes.uvasomrc.io/
http://refgenomes.databio.org/
http://refgenomes.uvadcos.io/ [DCOS preview - in testing]

Merging Upstream:

Minor tweaks and changes can be done via merge.
New features / testing / UAT should be done via PRs.

To Do:

Update DNS for dev.refgenomes.databio.org point as CNAME to dev1.uvasomrc.io.
Continue testing of Traefik LB per #23 to determine move to DCOS for both branches. This includes Redis data migration, if required.

adding new attributes to genome config file format

we originally described the format as just one layer deep: refgenie/refgenie#6

like:

# Genome configuration

genome_folder: $GENOMES
genome_server: http://localhost

genomes:
  hg38:
    bowtie2: indexed_bowtie2
    hisat2: indexed_hisat2
    tss_annotation: TSS.bed.gz
    gtf: blah.gtf
  mm10:
    bowtie2: indexed_bowtie2
    blacklist: blacklist/mm10.bed
 
  rCRSd:
    bowtie2: indexed_bowtie2

but the server requires a bit more information: the archive folder at the project level, and then for individual assets we need filesize, checksum, etc. So for server we need a bit more info. I therefore propose this format:

# Genome configuration

genome_folder: $GENOMES
genome_server: http://localhost
genome_archive: /path/to/archives

genomes:
  hg38:
    bowtie2:
      path: indexed_bowtie2
      archive_checksum: mm20349234n20349280345mv2035
      asset_size: 32G
      archive_size: 7G
    hisat2:
      path: indexed_hisat2
      archive_checksum: mm20349234n20349280345mv2035
      asset_size: 12G
      archive_size: 4G
    tss_annotation:
      path: TSS.bed.gz
      archive_checksum: mm20349234n20349280345mv2035
      asset_size: 21M
      archive_size: 3M

Endpoints for improved information

In light of #7, we need to rethink how our endpoints will share this new information. I propose:

/{genome}/{asset} - provides a complete JSON with all attributes for that asset (so, path, archive_size, asset_size, and checksum...and anything else we add).

/{genome}/{asset}/archive - downloads the archive itself (currently what we're serving with /{genome}/{asset}

I think that suffices. I don't really seed the need to allow accessing each attribute individually. The CLI can use the main JSON endpoint to get info about file sizes and checksums for display and verification (which is not possible with our current endpoints).

do not rely in args.config in archiver

refgenieserver archive
email-validator not installed, email fields will be treated as str.
To install, run: pip install email-validator
Traceback (most recent call last):
  File "/home/nsheff/.local/bin/refgenieserver", line 10, in <module>
    sys.exit(main())
  File "/home/nsheff/.local/lib/python3.7/site-packages/refgenieserver/main.py", line 124, in main
    archive(rgc, args)
  File "/home/nsheff/.local/lib/python3.7/site-packages/refgenieserver/server_builder.py", line 26, in archive
    server_rgc_path = os.path.join(rgc[CFG_ARCHIVE_KEY], os.path.basename(args.config))
  File "/home/nsheff/miniconda3/lib/python3.7/posixpath.py", line 146, in basename
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType

it should be reading the config from REFGENIE? or at least giving an understandable error.

endpoint for genome information

right Related to #36 -- right now there is no endpoint to retrieve metadata about a genome. we will need this to implement collection-level checksums.

generation of links to the relevant endpoints in asset splash page

we need to systematically construct a list of relevant endpoints in asset splash page.

We should source them from the openAPI description JSON file by fetching just the operation_ids that describe the endpoints of interest. To do that, the operation_ids should to be stored in a dictionary in refgenconf that defines the hierarchy we need.

hg19 ensembl_gtf asset is incorrect

It appears that the ensembl_gtf asset for hg19 is either the wrong initial file or something was changed to it along the way. It should contain the following identifiers in column 3:

gzip -dc ensembl_gtf/hg19.gtf.gz | awk '{print $3}' | sort -u

CDS
exon
five_prime_utr
gene
Selenocysteine
start_codon
stop_codon
three_prime_utr
transcript

But instead it has:

CDS
exon
gene
Selenocysteine
start_codon
stop_codon
transcript
UTR

Checking the original source file, as we reference in the docs, includes the "correct" values. Not sure what the original file from http://refgenomes.databio.org is derived from...

alternate way to count asset downloads

raised by @nmagee

The FS used by the container is read-only, so we can't update the config file in order to store download counts:

INFO: ('10.0.3.50', 36128) - "GET /asset/ERCC92/bowtie1/archive HTTP/1.1" 500

ERROR: Exception in ASGI application

Traceback (most recent call last):

 File "/usr/local/lib/python3.7/site-packages/uvicorn/protocols/http/httptools_impl.py", line 368, in run_asgi

   result = await app(self.scope, self.receive, self.send)

 File "/usr/local/lib/python3.7/site-packages/starlette/applications.py", line 133, in __call__

   await self.error_middleware(scope, receive, send)

 File "/usr/local/lib/python3.7/site-packages/starlette/middleware/errors.py", line 172, in __call__

   raise exc from None

 File "/usr/local/lib/python3.7/site-packages/starlette/middleware/errors.py", line 150, in __call__

   await self.app(scope, receive, _send)

 File "/usr/local/lib/python3.7/site-packages/starlette/exceptions.py", line 73, in __call__

   raise exc from None

 File "/usr/local/lib/python3.7/site-packages/starlette/exceptions.py", line 62, in __call__

   await self.app(scope, receive, sender)

 File "/usr/local/lib/python3.7/site-packages/starlette/routing.py", line 585, in __call__

   await route(scope, receive, send)

 File "/usr/local/lib/python3.7/site-packages/starlette/routing.py", line 207, in __call__

   await self.app(scope, receive, send)

 File "/usr/local/lib/python3.7/site-packages/starlette/routing.py", line 40, in app

   response = await func(request)

 File "/usr/local/lib/python3.7/site-packages/fastapi/routing.py", line 117, in app

   raw_response = await dependant.call(**values)

 File "/app/refgenieserver/main.py", line 67, in download_asset

   update_stats(rgc, genome, asset)

 File "/app/refgenieserver/helpers.py", line 96, in update_stats

   rgc.write()

 File "/usr/local/lib/python3.7/site-packages/yacman/yacman.py", line 34, in write

   with open(filename, 'w') as f:

OSError: [Errno 30] Read-only file system: '/genomes/genomes.yaml'

We need an alternate way to track downloads:

Other suggestions if you need alternate ways to track counts – you could +1 to a specific key in Redis (there is a ‘counter’ data type made for that), which we have available, or to a more robust DB either relational or nosql.

To do:

revert f486b1d
implement new approach

use checksums

the builder should produce and save an md5 checksum of the archive after building.

asset column width variability

set the asset name column width to a constant value, so that the table width is equal among assets

tagging in archive

similar to #39, but for refgenieserver archive

favicon

@nsheff can you create a square-shaped logo (maybe it's just the oil lamp from the original one) so that we can create a favicon out of it?

server has url hard coded

if you visit 'http://refgenomes.uvadcos.io/' it says: Welcome to refgenomes.databio.org.

that must be hard-coded...should probably make that banner dynamic?

quotes and asset and archive digest endpoints

is there a reason the digest endpoints return the digests with quotes?

http://staging.refgenomes.databio.org/v2/asset/hg19/fasta/default/archive_digest?tag=default

"06e47db59dad7054c4b45df7abbaef20"

Finding seek keys for an asset

Raised in conversation with @johanneskoester.

When browsing available assets via web interface, one of the things you really want to know is: what files are available in this asset (aka, which seek keys are there?). This is buried a bit, you have to click the 'attributes' tag and then look for them in the resulting json.

Should we move the seek keys to be displayed prominently on the web page?

Another possibility: could we have a "splash page" for each asset that shows all of this information more nicely formatted? It should show the seek keys, the parents, children, with links to them, for example.

This would not be hard to do; we just need a new template for a "asset detail" page that displays all info for just 1 asset, and then we need an API point that serves that for a given asset, then link to each "asset detail" page from the main asset list page. Would this solve your request, @johanneskoester?

no way to remove archives programmatically

currently there's no way to programmatically remove archives.
refgeniserver archive just checks for ones that are present in genome config but not in server genome config and builds them. And not the other way around followed by archive removal.

Calling refgenieserver alone at commandline leads to AttributeError

$ refgenieserver
Traceback (most recent call last):
  File "/home/user/.local/bin/refgenieserver", line 10, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.7/site-packages/refgenieserver/main.py", line 22, in main
    logger_args = dict(name=PKG_NAME, fmt=LOG_FORMAT, level=5) if args.debug else dict(name=PKG_NAME, fmt=LOG_FORMAT)
AttributeError: 'Namespace' object has no attribute 'debug'

Ideally should default to the --help messaging yes?

/genome endpoint is misdocumented.

Related to #31

the /genome/{genome} asset says:

Returns a tarball with all the archived assets available for the genome. Requires the genome name as an input.

but the query: http://refgenomes.databio.org/genome/hg38

yeilds

{"detail":"No such genome on server"}

which is false.