Coder Social home page Coder Social logo

aws-igenomes's Introduction

Hi! I'm @ewels (Phil Ewels) πŸ‘‹

I'm a bioinformatician from the UK πŸ‡¬πŸ‡§ living in Sweden πŸ‡ΈπŸ‡ͺ

I've worked in a range of different positions, I'm currently product manager for open-source software at Seqera.

I created MultiQC, co-founded nf-core and am heavily involved with Nextflow. I also have a host of small side-projects, such as SRA-explorer, AWS-iGenomes and more general-use tools like rich-click and rich-codex.

I'm passionate about making software development best-practices available to scientists, unlocking high-throughput reproducible research to those working at the forefront of research.

You can find me on:

aws-igenomes's People

Contributors

apeltzer avatar ewels avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aws-igenomes's Issues

Add Canfam4 genome?

Hello,

Would it be possible to update camfam to v4 of genome? v3.1 is very old and v4 is being new standard for analysis.

Thank you,
Keyur

What are the versions of genome files?

Hello AWS-iGenomes,
Thanks for making this database.
I'm using nf-core ATAC-seq pipeline now. On the introduction page, nf-core says they are using the files of this database as their reference genomes (https://ewels.github.io/AWS-iGenomes/) for alignment.
I'm now facing an issue that, I need to use the bam files for downstream RGT-HINT analysis. The RGT-HINT package is using the gtf files of Gencode vM25 version (mouse) and Gencode v21 version (human). I believe their versions do not match with the versions of AWS-iGenomes, because I'm keeping receiving the error messages that the coordinates of genes do not match.

# I believe the versions of AWS-iGenomes are not Gencode vM25 version (mouse) and Gencode v21 version (human), because the nf-core output file says:
The contents of the annotation directories were downloaded from UCSC on: July 17, 2015.
SmallRNA annotation files were downloaded from miRBase release 21.
# I'm keeping receiving the error messages that the coordinates of genes do not match.
Report: The scikit HMM encountered errors when applied. in region (10,52417320,52418086). This iteration will be skipped.

The contents of the annotation directories were downloaded from UCSC on: July 17, 2015. Could you please tell me which version you downloaded at that time both for human (https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/) and mouse (https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/)?

Thanks!
Best,
Yuanjian

Bowtie index for alternative haplotype in hg19

Hi,
I have a question about the haplotype information in hg19. Are both H1 and H2 haplotypes of the MAPT locus (chr17q21.31 region) included in the iGenome hg19 version? If not, how may I create bowtie index for the H2 haplotype?
Appreciate your help.
Jenny

Add ENCODE blacklists

Add ENCODE blacklists for assemblies where available:
https://sites.google.com/site/anshulkundaje/projects/blacklists

These were originally generated relative to UCSC genome builds. The files can be added to a new folder called Annotation/Mappability/. Alternatively, they could be added to the existing Sequence/AbundantSequences/ but the files dont really contain any actual sequence so its probably not appropriate.

Depending on the source the files will have to be unzipped.

These files can be used in conjunction with the liftOver tool to make them available for other assemblies for the same organism, or by changing the chromosome identifiers to map between Ensembl and UCSC ( I have done this already for the latest human and mouse Ensembl genomes). However, for now I think this would be a good start!

Pre-generated STAR indices are incompatible with newer versions of STAR

It appears STAR has made a backward incompatible change in recent releases - the existing indices in AWS-iGenomes only work for STAR <=2.7.1a. Versions beyond this need new indices generated with a more recent version of STAR.

Error when running STAR v2.7.2b is:

EXITING because of FATAL ERROR: Genome version: 20201 is INCOMPATIBLE with running STAR version: 2.7.2b
SOLUTION: please re-generate genome from scratch with running version of STAR, or with version: 2.7.1a

Manifest contains spaces in paths

Hi Phil,

Thanks for making this available!

ngi-igenomes_file_manifest.txt wrongly contains extra spaces (right after the bucket name) on 71 lines (mostly for STAR indices). The path without space is obviously the correct one.

Andreas

access denied

I am getting access denied when running locally, and on AWS. I have AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY exported in my env; Any idea why this might not be working?

aws s3 --region eu-west-1 sync s3://ngi-igenomes/igenomes/Homo_sapiens/NCBI/build37.2/Sequence/WholeGenomeFasta/ .
fatal error: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied

The future of AWS-iGenomes

I'm currently planning on including the AWS-iGenomes script into a RNA-seq pipeline I'm working on. As I read that the data-set is however only hosted until 30 January 2021 with hope of renewal of the agreement, the question for me came up, if there's an alternative plan to host the data-set and keep the script working, if AWS doesn't renew the current agreement.

This is important for my consideration, if I should or should not include the script into my pipeline for obvious reasons. And I could imagine that this question could be important for other users as well.

Bulk download for a species

Hi Phill,

first of all thank you for setting up this super practical resource. My question is comes from the fact that for any given pipeline / analysis, a number of references is needed for a species. For instance nf-core/rnaseq uses iirc genome sequence, STAR+RSEM+salmon indexes, GTF, etc.

Since downloading each one individually is quite tedious, Is it possible to get the full datasets for a particular species? Say for human? An example would be if this command bash aws-igenomes.sh -g Homo_sapiens -s Ensembl -b GRCh37 would download all indexes for Ensembl/GRCh37 instead of asking for -t.

I guess another alternative would be for me to download from iGenomes directly and produce the missing indexes :)

Cheers,
AntΓ³nio

Beagle reference set

Hello,

I am planning to use beagle5 in one of my nextflow pipelines (for allele-specific copy number estimation of somatic samples). It would be cool if a set of references and a plink genetic map set would be included in igenomes. Otherwise, it will be needed from the user to download and create those reference sets which is a bit hassle since it is per chromosome. Those sets are pretty much standard and open-access, I guess there won't be an issue to add here?

Some time ago, I also saw that someone from nf-core community was also using beagle5 for imputation analysis but I guess the pipeline was never published.

Here are the sources:

Web service is not downloading correct reference

Hi @ewels ,

Thanks for so nice work for making easier all this bioinfo world. I'm using the web service for downloading all reference files required by Sarek.

I'm requesting the GRCh38 version from GATK source with this command:

aws s3 --no-sign-request --region eu-west-1 sync s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/ ./Homo_sapiens_v2/GATK/GRCh38/,

but, together with the correct version, it also downloads the hg19 one πŸ˜“

download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr12.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr12.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr11.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr11.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr10.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr10.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr1.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr1.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr13.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr13.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr14.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr14.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr15.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr15.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19.zip to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19.zip
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr17.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr17.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr16.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr16.txt
....

I guess it's a little bug. Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.