ewels / aws-igenomes Goto Github PK

View Code? Open in Web Editor NEW

103.0 103.0 11.0 402 KB

Documentation and description of AWS iGenomes S3 resource.

Home Page: https://ewels.github.io/AWS-iGenomes/

License: MIT License

Groovy 24.06% Shell 30.47% HTML 45.47%

aws-igenomes's Introduction

Hi! I'm @ewels (Phil Ewels) 👋

I'm a bioinformatician from the UK 🇬🇧 living in Sweden 🇸🇪

I've worked in a range of different positions, I'm currently product manager for open-source software at Seqera.

I created MultiQC, co-founded nf-core and am heavily involved with Nextflow. I also have a host of small side-projects, such as SRA-explorer, AWS-iGenomes and more general-use tools like rich-click and rich-codex.

I'm passionate about making software development best-practices available to scientists, unlocking high-throughput reproducible research to those working at the forefront of research.

You can find me on:

aws-igenomes's People

Contributors

Stargazers

Watchers

Forkers

snewhouse inambioinfo maxulysse veena-v-g jmscraig maurya-anand apeltzer jaganmskcc sabryr mfitzgib

aws-igenomes's Issues

Add Canfam4 genome?

Hello,

Would it be possible to update camfam to v4 of genome? v3.1 is very old and v4 is being new standard for analysis.

Thank you,
Keyur

What are the versions of genome files?

Hello AWS-iGenomes,
Thanks for making this database.
I'm using nf-core ATAC-seq pipeline now. On the introduction page, nf-core says they are using the files of this database as their reference genomes (https://ewels.github.io/AWS-iGenomes/) for alignment.
I'm now facing an issue that, I need to use the bam files for downstream RGT-HINT analysis. The RGT-HINT package is using the gtf files of Gencode vM25 version (mouse) and Gencode v21 version (human). I believe their versions do not match with the versions of AWS-iGenomes, because I'm keeping receiving the error messages that the coordinates of genes do not match.

# I believe the versions of AWS-iGenomes are not Gencode vM25 version (mouse) and Gencode v21 version (human), because the nf-core output file says:
The contents of the annotation directories were downloaded from UCSC on: July 17, 2015.
SmallRNA annotation files were downloaded from miRBase release 21.

# I'm keeping receiving the error messages that the coordinates of genes do not match.
Report: The scikit HMM encountered errors when applied. in region (10,52417320,52418086). This iteration will be skipped.

The contents of the annotation directories were downloaded from UCSC on: July 17, 2015. Could you please tell me which version you downloaded at that time both for human (https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/) and mouse (https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/)?

Thanks!
Best,
Yuanjian

Bowtie index for alternative haplotype in hg19

Hi,
I have a question about the haplotype information in hg19. Are both H1 and H2 haplotypes of the MAPT locus (chr17q21.31 region) included in the iGenome hg19 version? If not, how may I create bowtie index for the H2 haplotype?
Appreciate your help.
Jenny

Add ENCODE blacklists

Add ENCODE blacklists for assemblies where available:
https://sites.google.com/site/anshulkundaje/projects/blacklists

These were originally generated relative to UCSC genome builds. The files can be added to a new folder called Annotation/Mappability/. Alternatively, they could be added to the existing Sequence/AbundantSequences/ but the files dont really contain any actual sequence so its probably not appropriate.

GRCh38
- https://raw.githubusercontent.com/nf-core/atacseq/master/assets/blacklists/hg38-blacklist.bed
```
<IGENOME_BASE>/Homo_sapiens/NCBI/GRCh38/Annotation/Mappability/
```
- May need to rename this to GRCh38-blacklist.bed
GRCh37
- https://raw.githubusercontent.com/nf-core/atacseq/dev/assets/blacklists/GRCh37-blacklist.bed
```
<IGENOME_BASE>/Homo_sapiens/Ensembl/GRCh37/Annotation/Mappability/
```
GRCm38
- https://raw.githubusercontent.com/nf-core/atacseq/dev/assets/blacklists/GRCm38-blacklist.bed
```
<IGENOME_BASE>/Mus_musculus/Ensembl/GRCm38/Annotation/Mappability/
```
hg38
- https://raw.githubusercontent.com/nf-core/atacseq/master/assets/blacklists/hg38-blacklist.bed
```
<IGENOME_BASE>/Homo_sapiens/UCSC/hg38/Annotation/Mappability/
```
hg19
- https://raw.githubusercontent.com/nf-core/atacseq/dev/assets/blacklists/hg19-blacklist.bed
```
<IGENOME_BASE>/Homo_sapiens/UCSC/hg19/Annotation/Mappability/
```
mm10
- https://raw.githubusercontent.com/nf-core/atacseq/dev/assets/blacklists/mm10-blacklist.bed
```
<IGENOME_BASE>/Mus_musculus/UCSC/mm10/Annotation/Mappability/
```
mm9
- http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm9-mouse/mm9-blacklist.bed.gz
```
<IGENOME_BASE>/Mus_musculus/UCSC/mm9/Annotation/Mappability/
```
ce10
- http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/ce10-C.elegans/ce10-blacklist.bed.gz
```
<IGENOME_BASE>/Caenorhabditis_elegans/UCSC/ce10/Annotation/Mappability/
```
dm3
- http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/dm3-D.melanogaster/dm3-blacklist.bed.gz
```
<IGENOME_BASE>/Drosophila_melanogaster/UCSC/dm3/Annotation/Mappability/
```

Depending on the source the files will have to be unzipped.

These files can be used in conjunction with the liftOver tool to make them available for other assemblies for the same organism, or by changing the chromosome identifiers to map between Ensembl and UCSC ( I have done this already for the latest human and mouse Ensembl genomes). However, for now I think this would be a good start!

Pre-generated STAR indices are incompatible with newer versions of STAR

It appears STAR has made a backward incompatible change in recent releases - the existing indices in AWS-iGenomes only work for STAR <=2.7.1a. Versions beyond this need new indices generated with a more recent version of STAR.

Error when running STAR v2.7.2b is:

EXITING because of FATAL ERROR: Genome version: 20201 is INCOMPATIBLE with running STAR version: 2.7.2b
SOLUTION: please re-generate genome from scratch with running version of STAR, or with version: 2.7.1a

Variation vcf missing from GRCh38

As the title says, it looks like NCBI GRCh38 does not contain a VCF. Is that intentional?

Manifest contains spaces in paths

Hi Phil,

Thanks for making this available!

ngi-igenomes_file_manifest.txt wrongly contains extra spaces (right after the bucket name) on 71 lines (mostly for STAR indices). The path without space is obviously the correct one.

Andreas

Web API to get filenames

Functionalities can be moved into a Web API.

Salmo_salar (Atlantic Salmon) ref?

Hello

Could you please add Atlantic salmon (salmo salar) to the references so that I could easily use it via AWS-iGenomes references in nf-core? on Ensembl: https://www.ensembl.org/Salmo_salar/Info/Index
(ICSASG_v2 (GCA_000233375.4)

Thanks
Marwa

access denied

I am getting access denied when running locally, and on AWS. I have AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY exported in my env; Any idea why this might not be working?

aws s3 --region eu-west-1 sync s3://ngi-igenomes/igenomes/Homo_sapiens/NCBI/build37.2/Sequence/WholeGenomeFasta/ .
fatal error: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied

Feature request: new rat reference mRatBN7.2

any interest in adding the most recent rat reference?

https://rgd.mcw.edu/wg/news/new-rat-genome/

happy to contribute where I can; I've never added to iGenomes before

GENCODE release and assembly info

Is there a way to tell which release of GENCODE (https://www.gencodegenes.org/human/) the files were created with? There are many released tied to GRCh38 at this point.

For example the STAR files:

Should that be a dropdown in the sync command builder?
https://ewels.github.io/AWS-iGenomes/

Given the last update of this repo was 4 years ago, it has me worried that these are 4+ years out of date

The future of AWS-iGenomes

I'm currently planning on including the AWS-iGenomes script into a RNA-seq pipeline I'm working on. As I read that the data-set is however only hosted until 30 January 2021 with hope of renewal of the agreement, the question for me came up, if there's an alternative plan to host the data-set and keep the script working, if AWS doesn't renew the current agreement.

This is important for my consideration, if I should or should not include the script into my pipeline for obvious reasons. And I could imagine that this question could be important for other users as well.

Add GRCh38 to nextflow.config

Best to use a version that doesn't contain ALT chromosomes.

Bulk download for a species

Hi Phill,

first of all thank you for setting up this super practical resource. My question is comes from the fact that for any given pipeline / analysis, a number of references is needed for a species. For instance nf-core/rnaseq uses iirc genome sequence, STAR+RSEM+salmon indexes, GTF, etc.

Since downloading each one individually is quite tedious, Is it possible to get the full datasets for a particular species? Say for human? An example would be if this command bash aws-igenomes.sh -g Homo_sapiens -s Ensembl -b GRCh37 would download all indexes for Ensembl/GRCh37 instead of asking for -t.

I guess another alternative would be for me to download from iGenomes directly and produce the missing indexes :)

Cheers,
António

Beagle reference set

Hello,

I am planning to use beagle5 in one of my nextflow pipelines (for allele-specific copy number estimation of somatic samples). It would be cool if a set of references and a plink genetic map set would be included in igenomes. Otherwise, it will be needed from the user to download and create those reference sets which is a bit hassle since it is per chromosome. Those sets are pretty much standard and open-access, I guess there won't be an issue to add here?

Some time ago, I also saw that someone from nf-core community was also using beagle5 for imputation analysis but I guess the pipeline was never published.

Here are the sources:

Web service is not downloading correct reference

Hi @ewels ,

Thanks for so nice work for making easier all this bioinfo world. I'm using the web service for downloading all reference files required by Sarek.

I'm requesting the GRCh38 version from GATK source with this command:

aws s3 --no-sign-request --region eu-west-1 sync s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/ ./Homo_sapiens_v2/GATK/GRCh38/,

but, together with the correct version, it also downloads the hg19 one 😓

download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr12.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr12.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr11.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr11.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr10.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr10.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr1.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr1.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr13.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr13.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr14.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr14.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr15.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr15.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19.zip to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19.zip
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr17.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr17.txt
download: s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr16.txt to Homo_sapiens_v2/GATK/GRCh38/Annotation/ASCAT/G1000_alleles_hg19/G1000_alleles_hg19_chr16.txt
....

I guess it's a little bug. Thanks in advance!