Hyb-Seq in Galaxy

Description

HybPiper is one of the current most convenient ways to process Hybrid sequencing (Hyb-Seq) data. While it is great at what it does, the people who would like to use it most, scientists, often don't know their way around a commandline-based tool and its flags/inputs. Considering HybPiper is a tool run exclusively from the commandline, this could be a problem for some people doing research involving Hyb-Seq data.

The tools in this repository aim to make HybPiper available in Galaxy, which, in short, is a workflow and pipeline platform that can link different tools together. It has the benefit of a graphical user interface (GUI) which makes it so anyone who knows how to navigate a webpage can use it.

With these Galaxy tools, the aim is to make HybPiper and Hyb-Seq data analysis accessible to more people.

The tools are currently available here to download and have also been installed on Naturalis' Galaxy development portal. For Naturalis employies, be sure to ask the IT department at Naturalis for access if you have permission.

Once final tests have been concluded and ability to scale jobs have been implemented, the tools will be integrated into the main naturalis Galaxy portal. But this is currently not the case.

Future plans

The inclusion of the generate_heatmap function already available in HybPiper to the HybPiper Galaxy tool.
The inclusion of the retrieve paralogs functionality already available in HybPiper to the HybPiper Galaxy tool.

Documentation

Guides, documentation and tutorials on how to use HybPiper in general. Are already available at the original HybPiper GitHub repository. This Readme file goes into detail about the Galaxy tool wrapper itself. Additional Wrapper files that with documentation that goes more in-depth, can be found in the /tools directory in each tool directory itself. Each tool has its own readme.md file with additional information. The links in the Tools section below can be used to access more documentation in the form of the mentioned readme.md files that go more in-depth about the tools, how to install them and how to use them.

Tools

Hybpiper Galaxy Tool Link

A wrapper for the HybPiper pipeline to be used as a single tool in Galaxy. It takes a ZIP file with reads in FASTQ format and a single FASTA file with target sequences and runs HybPiper. The result is another zip file with all of the output of the HybPiper pipeline.

Extract Loci Tool Link

A Galaxy tool designed to take the ZIP archive generated by the HybPiper Galaxy tool above, and extract the sequence files (with options for FNA, FAA and intron sequence files) to be used in the creation of Multiple Sequence Alignments (MSA's).

Hybpiper Analysis tool Link

A Galaxy tool designed to take the ZIP archive with folders for each locus generated by the Extract Loci tool above, concatenate the sequence files into one FASTA file per locus, then run MUSCLE to generate an alignment in phylip format for each locus and concatenate them into a supermatrix for further phylogenetic inference.

Inspiration and Credits

HybSeq is a data generation technique that constitutes a hybrid between shotgun high-throughput sequencing and amplicon sequencing. The technique results in markers suitable for phylogenomics (because they are single copy, universally present, and evolve at the optimal rate) by targeted sequencing using predefined bait kits, such as developed by Kew and others. As such, the sequencing results in multiplexed FASTQ data that both needs to be demultiplexed back to the input species as well as sorted by the various baits. The ideal end result is a set of as many multiple sequence alignments (MSAs) as there were baits, each containing as many rows as there were species, as the input for a multi-marker phylogenetic analysis yielding a robust tree. In this repository, the artefacts are developed to achieve this ideal end state. The repository is inspired by the following prior art:

Nikolov et al., 2019 - a seminal paper describing how to resolve the phylogenetic backbone of the Brassicaceae using HybSeq. The paper describes an informal reference pipeline.
brassicaceae-hybseq-pipeline - an implementation of the Nikolov et al. pipeline using snakemake. The implementation is not quite complete but very close, lacking the last few steps of MSA file conversion, concatenation, and phylogenetics.
galaxy-tool-hybseq - a first attempt to port the snakemake pipeline to Galaxy. Many of the tools used by Nikolov et al. are already available in Galaxy's package management system (the 'tool shed') but some are missing. This repo attempts to wrap those missing tools, especially YASRA.
HybPiper - a more current HybSeq pipeline. It is very possible that this is preferable over the Nikolov et al. pipeline. This needs to be investigated.
Naturalis Galaxy Portal, a local install of the Galaxy workflow management system. This is where the wrappers for the different pipeline steps will be deployed eventually. The portal is administered by Dick Groenenberg.

Running the HybPiper Tools

Step-by-step Guide

In this tutorial we will perform a hyb-seq analysis in Galaxy with the tools on this GitHub repository. Before you start, make sure you have installed each tool in your Galaxy instance as described in their seperate README.md files.

Preparing the data

You can download the dataset and target file in this tutorial from here. You must get the test_targets.fasta file and the test_reads.fastq.tar.gz and unpack it manually. Then get all the fastq sequences and zip them again, so it is just a regular zip file with FASTQ files directly into the archive.

If you follow this tutorial with your own target file and read files, be sure to follow this same format for your readfile.zip. There must be no subdirectories or folders otherwise present inside the zip file, or the Galaxy tools might crash.

For a guide on how to prepare your target file as well as your read files. Be sure to check back to the original HybPiper GitHub page over here.

Step 1

First off we must load in data. To do so, click on the small icon in the top left.

Step 2, 3 and 4

Then upload your ZIP file with the paired FASTQ files and the FASTA file with the targets By pressing the 'Choose local file' button, or fetching data from a dedicated repository. In this case, we do the former. Then specify the format of each file manually (auto-detect often has issues with ZIP files) and then press start.

Once the files have uploaded close the upload window, the files should have appeared on the history on the right and be displayed in green.

Step 5

Now it is time to run the first tool, namely the HybPiper Galaxy tool, which will map the sequences in the FASTQ files to the target sequences in the FASTA file in order to output enriched marker sequences sorted by locus.

Here a small list of the different settings one can change in this tool and what they mean:

Read Files:

The ZIP file containing the FASTQ read files. Will be used as input for the tool.

Target file:

The Fasta file containing the targets. Will be used as input for the tool.

Squence Format:

The type of sequences the input files are in, has the option for Amino Acids and DNA sequences.

Mapping Method:

The type of mapping algorithm to use for the mapping of the sequences to the targets. The choice is between BWA and DIAMOND, with a default option, to run HybPiper without any mapping specifications.

Run Intronerate:

Here the user can specify whether they want to run intronerate to recover the supercontigs with introns.

Generate Heatmap:

This option is non-functional and does not do anything, so can be ignored. It is a remnant of a planned feature that would have been implemented, but wasn't due to time constraints.

Timeout percentage:

With this option the user can specify a number X. During the HybPiper run, jobs taking too long (X percent longer than average) will be killed. This option is for large datasets and can be used if jobs get stuck a lot. Entering a value of '0' (zero) will run HybPiper without a timeout.

Step 6

For our example, our data are DNA sequences, which we will map using BWA, while also running intronerate just for the demonstration. Because this tutorial dataset is small, we won't specify a number for the timeout option.

Once all the settings are the way they are supposed to be for your data/project, click the 'Execute' button. This will then run the HybPiper tool, which will take a little bit to run. The larger your dataset, the longer it will take.

Step 7 and 8

While the tool/job is running, it will have a loading bar and remain an orange color. Once the tool has finished it will change its color to green to indicate this.

You can download the HybPiper output ZIP archive by clicking on the Hybpiper Output variable and pressing the small download button, however, we are going to use two more tools to easily process the enriched marker sequences HybPiper generated.

Step 9

The files that the HybPiper tool generated are each in their own subdirectories. With a few hundred sequences, copying and pasting them manually to a new folder to use in analyses would take a lot of time. In order to automate this process, we have the HybPiper Extract Loci tool.

Just like the previous tool, we can specify an input ZIP file, for which we will enter the HybPiper output as the input for this tool.

We can specify which type of files generated by HybPiper we want to sort by locus by using the File type option. We can choose between 'FNA', for DNA sequence files; 'FAA' for amino acid sequence files, and 'Intron sequences' for the files generated by intronerate (if it was run in the HybPiper tool steps).

In our case, we want FNA files so we select that and press execute again.

Step 10

Once the extract loci tool has finished, we have one file for each gene, containing the files of our choice for each locus. In this case the .FNA files.

Step 11

Next up we have one more tool, which is the HybPiper Phylo Analysis tool. This tool will concatenate the files we got from the extract loci tool and run MUSCLE to create alignment files in phylip format. Then finally, merge these alignments together into a supermatrix.

To use the tool, specify the output of the Extract loci tool as the input of this one and specify which sequence type we have, which is the same as in the previous tools.

Then press execute and let the tool run.

Step 12

Step 13

The results is a ZIP file with alignment files, and one large supermatrix file in phylip format as well as a partition.txt file with the partition information for that supermatrix.

These results can then be used to create a phylogenetic tree using tools like raxml or phylogeny.org which can then be visualized using a tool like Figtree.

That concludes this tutorial. We hope it serves you well!

naturalis / galaxy-pipeline-hybseq Goto Github PK

galaxy-pipeline-hybseq's Introduction