zavolanlab / htsinfer Goto Github PK

View Code? Open in Web Editor NEW

9.0 6.0 22.0 6.88 MB

Infer metadata for your downstream analysis straight from your RNA-seq data

License: Apache License 2.0

Python 99.35% Dockerfile 0.65%

bioinformatics ngs high-throughput-sequencing inference hacktoberfest

htsinfer's Introduction

HTSinfer

HTSinfer infers metadata from Illumina high-throughput sequencing (HTS) data.

Examples

Single-ended library*

htsinfer tests/files/adapter_single.fastq

Paired-ended library*

htsinfer tests/files/adapter_1.fastq tests/files/adapter_2.fastq

Output is written to STDOUT in JSON format. The log is written to STDERR.

Example output

This is the output (STDOUT) of the above-mentioned call on a paired-ended example library:

{
   "library_stats": {
      "file_1": {
         "read_length": {
            "min": 75,
            "max": 75,
            "mean": 75.0,
            "median": 75,
            "mode": 75
         }
      },
      "file_2": {
         "read_length": {
            "min": 75,
            "max": 75,
            "mean": 75.0,
            "median": 75,
            "mode": 75
         }
      }
   },
   "library_source": {
      "file_1": {
         "short_name": "hsapiens",
         "taxon_id": "9606"
      },
      "file_2": {
         "short_name": "hsapiens",
         "taxon_id": "9606"
      }
   },
   "library_type": {
      "file_1": "first_mate",
      "file_2": "second_mate",
      "relationship": "split_mates"
   },
   "read_orientation": {
      "file_1": "SF",
      "file_2": "SR",
      "relationship": "ISF"
   },
   "read_layout": {
      "file_1": {
         "adapt_3": "AATGATACGGCGACC",
         "polyA_frac": 10.0
      },
      "file_2": {
         "adapt_3": "AATGATACGGCGACC",
         "polyA_frac": 10.0
      }
   }
}

To better understand the output, please refer to the Results model in the API documentation. Note that Results model has several nested child models, such as enumerators of possible outcomes. Simply follow the references in each parent model for detailed descriptions of each child model's attributes.

General usage

htsinfer [--output-directory PATH]
         [--temporary-directory PATH]
         [--cleanup-regime {DEFAULT,KEEP_ALL,KEEP_NONE,KEEP_RESULTS}]
         [--records INT]
         [--threads INT]
         [--transcripts FASTA]
         [--read-layout-adapters PATH]
         [--read-layout-min-match-percentage FLOAT]
         [--read-layout-min-frequency-ratio FLOAT]
         [--library-source-min-match-percentage FLOAT]
         [--library-source-min-frequency-ratio FLOAT]
         [--library-type-max-distance INT]
         [--library-type-mates-cutoff FLOAT]
         [--read-orientation-min-mapped-reads INT]
         [--read-orientation-min-fraction FLOAT]
         [--tax-id INT]
         [--verbosity {DEBUG,INFO,WARN,ERROR,CRITICAL}]
         [-h] [--version]
         PATH [PATH]

Installation

In order to use the HTSinfer, clone the repository and install the dependencies via Conda:

git clone https://github.com/zavolanlab/htsinfer
cd htsinfer
conda env create --file environment.yml
# Alternatively, to install with development dependencies,
# run the following instead
conda env create --file environment-dev.yml

Note that creating the environment takes non-trivial time and it is strongly recommended that you install Mamba and replace conda with mamba in the previous command.

Then, activate the htsinfer Conda environment with:

conda activate htsinfer

If you have installed the development/testing dependencies, you may first want to verify that HTSinfer was installed correctly by executing the tests shipped with the package:

python -m pytest

Otherwise just go ahead and try one of the examples.

API documentation

Auto-built API documentation is hosted on ReadTheDocs.

Contributing

This project lives off your contributions, be it in the form of bug reports, feature requests, discussions, or fixes and other code changes. Please refer to the contributing guidelines if you are interested to contribute. Please mind the code of conduct for all interactions with the community.

Contact

For questions or suggestions regarding the code, please use the issue tracker. For any other inquiries, please contact us by email: [email protected]

htsinfer's People

Contributors

Stargazers

Watchers

htsinfer's Issues

Implement unit tests for read orientation inference

Is your feature request related to a problem? Please describe.

Currently, comprehensive unit tests are missing for module htsinfer.get_read_orientation.

Describe the solution you'd like

Implement unit tests for the indicated module. The goal is not only to achieve 100% test coverage, i.e., run each line of code in a given test run, but to test the intended/expected behaviors of each individual method/function separately and comprehensively. Available unit tests for other modules (e.g., htsinfer.py, get_read_layout) can serve as an example.

Describe alternatives you've considered

N/A

Additional context

Note that in order to pass CI checks, coverage checks for the module have been excluded in setup.cfg. Please remove the module from the corresponding config section before filing the pull request.

Add optional CLI parameter for organism to main app

In order to select the transcripts that the reads should be mapped to in later steps (see #26 and #27) of the read orientation inference process, the organism that the sample was derived from needs to be known. As we currently do not have a functionality to infer the organism from the data (it has been started and is almost ready, but has not been merged), we need to be able to pass that information when we are executing the read orientation inference (#23).

Add a command-line interface (CLI) parameter --organism that either takes an organism short name (a string) or a taxon ID (an integer). Have a look at data/transcripts.fasta to see how these look like (4th and 5th fields of identifier lines, respectively). Ensure that passing --organism when calling htsinfer.py is optional. Validate that the input is either of type str or of type int. Don't forget to document the new CLI parameter in the repo's README.md.

Handling exceptions to reduce duplicated code

# helper classes
class RaiseFileProblem:
    """Raise ``FileProblem`` ."""

I have recently found a way to write a generic exception raiser class. You can just pass it the exception that you want to raise. Reduces some duplicate code. Check here: https://github.com/zavolanlab/zarp-cli/blob/4b93643b8650656b18b9301e91413469bff23e6f/tests/utils.py#L6-L17

Maybe you want to add that and replace the usage of RaiseFileProblem, RaiseMetadataWarning, RaiseOSError and RaiseValueError? :)

Call read orientation module from main app

When building the package, an executable htsinfer is created, which will call the app's main entry point htsinfer/htsinfer.py, which coordinates calling the different inference functionalities (single/paired, read layout, read orientation).

Add the necessary code to the htsinfer.py module's main() function that will call the read orientation inference functionality in the htsinfer/infer_read_orientation.py module and accepts its results. Note that you need to pass an organism as well (see #24). Raise an error if the user has not provided an argument to the --organism parameter.

Improve library source inference

Consistently update set of reference transcripts for library source inference and include more genes for better accuracy
Include organisms/sources from all clades in Ensembl: Bacteria, Plants, Fungi, Metazoa etc.

Infer read layout generically

Is your feature request related to a problem? Please describe.

Currently, only 3' adapters are inferred and these are identified via scanning for the presence of a list of known adapters. Devise and implement a solution that generically infers complex read layout patterns, including 5' adapters, UMIs and other linkers, other fixed length/sequence adapters etc.

Describe the solution you'd like

TBD

Describe alternatives you've considered

TBD

Additional context

N/A

Logic for confidence of infer_organism

Accepting two new parameters from user, otherwise default:

Min-Match % : minimum match percentage that top organism needs to have among all.
Factor : by how much factor top organism first organism is greater than the second.

If both conditions meet, then we do have a organism with confidence else we are not sure.
Still, if user needs to view the whole count percentage info data for all organisms, for that there will be count_organism_info file for every run.

Passing more than two input files to the CLI prints the stack trace

Describe the bug
Passing no input file to the CLI prints the usage but no strack trace. Inconsistent with this standard behavior, passing more than two input files raises an uncaught ValueError including the stack trace.

To Reproduce
Install HTSinfer as described, then run, e.g.,:

htsinfer file_1 file_2 file_3

Expected behavior
Instead of the stack trace, the following error message should be printed below the usage string:

<USAGE>

htsinfer: error: no more than two of the following arguments are allowed: PATH

Compare the behavior for passing no positional arguments:

htsinfer

which prints the following to the screen:

usage: htsinfer [--output-directory PATH] [--temporary-directory PATH]
                [--cleanup-regime {DEFAULT,KEEP_ALL,KEEP_NONE,KEEP_RESULTS}]
                [--records INT ] [--verbosity {DEBUG,INFO,WARN,ERROR,CRITICAL}]
                [-h] [--version]
                FASTQ_PATH [FASTQ_PATH]
        
htsinfer: error: the following arguments are required: FASTQ_PATH

While fixing this bug, FASTQ_PATH should be renamed PATH.

Docker image for the htsinfer project

Is your feature request related to a problem? Please describe.
This Issue stems from the fact that conda on macOS(Monterey and possibly higher) with arm64 chip is not able to create htsinfer environment due to the following error:

Solving environment: failed

ResolvePackageNotFound: 
  - kallisto[version='>=0.46.1']
  - star[version='>=2.7.6']
  - pysam[version='>=0.16.0']

Describe the solution you'd like
I couldn't find any user-friendly fix or workaround, and since I sometimes develop on a mac I decided to create a docker image.

Describe alternatives you've considered
NA

Additional context
The image can be downloaded by docker pull edirexbot/htsinfer:dev and executed via:

docker run --platform linux/amd64 \
      -v $HOME/htsinfer/tests/files/:/tmp \
      edirexbot/htsinfer:dev htsinfer /tmp/adapter_single.fastq

--platform switch is there specifically for the arm64 architecture of the mac M1 chip and is not necessary. The volume is mounted on the /tmp directory in the container in this case, but similarly the command:

docker run edirexbot/htsinfer:dev htsinfer tests/files/adapter_single.fastq

could have been executed instead, as the entire htsinfer repo is inside the container in the /htsinfer directory.

The image is built by creating the conda environment inside the image. This is just a temporary solution that was easy to implement. If the container version of htsinfer is something that is desired, I could install the dependencies without the conda and make the image a bit more lightweight and also include the docker image build in the CI/CD pipeline when the dev branch is changed.

Prepare transcript index for mapping

In order to efficiently map reads to reference sequences, mapping programs require an index of the reference sequences (here: transcripts of genes coding for highly expressed ribosomal proteins, see #25 for more details).

Write a function that prepares an index of the subset of transcripts prepared in #25 with the STAR program. Accept the path to the transcripts file as an input and return the path to the built index. Use a subprocess to call the index generation. Write the resulting index to a directory path in the temporary directory created in #33. Make sure you have installed STAR v2.7.6a first (via Conda) and add it to the environment.yml file (if it is not already present from #27). Insert a call to your new function in the main function infer() in the htsinfer/infer_read_orientation.py module. Assign the result (i.e., the index directory path) to a variable that can be used subsequently to map the reads (#27).

Compute histograms of inter-motif distances

Given a set of motifs and a set of sequences, compute histograms of inter-motif distances for all pairs of motifs and return pairs for which there is a significantly preferred distance. That is, loop across all sequences and record the distance between that start positions of the two motifs in every sequence. Note the symmetry, namely that distances between motif1 and motif2 will be the negatives of distances between motif2 and motif1. To not deal get twice the same underlying configuration, we can keep track only of non-negative distances. If there really is a preferred distance, this distance should occur more often than all others. We will have to decide on a cutoff. Here we could use the statistics we compute for generating the randomized sequences to calculate how often we expect specific motifs at specific distances. But maybe a simpler rule of thumb would work well enough, e.g. for each motif, the most frequent distance, as long as it's x fold higher than the average or than the second most common distance.

Refactor argument passing to class instances

Is your feature request related to a problem? Please describe.

Currently, arguments are passed to class constructors individually. This has several disadvantages:

Arguments need to be described in multiple locations (e.g., CLI interface, HtsIinfer class, classes for each inference functionality)
Adding new arguments to classes requires code changes in multiple modules
Constructor interfaces become unwieldy; pylint messages that warn against unwieldy interfaces were globally disabled (specifically: R0902: too-many-instance-attributes and R0903: too-many-arguments)

Describe the solution you'd like

Define a (Pydantic) configuration model Args with all command-line arguments in htsinfer/models.py; add appropriate defaults
Define a (Pydantic) configuration model Config in htsinfer/models.py with properties args of type htsinfer.models.Args and results of type htsinfer.models.Results
Instantiate an instance of the model in the HTSinfer class constructor - which will then take only two arguments, args and results, both optional, with defaults being specified in models.py
Pass arguments in bulk to Args() in HtsInfer instantiation in htsinfer/cli.py
Redefine constructors of classes in all other modules (e.g., GetLibSource, GetReadOrientation) such that they only take a single argument, config of type htsinfer.models.Config; collapse separate classes for handling two or only one input file into one, where applicable (newer functionalities use only a single class) as there is no solid reason to keep these separate, again causing repetition of descriptions and sometimes code
Refactor calls to instantiate the modified classes in htsinfer/htsinfer.py so that only a single argument config is passed

Describe alternatives you've considered

N/A

Additional context

Following the suggested solution still requires arguments to be described in two places: the model definition in [htsinfer/models.py]
https://github.com/zavolanlab/htsinfer/blob/dev/htsinfer/models.py) and the CLI definition in htsinfer/cli.py. An ideal solution would re-use the definitions from model.py in model for argparse help messages in cli.py. However, this is likely not trivial and should probably constitute a separate issue.

Extend motifs based on nucleotide representation in sequenced reads

Given a set of core motifs, check whether they represent parts of longer motifs in sequenced reads. Extend left and right if the nucleotide composition in these positions is very biased. That is, starting from all occurrences of a core motif in the reads, determine the frequencies of A,C,G,T at the left- and right-flanking positions, and calculate the entropy of this distribution. Ideally, the motif can be extended if the flanking positions contain one specific nucleotide, i.e. entropy is 0. But sequencing errors could introduce some noise. So it may be necessary to introduce a cutoff. Continue the extension greedily until the entropy becomes too high.

Add missing unit tests for single/paired inference

While tests in tests/test_infer_single_paired.py cover every line of code in the corresponding htsinfer/infer_single_paired.py module, they do not include unit tests for any function except the module's entry point infer().

Add test cases to test each function/code unit independently.

Infer read layout: Single vs Paired

Trying to infer single vs paired (or rather which file contains which mate):

First Things before to check for:

Do read IDs adhere to Illumina format as per the Wikipedia FASTQ format.

Following can be the possible cases while taking input from the User :

If user provides 1 file & @SEQ_IDs have /1 identifier tag:
Possible outcomes:
- A single-end library.
- Library containing the first mate of a paired-end library ( and the file with the second mate is missing ).
If user provides 1 file & @SEQ_IDs have /2 identifier tag:
Possible outcome:
- Library containing the second mate of a paired-end library ( and the file with the first mate is missing ).
If user provides 1 file & @SEQ_IDs have both /1 or /2 :
Possible outcomes:
- A mixed paired-end library.
If user provides 2 files and @SEQ_IDs have /1 in one file and /2 in the other:
Possible outcomes:
- A split paired-end library.
  CHECK : Do Read IDs match between the two files.

Randomize nucleotide sequences

Given a set of input sequences, generate randomized variants using a markov process. The order of the markov chain to be considered is a parameter. More concretely, we want to generate shuffled variants of the initial sequences that are similar in some way, but are not constructed to contain specific adaptor/barcode motifs. The simplest way to generate such variant sequences is to shuffle (permute the letters of) the input sequences. However, mammalian genes/genomes have some very distinct nucleotide composition biases such as being relatively depleted in CG dinucleotides (relative to the expected frequency of the dinucleotide given the frequencies of C's and G's), which, if ignored in the shuffling, will lead to the misestimation of frequencies of these motifs relative to the genome-average. To better preserve genome composition statistics in the shuffling, we would have to enforce that dinucleotide frequencies of the input sequences are maintained. An approximation that is simpler to implement is to maintain dinucleotide frequencies not strictly, but on average, which we can do by computing the frequencies of observing A,C,G,T following each of the A,C,G,T nucleotides and then generating as many and as long random sequences as the initial sequences, extending a sequence base by base, and choosing the next base given the current base and the probabilities we just mentioned.

Another parameter could be the number of sets of randomizations to generate.

Count occurrences of adaptors/contaminants

Given a set of known adaptors/contaminants and an input fastq file, count the number of matches to potential adaptors/contaminants. Similar to issue #19, but instead of making position-dependent profiles, the number of matches and proportion of matched reads should be determined. Thus, the same inner function can be used for both to identify motif matches within a read. Then the number of motifs per read can be tabulated, where partial occurrences would be counted with a weight which is the fraction of the motif that was matched to the read.

Decide read orientation

Use the read alignment configuration counts (from #28) to decide whether the library is

stranded, reads in forward configuration
stranded, reads in reverse-complemented configuration
unstranded

Write a function that takes as input the counts tuple from #28, implements the logic to decide the read orientation and returns a Salmon LIBTYPE string describing the read orientation. For now exclude the first part (inward, matching, outward) that only applies to paired-end samples, so the only valid outcomes are SF, SR and U. Insert a call to your new function in the main function infer() in the htsinfer/infer_read_orientation.py module. Assign the result to a variable that can be returned to the calling function in the main() function in htsinfer/htsinfer.py after the temporary data directory was deleted (#33).

Design testing strategy

Is your feature request related to a problem? Please describe.

Before publishing, we need to see how the tool performs against libraries for which the metadata is known. A collection for libraries for which (some of) the metadata is already mined is here. But as this list will be expanding, we need to make sure that we have a testing strategy in place to easily run these tests.

Describe the solution you'd like
The testing strategy should include

Fetching of the data (this Snakemake workflow can be used)
Arranging of the data in some defined directory structure
Convert SRA to FASTQ (fastqdump from SRA toolkit, there may be an alternative fasterqdump)
Running HTSinfer over all files in the directory
Compilation of the results in a matrix format
Comparing the results with the mined data, as collected in the metadata sheet linked above
Analysis of comparison results
Reporting of final test results

Add missing known adapters to adapters file

From the test sample matrix, we know a few adapters that should be in the data but were not found. This is likely because these adapter (fragments) are not included in data/adapter_fragments.txt.

Also, from sourcing the kits for the list of test samples, and by going through the kits, we might a few othe adapters that are not yet in the list.

Finally, have a search online to see if similar lists/resources exist, e.g., this repo for small RNA adapters and official resources like this and this.

Add all known, missing adapter fragments (starting at 5' adapter and using only 15 nts) to data/adapter_fragments.txt.

Results object as Pydantic model Class

The results object (currently a simple dictionary) should probably be a model class in order to allow proper validation. This can be done with the Pydantic package, which makes use of Python's type hinting system to define model classes with very little overhead. This makes the code much safer and will also make it easier to document/understand the code as it is getting more complex

Infer presence of polyA tails in reads

Is your feature request related to a problem? Please describe.
TBA

Describe the solution you'd like
TBA

Describe alternatives you've considered
TBA

Additional context
TBA

Handle multimappers correctly

A given read can map to multiple locations. It is important, however, that every read is counted only once. Also, the resulting alignments may sometimes be to the sense and other times to the antisense strand.
So to make sure these constraints are considered, it is necessary to collapse the outcomes of all the alignments of a given read into one outcome. Say, read 12345 aligns to 5 different locations in the RP transcripts. If all of the alignments do not require taking the reverse complement of the read (i.e., they are on the sense strand), you should count 1 towards "forward", not 5. If all of the alignments need to be reverser complemented, then you should count 1 towards "reverse", not 5. Now, if some of the alignments are "forward" and some of them are "reverse", I think it's best if you omit that read from the counting altogether, because the origin of that particular read cannot be safely established and thus the orientiation information from that read will also be insecure. Alternatively (especially if we don't get many reads aligning to the RP transcripts), you could assing fractions of counts to "forward" and "reverse": e.g., you had 4 alignments consistent with "foward" and 1 with "reverse", you could count 0.8 for "forward" and 0.2 for "reverse".

Identify potential barcode k-mers

Barcodes should consist of a small set of k-mers with the same length and a very similar position-dependent frequency of occurrence within reads. From the motif enrichment function we will get a list of strings that are considered significantly enriched among the reads. Another function will also implement computing the position-dependent frequency of occurrence of a k-mer in a set of input reads. We will use these two pieces of information, as well as the reads themselves to identify a set of motifs of the same length that together are represented across a minimal fraction of the reads and therefore could represent library barcodes. Inputs: set of motifs (of potentially different lengths), set of reads. The function will first construct position-dependent frequency distributions. These will be compared pairwise across motifs of the same length. The list of pairwise distances will be traversed from smallest to largest, until a maximum threshold is reached, building sets of k-mers of the same length and with similar position-dependent frequencies. The coverage of reads by each set will be determined. The k for which the coverage satisfies a min threshold is selected. If multiple k's satisfy the condition, the largest one is chosen.

Adapter count

Searching all truncated adapters (length : 15) for n number of reads.
For now implementing KMP/Aho-Corasick algorithms for better time constraints.

Populate reference data matrix

Is your feature request related to a problem? Please describe.

We have recently started collecting reference data for rigorously testing HTSinfer in this matrx. In order to check the performance/reliability, we would need a lot more data sets and associated metadata, covering

a range of diverse organisms from different clades of the phylogentic tree (among the supported organisms; can be obtained by parsing the set of ribosomal transcripts)
different sequencing depths (<1 mio to >10 mio)
different read lengths (35 - 200 nt)
different read orientations
single-/paired-ended libraries

Note that we do not need all possible combinations of those different parameters - just a good general coverage to have at least a small number of samples per organism and for each of the categories (single/paired) and extremes (small/big libraries, short/long reads).

Describe the solution you'd like

Search/query the SRA for data sets of interest (i.e., small/big ones, short/longs reads, different organisms)
Shortlist data sets that have one or more publications associated
Mine the relevant metadata from SRA and the associated publications and enter into the list

Note that the library source/organism and whether a library is single- or paired-ended should be recorded in the SRA metadata. However, 3'-adapters are, if at all, usually only described in the methods sections of associated papers. Library orientations are not usually recorded at all, but can be determined if the name of the kit that was used to prepare the library is provided (kits generally use the same library orientation), by reading the kit instructions. This may also shed light on the sequence of the 3'-adapters used.

For that reason, it is probably a good idea to add a "Sequencing kits" tab to the reference data spreadsheet in which the library orientation (and possibly adapters) for the different kits can be recorded. In that way, we have to check each kit manual only once.

Note that this is an ongoing issue that can easily be addressed by multiple people without risk of conflicts/overlaps.

Find motif matches within a read

For constructing the position-dependent profile of motif occurrences and counting the reads that match individual motifs (issues #18 and #19), we need a functionality of finding where a specific motif occurs within a read. We could do this by sliding the motif from left to right, starting from a minimum overlap between motif end and read start and ending at some minimum overlap between motif start and read end, reporting all of the consistent matches as a list of tuples, (position:fraction of motif matched). Running this function over all reads and all motifs will be time-consuming, so minimize the number of tests that are done to decide whether the motif matches or not.

ci: move to GitHub Actions

I'd like to contribute the the HTSInfer repo too.

I see you are still running on Travis, I could transfer the CI to GH Actions. I like setting up CIs, I do this for fun :)

Would you like that @uniqueg?

Change from conda to mamba in CI workflows

Describe the bug
The current CI workflow uses conda to set up the work environment. It is very slow and workflow run times can climb up to almost 30 minutes.

To Reproduce
Push or Pull request

Expected behavior
conda-incubator/setup-miniconda@v2 offers the Experimental mamba-version: "*" option which enables mamba as the default package manager.
It would be reasonable to use mamba since the run is ~10 times faster.

Add support for paired-end samples

This is a meta issue used for planning support for paired-end samples. Once plans are finalized, individual issues will be created based off the bullet points in the following list:

1. Prepare test files

Prepare small FASTQ files (5 entries) representing inward, outward and matching read orientation for unstranded and stranded (both with read 1 mapping to forward and reverse strands) libraries (see here for more info)

2. Get matching subsets of reads for both mates

Write a function that splits up paired-end libraries that contain reads for both mates in a single file
Write a function that generates a mate identifier given the identifier of its mate (make use of or refactor parse_mate_info_from_id() from htsinfer/infer_single_paired.py/)
When selecting a subset of random reads from a library (#11), extract matching reads from the corresponding mate library; make use of the function above to find matching mates by identifiers

3. Test mapping strategy

STAR has the ability to consider both mates of a read when computing alignments; however, information required to infer read orientation (which mate aligns to which strand and in what relative position) may potentially be lost in that case; to decide whether to use STAR in this mode, we first need to test if the resulting output has all required information

Compute alignments with STAR for all test files; inspect 0x10 bit in SAM FLAG and CIGAR string to ensure that all required information to infer read orientation is present in output; if not, we need to run STAR independently for each mate

4. Adapt mapping (#27), parameter extraction (#28) and inference of read orientation (#29) to account for paired-end libraries

Details depend on mapping strategy

Infer organism from fastQ sample: Selecting Genes

The goal here is to identify which genes to pick, in order to map them across the fastQ reads which will determine the type of organism from the derived samples.

Selection of Genes are being done on the following factors:

Genes which are universally conserved across some organisms.
Genes that are highly expressed among organisms.
Most abundant Ribosomal Proteins :
One Ribosomal RNA
One protein coding gene.

Infer library type if sequence identifiers do not contain mate info

Is your feature request related to a problem? Please describe.

When depositing FASTQ files to the Sequence Read Archive (SRA), the original sequence identifiers are not preserved. In the sequence identifiers generated by the SRA, mate information is not retained. Therefore, the current regex-based inference of the library type (single vs paired) does not work for FASTQ samples derived from the SRA.

As using HTSinfer on libraries obtained from SRA is an important use case, this is a critical problem.

Describe the solution you'd like

Whether two different libraries are mates of one another can also be inferred indirectly, based on two assumptions:

Mates have the same sequence identifiers
Mates of a paired-ended library generally (though possibly not always) map to the same transcripts. Also, the alignment orientation of each mate should match the ones expected based on the read orientation.

So, for libraries whose type cannot be inferred based on the regex approach (faster!), mate library files could be individually mapped to the set of transcripts (or mappings used from another step) and the alignments could be analyzed for the fraction of mate pairs (reads with same identifier) that map to the same transcript.

Describe alternatives you've considered

Alternatively, one could possibly use a mapping strategy where both mate library files are supplied.

Support additional source naming schemes

Is your feature request related to a problem? Please describe.

Currently short names and taxon identifiers are used to refer to library sources (e.g., hsapiens and 9606, respectively). However, some downstream applications (e.g., ZARP) use other naming schemes, such as home_sapiens.

Describe the solution you'd like

See if additional library source naming schemes can be mined and provided. Also check if there are more official names for these identifiers (short_name is very much ad hoc).

Describe alternatives you've considered

N/A

Additional context

N/A

Infer Organism from FastQ Sample: Selecting Organisms

To infer organisms form the fastQ samples, the top most over-represented organisms in SRA are being selected based on the following parameters:

Fraction of samples per organism.

Construct position-dependent frequency profiles

Given a set of motifs and a set of sequences, construct a position-dependent frequency profile of occurrence of the motifs within reads. This can be done by scanning all the reads with a particular motif, looking for perfectly compatible matches. That is, if the motif occurs internally in the read, it should be found precisely, and if the motif occurs at the 3' or 5' end of the reads it should be matched perfectly over the bases that are present. Record all the positions in the reads where the motif occurs and construct a histogram. If the motif occurs more than once in a read, count it fractionally towards each of those positions (e.g. if the motif occurs at 2 positions, each of these positions gets a count of 1/2). The simple implementation of this procedure will be slow (nr motifs * nr reads * read length), so make use of multiprocessing to speed things up.

Calculate motif enrichment

Given two dictionaries of motif counts (foreground and background), calculate the enrichment (ratio of the probabilities of occurrence of the motif in foreground and background) and p-value (assuming binomial sampling with average frequency taken from the background set) for all motifs in the foreground set. Return these as a dictionary.

Extract RP transcripts for provided organism

The repository contains a file with transcript sequences of highly expressed ribosomal proteins for hundreds of organisms (prepared for the purpose of inferring the organism from the data, a functionality which has not yet been merged). You can find it at data/transcripts.fasta. In order to map the reads against the transcriptome in #26 and #27 of the read orientation inference process. we will need to select from this file only the FASTA entries of those transcripts that originate from the organism that was provided through #23 & #24.

To do so, implement a function that parses the identifier lines of data/transcripts.fasta and selects those entries (identifier and corresponding sequence) that match the 4th field if the provided organism is a string (organism short name) or the 5th column if the provided organism is an integer (taxon ID). See #24 for more info on this. Write the resulting records to a file in the temporary directory created in #33 and return the path to the resulting transcripts file. Make sure you call your new function in the main function infer() in the htsinfer/infer_read_orientation.py module and assign the result to a variable that can be used subsequently to create an index of the file (#26).

Modify Final Output

For now, We are keeping the ouput as strings for infer_organism and infer_adapter just like infer_single_paired:

infer_organism.py

Output: organism name, NA, invalid_file

infer_adapter.py

Output: adapter sequence, NA, invalid_file, not_available

We can later make use of Pydantic package to change the final output, once the other functionalities are done.

Align motifs

Given a set of motif pairs with specified inter-motif distances, align them and generate clusters of motifs. Motif pairs that occur in reads at a preferred distance would be defined in a regular expression-like manner as [M1]+[N][M2]+, where [M1]+ indicates the bases of motif 1, [M2]+ indicates the bases of motif 2, and [N] indicates that there can be spacers of unknown sequence in between. We will align these motifs using a greedy approach, calculating the similarity of motifs in each pair and aligning those with highest similarity. Since we are looking for well-defined sequences (no ambiguity of bases at any position), we can be very strict, aligning only motifs which are compatible with each other (no mismatches). To calculate the similarity, we slide one motif over the other and record the number of matching bases. 'N's do not count. Once two motifs are aligned, they generate a new motif, which is the consensus of the two, with 'N's in one motif being resolved by bases at the corresponding positions in the other motif. All distinct consensus motifs at the end of this clustering procedure should be reported.

Example: let's say we have the following motifs as input ['ACANNNTGT', 'CACNNNGTG', 'CANNNGT', 'GGANGGA', 'GGCNNGA'] along with a threshold a minim score).
At each step, we have to calculate the similarity (score) for all motif pairs. Taking the first two, we would start sliding one motif over the other, calculating the number of matching bases:
ACANNNTGT
CACNNNGTG
------------------------- matches = 0

ACANNNTGT
CACNNNGTG
------------------------ matches = 0
...
ACANNNTGT
CACNNNGTG
------------------------ matches = 5 (Note: I counted C-N and T-N each as a half match)
...
ACANNNTGT
CACNNNGTG
------------------------ matches = 5 (similar to the one above)
etc.
Based on the above alignments, the best score for the two motifs would be 5, with 2 alternative alignments.

We would similarly calculate all pairwise scores, and then merge the two motifs with the highest scores into one. In the case of ties, we could chose randomly among the possibilities. Other choices are possible, so this can be implemented as an option. Assuming that we take the following alignment
ACANNNTGT
CACNNNGTG
The merged motif will look like this: ACACNNTGTG, where some of the N's in the original motifs became constrained by the alignment. After the merge, the motif list will be shorter by one. We keep going, always picking the two motifs with highest score and merging them, until the list consists of a single motif, or until the score of any two motifs drops below the threshold that we have as input parameter. We then return the list of remaining motifs.

Create temporary directory

Several functions that need to be added to implement the read orientation inference functionality will need to write temporary data to the disk. In order to keep things organized and allow cleaning up temporary data after we are done with the inference, a temporary directory should be created to keep all of the data that is written in subsequent steps.

Implement a function that accepts a directory path, creates a temporary directory in that directory and return the path to the temporary directory. Make sure you call your new function in the main function infer() in the htsinfer/infer_read_orientation.py module and assign the result to a variable that can be used subsequently (#25, #26, #27, #34). Also make sure to write at the very end of the infer() function the code required to delete the temporary directory! All other required calls/code to implement the read orientation inference functionality will go between the creation and deletion of the temporary directory.

Handle mates

Check:

Whether searching for adapters in second mates is the same as in first mates
Whether they are possible reversed/complemented

We can probably only properly check that once #73 is merged.

Integration among TEST PYPI and PYPI

In the deployment code, It should go like this: If Publishing for Test PYPI fails then don't go for PYPI. Go for PYPI only if previous provider successes. However at the moment, both will run simultaneously.

deploy:
  # Publishing on TestPyPI
- provider: pypi
   
  # Publishing on PyPI  
- provider: pypi

Extract a subset of random reads from a fastq file

To speed up inferences about read architecture in a library, extract a subset of reads (default all). Should take as parameters the number or proportion of reads to be extracted and should take into account that the reads may or may not be sorted alphabetically.

Count motifs in a set of sequences

Given a set of sequences, calculate the number of occurrences of all motifs within a range of length that occur in the input sequences. Return the results as a dictionary.

Identify potential 'spacer' k-mers

Given a set of reads, identify k-mers with very high occurrence at a specific position in the reads. 'Spacer' k-mers should occur in most sequences with a highly peaked position-dependent frequency. We could identify these cases by fitting a mixture of two uniform distributions to the positions of the motif in reads, expecting that one component of the mixture will have much narrower range than the other and a much higher weight in the mixture. A package that could be used for the fitting is pomegranate (https://pomegranate.readthedocs.io/).

Implement unit tests for library source inference

Is your feature request related to a problem? Please describe.

Currently, comprehensive unit tests are missing for module htsinfer.get_library_source.

Describe the solution you'd like

Describe alternatives you've considered

N/A

Additional context

Note that in order to pass CI checks, coverage checks for the module have been excluded in setup.cfg. Please remove the module from the corresponding config section before filing the pull request.

Extract alignment configuration

Mappers will align reads either in sense or antisense (i.e., reverse complement) configuration. The standard format for reporting read alignemnts, in human-readable form, is the SAM format. Whether a read was mapped either to the sense or antisense strand of the template is visible from the FLAG field (see SAM specification, section 1.4), which is a combination of bitwise flags, expressed as a decimal integer. In particular, the 0x10 bit tells us whether a given read was reverse-complemented for alignment.

Write a function that takes as an input a path to a BAM file and returns a tuple of two items representing the counts of
alignemnts in that file, where the 0x10 bit is (1) unset and (2) set, i.e., where the read could be aligned in its given or reverse-complemented configuration, respectively. Note that due to the very large file sizes of typical alignment files, SAM files are typically compressed, and will be available in either BAM or CRAM format. They can be parsed with the Python package pysam. Insert a call to your new function, taking the path of the pseudobam file created in #27 as its argument, into the main function infer() in the htsinfer/infer_read_orientation.py module. Assign the result (i.e., the results directory) to a variable that can be used subsequently.

pytest issue on macOS Big Sur

Describe the bug
Os-dependent issue when testing the installation of development/testing dependencies.
Error is: htsinfer.exceptions.StarProblem: Failed to create STAR index
If solved then: htsinfer.exceptions.StarProblem: Failed to generate STAR alignments

To Reproduce
Steps to reproduce the behavior:

Use computer with macOS Big Sur and Miniconda(4.12.0) installation.
Follow the Installation tutorial for htsinfer.

Expected behavior
To pass all tests.

Desktop (please complete the following information):

OS: macOS Big Sur

Additional context
The problem solution is trivial although not up to me to decide
if it should be a part of the codebase.

Solution
In file htsinfer/htsinfer/get_read_orientation.py (dev branch)
Add index_dir.mkdir(parents=True, exist_ok=True) in line 255.
Add Path(out_dir).mkdir(parents=True, exist_ok=True) in line 319.

Map reads to RP transcripts

Alignments to transcripts will relatively easily tell us the orientation in which RNA sequence fragments are read in a given library. Here, we will use the transcriptome index generated in #26 to map the reads selected from #11.

Write a function to implement the mapping. Use a subprocess to compute alignments with STAR. For the results directory, provide a path in the in the temporary directory created in #33. Return the path to the results directory to the calling function. Make sure you have installed STAR v2.7.6a first (via Conda) and add it to the environment.yml file (if it is not already present from #26). Insert a call to your new function in the main function infer() in the htsinfer/infer_read_orientation.py module. Assign the result (i.e., the results directory) to a variable that can be used subsequently.

Compile a list of common model organisms to retrieve test samples for

Include the following:

Classical model organisms (human, mouse, rat, yeast, zebrafish, worm, fruit fly, frog)
Domesticated animals (cat, dog, horse, cow, pig, goat)

Try to find, for each family of organims, also one or two closely related organisms, preferring the more common ones.
Pick only organisms that are in data/transcripts.fasta.gz.

Once the list is assembled, pass to @BorisYourich for inclusion in the retrieval script.

zavolanlab / htsinfer Goto Github PK

htsinfer's Introduction

HTSinfer

Examples

Example output

General usage

Installation

API documentation

Contributing

Contact

htsinfer's People

Contributors

Stargazers

Watchers

Forkers

htsinfer's Issues

Trying to infer single vs paired (or rather which file contains which mate):

infer_organism.py

infer_adapter.py

Recommend Projects

Recommend Topics

Recommend Org