vanheeringen-lab / ananse Goto Github PK

Prediction of key transcription factors in cell fate determination using enhancer networks. See full ANANSE documentation for detailed installation instructions and usage examples.

Home Page: http://anansepy.readthedocs.io

License: MIT License

Python 100.00%

grn key-transcription-factors cell-fate-determination enhancer-database bioinformatics

ananse's Introduction

ANANSE: ANalysis Algorithm for Networks Specified by Enhancers

Prediction of key transcription factors in cell fate determination using enhancer networks

ANANSE is a computational approach to infer enhancer-based gene regulatory networks (GRNs) and to identify key transcription factors between two GRNs. You can use it to study transcription regulation during development and differentiation, or to generate a shortlist of transcription factors for trans-differentiation experiments.

ANANSE is written in Python and comes with a command-line interface that includes 3 main commands: ananse binding, ananse network, and ananse influence. A graphical overview of the tools is shown below.

Check out the ANANSE documentation for

For documentation on the development version see here.

Citation

ANANSE: an enhancer network-based computational approach for predicting key transcription factors in cell fate determination Quan Xu, Georgios Georgiou, Siebren Frölich, Maarten van der Sande, Gert Jan C Veenstra, Huiqing Zhou, Simon J van Heeringen Nucleic Acids Research, gkab598, https://doi.org/10.1093/nar/gkab598

scANANSE: Gene regulatory network and motif analysis of single-cell clusters

scANANSE is a pipeline developed for single-cell RNA-sequencing data and single-cell ATAC-sequencing data. It can export single-cell cluster data from both Seurat or Scanpy objects, and runs the clusters through ANANSE using a snakemake workflow to significantly simplify the process. Afterwards, results can be imported back into your single-cell object.

For more info on this implementation check out the

Help and Support

The preferred way to get support is through the Github issues page.

License

MIT license
Copyright 2020 © vanheeringen-lab.

ananse's People

Contributors

Stargazers

Watchers

Forkers

andrewbcaldwell simonvh jgasmits arts-of-coding arturoarciniega lucidrains techthiyanes bit-vs-it natnaelt maarten-vd-sande allisonvmitch brancoheuts socialtree-yt odielwoolmore hjanime

ananse's Issues

auto update docs

If I got it correctly, docs are controlled by mkdocs.yml, which compiles stuff in docs/.
Is there a github action that can compile these docs on a PR?

Bonus: can we put mkdocs.yml in the docs/ directory too?

Is ananse network supposed to have an output option?

Hello,

I have just installed the latest conda version of ananse and when I try to run:

ananse network -b cDC1.F5_enhancers.binding/binding.tsv -e tpm_files/cDC1.ananse.tpm.txt -n 4

I get the following error:

Traceback (most recent call last): File "/opt/anaconda3/envs/gimme/bin/ananse", line 306, in <module> args.func(args) File "/opt/anaconda3/envs/gimme/lib/python3.8/site-packages/ananse/commands/network.py", line 31, in network outfile=args.outfile, AttributeError: 'Namespace' object has no attribute 'outfile

However, I cannot see an output option as an argument to ananse network and when I try that I get an error as well (but it seems in older instructions like there used to exists such an option).

Am I just missing something really simple?

Best,

Jonas

Combine two tools

The ananse interaction and the ananse network can be combined into one command (ananse network). The file that is saved by ananse interaction is an intermediate file, but not really something that you would generally need. You can add a --keep-intermediate flag to save it, otherwise it can be deleted.

Ananse influence command

I found two potential problems when running the command ananse influence. The first problem I found was related to the -p tag. When I run for instance this command:

ananse influence \
-s /jdeleuw/cell_type1.txt \
-t /jdeleuw/cell_type2.txt \
-d /jdeleuw/degenes.tsv \
-p /jdeleuw/out/type_1VS_type2.pdf \
-o /jdeleuw/out/type_1VS_type2.txt

I get the following error message:

usage: ananse [-h] <subcommand> [options]
ananse: error: unrecognized arguments: /scratch/bacint/jdeleuw/ananse/ananse_influence/test.pdf

I get the same error message when specifying true (-p true).
When I don't specify the path after the -p tag it seems to work fine:

ananse influence \
-s /jdeleuw/cell_type1.txt \
-t /jdeleuw/cell_type2.txt \
-d /jdeleuw/degenes.tsv \
-p \
-o /jdeleuw/out/type_1VS_type2.txt

I also found a problem related to the documentation. As explanation for the -o tag in the ananse influence command, the documentation states "The folder to save results". This should however not be the folder but a txt document as correctly demonstrated in the example of the documentation and in the first two commands of this issue. When I try to specify a folder I get this error message:

Matplotlib is building the font cache; this may take a moment.
2021-02-03 11:25:13.047 | INFO     | ananse.influence:__init__:244 - Reading network(s)
2021-02-03 11:49:26.210 | INFO     | ananse.influence:run_influence:338 - Run target score
Traceback (most recent call last):
  File "/mbshome/jdeleuw/miniconda3/envs/ananse/bin/ananse", line 364, in <module>
    args.func(args)
  File "/mbshome/jdeleuw/miniconda3/envs/ananse/lib/python3.7/site-packages/ananse/commands/influence.py", line 28, in influence
    a.run_influence(args.plot, args.fin_expression)  # -p and --expression (HGNC gene names and TPM)
  File "/mbshome/jdeleuw/miniconda3/envs/ananse/lib/python3.7/site-packages/ananse/influence.py", line 339, in run_influence
    influence_file = self.run_target_score()
  File "/mbshome/jdeleuw/miniconda3/envs/ananse/lib/python3.7/site-packages/ananse/influence.py", line 298, in run_target_score
    influence_file = open(self.outfile, "w")
IsADirectoryError: [Errno 21] Is a directory: '/scratch/bacint/jdeleuw/ananse/ananse_influence/test'

I am using the version of ananse (0.1.7+1.ge998494), additional information about the environment can be found in the attachment below.
ananse.txt

Add a progress bar

You can use tqdm to add a progress bar. Some of the steps take a long time, it would be good to see that it is actually doing something.

ananse binding: what does -d do?

The docs and --help message of ananse binding both state that the -d flag keeps the detail files.

Where do these go?
What are they?

question 1 might be solved by making the -o flag a directory instead of a file.

Why 100.000 edges?

ANANSE/ananse/influence.py

Lines 34 to 44 in f6fd473

    
           def read_network(fname, edges=100000): 
        
               """Read network file and return networkx DiGraph.""" 
        
               G = nx.DiGraph() 
        
               rnet = pd.read_csv(fname, sep="\t") 
        
               nrnet = rnet.sort_values("prob", ascending=False) 
        
               if len(nrnet) < edges: 
        
                   usenet = nrnet 
        
               else: 
        
                   usenet = nrnet[:edges]

Running ananse binding without arguments should print help

ananse network error

When running ananse network I've supplied a single gene expression file formatted as described, however I get this error:

ananse network -e tpm.txt -b binding_output.txt -o network_output.txt -a ~/.local/share/genomes/mm10/mm10.annotation.bed.gz -g mm10 --include-promoter

2020-06-18 16:39:02.214 | INFO | ananse.network:run_network:746 - Read data
2020-06-18 16:39:06.236 | INFO | ananse.network:run_network:756 - Aggregate binding
[########################################] | 100% Completed | 5min 38.6s
2020-06-18 16:50:44.196 | INFO | ananse.network:run_network:759 - Join expression
Traceback (most recent call last):
File "/home/cjr78/miniconda3/envs/ananse/bin/ananse", line 326, in
args.func(args)
File "/home/cjr78/miniconda3/envs/ananse/lib/python3.7/site-packages/ananse/commands/network.py", line 30, in network
args.binding, args.fin_expression, args.corrfiles, args.outfile
File "/home/cjr78/miniconda3/envs/ananse/lib/python3.7/site-packages/ananse/network.py", line 761, in run_network
expression_file = self.get_expression(fin_expression, features)
File "/home/cjr78/miniconda3/envs/ananse/lib/python3.7/site-packages/ananse/network.py", line 453, in get_expression
df[col + ".scale"] = minmax_scale(df[col])
File "/home/cjr78/miniconda3/envs/ananse/lib/python3.7/site-packages/sklearn/preprocessing/_data.py", line 502, in minmax_scale
dtype=FLOAT_DTYPES, force_all_finite='allow-nan')
File "/home/cjr78/miniconda3/envs/ananse/lib/python3.7/site-packages/sklearn/utils/validation.py", line 586, in check_array
context))
ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.

Since the last call is 'join expression', I thought it might be to do with the expression file. Using the example expression file, the same error occurs.

Change default genome to hg38

bug? fatel: bad revision Head

When I try to run Ananse binding using the following code:

ananse binding
-r /home/jsmits/ananse/KC_enh_int.bed
-o /home/jsmits/ananse/binding_KC.txt
-a /home/jsmits/tools/GRCh38.p13/GRCh38.p13.annotation.bed
-g /home/jsmits/tools/GRCh38.p13/GRCh38.p13.fa
-p /home/jsmits/git/ANANSE/data/gimme.vertebrate.v5.1.pfm

I get a error:

fatal: bad revision 'HEAD'
Traceback (most recent call last):
File "/home/jsmits/anaconda3/envs/ananse/bin/ananse", line 286, in
args.func(args)
File "/home/jsmits/anaconda3/envs/ananse/lib/python3.7/site-packages/ananse/commands/binding.py", line 23, in binding
a.run_binding(args.fin_rpkm, args.outfile)
File "/home/jsmits/anaconda3/envs/ananse/lib/python3.7/site-packages/ananse/binding.py", line 229, in run_binding
filter_bed = self.clear_peak(peak_bed)
File "/home/jsmits/anaconda3/envs/ananse/lib/python3.7/site-packages/ananse/binding.py", line 114, in clear_peak
peaks = self.set_peak_size(peaks, 200)
File "/home/jsmits/anaconda3/envs/ananse/lib/python3.7/site-packages/ananse/binding.py", line 82, in set_peak_size
for peak in peaks:
File "pybedtools/cbedtools.pyx", line 792, in pybedtools.cbedtools.IntervalIterator.next
File "pybedtools/cbedtools.pyx", line 656, in pybedtools.cbedtools.create_interval_from_list
IndexError: list index out of range

That is not really clearly telling me where stuff goes awry. What is going wrong here? :)

Check filtering steps in influence

Unrelated to PR. But the amount of (unexpected to me) filtering is strange, this is one example. I think this is something we should take a look at and think about.

E.g. removing not-validated TFs by default will of course make the AUPRC better, since we do not have true data of those. However, if we are confident in the model I am not sure if this is the way to go

Originally posted by @Maarten-vd-Sande in #44 (comment)

Biological replicates

When reading the docs from Github, the enhancer command uses a file called "KRT_H3K27ac_rep1.bam". Since I haven't seen any "rep2.bam" I wonder if there is a way to handle biological replicates for the peaks data.

Pairwise comparison of multiple time points

The way I want to use ANASE is comparing GRNs from an early sample vs an old sample from the same tissue. I have RNA-seq and ATAC-seq for multiple time points, do you see reasonable looking for transcription factors doing pairwise comparison? I wanted to perform time point 1 vs time point 0, time point 2 vs time point 1, and so on. I would expect to see if there are a common TF program across all time points or if they are different in each step.

Running ananse witjout arguments gives an error

heeringen@cn45:~$ ananse
Traceback (most recent call last):
  File "/vol/mbconda/heeringen/envs/ananse_dev/bin/ananse", line 375, in <module>
    if args.func.__name__.startswith("run_"):
AttributeError: 'Namespace' object has no attribute 'func'

Update influence.py

Personal preference, but I am linking pep to make it look like it isn't (https://www.python.org/dev/peps/pep-0008/#programming-recommendations)

For sequences, (strings, lists, tuples), use the fact that empty sequences are false:

# Correct:
if not seq:
if seq:

# Wrong:
if len(seq):
if not len(seq):

Anyways, I think line[1] should have a descriptive name.

Originally posted by @Maarten-vd-Sande in #44 (comment)

ananse influence: command line

The docs refer to 'first cell' and 'second cell'. Do you mean 'output from command x/y', or sequencing samples?

improve documentation

You function do a lot, but what and how they do it is not explained, making it almost magic unless you delve into the code!

Here are some places in the documentation that might need improvement from you:

what does each step do? a short explanation in the docs might help.
what do the output files mean? binding.txt, full_features.txt and {TF}.txt aren't descriptive (to me)
the ananse influence command requires:
1. a network
2. gene expression
3. differential gene expression with another condition
but what does it do with them? From what I can tell it will infer the ATAC accessbility (and i know it doesn't!).

enhancer command ValueError: Cannot take a larger sample than population when 'replace=False'

Command ananse enhancer not working for my data. As genome I used the fasta file of the assembly (my assembly can not be retrieved with genomepy), the folder containing that .fa also stores .fai, .gaps.bed and .sizes files. In addition, the folder containing the .bam file contains the .bai file. The .narrowPeak file was computed with macs2.

Execution command

ananse enhancer \
        -g ~/nfur/genome/Nfu_20150522.genes_20140630/Nfu_20150522.softmasked_genome.fa  \
        -t ATAC \
        -b ~/nfur/ATACseq/nucfree/8sem_FB/N-fur-ATAC-8sem-FB1.bam \
        -p ~/nfur/ATACseq/nucfree/8sem_FB/CONS_PEAKS/8sem_FB_rep1_peaks.narrowPeak \
        -o ~/8sem_FB1_enhancer.bed

Traceback

Traceback (most recent call last):
  File "/home/ska/areyes/miniconda3/envs/ananse/bin/ananse", line 364, in <module>
    args.func(args)
  File "/home/ska/areyes/miniconda3/envs/ananse/lib/python3.8/site-packages/ananse/commands/enhancer.py", line 43, in enhancer
    b.run_enhancer(
  File "/home/ska/areyes/miniconda3/envs/ananse/lib/python3.8/site-packages/ananse/enhancer.py", line 234, in run_enhancer
    quantile_bed = self.quantileNormalize(epeak, bed_cov, bed_output)
  File "/home/ska/areyes/miniconda3/envs/ananse/lib/python3.8/site-packages/ananse/enhancer.py", line 213, in quantileNormalize
    rank = pd.read_csv(self.peak_rank, header=None).sample(n = enahcer_number, random_state = 1).sort_values(0)[0].tolist()
  File "/home/ska/areyes/miniconda3/envs/ananse/lib/python3.8/site-packages/pandas/core/generic.py", line 4995, in sample
    locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
  File "mtrand.pyx", line 959, in numpy.random.mtrand.RandomState.choice
ValueError: Cannot take a larger sample than population when 'replace=False'

ananse network: full_features.txt

Full features.txt contains 3 columns and 2 headers. Not sure if this is a bug or a feature.

head of results/full_features.txt:

source_target   binding prob
AFP_A1BG        0.1798608119532118      0.06992326783886157
AFP_AADAT       0.18974581561592965     0.07715223098288283
AFP_AASDHPPT    0.16624518204520597     0.0604560220908372
AFP_ABCA4       0.2680246705361807      0.14873876044464707
AFP_ABCB1       0.16239896707039708     0.05791679392663198
AFP_ABCB6       0.5455329820882627      0.5883154303566295
AFP_ABCC10      0.15746549560853199     0.0548652417730351
AFP_ABCC6       0.32669678721144424     0.2189110249077235
AFP_ABCG1       0.5327329751529389      0.5654740206058142

zenodo changelog parser

@siebrenf implemented a nice CHANGELOG parser for seq2science, which automatically adds the relevant piece of CHANGELOG to the release description. Might be nice?

Originally posted by @Maarten-vd-Sande in #35 (comment)

ananse binding genome error

When trying to run ananse binding, I get this error:

ananse binding -r counts.bed -o ananse_binding/H3K27ac.txt -g mm10 -p "/home/cjr78/miniconda3/pkgs/gimmemotifs-0.14.4-py37h516909a_0/lib/python3.7/site-packages/data/motif_databases/gimme.vertebrate.v5.0.pfm"
Traceback (most recent call last):
File "/home/cjr78/miniconda3/envs/gimme/bin/ananse", line 286, in
args.func(args)
File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/ananse/commands/binding.py", line 21, in binding
genome=args.genome, gene_bed=args.annotation, pfmfile=args.pfmfile
File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/ananse/binding.py", line 40, in init
self.gsize = g.props["sizes"]["sizes"]
AttributeError: 'Genome' object has no attribute 'props'

My genome is installed via genomepy and works with gimmemotifs. Tried to specific the .fa file but the same error.

filtering

Multiple functions filter by gene names (binding and network at least).
I'd like to know where so we can discuss where and how the filtering should be performed.

If it is required in multiple places we can maybe add a filter function to utils.py.

In ANANSE binding, this happens in enhancer_binding.py's filter_transcription_factors() (this used to be binding.py's clear_tfs())
However, we could also filter earlier, on gimme's motifs.pfm file, which would reduce the scanning time as well!

Replace True/False with flags

Change all command-line flags tha now require a True or False argument to flags that change the setting using store_true or store_false.
For instance the --keep_detail argument.

    p.add_argument(
        "-d",
        "--keep_detail",
        dest="detail",
        help="Keep detail files",
        metavar="NAME",
        action="store_true",
        default=False,
    )

ananse influence: AttributeError: 'Namespace' object has no attribute 'fin_expression'

Hello again and sorry for my spams :)

When I updated my local environment with pip install git+https://github.com/vanheeringen-lab/[email protected] ananse network run smoothly but I have now encountered a problem when running ananse influence -s cDC1.F5_enhancers.binding/cDC1.network.txt -t cDC2.F5_enhancers.binding/cDC2.network.txt -d log2fc_files/cDC1vscDC2.ananse.log2fc.txt -o influence_out/cDC1TOcDC2.out.txt -n 12

I get the following error:

2021-06-03 14:10:03 | INFO | Reading network(s) Traceback (most recent call last): File "/opt/anaconda3/envs/gimme/bin/ananse", line 317, in <module> args.func(args) File "/opt/anaconda3/envs/gimme/lib/python3.8/site-packages/ananse/commands/influence.py", line 29, in influence args.plot, args.fin_expression AttributeError: 'Namespace' object has no attribute 'fin_expression

So when I look at lines 28-30 it says a.run_influence( args.plot, args.fin_expression ) # -p and --expression (HGNC gene names and TPM)

Note that it only says that here when ananse is updated with the command above but in the current github instance of influence.py that does not seem to be the case.

Should something have been included in the network files that are not there so ananse network didn't run properly. -p is the plot option I assume but where does --expression come from? Am I just missing something?

Best,
Jonas

Motif database

In the new GimmeMotifs the motif2factors file looks like this:

Motif   Factor  Evidence        Curated
GM.5.0.Sox.0001 SRY     JASPAR  Y
GM.5.0.Sox.0001 SOX9    Transfac        Y
GM.5.0.Sox.0001 Sox9    Transfac        N
GM.5.0.Sox.0001 SOX9    SELEX   Y
GM.5.0.Sox.0001 Sox9    SELEX   N
GM.5.0.Sox.0001 SOX13   ChIP-seq        Y
GM.5.0.Sox.0001 SOX9    ChIP-seq        Y
GM.5.0.Sox.0001 Sox9    ChIP-seq        N
GM.5.0.Sox.0001 SRY     SELEX   Y
GM.5.0.Sox.0001 SOX15   SMiLE-seq       Y
GM.5.0.Sox.0001 Sox12   PBM     N
GM.5.0.Sox.0001 Sox15   PBM     N
GM.5.0.Sox.0001 Sox18   PBM     N

Please update ananse to use this file. We should not need the factortable.txt file. This also means that the motif database can be optional. If not specified on the command line, it uses the GimmeMotifs default db. Make sure to always use the gimmemotifs.utils.pfmfile_location() function to get the motif file. Then you can use all the dbs included with GimmeMotifs by name.

In [3]: pfmfile_location(None)                                                                 
Out[3]: '/home/simon/anaconda3/envs/ananse/lib/python3.6/site-packages/gimmemotifs/../data/motif_databases/gimme.vertebrate.v5.0.pfm'

In [4]: pfmfile_location("JASPAR2020_vertebrates")                                             
Out[4]: '/home/simon/anaconda3/envs/ananse/lib/python3.6/site-packages/gimmemotifs/../data/motif_databases/JASPAR2020_vertebrates.pfm'

In [5]: pfmfile_location("HOMER")                                                              
Out[5]: '/home/simon/anaconda3/envs/ananse/lib/python3.6/site-packages/gimmemotifs/../data/motif_databases/HOMER.pfm'

If you give this function a custom PFM name it will just return that:

In [6]: pfmfile_location("data/gimme.vertebrate.v5.2.pfm")                                     
Out[6]: 'data/gimme.vertebrate.v5.2.pfm'

Update ananse network help

Rephrase:

All TFs binding file
Expression scores

Remove correlation argument (it is not used?)

Include promoters by default, also change in help message.

Use logging

In line with #14, it would be maybe good to add some status messages that explain what the program is doing. Probably loguru would be a good option, it is a relatively straightforward logging utility.

custom genome ananse network

Trying to run ananse network using an assembly not from genomepy gave me a problem. I encountered the error ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required. after "Join expression" is printed, which in a former issue was related to gene names.

Previously to ananse network, I consulted Maarten and he suggested and share with me a code to generate a motif2factors file for my gene ids. The following motif2factors file contains gene names from orthologous genes to the original motif2factors file but from different assemblies. Here it is:

Motif	Factor	Evidence	Curated
GM.5.0.Sox.0001	SOX9A	Orthologs	N
GM.5.0.Sox.0001	SOX9B	Orthologs	N
GM.5.0.Sox.0001	AL929022.1	Orthologs	N
GM.5.0.Sox.0001	SOX18	Orthologs	N
GM.5.0.Sox.0001	SOX7	Orthologs	N
GM.5.0.Sox.0001	SOX4A	Orthologs	N
GM.5.0.Sox.0001	SOX4B	Orthologs	N
GM.5.0.Homeodomain.0001	TGIF1	Orthologs	N
GM.5.0.Homeodomain.0001	TGIF2	Orthologs	N
GM.5.0.Mixed.0001	Nfu_g_1_013515	Orthologs	N

I filtered it to keep those lines containing my gene names (I also removed the "_" and transformed to uppercase, suggested by Maarten):

Motif	Factor	Evidence	Curated
GM.5.0.Mixed.0001	NFUG1013515	Orthologs	N
GM.5.0.bHLH.0001	NFUG1022543	Orthologs	N
GM.5.0.bHLH.0001	NFUG1006036	Orthologs	N
GM.5.0.C2H2_ZF.0001	NFUG1013515	Orthologs	N
GM.5.0.bZIP.0002	NFUG1018863	Orthologs	N

Then, I used this command:

ananse binding -r enhancer.bed -o binding.txt -g Nfu_20150522.genes_20140630/Nfu_20150522.softmasked_genome.fa -t ATAC -p Nfu_20150522.gimme.vertebrate.v5.0.pfm -f ../tf_nfu.txt --unremove-curated --include-notfs

Resulting binding.txt file

factor	enhancer	zscore	log10_peakRPKM	binding
NFUG1013515	scaffold01062:16192-16392	-0.07880385405158605	0.4758762314880969	0.0936134202605538
NFUG1013515	sgr01:12854699-12854899	-0.17834431235782186	0.7764678514396647	0.15454978522361046
NFUG1013515	sgr01:23638699-23638899	0.6767994812144132	1.2505841736269574	0.45572099669473215
NFUG1013515	sgr01:38279565-38279765	-1.3940636981458527	1.1807659445934546	0.1686181312937036

Finally, I show the ananse network command which popped up the error I described at the begining and files used:

ananse network -e tpm.txt -b binding.txt -o net.txt -a annotation.bed -g Nfu_20150522.genes_20140630/Nfu_20150522.softmasked_genome.fa --exclude-promoter --include-enhancer

tpm.txt

target_id	tpm
NFUG1013515	5.3
NFUG1000002	5.29503121149557
NFUG1000003	19.6412583379942
NFUG1000004	9.03547608404788
NFUG1000005	31.93858899898

annotation.bed

scaffold00001	10325	22501	NFUG1003772	0	-	10325	22501	0	7	16,89,44,283,79,27,23,	0,1123,2270,8520,11102,12103,12153,
scaffold00001	22086	26253	NFUG1003773	0	+	22675	25081	0	3	910,499,1186,	0,1701,2981,
scaffold00001	26043	273596	NFUG1003771	0	-	26043	273596	0	12	45,26,79,173,258,186,150,209,196,6,123,124,	0,120,231,397,2641,8138,91840,111272,143840,203384,244535,247429,

I don't know where the problem could be, I believe my custom gene ids are overlapping in all files.

ananse network: correlation file

where does the correlation file come from? what is it, and how can I make one for my assembly/assemblies?

problem ananse network

Hello,

While running the command "ananse network" the error AttributeError: 'float' object has no attribute 'upper' pops up. I have checked my input files and cannot find anything wrong with them. Does anyone have any suggestions to fix this problem?

Kinds regards,
S. van den Oever

differential expression data .tsv not .csv

Small things, but whenever I encounter them I just make a issue so its clear where in the docs there are any mistakes left.
In the documentation the example differential genes, the example input is a .csv file. However at the input data section of the docs the example is a .tsv.

Furthermore many of the links in the documentation to examples are dead giving a 404 error.

Update ananse influence

Print full help message when run without arguments.
Update the description of the arguments.

README: command line help

Split the help clearly in required arguments and optional arguments (like I did with ananse binding).

Peaks data only from ATAC-seq

I understand from the enhancer data section that the enhancer command needs ATAC-seq and H3K27ac ChIP-seq. Can I find the enhancers with ATAC-seq data only?

List of TODO's

Let's keep a list of things that still need to be done, I'll update this when I come across more issues.

Remove all irrelevant data from the data/ directory.
Add two example files, for instance for fibroblasts and keratinocytes, to be able to test the method.
Create a main script ananse with several subcommands (binding, grn and influence).
Make sure the input is not size-specific. If the regions are larger than 200bp they should be resized to 200 bp.

Network annotation file incorrect error

When running ananse binding, with a costum bedfile with the wrong chromosome identifiers, the error is:

2021-03-05 14:03:23 | INFO | Loading expression
2021-03-05 14:04:39 | INFO | creating expression dataframe
2021-03-05 14:05:51 | INFO | Aggregate binding
2021-03-05 14:05:51 | INFO | reading enhancers
Traceback (most recent call last):
File "/vol/mbconda/jsmits/envs/ananse_dev/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Start_b'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/vol/mbconda/jsmits/envs/ananse_dev/bin/ananse", line 371, in
args.func(args)
File "/vol/mbconda/jsmits/envs/ananse_dev/lib/python3.8/site-packages/ananse/commands/network.py", line 28, in network
b.run_network(
File "/vol/mbconda/jsmits/envs/ananse_dev/lib/python3.8/site-packages/ananse/network.py", line 552, in run_network
df_binding = self.aggregate_binding(
File "/vol/mbconda/jsmits/envs/ananse_dev/lib/python3.8/site-packages/ananse/network.py", line 372, in aggregate_binding
gene_df = self.enhancer2gene(
File "/vol/mbconda/jsmits/envs/ananse_dev/lib/python3.8/site-packages/ananse/network.py", line 274, in enhancer2gene
(genes["Start_b"] + genes["End_b"]) / 2 - genes["Start"]
File "/vol/mbconda/jsmits/envs/ananse_dev/lib/python3.8/site-packages/pandas/core/frame.py", line 2906, in getitem
indexer = self.columns.get_loc(key)
File "/vol/mbconda/jsmits/envs/ananse_dev/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc
raise KeyError(key) from err
KeyError: 'Start_b'

Changing this to a more insightfull error message would be nice. :)

ananse enhancer: ValueError: Cannot take a larger sample than population when 'replace=False'

Hi,
I am attempting to run ananse enhancer using the hg38 genome (installed with genomepy) for ATAC-seq data. The peaks were called with Genrich and the bam file is sorted, and the index exists. Here is my command:

ananse enhancer -g hg38 -t ATAC \
>                   -b celltype2.bam \
>                   -p celltype2.Genrich.peaks.narrowPeak \
>                   -o celltype2.ananse.enhancer.bed

and this is the error I get:

Could not find/load indexes.
Traceback (most recent call last):
  File "/home/andrewbcaldwell/miniconda3/envs/ananse/bin/ananse", line 364, in <module>
    args.func(args)
  File "/home/andrewbcaldwell/miniconda3/envs/ananse/lib/python3.8/site-packages/ananse/commands/enhancer.py", line 43, in enhancer
    b.run_enhancer(
  File "/home/andrewbcaldwell/miniconda3/envs/ananse/lib/python3.8/site-packages/ananse/enhancer.py", line 234, in run_enhancer
    quantile_bed = self.quantileNormalize(epeak, bed_cov, bed_output)
  File "/home/andrewbcaldwell/miniconda3/envs/ananse/lib/python3.8/site-packages/ananse/enhancer.py", line 213, in quantileNormalize
    rank = pd.read_csv(self.peak_rank, header=None).sample(n = enahcer_number, random_state = 1).sort_values(0)[0].tolist()
  File "/home/andrewbcaldwell/miniconda3/envs/ananse/lib/python3.8/site-packages/pandas/core/generic.py", line 4995, in sample
    locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
  File "mtrand.pyx", line 959, in numpy.random.mtrand.RandomState.choice
ValueError: Cannot take a larger sample than population when 'replace=False'`

I've been able to run the whole pipeline (ananse enhancer through influence) for 3/4 of my cell types, but for some reason I run into this issue for 1 cell type. Processing for bam files and peak calling parameters were the same for all 4 cell types. Do you have any idea what may be causing this error? Thanks!

ananse network, genomepy hg38

When trying to run Ananse-network using a genomepy downloaded version of hg38, it crashes due to: 'ERROR: chrom "chr10_KN196480v1_fix" not found in genome file. Exiting.'.

I've told my student to use the annotation file : "hg38_genes.bed' , that I downloaded a while back from the ananse github (but I could be mistaken). Now it seems to run. However I think its good to check that it works with genomepy genome and annotation files.

Gr Jos

reuse ananse binding when peakset did not change

I have timeseries data (e.g. 4 timepoints) with an identical peakset, which means that I am running ananse binding 4 times with the exact same peakset. The motif scanning takes really long 😢

I want to reuse the result of motif scanning

Enhancer command: pandas.errors.EmptyDataError

Hi,
I am attempting to run ananse enhancer using the hg38 genome (installed with genomepy) for ATAC-seq data. The peaks were called with Genrich and the bam file is sorted. Here is my command:

ananse enhancer -g hg38 -t ATAC \
>                   -b NDC.merged.bam \
>                   -p NDC.merged.Genrich.peaks.narrowPeak \
>                   -o NDC.ananse.enhancer.bed

and this is the error I get:

Could not find/load indexes.

Traceback (most recent call last):
  File "/home/andrewbcaldwell/miniconda3/envs/ananse/bin/ananse", line 364, in <module>
    args.func(args)
  File "/home/andrewbcaldwell/miniconda3/envs/ananse/lib/python3.8/site-packages/ananse/commands/enhancer.py", line 43, in enhancer
    b.run_enhancer(
  File "/home/andrewbcaldwell/miniconda3/envs/ananse/lib/python3.8/site-packages/ananse/enhancer.py", line 234, in run_enhancer
    quantile_bed = self.quantileNormalize(epeak, bed_cov, bed_output)
  File "/home/andrewbcaldwell/miniconda3/envs/ananse/lib/python3.8/site-packages/ananse/enhancer.py", line 215, in quantileNormalize
    bed = pd.read_csv(bed_input, header=None, sep="\t")
  File "/home/andrewbcaldwell/miniconda3/envs/ananse/lib/python3.8/site-packages/pandas/io/parsers.py", line 688, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/andrewbcaldwell/miniconda3/envs/ananse/lib/python3.8/site-packages/pandas/io/parsers.py", line 454, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/home/andrewbcaldwell/miniconda3/envs/ananse/lib/python3.8/site-packages/pandas/io/parsers.py", line 948, in __init__
    self._make_engine(self.engine)
  File "/home/andrewbcaldwell/miniconda3/envs/ananse/lib/python3.8/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/andrewbcaldwell/miniconda3/envs/ananse/lib/python3.8/site-packages/pandas/io/parsers.py", line 2010, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 540, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file`

both 'before' and 'after' network should able to become the 'only network'

One thing to consider is maybe sqlite r-tree.

To consider.. not sure about this yet.

One thing to consider is maybe sqlite r-tree. It's the perfect datastructure for this (linking proximalish peaks with tss) I think, and should be fast / dynamic. I implemented it for peaksql, and I could try a bit if it's (much) faster. The advantage is that we can put everything on disk, so the memory usage should be very low, and sqlite in general is fast. Big disadvantage is maintainability and perhaps portability between systems..

Originally posted by @Maarten-vd-Sande in #37 (comment)

Ananse influence score for which TFs

When running Ananse influence, the output does not contain influence scores for all TFs, only for (i gues) differential expressed TFs? Or there seems to be some other cutoff for which TFs the influence score is or isnt calculated. Or is which TF is and isnt plotted dependent on the top 1000 edges being used? I have no clue, but the final choose of which TFs do get a influence score, and which dont isnt very clear.
Especially for time-serie analysis, where a TF might have a huge influence in a certain timepoint, but a lower or almost non-existant influence in another timepoint this is problematic.

Is this TF selection based on if TF their targets are included in the amount of selected edges? Or on RNAseq expression? If there is some kind of cuttoff regarding which TFs do and which dont get a influence score calculated, it would be nice if this could be tweakable as a input.

Would you consider citing pyranges since you use it?

Thanks!

ananse binding: ValueError: Found array with 0 sample(s) (shape=(0, 2)) while a minimum of 1 is required.

command:

ananse binding -r results_ananse/enhancer/atac_13.25.bed -o results_ananse/binding_13.25.txt -g genomes/UCB_Xtro_10.0/UCB_Xtro_10.0.fa --ncore 8 --pfmfile motif2factors/UCB_Xtro_10.0.gimme.vertebrate.v5.0.pfm --etype ATAC

output:

2021-02-16 17:58:02.372 | INFO     | ananse.binding:run_binding:253 - Peak initialization
2021-02-16 17:58:02.809 | INFO     | ananse.binding:run_binding:257 - Motif scan
  6%|█████████▋                                                                                                                                                       | 3989/66434 [01:40<03:29, 297.51it/s]100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 66434/66434 [29:13<00:00, 37.88it/s]
2021-02-16 18:27:28.233 | INFO     | ananse.binding:run_binding:261 - Predicting TF binding sites
Traceback (most recent call last):
  File "/vol/mbconda/sande/envs/ananse/bin/ananse", line 364, in <module>
    args.func(args)
  File "/vol/mbconda/sande/envs/ananse/lib/python3.7/site-packages/ananse/commands/binding.py", line 30, in binding
    a.run_binding(args.fin_rpkm, args.outfile)
  File "/vol/mbconda/sande/envs/ananse/lib/python3.7/site-packages/ananse/binding.py", line 264, in run_binding
    table = self.get_binding_score(pfm, peak)
  File "/vol/mbconda/sande/envs/ananse/lib/python3.7/site-packages/ananse/binding.py", line 245, in get_binding_score
    table["binding"] = clf.predict_proba(table[["zscore", "log10_peakRPKM"]])[:, 1]
  File "/vol/mbconda/sande/envs/ananse/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py", line 1651, in predict_proba
    return super()._predict_proba_lr(X)
  File "/vol/mbconda/sande/envs/ananse/lib/python3.7/site-packages/sklearn/linear_model/_base.py", line 307, in _predict_proba_lr
    prob = self.decision_function(X)
  File "/vol/mbconda/sande/envs/ananse/lib/python3.7/site-packages/sklearn/linear_model/_base.py", line 268, in decision_function
    X = check_array(X, accept_sparse='csr')
  File "/vol/mbconda/sande/envs/ananse/lib/python3.7/site-packages/sklearn/utils/validation.py", line 586, in check_array
    context))
ValueError: Found array with 0 sample(s) (shape=(0, 2)) while a minimum of 1 is required.

The files can be found here:
/scratch/sande/ananse_valueerror/

I don't know what is going wrong here, if I check my files they look fine to me.. I am running it with my own "custom" motif2factors file, but it looks good to me.. 🤔

Update influence to use pandas

This whole function could be pandas, but again.. Not part of the PR 🙃

Originally posted by @Maarten-vd-Sande in #44 (comment)

bug in gimmemotifs(?)

The following occurs when running ananse binding on Ensembl-aligned data.
@simonvh is working on a fix. Posting it here for clarity.

Motif scanning
         0 - 10000 enhancers
2020-03-18 17:09:50,497 - INFO - using 10000 sequences
2020-03-18 17:09:50,497 - INFO - Creating index for genomic GC frequencies.
Traceback (most recent call last):
  File "/home/siebrenf/anaconda3/envs/ananse/bin/ananse", line 286, in <module>
    args.func(args)
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/site-packages/ananse/commands/binding.py", line 23, in binding
    a.run_binding(args.fin_rpkm, args.outfile)
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/site-packages/ananse/binding.py", line 232, in run_binding
    pfm_weight = self.get_PWMScore(filter_bed)
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/site-packages/ananse/binding.py", line 186, in get_PWMScore
    for seq, scores in zip(chunk_seqs, it):
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/site-packages/gimmemotifs/scanner.py", line 872, in best_score
    for matches in self.scan(seqs, 1, scan_rc, zscore=zscore, gc=gc):
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/site-packages/gimmemotifs/scanner.py", line 946, in scan
    self.set_meanstd(gc=gc)
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/site-packages/gimmemotifs/scanner.py", line 623, in set_meanstd
    self.set_background(gc=gc)
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/site-packages/gimmemotifs/scanner.py", line 747, in set_background
    tmp.name, genome, number=nseq, length=size, bins=gc_bins
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/site-packages/gimmemotifs/background.py", line 387, in gc_bin_bedfile
    create_gc_bin_index(genome, fname, min_bin_size=min_bin_size)
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/site-packages/gimmemotifs/background.py", line 352, in create_gc_bin_index
    df.reset_index()[cols].to_feather(fname)
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/site-packages/pandas/core/frame.py", line 2152, in to_feather
    to_feather(self, fname)
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/site-packages/pandas/io/feather_format.py", line 66, in to_feather
    feather.write_feather(df, path)
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/site-packages/pyarrow/feather.py", line 183, in write_feather
    writer.write(df)
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/site-packages/pyarrow/feather.py", line 94, in write
    table = Table.from_pandas(df, preserve_index=False)
  File "pyarrow/table.pxi", line 1177, in pyarrow.lib.Table.from_pandas
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 580, in dataframe_to_arrays
    convert_fields))
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
    yield fs.pop().result()
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/siebrenf/anaconda3/envs/ananse/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 560, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
TypeError: an integer is required (got type str)

ananse binding error

When running ananse binding, the following error occurs:

2020-06-17 16:30:06.413 | INFO     | ananse.binding:run_binding:250 - Peak initialization
Traceback (most recent call last):
  File "/mnt/fls01-home01/r77887ia/.conda/envs/ananse/bin/ananse", line 326, in <module>
    args.func(args)

  File "/mnt/fls01-home01/r77887ia/.conda/envs/ananse/lib/python3.7/site-packages/ananse/commands/binding.py", line 28, in binding
    a.run_binding(args.fin_rpkm, args.outfile)

  File "/mnt/fls01-home01/r77887ia/.conda/envs/ananse/lib/python3.7/site-packages/ananse/binding.py", line 252, in run_binding
    filter_bed = self.set_peak_size(peak_bed)

  File "/mnt/fls01-home01/r77887ia/.conda/envs/ananse/lib/python3.7/site-packages/ananse/binding.py", line 132, in set_peak_size
    if start > 0 and end < gsizedic[peak.chrom]:
TypeError: '<' not supported between instances of 'int' and 'str'

The enhancer BED file scores have been generated using deepTools computeMatrix

The genome has been installed with genomepy and I have updated to the latest version of ANANSE.

What could be causing this? Thank you.

Gene names get capitalized

When working with e.g. frog genes (all lower case), they get capitalized here:

https://github.com/vanheeringen-lab/ANANSE/blob/master/ananse/influence.py#L99-L100

However, this assumes that the motif database also used lower letters (which it wasn't in my case).

	def read_network(fname, edges=100000):
	"""Read network file and return networkx DiGraph."""

	G = nx.DiGraph()

	rnet = pd.read_csv(fname, sep="\t")
	nrnet = rnet.sort_values("prob", ascending=False)
	if len(nrnet) < edges:
	usenet = nrnet
	else:
	usenet = nrnet[:edges]

vanheeringen-lab / ananse Goto Github PK

ananse's Introduction

ANANSE: ANalysis Algorithm for Networks Specified by Enhancers

Prediction of key transcription factors in cell fate determination using enhancer networks

Citation

scANANSE: Gene regulatory network and motif analysis of single-cell clusters

Help and Support

License

ananse's People

Contributors

Stargazers

Watchers

Forkers

ananse's Issues

Recommend Projects

Recommend Topics

Recommend Org