Comments (23)
The additional columns in the run_results
output table should now be available under the branch issue_421
.
When running the pyANI
analysis on 2 genomes, it will perform 2 comparisons (forward and reverse), resulting in the following changes:
- The number of rows in the
run_results
output table will double (one for the reverse comparison and one for the forward comparison). - Two columns for percentage identity will be present, one for
% query identity
and one for% subject identity
. - Separate columns for percentage coverage will be included, one for
% query coverage
and one for% subject coverage
. - Two columns for alignment lengths will be provided, one for
% query alignment length
and one for% subject alignment length
.
Rerunning the analysis under the issue_421
branch, the following output was generated:
Comparison ID Query ID Query description Subject ID Subject description % query identity % subject identity % query coverage % subject coverage query alignment length subject alignment length similarity errors program version fragment size maxmatch Run ID
0 1 2 genome_B 1 genome_A 0.9795918367346939 0.98 1.0 1.0 98 100 2 nucmer Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer) False 1
1 2 1 genome_A 2 genome_B 0.98 0.9795918367346939 1.0 1.0 100 98 2 nucmer Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer) False 1
The code, input, and output data are provided in aln_length_issue.zip.
As we are now running 2 comparisons instead of just one, both forward and reverse, we also had to update the matrices. However, since this was initially explained in more detail in issue #151, I discussed this in more details there.
from pyani.
Considering our discussion yesterday @kiepczi - I've updated the reference matrix for ANIm identity consistency tests in 97e5570. This now reflects (i) the new method for calculating identity, and (ii) running A vs B and B vs A (i.e. nonsymmetry).
A couple of things worth noting:
- The difference between the new and old methods was typically at the fourth decimal place, so we shouldn't expect much to change in terms of classifications
- The results don't agree exactly with
dnadiff
from MUMmer, even though we have adopted their approach almost exactly. We might like to assume that the remaining difference:dnadiff
sums precalculated percentages for each alignment fragment in an.mcoords
file, rounded to two decimal places (and so is subject to accumulations of rounding error) where we maintain a singlefloat
while processing the.delta
file directly (a more exact calculation) is responsible for this. Even so, confirming that by running a similar calculation where we do the rounding and checking against thednadiff
output might be a good idea.
(pyani_py311) lpritc@Rodan pyani % pytest -v tests/test_anim.py::test_deltadir_parsing [8:47:35]
=============================================================== test session starts ===============================================================
platform darwin -- Python 3.11.7, pytest-7.4.3, pluggy-1.3.0 -- /Users/lpritc/opt/anaconda3/envs/pyani_py311/bin/python3.11
cachedir: .pytest_cache
rootdir: /Users/lpritc/Documents/Development/GitHub/pyani
configfile: pytest.ini
plugins: cov-4.1.0, anyio-4.2.0
collected 1 item
tests/test_anim.py::test_deltadir_parsing PASSED [100%]
================================================================ 1 passed in 1.30s ================================================================
from pyani.
2. The
tests/test_anim.py::test_genome_sorting
is now failing because, by modifying the codebase to run both comparisons (reverse and forward), I have never taken sorting into consideration. I think restoring sorting would be beneficial as it would help us obtain consistent and reproducible results for future testing and easier debugging.
It would be worth looking back through the commit logs to be sure, but my memory is that sorting was introduced because…
- @peterjc brought in a separation of
nucmer_output
into subdirectories, for easier file management (a good thing) - The original, unsorted, one-way
anim
comparison carried out one pairwise comparison which may have gone intoA/A_vs_B
orB/B_vs_A
- somewhat unpredictably. - We wanted a deterministic, predictable placing of those files, and sorting inputs was the way we decided to do it.
Now, with anim
that does both A vs B and B vs A I don't think we need to sort, any more. The last commit I made above skips a sorting test in test_anim.py
- you might like to use that as a template for how to handle the test_genome_sorting()
test, here.
from pyani.
- The
tests/test_subcmd_09_fastani.py::TestfastANISubcommand::test_fastani
fails because there is no longer aall_length
keyword. This was expected, as we modified the codebase to run pairwise comparisons in both directions (forward and reverse). Because of this, we have decided to change how the alignment lengths are now calculated and reported. Thealn_length
keyword in theComparison
class was replaced withquery_aln_length
andsubject_aln_length
. I would need a little more investigation of thefastani
approach to come up with the best solution for this.
One solution here is - essentially - polymorphism. We would declare new classes that describe output from distinct tools (e.g. DnadiffResult
, NucmerResult
, FastANIResult
), and have the ORM function process them differently. That's probably the most elegant and explicit way to do it, rather than try to overextend the function that's being called. We should talk about that in person and get something down in a document.
from pyani.
Looking through the changes, I see that the calculation looks sensible, and the changes to reporting etc. seem OK - good job! There are a few snags, some of which are being picked up when running pytest
locally. The main points are, I think…
- We're now handling database entries for a comparison differently. In the previous code we stored
query_identity
andsubject_identity
:
run.comparisons.append(
Comparison(
query=job.query,
subject=job.subject,
query_aln_length=qaln_length,
subject_aln_length=saln_length,
sim_errs=sim_errs,
query_identity=query_pid,
subject_identity=subject_pid,
cov_query=qcov,
cov_subject=scov,
program="nucmer",
In the new code we're storing a single value, perc_id
:
run.comparisons.append(
Comparison(
query=job.query,
subject=job.subject,
query_aln_length=qaln_length,
subject_aln_length=saln_length,
sim_errs=sim_errs,
perc_id=pc_id,
cov_query=qcov,
cov_subject=scov,
program="nucmer",
but the database schema and pyani_orm.py
code have not been updated. What are the plans for this?
- We should implement more tests under
tests/
to be sure that the new percentage identity code works as expected/intended.
There are other side-effects (e.g. presence/absence of the sym
argument, other keyword incompatibilites) flagged by pytest
that need attention. I'll try to address a couple of these just now, and we'll talk more at the Monday meeting.
from pyani.
We're getting quite a bit discrepancy in the concordance test for ANIm:
> assert result_pid - tgt_pid == pytest.approx(0, abs=tolerance_anim)
E assert array([[0.0, ... dtype=object) == 0 ± 1.0e-01
E comparison failed
E Obtained: [[0.0 -0.27337324666439144 0.05531150545986918]\n [-0.29824688924479403 0.0 -1.0995470765852389]\n [0.05778732127045316 -1.073189551092085 0.0]]
E Expected: 0 ± 1.0e-01
tests/test_concordance.py:203: AssertionError
We'll need to look into this to see what's going on.
from pyani.
I've modified the expected output in test_anim_delta()
so that tests pass, assuming that the calculations are correct. We should go through the calculation for that test file manually and confirm the result - logging here that we've done so.
UPDATE: this induces a new error in test_deltidir_parsing()
- I'm looking into it. The expected/returned matrices are quite different. We'll need to confirm what's correct.
NC_002696 NC_010338 NC_011916 NC_014100
NC_002696 1.0 0.760899 0.999974 0.693209
NC_010338 1.0 1.000000 0.760913 0.700076
NC_011916 1.0 1.000000 1.000000 0.693673
NC_014100 1.0 1.000000 1.000000 1.000000
NC_002696 NC_010338 NC_011916 NC_014100
NC_002696 1.000000 0.852072 0.999974 0.868745
NC_010338 0.852072 1.000000 0.852027 0.853775
NC_011916 0.999974 0.852027 1.000000 0.868795
NC_014100 0.868745 0.853775 0.868795 1.000000
The test results matrix is on top (expected values are underneath). The lower triangle there is all 1.0
as that's how the dataframe is initialised and - under the assumption of a symmetrical comparison - we didn't populate that half of the matrix. Correcting that should be straightforward.
The other values are more worrying. They agree in only one case, and we'll need to investigate what that is.
from pyani.
- We're now handling database entries for a comparison differently. In the previous code we stored
query_identity
andsubject_identity
:run.comparisons.append( Comparison( query=job.query, subject=job.subject, query_aln_length=qaln_length, subject_aln_length=saln_length, sim_errs=sim_errs, query_identity=query_pid, subject_identity=subject_pid, cov_query=qcov, cov_subject=scov, program="nucmer",In the new code we're storing a single value,
perc_id
:run.comparisons.append( Comparison( query=job.query, subject=job.subject, query_aln_length=qaln_length, subject_aln_length=saln_length, sim_errs=sim_errs, perc_id=pc_id, cov_query=qcov, cov_subject=scov, program="nucmer",
These changes were made due to our conversation last week about how we want to calculate the percentage identity. Our previous assumption was that the percentage identity should be calculated separately for the query and subject by calculating it as follows:
Percentage identity = 1 - (similarity error / alignment length)
However, we later agreed that the best option would be to calculate the average across all alignments. This would be calculated as follows (assuming that there are two alignments between the genomes we are comparing):
Alignment 1 identity weighted = (query length alignment 1 + subject length alignment 1) - (2 * similarity error)
Alignment 2 identity weighted = (query length alignment 2 + subject length alignment 2) - (2 * similarity error)
Percentage identity = (alignment 1 identity weighted + alignment 2 identity weighted) / (query length alignment 1 + subject length alignment 1 + query length alignment 2 + subject length alignment 2)
With this, I assumed that there would only be one calculation for percentage identity, hence just one column, rather than two.
from pyani.
However, we later agreed that the best option would be to calculate the average across all alignments. This would be calculated as follows (assuming that there are two alignments between the genomes we are comparing):
[…]
With this, I assumed that there would only be one calculation for percentage identity, hence just one column, rather than two.
Yes - the point here is rather now that the call to create a Comparison()
object does not match what is defined in the pyani_orm.py
file, so we need to consider the way these changes to the code affect (i) how the program behaves and (ii) how we should - or should not - modify the database schema to accommodate that.
from pyani.
NC_002696 NC_010338 NC_011916 NC_014100 NC_002696 1.0 0.760899 0.999974 0.693209 NC_010338 1.0 1.000000 0.760913 0.700076 NC_011916 1.0 1.000000 1.000000 0.693673 NC_014100 1.0 1.000000 1.000000 1.000000 NC_002696 NC_010338 NC_011916 NC_014100 NC_002696 1.000000 0.852072 0.999974 0.868745 NC_010338 0.852072 1.000000 0.852027 0.853775 NC_011916 0.999974 0.852027 1.000000 0.868795 NC_014100 0.868745 0.853775 0.868795 1.000000
The test results matrix is on top (expected values are underneath). The lower triangle there is all
1.0
as that's how the dataframe is initialised and - under the assumption of a symmetrical comparison - we didn't populate that half of the matrix. Correcting that should be straightforward.
I'm looking into this just now, and running this on my local machine I get the following matrix_identity:
NC_002696:1 NC_010338:2 NC_011916:3 NC_014100:4
NC_002696:1 1.0 0.7608992767 0.9999742381000001 0.6932093551
NC_010338:2 0.760899887 1.0 0.7609126078 0.7000758335
NC_011916:3 0.9999742381000001 0.7609120054 1.0 0.6936734489
NC_014100:4 0.6934934418 0.6997481289 0.6939592582 1.0
Am I right to assume that you have run this comparisons under branch issue_421
?
from pyani.
NC_002696 NC_010338 NC_011916 NC_014100 NC_002696 1.0 0.760899 0.999974 0.693209 NC_010338 1.0 1.000000 0.760913 0.700076 NC_011916 1.0 1.000000 1.000000 0.693673 NC_014100 1.0 1.000000 1.000000 1.000000 NC_002696 NC_010338 NC_011916 NC_014100 NC_002696 1.000000 0.852072 0.999974 0.868745 NC_010338 0.852072 1.000000 0.852027 0.853775 NC_011916 0.999974 0.852027 1.000000 0.868795 NC_014100 0.868745 0.853775 0.868795 1.000000
I've run dnadiff
locally on the NC_011916.fna vs NC_014100.fna comparison .filter
file. This returns:
[REF] [QRY]
[Sequences]
TotalSeqs 1 1
AlignedSeqs 1(100.00%) 1(100.00%)
UnalignedSeqs 0(0.00%) 0(0.00%)
[Bases]
TotalBases 4042929 4655622
AlignedBases 2259662(55.89%) 2261394(48.57%)
UnalignedBases 1783267(44.11%) 2394228(51.43%)
[Alignments]
1-to-1 1189 1189
TotalLength 2263684 2264006
AvgLength 1903.86 1904.13
AvgIdentity 86.88 86.88
M-to-M 1189 1189
TotalLength 2263684 2264006
AvgLength 1903.86 1904.13
AvgIdentity 86.88 86.88
from pyani.
NC_002696 NC_010338 NC_011916 NC_014100 NC_002696 1.0 0.760899 0.999974 0.693209 NC_010338 1.0 1.000000 0.760913 0.700076 NC_011916 1.0 1.000000 1.000000 0.693673 NC_014100 1.0 1.000000 1.000000 1.000000 NC_002696 NC_010338 NC_011916 NC_014100 NC_002696 1.000000 0.852072 0.999974 0.868745 NC_010338 0.852072 1.000000 0.852027 0.853775 NC_011916 0.999974 0.852027 1.000000 0.868795 NC_014100 0.868745 0.853775 0.868795 1.000000
The test results matrix is on top (expected values are underneath). The lower triangle there is all
1.0
as that's how the dataframe is initialised and - under the assumption of a symmetrical comparison - we didn't populate that half of the matrix. Correcting that should be straightforward.I'm looking into this just now, and running this on my local machine I get the following matrix_identity:
NC_002696:1 NC_010338:2 NC_011916:3 NC_014100:4 NC_002696:1 1.0 0.7608992767 0.9999742381000001 0.6932093551 NC_010338:2 0.760899887 1.0 0.7609126078 0.7000758335 NC_011916:3 0.9999742381000001 0.7609120054 1.0 0.6936734489 NC_014100:4 0.6934934418 0.6997481289 0.6939592582 1.0
Am I right to assume that you have run this comparisons under branch
issue_421
?
Yes - which branch should I be looking at?
from pyani.
Yes - which branch should I be looking at?
Branch Issue_421
from pyani.
So we're running under the same branch but getting different results?
In both cases the calculated identities do not correspond to the expected identity, or to applying dnadiff
to the same .delta
format files. This needs to be addressed.
from pyani.
I have updated more tests and fixtures in test_anim.py
to reflect the changes made in the codebase (see da8852c). This included:
i) Adding more commands to the mummer_cmds_four
fixture. We now run pairwise comparisons in both directions (forward and reverse). This adjustment allowed us to pass the test_mummer_multiple
and test_mummer_job_generation
tests successfully.
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % pytest -v tests/test_anim.py::test_mummer_multiple
=================================================================== test session starts ===================================================================
platform darwin -- Python 3.8.18, pytest-8.1.1, pluggy-1.4.0 -- /opt/anaconda3/envs/pyani_issue_421/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/angelikakiepas/Desktop/pyani
configfile: pytest.ini
plugins: anyio-4.3.0, cov-4.1.0, ordering-0.6
collected 1 item
tests/test_anim.py::test_mummer_multiple PASSED [100%]
==================================================================== 1 passed in 0.12s ====================================================================
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani %
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % pytest -v tests/test_anim.py::test_mummer_job_generation
=================================================================== test session starts ===================================================================
platform darwin -- Python 3.8.18, pytest-8.1.1, pluggy-1.4.0 -- /opt/anaconda3/envs/pyani_issue_421/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/angelikakiepas/Desktop/pyani
configfile: pytest.ini
plugins: anyio-4.3.0, cov-4.1.0, ordering-0.6
collected 1 item
tests/test_anim.py::test_mummer_job_generation PASSED [100%]
==================================================================== 1 passed in 0.13s ====================================================================
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani %
ii) Modification of the deltafile_parsed
fixture. We now expect different values to be returned by the parse_delta
function. This change is due to adjustments made in the codebase, where we no longer double count overlapping regions, calculate weighted ANIm %ID, and return additional items such as query alignment length, subject alignment length, weighted %ID, and similarity errors. This allowed to pass test_deltafile_parsing
test.
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % pytest -v tests/test_anim.py::test_deltafile_parsing
=================================================================== test session starts ===================================================================
platform darwin -- Python 3.8.18, pytest-8.1.1, pluggy-1.4.0 -- /opt/anaconda3/envs/pyani_issue_421/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/angelikakiepas/Desktop/pyani
configfile: pytest.ini
plugins: anyio-4.3.0, cov-4.1.0, ordering-0.6
collected 1 item
tests/test_anim.py::test_deltafile_parsing PASSED [100%]
==================================================================== 1 passed in 0.13s ====================================================================
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani %
from pyani.
Although the majority of tests pass, three are currently failing:
================================================================= short test summary info =================================================================
FAILED tests/test_anim.py::test_genome_sorting - AssertionError: assert ('nucmer --mu...first.filter') == ('nucmer --mu...econd.filter')
FAILED tests/test_parsing.py::test_anim_delta - AssertionError: assert (4016947, 401...7031752, 2191) == (4016947, 401...4447228, 2191)
FAILED tests/test_subcmd_09_fastani.py::TestfastANISubcommand::test_fastani - TypeError: 'aln_length' is an invalid keyword argument for Comparison
======================================= 3 failed, 115 passed, 1 xfailed, 1 xpassed, 3 warnings in 389.17s (0:06:29) =======================================
Here are my current thoughts on them:
-
The
tests/test_subcmd_09_fastani.py::TestfastANISubcommand::test_fastani
fails because there is no longer aall_length
keyword. This was expected, as we modified the codebase to run pairwise comparisons in both directions (forward and reverse). Because of this, we have decided to change how the alignment lengths are now calculated and reported. Thealn_length
keyword in theComparison
class was replaced withquery_aln_length
andsubject_aln_length
. I would need a little more investigation of thefastani
approach to come up with the best solution for this. -
The
tests/test_anim.py::test_genome_sorting
is now failing because, by modifying the codebase to run both comparisons (reverse and forward), I have never taken sorting into consideration. I think restoring sorting would be beneficial as it would help us obtain consistent and reproducible results for future testing and easier debugging. -
The
tests/test_parsing.py::test_anim_delta
fails because of the rounding error, which, as @widdowquinn suggested, should be investigated a little bit more.
@widdowquinn, if you have any ideas comments regarding this, let me know.
from pyani.
3. The
tests/test_parsing.py::test_anim_delta
fails because of the rounding error, which, as @widdowquinn suggested, should be investigated a little bit more.
Thing 1 is to satisfy ourselves that that's the origin of the discrepancy ;) If it is, then the job of implementing that part of pyani dnadiff
is slightly easier than we at first thought…
from pyani.
- The
tests/test_parsing.py::test_anim_delta
fails because of the rounding error, which, as @widdowquinn suggested, should be investigated a little bit more.Thing 1 is to satisfy ourselves that that's the origin of the discrepancy ;) If it is, then the job of implementing that part of
pyani dnadiff
is slightly easier than we at first thought…
I looked into how dnadiff.pl
calculates the Average %ID for 1-to-1 alignments.
To begin, it assigns the combined alignment lengths (rqSumLen1
) and the sum of weighted alignment length identities (rqSumIdy1
) to 0.
my ($rqSumLen1, $rqSumLenM) = (0,0); # combined alignment length sum
my ($rqSumIdy1, $rqSumIdyM) = (0,0); # weighted alignment identity sum
The weighted alignment lengths are then calculated by dividing the average %ID of each sequence by 100 and multiplying it by the sum of the lengths of both the reference and query sequences (note that these values are extracted from the 1coords
file).
$rqSumIdy1 += ($A[6] / 100.0) * ($A[4] + $A[5]);
The length of alignments is calculated as the sum of the lengths of both the reference and query sequences:
$rqSumLen1 += ($A[4] + $A[5]);
The Average Identity is calculated by dividing the sum of all weighted alignment lengths by the sum of all combined alignment lengths and then multiplying that value by 100.
($rqSumLen1 ? $rqSumIdy1 / $rqSumLen1 * 100.0 : 0);
I manually calculated this for a small comparison of two viral genomes, where according to dnadiff
(1-to-1 alignments), the average %ID is 99.63:
[Alignments]
1-to-1 2 2
TotalLength 59174 59187
AvgLength 29587.00 29593.50
AvgIdentity 99.63 99.63
The 1coords
file:
85 37713 1 37636 37629 37636 99.43 39253 39594 95.86 95.05 MGV_MGV-GENOME-0264574 MGV_MGV-GENOME-0266457
17709 39253 17626 39176 21545 21551 99.97 39253 39594 54.89 54.43 MGV_MGV-GENOME-0264574 MGV_MGV-GENOME-0266457
Then, the calculations I get are:
alignment weighted 1 = (99.43 / 100) * (37629 + 37636) = 74835.9895
alignment weighted 2 = (99.97 / 100) * (21545 + 21551) = 43083.0712
Average %ID = (74835.9895 + 43083.0712) / (37629 + 37636 + 21545 + 21551) * 100 = 99.6266174669021 ≈ 99.63
I attempted to replicate this value using delta
files instead, but as we know, these were slightly different, usually at the second decimal place. We discussed earlier that this could be due to the rounding error coming from the %ID of each individual sequence given in the 1coords
file.
So, let's calculate these values without rounding the individual sequence %IDs from the delta
file. The file is as follows:
/Users/angelikakiepas/Desktop/pyani/issue_421/rounding_error/scripts/../data/donovan_test/input/MGV-GENOME-0264574.fna /Users/angelikakiepas/Desktop/pyani/issue_421/rounding_error/scripts/../data/donovan_test/input/MGV-GENOME-0266457.fna
NUCMER
>MGV_MGV-GENOME-0264574 MGV_MGV-GENOME-0266457 39253 39594
85 37713 1 37636 215 215 0
-3013
-24624
-1
-1
-1
-1
-1
0
17709 39253 17626 39176 7 7 0
-9994
-1
-1
-1
-1
-1
0
For the first alignment calculations:
aligned reference bases = 37713 - 85 + 1 = 37629
aligned query bases = 37636 - 1 + 1 = 37636
similarity errors: 215
percentage id = ((37629 + 37636) - (2 * 215)) / (37629 + 37636) *100 = 99.42868531189795
alignment weighted 1 = (99.42868531189795 / 100) * (37629 + 37636) = 74834.99999999999
For the second alignment calculations:
aligned reference bases = 39253 - 17709 + 1 = 21545
aligned query bases = 39176 - 17626 + 1 = 21551
similarity errors: 7
percentage id = ((21545 + 21551) - (2 * 7)) / (21545 + 21551) * 100 = 99.96751438648599
alignment weighted 2 = (99.96751438648599 / 100) * (21545 + 21551) = 43082.0
Average %ID = (74834.99999999999 + 43082.0) / (37629 + 37636 + 21545 + 21551) * 100 = 99.62487643733999
@widdowquinn suggested that there could be a quicker way to calculate this by skipping the intermediate calculation of identity as follows:
alignment 1 identity weighted = (37629 + 37636) - (2 * 215) = 74835
alignment 2 identity weighted = (21545 + 21551) - (2 * 7) = 43082
Average %ID = (74835 + 43082) / (37629 + 37636 + 21545 + 21551) *100 = 99.62487643734
This example demonstrates that the differences in the reported numbers could arise from a rounding error, and it appears that our calculations would be more accurate.
I have written a Python script that calculates these values using these approaches (available here). I ran this on a few examples and found that in some cases, even when we calculate the Average %ID where the individual aligned sequences %ID is rounded, the values differ from the ones reported by dnadiff
.
When the values are calculated in Python, I obtain the following results:
- 85.24 = Follows the approach implemented by
dnadiff
(percentage identities for each found alignment were rounded to the 2nd decimal place). - 85.24244273445755 = Follows the approach implemented by
dnadiff
, but no %ID for each found alignment were rounded to the 2nd decimal place. - 85.24244273445755 = Skipping the intermediate calculations.
The average %ID reported by dnadiff
is 85.35%, so the difference is quite significant:
[Alignments]
1-to-1 1455 1455
TotalLength 1679781 1679569
AvgLength 1154.49 1154.34
AvgIdentity 85.35 85.35
I checked if the sequence percentage identities calculated for each alignment match the ones reported in the 1coords
file. To my surprise, I found that the majority of these do not match.
One example where the percentage identity for the alignment does not match is:
1486 3045 124165 125722 1560 1558 88.49 16110 7787608 9.68 0.02 NZ_NOWW01000027.1 NZ_CP026730.1
Using the above information, I found the alignment in the delta file
:
>NZ_NOWW01000027.1 NZ_CP026730.1 16110 7787608
1486 3045 124165 125722 180 180 0
8
-53
493
-4
743
-18
3
123
-6
33
0
When calculating the percentage identity manually, I found that these do not match. In this example, the difference is almost 0.035%. As this is happening in most of the alignments in this comparison, the differences add up to the point that the overall Average %ID differs significantly.
Reference alignment length = 3045 - 1486 + 1 = 1560
Query alignment length = 125722 - 124165 + 1 = 1558
Percentage ID = ((1560 + 1558) - (2 * 180)) / (1560 + 1558) * 100 = 88.45413726747915
from pyani.
I should probably mention that all code and data used for the investigation can be found here.
from pyani.
After discussion between @kiepczi and me, and inspecting the code of show-coords
, it appears that - in some cases at least - the graph model internal to show-coords
is leading to a different measure of sequence identity than would be obtained from our (naive) reading of the .delta
files.
We cannot then guarantee that our pyani
-calculated percentage identity matches for any particular region exactly match those from dnadiff
, or that they will match within simple rounding error (although in many cases they actually do so).
At this point, my view is that the pyani
approach of considering all alignments in the .delta
file, and the dnadiff
approach of (apparently) constructing a graph model from those alignments are not straightforward to reconcile.
The least time-consuming, and probably most useful, way forward is, I think, this:
- Provide the (corrected)
pyani
calculation method as the standard method forpyani anim
- Provide a second
pyani dnadiff
method that generates theshow-coords
andshow-diff
outputs usingMUMmer
's tools, allowing us to recreate the key parts of thednadiff
output. - Using
pyani compare
make a thorough comparison between the operation of both methods for the resulting publication.
from pyani.
Adding the pyani dnadiff
command should be reserved for pyani plus
, as per #424.
from pyani.
One solution here is - essentially - polymorphism. We would declare new classes that describe output from distinct tools (e.g. DnadiffResult, NucmerResult, FastANIResult), and have the ORM function process them differently. That's probably the most elegant and explicit way to do it, rather than try to overextend the function that's being called. We should talk about that in person and get something down in a document.
To expand on this… the error below
FAILED tests/test_subcmd_09_fastani.py::TestfastANISubcommand::test_fastani - TypeError: 'aln_length' is an invalid keyword argument for Comparison
arises because the keywords provided from a fastANI
comparison are no longer the same as for an anim
comparison. Commit 693d0c2 changed the requirement from aln_length
to query_aln_length
and subject_aln_length
, though it is not clear that the underlying database schema changed (we need to check this and update to accommodate if needed!).
If we require only one of those two lengths for the ANI result, then we should prefer one over the other, and we ought to be able to provide it to the Comparison
object through a consistent interface. For example…
class ResultfastANI()
[...]
self.aln_length = #total alignment length as only one length makes sense
[...]
@property
def mylen():
return self.aln_length
class ResultNucmer()
[...]
self.query_aln_length = #self-explanatory
self.ref_aln_length = #self-explanatory
[...]
@property
def mylen():
return self.query_aln_length # because we are prioritising the _query_ alignment length by choice
But, if we require two lengths, then we will have to change the schema/reporting/ORM to accommodate this.
Which do we think we want?
from pyani.
If we only require one alignment length per comparison for our reporting purposes, then the first solution is sufficient.
Having both results from a tool like nucmer
, which reports query and reference alignment lengths (perhaps differently for A v B and B v A comparisons), might be interesing to some, but changing the internals is a large task, and would be better postponed to pyani-plus
where it can be accommodated in a schema redesign.
from pyani.
Related Issues (20)
- Untested code in `fastani.py` HOT 1
- pyani: command not found HOT 1
- Use of `unittest.TestCase` assertion methods provides more informative error messages than `assert` HOT 1
- pyani command not found HOT 1
- `pyani` command not found HOT 1
- Cannot run anim in pyani 0.3.0-alpha using provided data, on macosx 12.6, nucmer fails. HOT 4
- Error "Cannot update because the update target is missing these hooks" when trying to use autoupdate on my repo HOT 1
- sending some love for this script: " genbank_get_genomes_by_taxon.py" HOT 2
- ERROR: This has possibly been a NUCmer run failure, please investigate ERROR: NoneType: None HOT 3
- Type strain vs X of genomes HOT 1
- ValueError in pyani plot using pyani 0.3.0-alpha
- Pyani not installing through conda HOT 5
- Inquiry on Analysis Duration with ANIb and ANIm for Streptomyces Genomes HOT 4
- Fix alignment coverage >1.0 and aniM symmetrical behaviour HOT 8
- numpy: No module named 'version' HOT 4
- add `pyani dnadiff` subcommand HOT 5
- Running --graphics after analysis HOT 1
- ZeroDivisionError: float division by zero HOT 3
- Installation issues HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyani.