Coder Social home page Coder Social logo

Comments (7)

macmanes avatar macmanes commented on September 3, 2024

Note searching the original file for illicit characters does not return anything..

grep -v '^>' GCF_011100685.1_UU_Cfam_GSD_1.0_genomic.fna.masked | grep -o '[^ATCGactgNn]'

from cactus.

glennhickey avatar glennhickey commented on September 3, 2024

That's a new one. Given the error says (greater than sign ">") perhaps it's an empty sequence names that's triggering it? Cactus does its own check for non-agctn characters well upstream of this, so it's likely something less simple.

You should be able to confirm by creating a work directory (I'm using ./work below) and rerunning the failing command with these flags added:

--restart --caching false --cleanWorkDir never --workDir ./work

then pull the relevant fasta files being input into lastz out of find ./work and inspect them yourself.

from cactus.

macmanes avatar macmanes commented on September 3, 2024

yup so there are some issues with some of the fasta files - but not empty headers.

1st lines of one of the offending fasta files (which I wrapped to make parsing easier) - looks good here.

Line 1: >id=GCF_009873245.2_mBalMus1.pri.v3|NC_045787.1|171266408|90000000
Line 2: ctttaatccattttgagtttatttttgtgtgtggtgttaggaagtgttctaatttcattcttttacatgtagctgtccagttttcccagcaccacttatt
Line 3: gaagaggctgtcttttctccactgtatattcttgcctcctttgtcaaagataaggtgaccatatctgcgtgggtttatctctgggctttctatcctgttc

Somewhere in the middle of the file a newline character was missed for a 2nd fasta entry.

Line 783808: attgacacgtggcactgaacccagagtggcaagtcttccccgtttcccagagaacccacaattccccgtcctatgtgaaatcccccaagttttaaatacc
Line 783810: GACATTATAGATACATTTGATAATTAAAAGGAATAGTACGTATTCCAGCTAGGAGGAGGAGCCCTCCTTTTCGACTGGTTTTAGTCGATTAAGAAGGTTG
Line 783811: TGGGGTTTTGTATGTATGTTAAGATGATACCAGTTTTTGTCTTCATCACGGCTCTGAGCTGTTCAGATAGCTTATTCATCTAAGGTGAG>id=GCF_009
Line 783812: 873245.2_mBalMus1.pri.v3|NC_045788.1|144968589|0acaaggagtagcccccactagccacaactagaggaagtccacatgcagcaat
Line 783813: gaagacacaacgcagccaaaaataaataatgaataaataaataagttaattaattaattaattaaaaaaataagagtagagtggaaattcaggaagttga

I certainly hope this is not a fatal flaw requiring a total restart but i suspect it is. Wondering about the cause here. Any ideas?

from cactus.

macmanes avatar macmanes commented on September 3, 2024

@glennhickey any chance it's the pipes in the fasta headers that are causing this issue?

from cactus.

glennhickey avatar glennhickey commented on September 3, 2024

Could be. When I try to change some names in the test data to look like yours, I get an error right away

RuntimeError: An invalid character was found in the first word of a fasta header. Acceptable characters for headers in an assembly hub include alphanumeric characters plus '_', '-', ':', and '.'. Please modify your headers to eliminate other characters. The offending header: 'id=simMouse_chr6|873245.2_mBalMus1.pri.v1|NC_045788.1|144968589|0' in 'simMouse_chr6'

from cactus.

macmanes avatar macmanes commented on September 3, 2024

it's funny that I don't get that error until much later - does "sanitize_fasta" deal with |'s? They are sadly common in NCBI downloaded genomes.

Anyway I went for the clean/full restart to see if the error is reproducible or if it could have been related to some ?transient read/write issue.

from cactus.

glennhickey avatar glennhickey commented on September 3, 2024

That check is on by default because the ucsc genome browser doesn't (or didn't) support these characters in assembly hubs. If I disable the check by setting checkAssemblyHub="0" in the config, then my test runs through fine.

halStats em.hal --sequenceStats simMouse_chr6
SequenceName, Length, NumTopSegments, NumBottomSegments
873245.2_mBalMus1.pri.v3|NC_045788.1|144968589|0, 636262, 46692, 0
873245.2_mBalMus1.pri.v1|NC_045788.1|144968589|0, 850, 104, 0
873245.2_mBalMus1.pri.v4|NC_045788.1|144968589|0, 1250, 129, 0

but since the check is on by default, I don't know why it didn't complain for you. It's not in cactus_santiize_fasta_headers but slightly upstream when running cactus.

In any case, I do not know what caused your error, and suggest double checking your input file. But if you are sure it's cactus causing the problem, please send me the input so I can try to reproduce.

from cactus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.