Coder Social home page Coder Social logo

hal2sg's Introduction

Welcome

I am a freelance software consultant, specializing in comparative genomics research. I work primarily on open source projects using C++ and Python, such as Cactus and vg. I have a PhD in Computer Science from McGill University.

My consulting company, Arenaria, has been in business since 2015. If you are interested in working together feel free to contact me (my first name dot my last name @arenaria.ca).

hal2sg's People

Contributors

adamnovak avatar glennhickey avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Forkers

adamnovak

hal2sg's Issues

Minor updates to SQL schemas

I added a name field both VariantSet and to Allele - both can be left NULL if desired. Also took the now-unused VariantID out of AlleleCall:

CREATE TABLE VariantSet (ID INTEGER PRIMARY KEY,
    referenceSetID INTEGER NOT NULL,
    name TEXT, -- seems to be used in practice this way
    FOREIGN KEY(referenceSetID) REFERENCES ReferenceSet(ID));
CREATE TABLE Allele (ID INTEGER PRIMARY KEY, 
    variantSetID INTEGER,  -- TODO: Can this be null? 
    name TEXT,
    FOREIGN KEY(variantSetID) REFERENCES VariantSet(ID));
CREATE TABLE AlleleCall (alleleID INTEGER NOT NULL, 
    callSetID INTEGER NOT NULL,
    ploidy INTEGER NOT NULL,
    PRIMARY KEY(alleleID, callSetID),
    FOREIGN KEY(alleleID) REFERENCES allele(ID),
    FOREIGN KEY(callSetID) REFERENCES CallSet(ID)); -- TODO: metadata!

Adjust output SQL to match ๐Ÿ˜„

Minor SQL formatting corrections needed

I'm putting all these together, as I suspect they individually require next to no work, and will probably be fixed as a group:

  • Add quotes around 2nd argument of INSERT INTO FASTAstatement
  • Add semicolon to end of INSERT INTO FASTAstatement
  • Add semicolon to end of INSERT INTO ReferenceSet statement
  • Add quotes around TRUE and FALSE in INSERT INTO GraphJoin statement
  • Add semicolon to end of INSERT INTO GraphJoin statement
  • Add quotes around last argument of INSERT INTO VariantSet statement
  • Add keyword VALUES after INSERT INTO Allele
  • Add semicolon to end of INSERT INTO AllelePathItem statement

Segmentation Fault

Hello,

I am trying to use hal2sg to export one of my HAL files (which you claim it ought to work on). However, I'm getting a segmentation fault.

I'm running this command line:

/cluster/home/anovak/build/hal2sg/hal2sg altRegions/MHC/graph.hal altRegions/MHC/server/database.fa altRegions/MHC/server/database.sql

I'm running in this directory on Kolossus:

/cluster/home/anovak/hive/sgdev

Here is what gdb has to say:

[anovak@kolossus sgdev]$ gdb /cluster/home/anovak/build/hal2sg/hal2sg
GNU gdb (GDB) 7.8.50.20140829-cvs
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /cluster/home/anovak/build/hal2sg/hal2sg...done.
(gdb) run altRegions/MHC/graph.hal altRegions/MHC/server/database.fa altRegions/MHC/server/database.sql
Starting program: /cluster/home/anovak/build/hal2sg/hal2sg altRegions/MHC/graph.hal altRegions/MHC/server/database.fa altRegions/MHC/server/database.sql
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string (this=0x9aa0c8, 
    __str=<error reading variable: Cannot access memory at address 0x1>)
    at /cluster/home/anovak/build/objdir/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:173
173 /cluster/home/anovak/build/objdir/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc: No such file or directory.
(gdb) bt
#0  std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string (this=0x9aa0c8, 
    __str=<error reading variable: Cannot access memory at address 0x1>)
    at /cluster/home/anovak/build/objdir/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:173
#1  0x0000000000440c0f in construct (this=0x7fffffffcfd0, __val=<error reading variable: Cannot access memory at address 0x1>, 
    __p=0x9aa0c8) at /cluster/home/anovak/.local/include/c++/4.9.0/ext/new_allocator.h:130
#2  std::deque<std::string, std::allocator<std::string> >::_M_push_front_aux (this=this@entry=0x7fffffffcfd0, 
    __t=<error reading variable: Cannot access memory at address 0x1>)
    at /cluster/home/anovak/.local/include/c++/4.9.0/bits/deque.tcc:496
#3  0x00000000004363d2 in push_front (__x=<error reading variable: Cannot access memory at address 0x1>, this=0x7fffffffcfd0)
    at /cluster/home/anovak/.local/include/c++/4.9.0/bits/stl_deque.h:1375
#4  hal::HDF5Alignment::getLeafNamesBelow (this=0x99be50, name=<error reading variable: Cannot access memory at address 0x1>)
    at hdf5_impl/hdf5Alignment.cpp:614
#5  0x0000000000406ddd in main (argc=<optimized out>, argv=<optimized out>) at hal2sg.cpp:117
(gdb) 

I am using commit 39ab8ad.

Can you reproduce this issue with your build? Can you figure out what's going on?

Thanks,
-Adam

question regarding usage

Hi Glenn,
I am looking for a way to convert the hal files generated by progressiveCactus into vg files. I came across two of your converters, hal2sg and sg2vg, which sound like they could be the solution I am looking for. Do you think one could use hal2sg and sg2vg on an existing hal file to create a vg file (our variants are based on alignments of near complete whole genomes and therefore we don't have vcf files) ?
Thank you.
Mahul

Too Many VariantSets

hal2sg is producing a VariantSet for every genome in the HAL, but is only putting Alleles into VariantSet 0:

INSERT INTO VariantSet VALUES (0, 0, 'Anc0');
INSERT INTO VariantSet VALUES (1, 0, 'ref');
INSERT INTO VariantSet VALUES (2, 0, 'GI568335879');
INSERT INTO VariantSet VALUES (3, 0, 'GI568335954');
INSERT INTO VariantSet VALUES (4, 0, 'GI568335976');
INSERT INTO VariantSet VALUES (5, 0, 'GI568335986');
INSERT INTO VariantSet VALUES (6, 0, 'GI568335989');
INSERT INTO VariantSet VALUES (7, 0, 'GI568335992');
INSERT INTO VariantSet VALUES (8, 0, 'GI568335994');
INSERT INTO VariantSet VALUES (9, 0, 'GI568335997');

INSERT INTO Allele VALUES (0, 0, 'Anc0refChr0');
INSERT INTO Allele VALUES (1, 0, 'Anc0refChr1');
INSERT INTO Allele VALUES (2, 0, 'Anc0refChr2');
INSERT INTO Allele VALUES (3, 0, 'Anc0refChr3');
INSERT INTO Allele VALUES (4, 0, 'Anc0refChr4');
INSERT INTO Allele VALUES (5, 0, 'Anc0refChr5');
INSERT INTO Allele VALUES (6, 0, 'Anc0refChr6');
INSERT INTO Allele VALUES (7, 0, 'Anc0refChr7');
...

Instead, it should produce only one VariantSet.

Occasional invalid entry in Allele SQL table

I'm seeing this quite a bit, in various positions in the Allele table:

INSERT INTO Allele (0, 0);
INSERT INTO Allele (1, 4250022048784970411);
INSERT INTO Allele (2, 1);
INSERT INTO Allele (3, 2);

Bad pointer or uninitialized value?

AllelePath SQL output is odd

The SQL schema for the AllelePath table should be:

CREATE TABLE AllelePathItem (alleleID INTEGER, 
    pathItemIndex INTEGER NOT NULL, -- one-based index of this pathItem within the entire path
    sequenceID INTEGER NOT NULL, start INTEGER NOT NULL,
    length INTEGER NOT NULL, strandIsForward BOOLEAN NOT NULL,
    PRIMARY KEY(alleleID, pathItemIndex),
    FOREIGN KEY(alleleID) REFERENCES allele(ID),
    FOREIGN KEY(sequenceID) REFERENCES Sequence(ID));

Thus, the first two values taken together comprise the primary key. This assumption is being violated in all the output I've seen so far. For example:

INSERT INTO AllelePathItem VALUES (2, 2, 0, 0, 2, 'TRUE')
INSERT INTO AllelePathItem VALUES (2, 2, 0, 1926, 1925, 'FALSE')
INSERT INTO AllelePathItem VALUES (2, 2, 2, 0, 10, 'TRUE')
INSERT INTO AllelePathItem VALUES (2, 2, 0, 8533, 6579, 'FALSE')
INSERT INTO AllelePathItem VALUES (2, 2, 3, 0, 10, 'TRUE')
INSERT INTO AllelePathItem VALUES (2, 2, 0, 17436, 8891, 'FALSE')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.