Hello,
I've started working with the DBGenerator.py script a few days ago. It's a great help for me, and I got it to work with Python 2.7 and with genomes which still have their version numbers attached (e.g. NC_008253.1). It worked well on my example data, but not on the whole dataset.
The problem are apparently duplicated genes in the gene_presence_absence.csv table. In that table, a sample can have multiple gene IDs for one gene, separated by tabs. In the genomas_locus.csv, I then get multiple entries as well, like this:
NZ_CP027766.1|['NZ_CP027766.1_00163', 'NZ_CP027766.1_00164']
NZ_CP027766.1|['NZ_CP027766.1_00163', 'NZ_CP027766.1_00164']
I am not sure how to proceed with this. Did this happen in your analysis as well? You have the
lines in the get_locus_sequence() function, so maybe you were looking at this already?
Thanks!
@LilithElina