Coder Social home page Coder Social logo

primers-java's People

Contributors

ahmedihafez avatar dependabot[bot] avatar srgnis avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

srgnis

primers-java's Issues

Detect and warn about alternative products of non-unique PCR

The code is evidently forked from version 1.5 of the C and so didn't catch the changes to amplicons.c in June 2019 (versions 1.6 and 1.61) to detect and warn about alternative products of non-unique PCR. Version 1.6 is a regression and was fixed by 1.61, so the best diff to look at is the one that compares Version 1.51 with 1.61: https://github.com/ssb22/PrimerPooler/compare/90a82bd5ec920e8e198b1990f18ce7584fbf918d..2f07c4d6ac8442c85b5e77b8c0a49d0db71bf73d and it's the changes to amplicons.c part-way down the diff that need looking at (all other changes are just to the version number).

I have confirmed that the fix from 1.5 to 1.51 is not relevant to this fork, but 1.6 / 1.61 probably should be ported in (or documented as missing).

Make "throws Exception" more specific + do not ignore exceptions or errors

Writing throws Exception is generally considered "sloppy" Java, we are supposed to be more specific about the type of exception we throw. A quick way to fix this is to (1) remove the throws Exception, (2) try to compile, look carefully at the compiler's error messages to see exactly which types of exception it says are uncaught (for example IOException), and (3) write throws that kind of exception, not just any exception. That way we can count on the compiler to flag it up if we later call something that throws some other kind of exception without realising it; we also avoid inadvertently sending out a signal that says "this code was not written carefully".

In Splitter.java

public void runSplit(PrintStream redirErr)
public void runSplit()
void poolsplit_thread

In PoolSplit.java

static void randomise_pools
static public int suggest_num_pools
public static int[]  split_into_pools

In Fasta2Bit.java we have public Fasta2Bit(String genomeFile) throws Exception

In AllPrimers.java we have public void loadFasta(String fileName) throws Exception and it also has two // TODO Auto-generated catch block which just print the stack trace and return without error which is not good at all (if it can throw the exception, then we should let it throw the exception, not just catch it and carry on regardless).

Additionally, loadSequences (while not being declared as being able to throw exceptions) also has an auto-generated catch block that simply prints the stack trace and returns without error, this is wrong, the exception should propagate and be handled by the caller, since carrying on anyway after there's been an exception is not the correct thing for the caller to do. We should take out that auto-generated catch block, find out exactly which exceptions the complier tells us can be thrown, and declare throws those.

There is also one of these in Genome.java in go_through_genome (prints stack trace and returns success anyway).

And in Fasta2Bit.java we have // TODO :: Invalid signature (is this really a .2bit genome file?) (in two places), and carries on regardless, where the C version raised an error. It's especially important to flag this error because we have not yet implemented fasta_genome: if someone tries to pass in a .fa genome file instead of a .2bit one, we do not want to carry on regardless. Also, there is one of those auto-generated catch blocks in that same method (prints stack trace and carries on regardless), errors need to propagate back to caller.

And PoolerMain.java has a similar problem in runPooler, stack trace is printed but the caller is not informed that there was a problem (I assume runPooler is called by the user interface, and standard output is made visible in some way? does the UI need to be told whether there was a problem?)

On the subject of exception handling, we should also consider how poolsplit_thread is stopped when a time limit is not set (or if it quickly prints a good-enough solution and the user wants to stop it early). Splitter.java prints press Control-C to stop, but the handling of Control-C in the C version relies on a call to signal(SIGINT, intHandler) which does not seem to have been replaced with anything in Java. I think Control-C becomes an InterruptedException, but the GUI might prefer to have an "Early Stop" button that sets some variable somewhere that is periodically checked like the timer.

Confusing comments

In DegeneratePrimer64.java :
// TODO :: this (bonds == 0 ) ? 0 : bonds is not need for now,
// the difference is the operation in C is undefinde and give 1 for 0 ??
// here it return 0 for 0
int shift = 64-2-Long.numberOfLeadingZeros((bonds == 0 ) ? 0 : bonds);

this comment doesn't seem to make sense, because the C just said int shift = 64-2-leading0_64(bonds); where leading0_64 simply maps to __builtin_clzll and there is no special case for 0, I don't know if that came from a translation script gone wrong? anyway I think that comment can be deleted and just say int shift = 64-2-Long.numberOfLeadingZeros(bonds);

Incidentally later on in that file there's if(! (overlap != 0)) which I really think would be more readable to write as if (overlap==0) (it was if (!overlap) in C, which we read as "if not overlap")

and in PrimerFactory.java getScore,
// TODO :: here make sure that the order of primer does not matter here
Yes, ordering of a pair of primers does not matter when calculating score or deltaG between them.

But backword should be spelled backward throughout PrimerFactory :)

Confusing comments

In DegeneratePrimer64.java :
// TODO :: this (bonds == 0 ) ? 0 : bonds is not need for now,
// the difference is the operation in C is undefinde and give 1 for 0 ??
// here it return 0 for 0
int shift = 64-2-Long.numberOfLeadingZeros((bonds == 0 ) ? 0 : bonds);

this comment doesn't seem to make sense, because the C just said int shift = 64-2-leading0_64(bonds); where leading0_64 simply maps to __builtin_clzll and there is no special case for 0, I don't know if that came from a translation script gone wrong? anyway I think that comment can be deleted and just say int shift = 64-2-Long.numberOfLeadingZeros(bonds);

Incidentally later on in that file there's if(! (overlap != 0)) which I really think would be more readable to write as if (overlap==0) (it was if (!overlap) in C, which we read as "if not overlap")

and in PrimerFactory.java getScore,
// TODO :: here make sure that the order of primer does not matter here
Yes, ordering of a pair of primers does not matter when calculating score or deltaG between them.

But backword should be spelled backward throughout PrimerFactory :)

Data corruption possible because some right-shift operators are incorrectly ported

The C version of PrimerPooler uses unsigned integers to represent all bit patterns, which means the right-shift operator >> always brings in 0 bits on the left.

The Java language has no unsigned integers, so we must change all instances of the right-shift operator >> into >>> so as to explicitly specify that it should be a logical right-shift, not an arithmetic right-shift. If >> is used instead of >>>, and if there is a 1 bit in the leftmost position of the operand, then a new 1 will be added to the bit pattern, and this can corrupt our data in some cases.

In most cases the port is correctly using >>>, but there are still instances in PoolSplit.java and primers64/DegeneratePrimer64.java where >> is used. I have not done a full analysis of exactly which circumstances would cause this to corrupt data, so I suggest after fixing it we inform users that they should re-run their calculations in the new version, because the first version might have given them inaccurate results due to this bit-pattern corruption.

Change package name

Technically we're not supposed to use package names from domains we do not control, for example pooler.org currently directs to a domain-parking company, somebody could buy it and claim we're using their domain.

I've been told I can use *.ssb22.ucam.org in perpetuity, so I suppose it makes sense to use org.ucam.ssb22.pooler instead of org.pooler throughout. The code that calls it will also have to use the new package name.

Implement AllPrimers commented-out warnings

In AllPrimers.java methods loadFasta and loadSequences there are commented-out calls to fprintf which are supposed to be there to warn the user about errors in the input; leaving them commented out means the user will not be warned about such mistakes by the Java version. The user does need to be warned in some way if the code reaches these branches.

list of copyright owners

In your Readme.md in the sentence
"Primer3 copyrights are reserved to its authors..."
Please list all authors of the 2012 NAR paper as copyright owners.

Implement parallelisation

PrimerPooler is supposed to be fast, and part of its speed comes from parallelisation. The Java port is currently not parallelised, so will be slower than the C version (and I'd like users to know if the C version is many times faster☺)

Java does not have OpenMP, but we may be able to port our use of it into standard Java threads.

For dGandScoreCounts and dGprintBonds, in each case the #pragma omp parallel simply means "find out how many cores the CPU has (I think it's Runtime.getRuntime().availableProcessors() in Java) and start that number of threads, each of them running a copy of this block". So to do that in Java we'd have to put the block into a public void run() method of a class that 'extends Thread, then create N instances of that class, call .start()on each (so they all start), then call.join()` on each (so we wait till they've all finished).

Each thread will want to know what its thread number is, so we'd best pass the loop counter in to the constructor and save it somewhere so we can pass it in to the call to Triangle.t_iBounds(np) (which needs to know thread number and total number of threads, like "we are thread number 3 of 4", so it can calculate a sensible share of the run for itself: the code to do this is already ported, we just need to set its nThreads and tNum to something other than 1 and 0).

And then there is the matter of the #pragma omp critical section, which is equivalent to synchronized in Java. So that block would have to go into a method of AllPrimers (or some other class of which there is only one instance), declared synchronized so that only one thread at a time can run it (with parameters bucket and score), and the counts, minScore and maxScore arrays will also have to be looked after by that object. Similarly for dGprintBonds which needs to synchronize access to sr in the same way.

poolsplit_thread in Splitter.java can also be parallelised in this way, but beware it has multiple critical sections (and each thread needs its own copy of all parameters, rather than having them all shared in the Splitter object). Note also the use of ThreadRand which must return a different random sequence to each thread (otherwise there's no point in parallelising here); this is also why the C version does not parallelise here if seedless is set.

The most difficult part will be the parallelisation of the genome scan. In this case, we cannot rely on "share the chromosomes equally between the threads" being a good strategy (we don't know these chromosomes are the same lengths, they might be very different). So the C version uses #pragma omp parallel for schedule(dynamic) which basically means "create N threads (one per core), give one chromosome to each thread, and then whenever any thread finishes, give it another chromosome, until there are no chromosomes left, then wait for all threads to finish".

Warn if primers truncated

As the Java version limits the primer length to 64 (as we don't have a 128-bit type), we should warn the user if they try to supply a primer longer than 64bp. The C version has such a warning in bit-common.c in checkLenLimit, called from numPrimers in load-common.c, which is called from loadFASTA during a preliminary pass to figure out how much memory to allocate. In other words, we check if we need to warn about primers being too long as a side-effect to figuring out how much memory to allocate for them. But the Java version of loadFasta, like many things in Java, expands as it goes along, and therefore lacks the preliminary pass to figure out how much memory to allocate. Unfortunately, that removal of the first pass i.e. the removal of the call to numPrimers has also removed the code that warns about primers being too long. Perhaps I should have thought of a function name that made it more obvious it was not just dimensioning the memory but also performing a check. Anyway, we now need to put the warning back into the Java version, in case anyone tries to feed it a long primer.

DegeneratePrimer64 constructor handles unrecognised bases badly

In the C version (parseDegeneratePrimer64 in 64.h) we check to see if l is in fact present in degenerateCombos at all; if it's not, strchr returns NULL and the else branch of if(combo) is taken, resulting in a call to reportUnrecognisedBase.

Now, reportUnrecognisedBase can do one of two things: if the "unrecognised base" is just whitespace, we can ignore this and carry on regardless (so it doesn't matter if somebody adds whitespace in their bases strings). If it is anything else, we still ignore it but we make sure to warn the user about it.

In the Java version (DegeneratePrimer64.java), indexOf returns -1 but we simply add 1 to this to make it 0, which results in the addition of a base that is "not allowed to be anything" (a 0 bit in all four registers). The bit-field operations in other parts of the code will interpret this as "we have gone beyond the edge of the primer". So primers will get truncated without warning. This will happen if someone inserts whitespace, and will also happen if they insert any non-recognised character that they really should have been warned about.

There needs to be a proper handler for the case of indexOf returning -1 here. I suspect the reason why it got ignored is that a lot of C code that checks things aren't NULL is checking things like return values of malloc, and those checks can indeed be removed when porting to Java. But strchr is different and its NULL case still needs attention.

Ignoring variant chromosomes should be optional

The code is evidently forked from version 1.5 of the C and so didn't catch the change made in March 2020 to make ignoring variant chromosomes optional: ssb22/PrimerPooler@b595384 This change was made after a user emailed me when they couldn't scan chromosomes with - and _ in their names. The code to ignore - and _ was there because the Cambridge group I was helping were using an hg38.2bit with some variants they didn't want: any chromosome with - or _ in its name was unwanted, and I assumed this was a universal thing so I hard-coded it. But evidently I was wrong: some other users do indeed want these variants, so it's not helpful to remove them without asking or warning, hence version 1.7 makes this removal optional.

Genome.java still unconditionally skips if seqData.seq.seqname.contains("-") || seqData.seq.seqname.contains("_"), this needs to be made into an option.

seedless option does nothing

PoolerMain.java has a seedless option but Splitter.java comments out the use of it.

But for seedless to be actually useful, I would suggest not using Java's ThreadLocalRandom but instead porting the fast XorShift generator from the C version, since that should cause the exact same sequence to be generated between the C version and the Java version, which could make it possible to test for regressions between the C and Java versions.

Implement should_stick_to_pool in AllPrimers.java

Currently commented out. The Java version will be something like this:

String n = this.names.get(i);
if (n.startsWith("@")) {
    int end = n.indexOf(':');
    if (end > -1) {
        try {
            int pool = Integer.parseInt(n.substring(1,end));
            return pool - 1;
        } catch (NumberFormatException n) {} // not a valid @poolNo:
    }
}
return -1;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.