hazyresearch / dd-genomics Goto Github PK
View Code? Open in Web Editor NEWThe Genomics DeepDive project
License: Apache License 2.0
The Genomics DeepDive project
License: Apache License 2.0
Should be some sub-tree of the HPO DAG to eliminate (didn't we already do this?)
Seems like it occurs when loading the pmid to mesh mapping but this doesn't make sense... need to figure this out. Also in general reducing the memory overhead for pheno would be good thing to do
This might be unnecessary- is simple enough (e.g. select LIKE %.Body%
versus %.Abstract%
)
Not doing this currently because we're still aiming for precision rather than recall. Once the focus changes, do it. Talk to me (@Colossus) or assign the task to me.
Not a big issue at all but probably can be separated out a bit more to clean up code. Might be irrelevant post-dashboard integration
Esp. to make easier for e.g. switch to UMLS
When this is ready
Basically for speed. This should cut our sentences_input size by a factor of 8.
We are currently using the tsv_extractor
. Should we switch to e.g. plpy
or piggy
extractor?
@netj any comments on what best practices are for this right now?
Also, more minor: right now we pre-convert postgres arrays to strings (sentences
-> sentences_input
) is this actually saving any time? Would be cleaner / simpler to do without this. May change if we switch extractors.
Now we have the association vs. causation multinomial, let's get real and make supervision rules to sharply distinguish between the two sentence types.
Is this about resurrecting bad_genes.tsv
, or (more ideally) are there specific distant supervision rules we can add and/or some systematic errors in our gene-list generation step?
PM abstracts, science direct, etc... how many of these have XML or clean enough HTML vs. OCR/pdf?
Write a guidelines document and have a standardized dataset, truly random subset- of all sentences which have a G or P
Update mindtagger first to have "interesting", "association", "causal" tags
tags:
@amwenger Will discuss with you today- we have a simple elasticsearch layer in bazaar/view
which we could use to set up a REST API that you could hit via faceted search. This would be instead of just handing you a db dump
@netj @raphaelhoffmann Should I be able to do this decently quickly?
Ability to determine if a given distant supervision flag/value/set of values in config.py has any positive effect
UMLS raw files at /dfs/scratch0/ajratner/UMLS/
e.g. "San Diego" / SAN
e.g. in a hidden expandable div
Reduce size of loaded data structures
Example ideas:
Find out if we use noncanonical gene names.
If we use noncanonical gene names, MARK AS OBSOLETE.
Otherwise, implement noncanonical gene names. Figure out what the issues are! There are a lot of gene names coming in then and there will be clashes with other abbreviations and english words.
Get a list of common bio abbreviations (PSD = postsynaptic density; PCR = polymerase chain reaction; ...). Get overlap and figure out if we can create a blacklist or figure out if the abbreviation or the gene name is used.
Switch everything -> multinomial for GP inference
This includes changing the mindtagger template, the config VALS, the distant supervision rules, application.conf, etc.
We think that this will better structure our efforts and avoid a lot of the confusion / debate around this issue. Basically, we have many examples that meet the "Things Aaron would find interesting" bar but are not "causal", and we would like to retain these in some structured manner
Also, as recap: this is not a philosophical distinction (for the most part) but a methodological one. As example of two extremes:
For better readability of overall pipeline for each entity/relation, seems more intuitive
E.g. for tracking precision against previously-labeled training set, # of labeled examples, etc
See email from Harendra- the dataset he and Aaron are now using
For supervision re: the causal vs. associative division?
The idea here is simple- we want to implement a version of a multinomial variable that allows each individual variable from a template to have scope X, where X is a subset of a shared reference set of values.
One motivating example is entity linking- each candidate phenotype mention would be represented as a ScopedMultinomial variable which had some limited scope (say: False U {a couple possible nearest-neighbor HPO codes}) which would be defined in reference to the full set of HPO codes
Check this!
Once #16 done
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.