The dd-genomics's discuss from hazyresearch

Finish moving raw PMC data -> dfs

P: Get rid of some bad phenos still showing up e.g. "chronic"

Should be some sub-tree of the HPO DAG to eliminate (didn't we already do this?)

Some sort of memory error w pheno_extract_candidates

Seems like it occurs when loading the pmid to mesh mapping but this doesn't make sense... need to figure this out. Also in general reducing the memory overhead for pheno would be good thing to do

Build a bit of light wrapper code for switching between doc section types?

This might be unnecessary- is simple enough (e.g. select LIKE %.Body% versus %.Abstract%)

Include noncanonical gene names in gene-pheno relationship extractor

Not doing this currently because we're still aiming for precision rather than recall. Once the focus changes, do it. Talk to me (@Colossus) or assign the task to me.

GP: Work on handling negation / hypothetical statements better!

Move delta-eval stuff to eval directory / separate out a bit?

Not a big issue at all but probably can be separated out a bit more to clean up code. Might be irrelevant post-dashboard integration

Repo cleanup: onto directory, delete archived, util code, etc

Transition to new deepdive release

Add a Defined Acronym extractor, use this to avoid incorrect G extractions

Centralize HPO canonicalization / other HPO DAG functions in util

Esp. to make easier for e.g. switch to UMLS

Remove sentences from current sentences_input that do not contain a gene name

Basically for speed. This should cut our sentences_input size by a factor of 8.

Extract relationships from abstracts and titles only

Switch to faster extractors?

We are currently using the tsv_extractor. Should we switch to e.g. plpy or piggy extractor?

@netj any comments on what best practices are for this right now?

Also, more minor: right now we pre-convert postgres arrays to strings (sentences -> sentences_input) is this actually saving any time? Would be cleaner / simpler to do without this. May change if we switch extractors.

Process & run on all current XML data: PLoS full + PMC

Adapt MindTagger for joint causation/association viewing

Association VS. Causation supervision rules

Now we have the association vs. causation multinomial, let's get real and make supervision rules to sharply distinguish between the two sentence types.

G: Get rid of certain 'bad' gene names

Is this about resurrecting bad_genes.tsv, or (more ideally) are there specific distant supervision rules we can add and/or some systematic errors in our gene-list generation step?

[Mindtagger] Dependency path visualizations

Incorporate datasets beyond PLoS, PMC and PubMed

PM abstracts, science direct, etc... how many of these have XML or clean enough HTML vs. OCR/pdf?

Get Aaron, Gill, Johannes, (Alex?) to label set independently

Write a guidelines document and have a standardized dataset, truly random subset- of all sentences which have a G or P

Update mindtagger first to have "interesting", "association", "causal" tags

[Mindtagger] Start mindtagger with fixed tag set

tags:

Association
Causation
gene_error
pheno_error

Set up REST API for handoff -> Aaron

@amwenger Will discuss with you today- we have a simple elasticsearch layer in bazaar/view which we could use to set up a REST API that you could hit via faceted search. This would be instead of just handing you a db dump

@netj @raphaelhoffmann Should I be able to do this decently quickly?

Switch from fixed PLoS ID -> PMID dictionary to direct from XML

Implement DS rule testing framework

Ability to determine if a given distant supervision flag/value/set of values in config.py has any positive effect

Integrate UMLS

UMLS raw files at /dfs/scratch0/ajratner/UMLS/

G: Handling gene names that are part of multi-word proper nouns

e.g. "San Diego" / SAN

Load PMID to HPO via mesh into database table + add PMID column to sentences_input

Write DepPathDAG test extractor

G: Add in Generifs dataset for additional supervision?

[Mindtagger] Put all the other extractions in same sentence in

e.g. in a hidden expandable div

Make pheno_extract_candidates less top-heavy!

Reduce size of loaded data structures

Set up eval against saved mindtagger labels (shell scripts + Dashboard)

G: Add some targeted distant supervision rules

Example ideas:

"signs of G", "symptoms of G" [neg]
"G pathway" [neg]

Fix the genes canonical/noncanonical issue

Find out if we use noncanonical gene names.
If we use noncanonical gene names, MARK AS OBSOLETE.
Otherwise, implement noncanonical gene names. Figure out what the issues are! There are a lot of gene names coming in then and there will be clashes with other abbreviations and english words.

Hook G & P up to config.py

List of common bio abbreviations

Get a list of common bio abbreviations (PSD = postsynaptic density; PCR = polymerase chain reaction; ...). Get overlap and figure out if we can create a blacklist or figure out if the abbreviation or the gene name is used.

More fine-grained labels: Two type columns, one with the exact supervision rule that fired

Add in genetic variant tagging (as preprocessing step)

Use http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3661051/

GP: Add Multinomial {causation, association, NONE}

Switch everything -> multinomial for GP inference

This includes changing the mindtagger template, the config VALS, the distant supervision rules, application.conf, etc.

We think that this will better structure our efforts and avoid a lot of the confusion / debate around this issue. Basically, we have many examples that meet the "Things Aaron would find interesting" bar but are not "causal", and we would like to retain these in some structured manner

Also, as recap: this is not a philosophical distinction (for the most part) but a methodological one. As example of two extremes:

Causative link: "We knocked out X in mice and they got headaches"
Associative link: "We performed a GWAS study on a mouse population and found a slight statistical correlation between X and headaches"

Change config file to be ordered by entity

For better readability of overall pipeline for each entity/relation, seems more intuitive

Use Trends in Dashboard

E.g. for tracking precision against previously-labeled training set, # of labeled examples, etc

Update to new Charite dataset

See email from Harendra- the dataset he and Aaron are now using

Make dashboard have 'multinomial' values

GP: Can we use NHGRI GWAS catalogue / other similar datasets?

For supervision re: the causal vs. associative division?

[deepdive] Implement ScopedMultinomial / non-heuristic ENTITY LINKING...?

The idea here is simple- we want to implement a version of a multinomial variable that allows each individual variable from a template to have scope X, where X is a subset of a shared reference set of values.

One motivating example is entity linking- each candidate phenotype mention would be represented as a ScopedMultinomial variable which had some limited scope (say: False U {a couple possible nearest-neighbor HPO codes}) which would be defined in reference to the full set of HPO codes

hazyresearch / dd-genomics Goto Github PK

dd-genomics's Issues

Recommend Projects

Recommend Topics

Recommend Org