Coder Social home page Coder Social logo

metametrics's Introduction

title author date vignette output
Metametrics: an R package with metadata metrics for annotation of genomic compendia
Vincent J. Carey, stvjc at channing.harvard.edu
`r format(Sys.time(), '%B %d, %Y')`
%\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Semantic metrics for cancer corpus} %\VignetteEncoding{UTF-8}
BiocStyle::html_document
highlight number_sections theme toc
pygments
true
united
true
suppressPackageStartupMessages({
library(ggplot2)
library(plotly)
library(metametrics)
library(ssrch)
})

Basic observations on a corpus of human RNA-seq studies in cancer

Using the Omicidx system, we harvested metadata about human samples for which RNA-seq data was deposited in NCBI SRA.

We work with a subset of 1009 studies for which a cancer-related term was present in study title as recorded at NCBI SRA.

library(ggplot2)
library(plotly)
library(metametrics)
data(study_publ_dates) # harvesting omicidx early 2019
library(lubridate)
ds_ca = DocSet_ca1009()
ds_ca

We accumulate (over dates of study submissions) the set of fields used in the sample annotation of the 1009 cancer studies.

study_publ_dates = na.omit(study_publ_dates)
studs1009 = ls(docs2kw(ds_ca))  # in cancer corpus
stud_dates = as_datetime(study_publ_dates[,2])
names(stud_dates) = study_publ_dates[,1]
stud_dates = stud_dates[studs1009]  # limit to corpus
stud_dates = sort(stud_dates)
ofields = lapply(names(stud_dates), 
    function(x) names(retrieve_doc(x, ds_ca)))
freqs = table(unlist(ofields))
#sort(freqs,decreasing=TRUE)[1:20]
cumfields = ofields
for (i in 2:length(cumfields)) cumfields[[i]] = 
    union(cumfields[[i]], cumfields[[i-1]])
csiz = sapply(cumfields,length)
bag_fields_ca1009 = unique(unlist(cumfields))
nfields = length(bag_fields_ca1009)
mydf = data.frame(date_published=stud_dates, nfields=csiz)

The growth in size of the set of fields in use over time is displayed here:

ggplot(mydf, aes(x=date_published, y=nfields)) + geom_point()
library(plotly)
ddf = data.frame(date=stud_dates[-1], newly_introduced_fields=diff(csiz),
    study=paste0(names(stud_dates[-1]), "\na"))

The next display is interactive -- hover over points to see study accession number and newly introduced field names.

incrs = lapply(2:length(cumfields), function(x) setdiff(cumfields[[x]],
   cumfields[[x-1]]))
incrs = unlist(lapply(incrs, function(x) paste0(x, collapse="\n")))
sn = names(stud_dates[-1])
incrs = paste(sn, incrs, sep="\n")
dddf = cbind(ddf, incrs)
g2 = ggplot(dddf, aes(x=date, y=newly_introduced_fields, text=incrs)) + geom_point()
ggplotly(g2)

Reference resources for reducing metadata isolation and variability

Use of common data elements is promoted by various initiatives. Dictionaries, thesauri, and ontologies are all relevant. We have examples of each in the metametrics package.

A snapshot of the Genomic Data Commons gdcdictionary, with fields and values related to diagnosis and sample characteristics is provided in gdc_dx_sam.

gdc_dx_sam

A table with all entries from several ontologies and the NCI Thesaurus is provided by load_ontolookup:

olook = load_ontolookup()
olook

Statistics on field use

We use robust linear modeling to estimate growth in vocabulary of fields employed over time. The data.frame mydf includes a variable nfields taking a value for each study publication date. The value of nfields associated with date $d$ records the the number of fields used to annotate all studies up to date $d$.

library(MASS)
nsecpy = 3600*24*365
summary( mm <- rlm(nfields~I(as.numeric(date_published)/nsecpy), data=mydf))
plot(nfields~I(as.numeric(date_published)/nsecpy), data=mydf)
abline(mm)

Proximity of terms in use to endorsed terminologies

metametrics's People

Contributors

vjcitn avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.