Coder Social home page Coder Social logo

matthewhirschey / ddh.org Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 7.0 225.69 MB

datadrivenhypothesis.org is a resource to query 100+ GB of raw biological science data to develop data-driven hypotheses

Dockerfile 0.05% R 4.27% Shell 0.15% Python 0.01% HTML 95.51% CSS 0.02%

ddh.org's People

Contributors

dleehr avatar johnbradley avatar matthewhirschey avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ddh.org's Issues

Clean up Dockerfile

The root Dockerfile originally was intended to include all the code to build datasets in Openshift/k8s. But now with the makefile, we only use this as a base set of dependencies.

Remove the code and workdir from /Dockerfile to clarify.

within OpenShift app.R can't find the code directory

Within openshift the current release is failing with:

Warning in file(filename, "r", encoding = encoding) :
cannot open file '/srv/code/current_release.R': No such file or directory

The shiny Dockerfile copies the contents of the code directory to a directory named shiny-server.
https://github.com/hirscheylab/ddh/blob/97422a539ab464838ce3b2715823a32b178bb201/Dockerfile.shiny#L40-L42

The app.R script tries to import files within code based on the .here file.
https://github.com/hirscheylab/ddh/blob/97422a539ab464838ce3b2715823a32b178bb201/code/app.R#L20

Need better feedback for gene not found in db

The only feedback if a gene is not found is to "type the official gene symbol". But in some cases, the gene is legitimately not in the DB (because it was not there in the first place, or because I filtered it out). Should be more informative if this is the case. Currently throws an error.

Screenshot 2020-01-14 11 18 52

ADIPOQ is a gene that has a gene symbol (in gene_summary), but I filtered it out because it was below my stated threshold.

Remaining issues to hosting application

I pulled the latest changes to the repo and downloaded the data from the dds project. I was able to get the app launched locally under docker, and have identified the following issues:

  • The expression_join object is missing - I don't see a script that generates it.
  • Increase memory limits in openshift project. Our cloud project has a default limit of 2GB. The app requires at least 2.6 GB of memory to run, so I requested an increase to 8GB quota. We may be able to optimize this but I wanted to get it up and running first.
  • app.R loads files without the versioning prefix (e.g. achilles.RData) but the files we have include the prefix (e.g. 19Q3_achilles.Rdata). I addressed this with symlinks for now but it relates to #6
  • The includeMarkdown(here::here("code", "methods.md")) fails under openshift because it's looking in the data directory for code
  • make_enrichment_table on line 93 fails with Error in .f: object 'enrichr' not found
  • Report download fails, but the error is 'expression_join' not found

I'll create separate issues for showing you how to login to openshift, reconfiguring the auto build from GitHub, and configuring the CNAME. But we're close!

consolidate to a single data generation method

We currently have to files that both specify the data generation process:

code/quarterly_release.R is currently used in the development and testing new releases

https://github.com/matthewhirschey/ddh/blob/16b744b7e1089a74a783006fbc3c10ba1c7198e3/code/quarterly_release.R#L4-L9

The Makefile is used to generate data for the website.

https://github.com/matthewhirschey/ddh/blob/16b744b7e1089a74a783006fbc3c10ba1c7198e3/Makefile#L48-L58

I would like for us to consider dropping one of these methods.

Discussion: https://github.com/matthewhirschey/ddh/pull/72#discussion_r392593820

create_gene_summary fails with synonyms not found

Running Rscript code/create_gene_summary.R fails with:

Error in is_string(y) : object 'synonyms' not found
Calls: create_gene_summary ... vars_select_eval -> map_if -> map -> .f -> : -> is_string
Execution halted

This is the problematic code:
https://github.com/hirscheylab/ddh/blob/ce56caaa77f8bd33e220728717d114516af97842/code/create_gene_summary.R#L58-L59

I think this is due to a change in the value returned from gene_names.org:
https://github.com/hirscheylab/ddh/blob/ce56caaa77f8bd33e220728717d114516af97842/code/create_gene_summary.R#L8

I think the synonyms column has been renamed to alias_symbols.

Example data returned:

   hgnc_id approved_symbol approved_name previous_symbols alias_symbols
   <chr>   <chr>           <chr>         <chr>            <chr>        
 1 HGNC:5  A1BG            Alpha-1-b gl… NA               NA           
 2 HGNC:3… A1BG-AS1        A1bg antisen… NCRNA00181, A1B… FLJ23569     
 3 HGNC:2… A1CF            Apobec1 comp… NA               ACF, ASP, AC…

DepMap using non-official gene symbols

I've noticed a few instances where DepMap does not use the official gene symbol. E.g.
Official gene symbol: MMUT
DepMap: MUT

If you put in MUT, the first gene look up page will give the appropriate response that this is not the gene symbol. If you put in MMUT, it will give the Entrez Gene summary, but then none of the data will be shown on subsequent pages. If you put in MUT, and go past the first page of the summary, then the appropriate information will be shown.

Need to think about best way to assess (count) these mis-matches, and then how to handle.

Code Duplication between R Markdown and R scripts

There is some code duplication between some R Scripts and R Markdown files

The R Scripts above are meant to be easily run to build data for a production environment.
The R Markdown files are exploratory in nature and contain important notes/comments that we wish to preserve. The R Scripts were created based on the R Markdown files.

Use readRDS/saveRDS instead of read/load

We currently use a mixture of Rds and RData files.
RData files support containing multiple items. Rds only contains a single item.
Where we are using RData files we only put a single item in them.

One drawback to RData files is that it is not obvious what items are loaded.
For example:
https://github.com/hirscheylab/ddh/blob/510637d8afbe6cfcf74595b2cb629dec36d94301/code/app.R#L26

In contrast where we load a Rds file the item name is explicit:
https://github.com/hirscheylab/ddh/blob/510637d8afbe6cfcf74595b2cb629dec36d94301/code/app.R#L34

Gene names spread across table columns

Current deployment (number 15 in depmap-shiny) has gene names spread across columns.
This is due to a dependency issue for the DataTables package.

Easy fix

  • Install DT to the container (install.packages("DT"))

  • Load DT library in app.R

Gene names spread across columns; snapshot of www.datadrivenhypothesis.org

cannot search for C#orf###

C open reading frames (Corfs) have lower case characters in their official gene symbol. str_to_upper(input$gene_symbol) breaks this.
I'll add an if_else to detect for this rare case.

capture user emails

Short term:
Would be good to ask for and capture user email addresses, perhaps on the download report page (as opposed to a login front page required for use of the site); getting user information would be helpful for notifying of future releases, etc.

Long term:
Would be good to have login credentials for users that would allow them to store previous lookups, previous reports (PDFs). Inspired by gene-clime.org (Broad Institute resource)

Allow users to view data for a pathway

Currently the website only allows users to view data for one gene.
During our last meeting @hirscheylab mentioned that supporting a pathway oriented view of the data would be a meaningful expansion to the value this website provides.
For this improvement we need to consider how to organize the content into separate pages.

improve instructions for running with singularity

There are some instructions in the README for how to generate /data using singularity:

This project also supports using a container runtime like singularity to run Rscript. To run under Singularity, set the RSCRIPT_CMD environment variable as noted in build-slurm.sh. This scripts expects site-specific environment variables to be exported from a config.sh file. This file is not included in the repo, as it

The Makefile also includes a container_image target that will download the docker image named in the DOCKER_IMG variable and prepare it for singularity.

Please provide instructions on how to setup/build the container_image and use it with build-slurm.sh.

'gene_symbol not found'

gene_symbol does not appear to be dynamically set. I sometimes code gene_symbol as a hard value to work on functions outside of the shiny app. I recently cleared my environment, and now have this error. Not sure which code you or I wrote to break it (TBH, it was me), but want to kick it over to you to see if you have any ideas.

Screencast explaining the error here

Fix 'FATAL: kernel too old' error generating data

When attempting to generate data I started receiving a FATAL: kernel too old error.
This is coming from singularity when it attempts to run the image built from the docker image created for this repo (matthewhirschey/ddh:latest).
In #92 we upgraded rocker/tidyverse:3.6.1 to rocker/tidyverse:3.6.3 to fix an error generating the correct data. The underlying problem seems to be the version of glibc used in tidyverse:3.6.2+ images.
I think this singularity issue is related - apptainer/singularity#1275

"<a href" in webpage title

Problem

There is a strange title when not on the initial landing page.
It looks like an HTML anchor tag is accidentally being used for the title.

Example link https://www.datadrivenhypothesis.org/?show=detail&content=gene&symbol=BRCA1
Screen Shot 2020-05-17 at 8 59 40 AM

Relevant code:
https://github.com/matthewhirschey/ddh/blob/df2a08bf9f553860e714a4c56dbcdff87f1857c6/code/app.R#L274

https://github.com/matthewhirschey/ddh/blob/df2a08bf9f553860e714a4c56dbcdff87f1857c6/code/app.R#L321

Possible solution

I think we just need to add the windowTitle argument to all calls to navbarPage: https://shiny.rstudio.com/reference/shiny/1.0.5/navbarPage.html

streamline/document update process

Broad releases new data every quarter that is processed by this app.
Currently updating for this data is a multistep process:

  1. @hirscheylab performs some data validation, then creates a PR that updates release name, URLs and methods details. Example PR: #57
  2. After the above PR is merged a docker image is automatically built. Wait for the docker image to build.
  3. On our HPC cluster I generate the data and upload the results to a DukeDS project. This process uses a singularity image created from the above docker image and a clone of this repo.
  4. On my laptop I update the list of files to download into the openshift ddh-data volume based on the new contents of the DukeDS project. Then I rerun the the job to download data and manually redeploy the app. Finally I create a PR with the updates to openshift/file-list.json.

I would like to simplify this process or at least have these details recorded so I don't forget them.

Notes based on steps above

Step 3 - Generate data - Docker Image

The docker image is used to supply the r libraries used by data generation. I currently also clone the repo and use them in combination. I am wondering if I could use just the docker/singularity image.

Step 3 - Generate data - directories to create

I need to manually create the following directories after cloning the repo: logs, singularity/images, and data. I'm not sure why the Makefile doesn't create the data
directory.

Step 3 - Generate data - sbatch commands

We have some notes here: https://github.com/hirscheylab/ddh#singularity
Basically I setup a config file, run sbatch build-slurm.sh wait for it to finish successfully then run sbatch upload-slurm.

Step 4 - Update website - Update list of files to download

To update the list of files to download into openshift we have openshift/make-file-list.py. This script creates a file for all files in the DukeDS project and not just the current release so I manually remove the older files from this list.

Step 4 - Update website - Download

To download data in the openshift app requires installing and configuring the openshift oc command.
To rerun the job to download data in the openshift app usually requires deleting the previous job:

oc delete job download-ddh-data

Then creating/running the job to download data:

oc create -f DownloadJob.yaml 

On the job finishes I redeploy the website using the okd application console: depmap -> Applications -> ddh-shiny-app -> Click Deploy.

FYI: @dleehr

enrichr_loop column conversion error

I encountered an error running generate_depmap_pathways.R with the 19Q4 data.

When running the enrichr_loop for gene VPS45 with the 19Q4 data the following error occurs:

Error: Column `Term` can't be converted from character to integer

This error is raised from here:
https://github.com/hirscheylab/ddh/blob/aeab2187e864d516f598858dbf57960ee04930ac/code/generate_depmap_pathways.R#L23

You can reproduce this error by running the following:

gene_list <- c("SPANXN5","ART1","IFNK","DEFB124","MT1B","CA5A","PPIAL4G","G6PC2","LYPD2","SSX3","RFPL3","FMO2","OR8S1","PMPCB","IL26","XAGE3","LRRC18")
focused_lib <- c("PPI_Hub_Proteins", "Rare_Diseases_AutoRIF_ARCHS4_Predictions")
enrichr_loop(gene_list, focused_lib) 

The problem seems to be related to PPI_Hub_Proteins response containing a single row with a Term column that contains an integer value of 231403. All the other libraries contain a character value for Term. So when we bind these data frames together the error occurs.

I was able to work around this issue by casting the Term field to character before running bind_rows.

Need better way to rank search output

While it seemed OK to have to scroll to find the gene of interest (in a small pond of genes), in the case of "TP53", you never get to see the actual gene, because the threshold limits of head=10 means that several other alphabetically ranked genes push TP53 off the bottom of the list.

Need to think about a better way to return gene of interest.

One idea: sequential search. Instead of str_detect... | str_detect, can we

  1. Search gene_name %>% (most specific)
  2. Search aka %>% (most likely alternative)
  3. Search approved_name (most generic)

And then row_bind, but never resort? And then present up to 20 (10 genes, 10 pathways, max) but probably fewer choices?

Configure custom domain in openshift hosting

Application is currently hosted on a subdomain of apps.cloud.duke.edu through an openshift route. We can use a custom domain name, as described here: https://openshift-docs.cloud.duke.edu/cluster-details/dns/#non-clouddukeedu

  • Determine hostname to use (we may want a dev and a prod hostname)
  • Create CNAME record to point to os-node-lb-fitz.oit.duke.edu in DNS (@hirscheylab )
  • (if needed) request approval for the host name from OIT
  • Update the openshift route for the new hostname

Make file extensions consistent

I noticed when building the makefile in #5 that some of the scripts use the Rdata extension and some use RData. Which is preferred?

.Rdata

generate_depmap_data.R:save(achilles, file = here::here("data", paste0(release, "_achilles.Rdata")))
generate_depmap_data.R:save(expression, file = here::here("data", paste0(release, "_expression.Rdata")))
generate_depmap_data.R:save(expression_id, file = here::here("data", paste0(release, "_expression_id.Rdata")))
generate_depmap_data.R:save(achilles_cor, file = here::here("data", paste0(release, "_achilles_cor.Rdata")))
generate_depmap_pathways.R:load(file = here::here("data", paste0(release, "_achilles_cor.Rdata")))
generate_depmap_stats.R:load(file = here::here("data", paste0(release, "_achilles_cor.Rdata")))
generate_depmap_tables.R:load(file = here::here("data", paste0(release, "_achilles_cor.Rdata")))

.RData

app.R:  # Read gene_summary saved as RData using: save(gene_summary, file=here::here("data", "gene_summary.RData"))
app.R:  load(here::here("data", "gene_summary.RData"), envir=tmp.env)
app.R:load(file=here::here("data", "achilles.RData"))
app.R:load(file=here::here("data", "achilles_cor.RData"))
app.R:load(file=here::here("data", "expression_join.RData"))
app.R:load(file=here::here("data", "master_bottom_table.RData"))
app.R:load(file=here::here("data", "master_top_table.RData"))
app.R:load(file=here::here("data", "master_positive.RData"))
app.R:load(file=here::here("data", "master_negative.RData"))
create_gene_summary.R:gene_summary_output_filename <- "gene_summary.RData"
generate_depmap_pathways.R:load(file = here::here("data", "gene_summary.RData"))
generate_depmap_pathways.R:save(master_positive, file=here::here("data", "master_positive.RData")) #change file name to include decX
generate_depmap_pathways.R:save(master_negative, file=here::here("data", "master_negative.RData")) #change file name to include decX
generate_depmap_tables.R:load(file = here::here("data", "gene_summary.RData"))
generate_depmap_tables.R:save(master_top_table, file=here::here("data", "master_top_table.RData"))
generate_depmap_tables.R:save(master_bottom_table, file=here::here("data", "master_bottom_table.RData"))

Makefile parallel compatibility

Some of the Makefile rules unnecessarily run commands multiple times when using the parallel jobs option(-j <num_jobs>). This flag tells make that it can run up to num_jobs processes instead of the default which is 1. Typically this would just speed up running the makefile.

An example of the problem can be seen by running:

make depmap_data -j 10

It will run 5 copies of Rscript code/generate_depmap_data.R.
The rules that have this problem are those with multiple output targets. In this case there are 5 output targets.

This isn't a high priority since we can just make targets that need to run in parallel separately. The rest we can run without the parallel jobs flag.

More details: https://github.com/hirscheylab/ddh/pull/19/files#r363313502

more informative error for multi-gene graph

Problem: If a user enters custom genes (or a pathway?) that do not draw a graph, the error is uninformative. See pix below for same gene pair with two different filters.

Possible solution: Add a validate() in make_graph().

Screenshot 2020-05-18 21 43 41

Screenshot 2020-05-18 21 43 32

Configure openshift build/deploy from GitHub repo

We have a few decisions to make here and probably some basic tutorial/training on openshift, but the goal is to allow @hirscheylab to push changes to GitHub that result in the application auto-deploying on Openshift. Currently the openshift app deploys from our fork (https://github.com/Duke-GCB/ddh), using the openshift-download-dds-data branch.

I propose the following:

  1. Configure a development instance of the application (related to #9 ) to auto deploy changes pushed to a development branch. This could remain master or switch to develop. Either way, @hirscheylab development would happen on this branch and changes could easily be seen live.
  2. Configure a production instance of the application that auto-deploys from a production branch. This branch does not yet exist, but should only be updated after changes have been tested on the development instance
  3. Meeting with @hirscheylab to demonstrate how to access openshift, manually start a deployment, or roll back to previous versions, and add dependencies to the build.

Auto-build not functioning

On 3/30 @johnbradley wrote:

I saw you merged your PR. For some reason the build didn't automatically happen - I'll look into this later.
I manually triggered the build. It looks like PDF generation is fixed on the website now.

For the PR merge on 5/13, auto-build is also not working (looking at manage.cloud.duke.edu)

Error in UseMethod("arrange_") running generate_depmap_pathways.R

When running generate_depmap_pathways.R for certain genes the following error occurs during the Bottom pathway enrichment analyses phase:

Error in UseMethod("arrange_") : 
  no applicable method for 'arrange_' applied to an object of class "NULL"

Some example genes that demonstrate this issue: ACBD5, ACTL7B, ADORA1, and AGAP9
There are 149 genes that have this problem. These genes have no correlated genes after filtering for achilles_lower.

The think error occurs in the arrange function call here:
https://github.com/hirscheylab/ddh/blob/56cda91836d2cd95ba63c38502ce825c649740ee/code/generate_depmap_pathways.R#L102-L113

For these genes enrichr_loop returns NULL since gene_list is empty.
This causes the above error.
https://github.com/hirscheylab/ddh/blob/56cda91836d2cd95ba63c38502ce825c649740ee/code/generate_depmap_pathways.R#L16-L20

Improve concurrent user handling / scalability

After deploying ddh to openshift, we find that the shiny server process (Shiny Server Open Source can only respond to a single user action at a time. A consequence here is that when a single instance is busy doing work (e.g. spending 5-10 seconds building a PDF for download), that instance will not respond to any other requests (including displaying the main page) until the action completes. We discussed on the call today, and it's an important concern before promoting the site. See also #45 (comment)

To handle traffic from more than tens or hundreds of active users, we should run more than 1 instance. So while 1 is probably too little, I think 100 would be overkill. I suspect 5-10 would be good fit. Some tasks below that I'll assign 😄

  • We need to have an idea how many users to expect. Do we expect 100 users all attempting to download 50 reports each for 4 hours at a time? Do we expect 1000 users to load the site, 100 to browse data, and 50 to download reports over a 6 hour period? @hirscheylab
  • We need to understand how the application behaves when it is at or over capacity. @johnbradley
  • Shiny-server open source is not stateless and user actions depend on sessions. So any load balancing configuration that distributes to multiple replicas would need to keep sessions sticky to replicas. In our current configuration with 2 replicas, we need to confirm this is working correctly (It seems to be) @dleehr

Other notes:

  • Thanks to some optimizations (#34, #52), each instance uses 1GB RAM (down from 5GB). Our project quota is 4GB for requests, so we could run up to 4 instances. 4 or 5 is probably a good start to handle 20-50 users. There is increased cost to using more sustained resources, so it's important to get the balance right.
  • Openshift does support horizontal pod autoscaling. This automatically launches or terminates pods based on CPU or memory utilization. This only works if those metrics are a proxy for server load. With shiny server, I haven't seen these values fluctuate when using the app. It's either busy or not busy. Unless we can rewrite the app to use more memory or CPU when it's busy (and less when it's not), I don't think this helps.
  • I suggested moving some of the longer-running actions (like PDF download and plot generation) to dedicated URL routes/paths so that they do not block simple browsing. We could run fewer instances, but users who want to get reports may have to wait. I think that's a reasonable tradeoff, and we do this in datadelivery using nginx for long-running downloads. That application is stateless and has clear URL paths for different services, and those don't hold true for this one. I have some other ideas along this line, but they require some more significant refactoring/rearchitecting.

Streamline the release version in scripts and filenames

The R scripts all currently set the release version as 19Q3, and the makefile proposed in #5 hard-codes it in the expected filenames. While it may seem simple to change this in the code, the best practice would be to provide it as a variable when generating the data files, and not have it be hard-coded in scripts and Makefiles that are otherwise reusable.

We don't need an immediate solution, but a good use of issues is to suggest and discuss solutions, so let's :)

simplify report generation code

The current report generation logic repeats each Rmd filename multiple times and duplicates the logic to move the markdown to a temp directory. Streamline this logic.

For example in render_report_to_file:
https://github.com/matthewhirschey/ddh/blob/25566ee751e4f6f78db6ecc5ea5d0c80037a8c8d/code/fun_reports.R#L112-L121
Then within render_complete_report()
https://github.com/matthewhirschey/ddh/blob/25566ee751e4f6f78db6ecc5ea5d0c80037a8c8d/code/fun_reports.R#L77

Change the code so we only reference the Rmd file once.
Have one location that makes a temp directory/copies the Rmd and renders it.

Background

When the website runs in OpenShift the /code directory is read only.
This is good from a security perspective but causes problems for rmarkdown::render().
This rmarkdown::render() creates intermediate files in the same directory as the .Rmd file it is rendering.
Even if we made /code writable I am concerned that two users generating PDFs at the same time would get their wires crossed. So to use rmarkdown::render() we need to copy the .Rmd files to a temp directory first.

twitter bot

I wrote a little R script [code/twitter_bot.R] that takes the data and generates some text, and 4 plots, and then posts it to a new twitter account (@ddhypothesis) that I made just for this. I got a dev account with twitter, and set-up the tokens for access. Locally, everything works fine.

Q: Can you help me set up a chron job scheduler for this on the server* to run the script 1x/day? (*is it allowed?)

Given the app is always loaded, I thought this might be a fun way to drive some interest.

p.s. I'm still working in the enhance pathways branch, so pushed changes there. Also, I .gitignored the R file where I save the tokens. This is supposed to be loaded into the environment, so will need to do that for it to have access.

remove bias in determining na_cutoff value

na_cutoff value is arbitrary.
Thinking of removing worst 5% of values, as determined by number of genes in gene list (too many = garbage).

To address:

  • Write function to find_threshold()
  • Test threshold locally (don't break data)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.