ebispot / goci Goto Github PK

View Code? Open in Web Editor NEW

26.0 10.0 19.0 144.97 MB

GWAS Catalog Ontology and Curation Infrastructure

License: Apache License 2.0

Java 80.39% PLSQL 2.04% CSS 1.06% JavaScript 2.04% HTML 12.97% Shell 0.21% Python 1.20% Perl 0.08% Dockerfile 0.03%

goci's Introduction

GOCI

GWAS Catalog Ontology and Curation Infrastructure from SPOT at EBI.

Introduction

This project is a result of a collaboration between the NHGRI and the EBI to produce ontology-based curation and search functionality for the GWAS catalog. This includes ontology-based query expansion in the public interface and curator tools for annotating studies as they are entered into the GWAS catalog.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Building this Project

This project is built with Maven (http://maven.apache.org) so make sure you have an up-to-date installation of Maven before proceeding.

Clone this project, change to the root GOCI directory and run

mvn clean install

to build all binaries of this project.

Module Structure

The GWAS Ontology and Curation Infrastructure (GOCI) is organised into several main strands: tools for working with the ontology, tools for enhancing curation activities, and tools to generate a diagram of GWAS catalog data.

GOCI Core

This module hosts the core classes underlying the GOCI tooling suite. It includes sub-modules for the key model objects that GOCI is based on, repository and service modules for accessing the data model and modules for diagram generation and interacting with ontologies.

GOCI Interfaces

This module includes all the different ways to interface with the GWAS Catalog. It includes modules for the curation system, the public GWAS Catalog portal, the diagram generation service and a place holder module with some config for the GWAS Solr index.

GOCI Tools

This module contains a range of stand-alone tools, including a datapublisher to convert the relational GWAS database into RDF/OWL, a mapper to annotate Catalog data with genomic context information from Ensembl via their REST API, a Solr indexer to load the database into a Solr index, as well as a range of util classes and one-off tools used for analysis.

GOCI Parent & GOCI Dependencies

These are convenience modules used for dependency management.

Acknowledgements

The GOCI project makes use of many public and freely available software resources - we would like to thank them all for their continued support.

Bamboo: Continuous integration, continuous deployment and release management.
Fisheye: Browse, search and track your source code repositories.
Spring: Application framework and inversion of control container for the Java platform
ThymeLeaf: Template engine capable of processing and generating HTML, XML, JavaScript, CSS and text, and can work both in web and non-web environments
Bootstrap: HTML, CSS, and JS framework for developing responsive, mobile first projects on the web.
OWL API: Java OWL API
HermiT: OWL reasoner
Solr: Search server
OpenLink Virtuoso: RDF triple store provider for SPARQL endpoint
Apache Tomcat: Web server
Apache Maven: Software library dependency management
GitHub: source code hosting

goci's People

Contributors

Stargazers

Watchers

Forkers

cmungall dportnoy hindorff joannellam anukat2015 aoifemcm hxin ens-lgil ljwh2 buniello mcerezof fcunningham seq101 geneticresources kat-jump markxiaotao ar-ibrahim vicad006 jamison413

goci's Issues

Update curation queue metrics script

Update curation script in the gwas-curation-utils repo named curation_queue_with_ancestry.py to send an email with the script output and schedule this to run Sunday night at 19:00.

Update link to Known issues

On this page: https://www.ebi.ac.uk/gwas/docs/known-issues, a link to the old Jira tickets is used. Suggest to update to GitHub or remove the link and/or page.

Add link from submission summary page to a view of the imported studies

After importing a submission from the submission summary page, curators should be able link directly to a view of the imported studies and begin editing them.

Link either to Curation Home, filtered by the PMID of the imported submission, or to the publication-centric view for that PMID. This decision (which location to link to) should wait until more curators have started using the new submission/import process and we can collect more information about the preferred workflow.

Studies are not consistently imported into the curation interface

Example: Kristjansson RP 26838040

There are 6 studies in the template, but when I clicked “Import”, only 2 studies were imported into the curation interface. Also the submission still appeared in the View Submissions list as if it hadn’t been imported yet. I deleted the new study entries and tried again, but this time only 1 new study was created (the first one in the template). Again, the submission still appears as if not imported in the View Submissions list.

Create Download of all Available GWAS including no PubMed ID

To support the pre-published, unpublished workflow, we need to create a new download of all studies including studies from publications without a PubMed ID (from Deposition). Will need to determine what fields to include in download, format and link from download page.

Import button in curation interface remains active after being clicked

After clicking the “Import” button, it remains active. It should be disabled once the data is imported, to prevent multiple import actions for the same submission.

For investigation: if the button does remain active, what happens if it is clicked a second time? Is the import action repeated? How does this affect the imported data/studies?

website is down - test

search homepage returns 404

remove unused modules

We identified a number of modules that are no longer being used, these should be removed to avoid confusion.
https://docs.google.com/spreadsheets/d/1dvJ6u8dww2Bp98eCykk4TfCqkliIBeVFgonrRJLQX_k/edit#gid=0

Add default note to the curator version of the Excel submission form

In the curation interface, when a curator creates a new Note with the Note subject “Initial extraction”, it automatically populates the corresponding Note field with a default note containing a list of headings. Curators would like the same default note to be available in the Excel submission template. The blank curator template could have the first note prefilled, with the default note in the Note field and “Initial extraction” in the Note subject field.
e.g.

Indicate active submission PubMed ID in curation

Curators need a visual indicator that a publication is being submitted by the author prior to curation, to prevent duplication of effort and allow collaboration with author early in review process. Curation app should show a visual indicator next to PubMed ID to indicate that there is an active submission and allow the curator to view the submission data in the Deposition app.

Bug in construction of sample description for replication samples after import

This bug concerns studies imported into the curation interface from deposition.

For replication samples only, when the Number of cases and Number of controls are not given, they are being incorrectly imported as 0 in the curation interface - and therefore the free-text replication description incorrectly reads “0 cases, 0 controls” (Example: Chibnik LB 28322283). Instead these fields should be blank.

Change colouring in Excel submission template

In the Excel template, the summary stats column headers should be orange as they are mandatory. This applies to both the curator and external user versions of the template.

User requested change to separator for multiple EFOs in the download file

Reported on gwas-info 26-02-20:
In the GWAS Catalog download TSV (both versions 1.0 and 1.01), the EFO name column (MAPPED TRAIT) can have multiple values; the column separates these by commas. However several GWAS Catalog EFO traits have commas in their names (eg EFO_0008009 1,5 anhydroglucitol measurement), which means it is impossible to pass this column correctly, without prior knowledge of these terms.
Solutions to this problem may be to quote commas or precede with a slash or backslash in the EFO names, or use a different separator string. Currently ";" or "|" would work with all GWAS Catalog traits.

Need to investigate whether we can change this separator. It would definitely improve the usability of the download file.

Allow curators to bypass the checklist before creating a submission in the deposition interface

Curators will be working with the system regularly, so should already have their Elixir ID, Globus registration etc. Therefore going through the tick boxes is unnecessary and time-consuming. Can we allow curators (but not external users) to bypass this step?

Tag Publications as Under Submission

Curators need the ability to see which studies they are working on might be affected by author submission. Need to add an indicator along with the PubMed ID to show which publications also have a submission in the deposition system.

Monitor PubMed for publication of pre-published paper

Curators should be notified if a PubMedID is added to queue that may already have been submitted pre-publication. Send alert message or automatically update with PMIDs via tracking of EuropePMC or BioRxiv status (RSS feed?)

Create 'Requires Review' state for curation

When a publication is imported from deposition, the existing studies should be moved to a state to indicate they were changed by import. We need to create a new state to filter the old studies from the data reports. 'Requires Review' indicates the curators need to review the old studies for potential deletion.

Create time-sensitive submission

Users should have the ability to embargo submissions so they are not released publicly in Catalog until after publication or at a date specified by the submitter. The system should monitor submissions for a release_date flag and not publish any studies prior to their release date. The system should also provide the ability to remove the release_date and allow any studies to be released at the next scheduled interval.

Change "Readme file" column header in submission template

Change "Readme file" column header in submission template from "Readme file" to "Readme text"

Change documentation (submission instructions) to reflect this change

Add new page to GWAS Catalog website displaying a list of valid Countries of Recruitment

Background:
Users who are submitting data to the GWAS Catalog are asked to follow the instructions on this page to find out how to correctly fill in the "sample" tab in the sumstats+metadata template: https://www.ebi.ac.uk/gwas/docs/submission-summary-statistics-plus-metadata#samples. Users are currently directed to The United Nations M49 Standard of Geographic Regions for the list of allowed country names that can be entered in the field "Country of recruitment". However the country names in that resource do not exactly match the list of country names that we actually use in the database. Therefore, we need to provide users with our own definitive list of country names. This will help to ensure that users enter valid values in the template (we have had some problems so far with "U.K." vs "UK") and reduce the amount of editing by curators after import.

Tasks:

Add a new page to the GWAS Catalog documentation:

Suggested filename: countries.adoc
Suggested URL: https://www.ebi.ac.uk/gwas/docs/countries

(The new page does not have to be accessible directly from the GWAS Catalog homepage, so no new icons/buttons need to be created. The page will only be linked to from https://www.ebi.ac.uk/gwas/docs/submission-summary-statistics-plus-metadata.)

Provide an up-to-date list of countries of recruitment from the GWAS Catalog database:

you could either add the list directly to the new page or send a text file to Elliot to add later, whatever's easier. Don't worry about formatting, I can sort that out.

Create project containing multiple GWAS

Users should be able to submit data for a “project” that can have multiple GWAS. This can be for a publication with a PubMed ID, a different publication or pre-print identifier, a publication prior to publication, or a paper with no associated publication accession.

Download option and icon .csv tables has disappeared in UI

The icon for the option of download customised tables for studies, associations or all SumStats studies has disappeared from the user interface.

Add field for curators to record corresponding author

Background:
We need to be able to store the corresponding author email for each publication in the Catalog, so that in the future we can contact authors to request sumstats. Daniel previously did some investigation on whether we can pull this information from ePMC, but found it was not reliable for all publications. In the curation interface there are currently non-editable fields in the Study Details tab for corr author name and email, but they don't seem to be populated. Curators need to be able to add this information manually to each publication, and have the possibility to edit any information that is pulled in automatically.
Tasks:

Find out if the fields in the study details tab are being populated from anywhere
Add fields for curators to add this manually. This needs to be at the publication level so the best place could be in the publication-centric view. The information should been be propagated to the Study Details tab for every study in a publication

Can't import author-submitted publication to curation interface

Nothing happens when I click 'import' on this page https://www.ebi.ac.uk/gwas/curation/submissions/5e84a7644c489b0001928d8e

PMID: 31969693

Author: Coleman JRI

Find ChEBI IDs for metabolites

For the Schlosser paper, the find the ChEBI ID for the metabolites https://app.zenhub.com/files/2995118/b4dda359-081e-437d-a978-21dafa2eabd5/download.

Once all terms are mapped, import needed terms to EFO:
EBISPOT/efo#726

Support Unpublished and Prepublished GWAS

Possible incorrect rounding of p-values for imported associations

Possible incorrect rounding of p-values for associations imported from deposition into the curation interface

Example:
Buchwald J 32157176, in the study with Study tag “NMR”: for at least 1 SNP (rs56113850) the p-value was entered in template as “5.54e-261”, so the mantissa should be rounded up to 6, but instead it appears in the curation interface rounded down to “5x 10-261”.

Please investigate how p-values are converted from floats (scientific notation) in the submission template into integer mantissa and exponent in the curation interface. How is the rounding being handled?

Add check that effect value (beta/OR) falls within CI

A user reported an instance where an effect value did not fall within the quoted confidence interval (CI). This is an error as the effect (beta/OR) must fall within the CI quoted.

Implement a check of association data with an error produced if effect does not fall within CI.

Association data is entered into Catalog in 3 ways:

Upload templates in deposition
Manual entry in curation interface
Summary statistics file upload
This QC will need adding in each of these places.

This check should be carried out on all data currently in Catalog but the importance of doing this should be discussed as would need curation resources to follow up on any errors identified.

Monitor prepublished submissions for publication in PubMed

To prevent duplicate publication being added into curation, we need to provide curators with feedback that a publication has previously been submitted without a PubMed ID by the author.
-Need to add status of submission publications to curation interface
-need to display match likeliness for publication prior to study creation
-need to add PubMed ID to publication in Deposition.

-changes needed to Curation and Deposition UI/backend.

Incorrect status for some imported studies

When a submission is imported it should create new entries for each study in the publication - these should all take the status “Level 1 curation done”. The “old” entry for that publication in the curation interface (i.e. the one that was there before the import) should be given the status “Requires Review”.

At the moment, some of the newly created studies are incorrectly also getting the status “Requires Review”.

Examples:

Buchwald J 32157176, new study with Study tag “NMR” was incorrectly marked “Requires Review”.
Chibnik LB 28322283, new study with Study tag “Gross infarct” was incorrectly marked “Requires Review”.

Add data format check for "Update EFO Traits" page

Add data format checks for "Update EFO Traits" page so that the the trait field accepts only text and the URI field accepts only a properly formatted URI, e.g. http://purl.obolibrary.org/obo/MONDO_0019472

Assign data provenance for studies in project

Users should be able to submit provenance info for this “project”. Sources of provenance include PubMed ID, Biorxv ID, UK BioBank ID, or no external unique identifier. User should be able to see the connection between their data and the study or studies in the project.
(https://docs.google.com/document/d/1SvRSbuVeud7M5_86uVDYRzshac2qkqdcsL0k5MqoS58/edit?usp=sharing) (and Excel format https://docs.google.com/spreadsheets/d/19frfYpk2C_BU7zPiNMJaqetWrFOhnJ45hm7k16dvnw8/edit#gid=0)
Change to collect doi (e.g. for BioRxiv manuscript) rather than ID specific to a resource (e.g. asking for BioRxiv ID)
Add - URL to manuscript or information describing project

Range field imported incorrectly for studies with no OR or Beta

This bug concerns associations imported into the curation interface from deposition.

For associations with no effect size (i.e. no Odds ratio or Beta), the Range column should be empty. In Chibnik LB 28322283, it is incorrectly being filled with “[0.0-null]”.

Exclude publications and their related studies with status "requires review" from Reporting metrics

For publications and their related studies with status "requires review", exclude these from metrics in the Curation app "Reports" tab for "Overall weekly progress" and "Open targets weekly progress".

Update project provenance identifiers

Curators or submitters should be able to add publication identifiers to studies in submitted “projects” that are subsequently published. This would associate a previously unidentified paper with a recognized unique identifier (PubMed ID, Biorxv ID, UK BB ID).

Discrepancy in UI between search and top search, the trait page does not include child traits

For cardiovascular disease, with 'child traits' ticked, the number of associations is 331. When unticked and re-ticked, the number is 5027.

https://www.ebi.ac.uk/gwas/efotraits/EFO_0000319

Update EFO_TRAIT table for obsolete terms

Following the "Data release QC report - 2020-02-25" the term "influenza infection" was reported as not being able to be retrieved from OLS. This term is now obsolete and replaced by "influenza" with URI http://www.ebi.ac.uk/efo/EFO_0007328. Update the data in the EFO_TRAIT table to reflect this update in EFO.

Replace Ensembl REST calls for data re-mapping

Ensembl REST API is the bottleneck for data re-mapping. As the GWAS catalog continues to grow, we will have significant impact from the time it takes to remap the catalog for every Ensembl release. Ensembl provides direct database access, which would significantly improve mapping time by eliminating the time to make REST calls. It would also simply the code and database as we would be able to remove the Ensembl_Rest_Call table and logic around its lookup.
Next Ensembl release is scheduled for April 2020. Plan to complete by then.

Validation SumStats once deposition app is back up

This is a reminder to validate SumStats submitted by Matthew Sampson
For the study with PMID: 29903748
The submission ID is 5e613a18b8b8fd000143a5a2

FTP link in search interface does not work for summary stats that come through deposition

An external user submitted summary statistics for the publication Nag A PMID 31960908. The summary statistics were transferred using Globus. I believe they are now stored at this location: gwas_cat/c27cd630-6d6f-468e-9d33-4fc86339c9c1.

The submission has now been imported and the publication has been re-published in the GWAS Catalog. However, on the search interface page for this publication (https://wwwdev.ebi.ac.uk/gwas/publications/31960908), the link to the summary stats on the FTP does not work.

It seems to be attempting to link to a directory using the old sumstats folder naming convention (i.e. ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/NagA_31960908_GCST009885) which does not exist, because the summary stats were introduced through Globus and are now in a folder with a different name.

Bug in Notes for imported studies

When a submission is imported into the curation interface, the “old” entry for that publication in the curation interface (i.e. the one that was there before the import) should have a note created saying “Review for deletion, replaced by deposition import”. The newly-created studies should not get this note.

At the moment, some of the newly created studies are incorrectly also getting this note, sometimes with multiple copies.

Examples:

Buchwald J 32157176, new study with Study tag “NMR” was incorrectly given 8 copies of this note
Chibnik LB 28322283, new study with Study tag “Gross infarct” was incorrectly given 1 copy of this note.

Add link from submission summary in curation interface to the original submission in deposition

Curators need a more efficient way to get from the submission summary (View Submissions/Review) in the curation interface to the original submission in the deposition interface (including the spreadsheet).

Suggest adding the submission ID to the summary screen, which would link to the depo app.

Fix bug in Update EFO trait form in Curation app

Add and Edit EFO form is broken - The URI field can be any string, e.g. EFO_0004541

Beta not imported into curation interface from deposition

Beta was entered in the submission template but was not imported into the curation interface - Example Buchwald J 32157176.

Daily Summary Stats release report email is not giving correct information

The summary stats release report email was developed when we moved to a daily release of sumstats (Jira GOCI-2801), to give curators an overview of which folders have been released.

Background:
Curators indicate that a study has sumstats by ticking the “full p value set” box in the curation interface. There is no QC linked to this process, the folders on the ftp are manually created by the curator. The release report is in 4 parts that let curators know that

they have correctly ticked the box and created the file, the two have been successfully linked and data has been released
folders named with internal study_id have been successfully renamed with the GCST
the box has been ticked but no matching folder has been found
a folder has been created but the box has not been ticked

Bug:
Since 29th jan 2020, there has been an accumulation of studies listed in the first (“This round summary statistics of n studies were released”) and second (“The following folders were renamed in the staging area”) sections. Example: HoglundJ_31727947_GCST009522 has been listed as released in every email since 29th Jan. Expected behaviour is that a study, once released, should not appear in the report again unless it is unpublished and republished. Also, the release report is going out every night when it should only go out when sumstats are actually released.

Also, in the last few days (26th & 27th), some unexpected files have appeared in the final section called “(A Document Being Saved By AUHelperService)”. It isn’t critical to get rid of this but it’s just additional clutter that reduces the utility of the report.

Allow addition of PubMed ID to submission manuscript

Curator and submitter users need the ability to associate a PubMed ID with a previously submitted publication. The endpoint should allow the update of a manuscript and add a PubMed ID to it.

Genes disappear from LD plot at certain plot widths

Daniel reported a bug with the LD plot. Genes that are visible in the 50kbp window disappear in the 25 and sometimes 200kbp window (genes displayed in track "Ensembl genes", below the plot; plot width changed using the drop-down on top right of the plot). I replicated the behaviour with a couple of other variants. As Daniel observes, it seems to occur when the gene is larger than the initial plot width:

https://www.ebi.ac.uk/gwas/variants/rs6466479
https://www.ebi.ac.uk/gwas/variants/rs45446698

but not

https://www.ebi.ac.uk/gwas/variants/rs7412

This is misleading for users and needs fixing so the genes are visible at all window sizes.

Original GWAS-info email:
I wanted to show off with the LD plot and I have bumped into a bug. Check out this variant: https://www.ebi.ac.uk/gwas/variants/rs35407685

This variant is an intron variant (https://www.ensembl.org/Homo_sapiens/Variation/Mappings?db=core;r=7:140785945-140786947;v=rs35407685;vdb=variation;vf=20866875) overlapping with BRAF. When the LD plot is loaded the BRAF gene on the reverse strand is correctly displayed (50kb window by default). However it disappears when the window size is decreased to 25kb, and when you try to change back to 50kb it still not shown, only when the largest 500kbp window is selected.

I think when the plot is updated only those genes are shown, whose start and end coordinates are both included within the window. So I suspect the fix could not be too complicated.
Please take a look and verify if this indeed not the expected behavior.

Provide stable, unique ID for project upon submission

Users should be provided with an accession ID for the “project” and accession IDs (GCSTs) for each GWAS within this project. These should be provided as soon as the submission passes validation so that the authors can include these in the manuscript and also to Journal editors to confirm that they have submitted.

Update EFO term labels in database to match OLS

Update the EFO term labels in the database to match the exact format in EFO, e.g. capitalization, etc. to resolve the errors reported by the data release QC script.

Studies without sumstats incorrectly marked as with sumstats upon import

In “View Submissions”, submissions that do not have summary statistics are being incorrectly labelled as “Metadata Summary Stats and Top Associations”.

Example: Buchwald J 32157176 was submitted by a curator with metadata and top associations but no summary stats (see submission:
https://www.ebi.ac.uk/gwas/deposition/submission/5e846f9e4c489b0001928d5a) but it appears in the View Submissions table as “Metadata Summary Stats and Top Associations”. When the studies were imported, they were therefore incorrectly marked as having summary statistics.