genome-nexus / g2s Goto Github PK

7.0 7.0 9.0 759 KB

A standalone component to provide mappings between protein sequence positions and PDB 3D protein structure models

License: GNU Affero General Public License v3.0

Java 93.54% CSS 0.02% Python 0.19% HTML 5.27% Shell 0.98%

g2s's Introduction

Tutorial of Running this project (Beta 1)

Prerequest:
OS: Linux 64bit
java: openjdk_1.8.0
maven: 3.3.9
mysql: Ver 15.1 Distrib 10.0.21-MariaDB
blast: 2.4.0+
*Please make sure java, mvn, mysql, blastp are all in your paths. 

How to run this project:
Step 1. Init the Database
1. Create an empty database schema named as "pdb", username as "cbio", password as "cbio" in mysql:
	In mysql prompt,type:
	CREATE USER 'cbio'@'localhost' IDENTIFIED BY 'cbio';
	GRANT ALL PRIVILEGES ON * . * TO 'cbio'@'localhost';
	FLUSH PRIVILEGES;
	create database pdb;
2. In your code workspace, git clone https://github.com/cBioPortal/pdb-annotation.git
3. Change settings in src/main/resources/application.properties 
	(i) Change workspace to the input sequences located ${workdir}. 
	(ii)Change resource_dir to "~/pdb-annotation/pdb/src/main/resources/"  
	(iii)Change ensembl_input_interval for memory performance consideration
	(iv) * If you want to use other test ensembl sequences, please change both ensembl_download_file and ensembl_fasta_file in your workspace
4. mvn package
5. in pdb-annotation/pdb-alignment-pipeline/target/: java -jar -Xmx7000m pdb-0.1.0.jar init
 
Step 2. Check the API
1. in pdb-annotation/pdb-alignment-api/: mvn spring-boot:run
2. Swagger-UI:
http://localhost:8080/swagger-ui.html
3. Directly using API:
http://localhost:8080/pdb_annotation/StructureMappingQuery?ensemblId=ENSP00000483207.2
http://localhost:8080/pdb_annotation/ProteinIdentifierRecognitionQuery?ensemblId=ENSP00000483207.2

Step 3. Weekly update
1. in pdb-annotation/pdb-alignment-pipeline/target/: java -jar -Xmx7000m pdb-0.1.0.jar update

Notes:
Typical Running time for pipeline Init  : 80219.905 Seconds (around 22 hours)
Typical Running time for pipeline Update: 1062.796  Seconds (around 20 minutes)

Test on Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 8 cores, 8G Memory
Linux version 3.18.7-200.fc21.x86_64 (gcc version 4.9.2 20141101 (Red Hat 4.9.2-1) (GCC) ) #1 SMP
OpenJDK version "1.8.0_65" 64-Bit Server VM (build 25.65-b01, mixed mode)
mysql  Ver 15.1 Distrib 10.0.21-MariaDB, for Linux (x86_64) using  EditLine wrapper

Please let me know if you have questions.

g2s's People

Contributors

Stargazers

Watchers

Forkers

onursumer sheridancbio juexinwang marriott-er

g2s's Issues

construction of mysql command hardcodes max_allowed_packet setting

in PdbScriptsPipelineRunCommand.java, we have:
private List makeDBCommand() {
List list = new ArrayList();
// Building the following process command
list.add(rc.mysql);
list.add("--max_allowed_packet=1024M");

This max_allowed_packet setting must be configured in both the client and in the mysqld server to function correctly. The setting should not be hardcoded here, but be set in application.properties

Clean up endpoints

typo in PdbScriptsPipelineRunCommand.java (dataVesrsion)

The class member dataVesrsion should be changed to dataVersion

mysql command steps of pdb-annotation-pipeline not working

at the end of the update process (and the init process as well) the mysql client is not actually executing the "insert" and other commands. No exception or error is reported.

I will soon be debugging this problem

PdbScriptsPipelineMakeSQL.java should be improved if we are keeping it

If we are switching to using a direct DB interaction through a java repository, then this class will be mostly deleted.

Otherwise, responsibilities should be grouped and broken out into other classes. One example might be to separate the parsing of blast xml, and separate the generation of insert statements.

Other improvements:

"choose" argument into parse2sql should be renamed so that it is clear what choice is being made. Also "1" and "0" as choices are unclear. Maybe use an enum with named values ... or define static final int class members at least.
parseblastresults : we have alternatives SmallMem and Single, but maybe there is no use for Single? If SmallMem will always work, drop Single from code.

experiment with spring-MVC and spring-boot for creating Web API

work through spring tutorial(s) and try ApiController example from cbioportal repository, to gain familiarity with Web API techniques.
also begin to think about what Web API endpoints (URL) and services (input parameters, and response format / java model) would make sense

PdbScriptsPipelineRunCommand.java is too big

I would separate all the "construct a command" type of functions to a utility class (or classes). I would create separate classes to handle:
- run a local process functions (runwithRedirect)
- downloadfile functionality
- blast processing functions (makedb, blastp)
- gunzip functionality
- ftp file download and parsing
That would leave the main run methods (run, runInit, runUpdatePDB) .. which might also be split up unless they share some common code. The ReadConfig class can/should be used in many places to provide access to the properties.

Different java versions specified in different modules

The root pom.xml file specifies java 1.7, but the alignment-api specifies java 1.8.
The project as a whole should be using one consistent version of java, so these should be reconciled.

delete (or support) -word_size argument to blast program

PdbScriptsPipelineRunCommand.java has commented out reference to the -word_size parameter to blast.

Delete this comment ... or make this a parameter in application.properties and include the parameter to blast.

Add "DATE_ADDED" field to table pdb_ensembl_alignment

Since the alignment table is subject to modification, it would be helpful for development and debugging to know when records were added to the table. The same goes for pdb_entry and any other tables which are modified as part of the update process

gzip external process for insert archiving is broken

command : java -jar -Xmx10g pdb-alignment-pipeline/target/pdb-alignment-pipeline-0.1.0.jar init

This was run with batch increment sizes of 10000 and a Ensembl sequence database size of 50000.

First, the cleanup steps which gzip insert scripts fail when a gzip file already exists in the current directory .. this should be fixed to succeed (or not be run) when the user issues the "init" command on the java command line
There is a loop for reading the error stream. This loop has several problems: (1) the standard error stream is a stream of characters, but the code will read these characters as integer values and output them with tab characters between them. This destroys the meaning of the character stream. (2) The loop condition is set by calling InputStream.available() ... which only tells how many characters from the file are currently in the memory buffer. Why should we be printing out only the very small memory buffer? We should print the entire contents of the error stream. read() returns -1 when the system reaches EOF. (3) The reading of the input stream and output stream should be done no matter what --- not just when there is an error. We must flush both streams so that the buffers do not fill up and cause the problem to hang because the next write operation to the standard stream cannot complete due to a full buffer. Use a thread which reads an input stream and collects output into a String. I will send you example code.

Here are examples of the stderr output stream:

As the code is currently written:
2016-08-15 14:23:23 ERROR CommandProcessUtil:33 - [Process] Error: 103 122 105 112 58 32 47 85 115 101 114 115 47 115 104 101 114 105 100 97 114 47 114 101 112 111 115 47 115 104 101 114 105 100 97 110 99 98 105 111 47 112 100 98 45 97 110 110 111 116 97 116 105 111 110 47 103 115 111 99

After changing to output characters, not integers:
2016-08-15 14:53:59 ERROR CommandProcessUtil:33 - [Process] Error: g z i p : / U s e r s / s c

After recoding the loop, so that the reading continues until end of file, and getting rid of tab characters between each letter:
2016-08-15 15:17:19 ERROR CommandProcessUtil:36 - [Process] Error: gzip: /Users/sheridar/repos/sheridancbio/pdb-annotation/gsoc_3d_testing/insert.sql.0.gz already exists; not overwritten

Does the "updateweekly" mode work correctly currently?

I have not tested the scheduled weekly updates using the Calendar function and Timer.schedule option.

From looking at the code it seems that it will run the update at a future time and then exit .. but won't the user need to reset this process after each update? If the user needs to reset after each week, then they are not saving any time compared to running the process manually each week. So I think the internal scheduler should loop ... scheduling the next update after it finishes each update.

If the desired functionality is that the program run continuously in this mode, running each week without needing to be manually restarted by an external user, then convert this issue from "Question" to "Bug" (unless the code is functioning correctly already and I misunderstood it).

runwithRedirectFrom() call needs error checking and output stream handling

In PdbScriptsPipelineRunCommand.java : runwithRedirectFrom(), a process is created and input is redirected to it, but output streams are ignored and the return code / status code is not checked.

Long standard output or standard error streams can cause the small process buffers to overload .. but in the case of mysql we probably don't expect that. Still, maybe we should redirect output streams to temporary files and then delete them when the mysql commands are complete.

But return codes must be checked to see if the mysql command failed or exited with an error.

access levels not appropriate for data members

PdbScriptsPipelineRunCommand.java has some public data members, and one which is not specified (so it defaults to "package" level access.

In general all data members should have access level "private" specified unless there is a good reason not to. Make these private or discuss alternatives.

choose schema and create database tables for storing sequences and alignments

Create database tables for holding sequences. Handle sequence identifiers properly, including a mapping of original sequence identifiers from sequence database source (such as Ensembl or Uniprot) to a set of non-redundant sequences with unique internal identifiers.
Create database tables for holding alignments between sequence database sequences and pdb sequences. Primary goal is that system can retrieve an amino acid to amino acid mapping between sequence database sequence and pdb structure coordinates (resSeq). Secondary goal is to retrieve information about entire alignment (such as alignment start / end positions and a text representation of the alignment which show matches, mismatches and gaps as shown in a blast hit report.

Choice : custom application.properties parser versus spring based properties

There is currently custom code for parsing the properties files and making values available to the code. There are also spring based directives for connecting to properties files.

We should have one or the other.

If we keep the custom parsing code then:

drop the spring annotations for properties processing
make the parsing code function so that the file is parsed only once, values are cached in memory, and many java classes can read the properties without causing a new parsing of the file.

If we get the spring based property parsing to work, then remove the custom code, and replace all properties references using the spring framework.

plan components and packages / modules for gsoc code development

consider what classes and other resources are likely to be needed for the gsoc project development; draft a design list
group new development items into components
decide on reasonable code organization [modules, packages, paths] for components being planned

exception when "interval" properties divide evenly into alignment count

The two application.properties settings:
ensembl_input_interval=10000
sql_insert_output_interval=10000

There are errors when these numbers evenly divide into the number of generated alignments. We need to add code into the pipeline to avoid these exceptions. We could try to detect and delete empty files before attempting to parse them .. or use clever record counting and interval number factorization (using disparate prime numbers)

examine the project proposal and milestone document - collect questions

looking forward through the plan in the google docs, consider upcoming decisions and questions which will need to be answered. Add new questions to the google document holding project questions.

PdbScriptsPipelinePreprocessing.java has commented out alternatives

preprocessPDBsequences() comments out alternative ways of writing the fasta file. Either create a functional conditional alternative which is chosen in application.properties, or delete the commented out code.

support genome build version in ID instead of id_type

support GRCh37:17:g.79478130C>G
support GRCh38:17:g.79478130C>G.

Repo still alive?

Is this repo still actively used anywhere? The main site seems to just use endpoints like https://mmtf.rcsb.org/v1.0/full/6GYR. Maybe we should archive this repo and remove its reference from https://docs.cbioportal.org/2.1-deployment/architecture-overview#g2s?

@inodb , @sheridancbio

default data import is too slow

With the current DB settings, data import service runs slow, find a way to optimize it. A possible solution is MySQL bulk load.

Rev2: changes to the gsoc project completion wiki page

Here are some suggested improvements and points of feedback:

The title of the page doesn't need to say "wiki", since that is obvious because the page is found under the "wiki" tab. It also doesn't need to say pdb-annotation .. since visitors already know the project they are looking at. For a title, describe what this particular wiki page is meant to capture .. my suggestion: "gsoc 2016 project summary and guide"
delete the welcome line (not helpful)
Use better text formatting and indentation to make clear which parts of the instructions in section 2 (Installation Guide) are "commands for the user to type". If these are all in a distinctive font and have a standard indent it makes it much easier for the reader to know what they are supposed to type and where they are in the process.
The instructions themselves in section 2 (Installation Guide) should be rewritten so that someone who knows nothing about the project will still be able to know what to do and how to get the program installed and running. Too many of these instructions assume the user knows the project and can figure out appropriate settings ... and some of the example file paths will probably need to be different depending on where the user chooses to install and run the code.
the API Documentation Section "Functional Details" for the first API endpoint is not accurate. Rewrite this text so it describes what the API actually does
the one main figure / diagram is good. Look for other opportunities to present helpful images which capture the purpose of document sections "at a glance" A figure for the API Documentation section might be helpful for example.
there are many suggested rewordings and comments on how to improve the text here: https://docs.google.com/document/d/1tU0lrjmdRu0NrESq1K4mgcHMR7FwMt5HLQklgSw8ikQ Consider these suggestions and make improvements as possible.
consider adding a section describing the remaining functionality that needs to be added to meet all the needs of the cBioPortal project. For example, describe the need for an API query to find the PDB coordinate for a specific residue in the protein sequence ... and mention some of the challenges that need to be dealt with, such as the segmentation problem and gaps in the pdb sequence.

Add a cache for pdb header data

Cache the data retrieved from the PDB header service (service/internal/PdbDataServiceImpl.java)

Web API does not report exceptions

If the database is inaccessible, or other exceptions are thrown during database query, the exceptions are not seen by the end user and the api may return incorrect values. For example, the following exception was recorded on startup in a local installation but querying the Api endpoint http://localhost:8080/pdb_annotation/ProteinIdentifierRecognitionQuery?ensemblId=ENSP00000485937.1 resulted in a simple return value of "false"

Beginning of exception stacktrace:

2016-09-13 05:52:30.490 ERROR 11452 --- [ main] o.a.tomcat.jdbc.pool.ConnectionPool : Unable to create initial connections of pool.

com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown database 'pdbtest'
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_91]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[na:1.8.0_91]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.8.0_91]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[na:1.8.0_91]
at com.mysql.jdbc.Util.handleNewInstance(Util.java:404) ~[mysql-connector-java-5.1.38.jar:5.1.38]
at com.mysql.jdbc.Util.getInstance(Util.java:387) ~[mysql-connector-java-5.1.38.jar:5.1.38]

mapping issues from genomic to pdb

reported by user:

a strange case where a genomic position maps onto several different residues on the same protein chain. Here is an example:
https://g2s.genomenexus.org/api/alignments/hgvs-grch37/chr12:g.104713303C%3ET/pdb/3QFB_A/residueMapping

After getting the genome nexus data, I suspect that the gene_id was used for retrieving alignments. If so, transcript_id or trotein_id should be used instead.

database update process does not report numbers

The output from running the update step of the pdb-alignment-pipeline should report the number of pdb files added, deleted, modified in the runtime report. Total Input Queries is reported in the runtime report below (596) --- this is good. We also would be able to report the number of deletions from the alignment table if we used a true database interaction rather than the external .sql script.

[SHELL] Weekly Update: Create deleted list
[Preprocessing] Preprocessing PDB sequences...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/sheridar/repos/sheridancbio/pdb-annotation/pdb-alignment-pipeline/target/pdb-alignment-pipeline-0.1.0.jar!/lib/logback-classic-1.1.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/sheridar/repos/sheridancbio/pdb-annotation/pdb-alignment-pipeline/target/pdb-alignment-pipeline-0.1.0.jar!/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder]
[BLAST] Running makeblastdb command...
[BLAST] Command makeblastdb complete
[BLAST] Running blastp command...
[BLAST] Command blastp complete
[BLAST] Read blast results from xml file...
[BLAST] Total Input Queries = 596
[SHELL] Start Write insert.sql File...
[SHELL] Write insert.sql Done
[DATABASE] Running mysql command...
[DATABASE] Command mysql complete
[Shell] Generating delete SQL
[DATABASE] Running mysql command...
[DATABASE] Command mysql complete
[Shell] All Execution time is 52.981 seconds

install and test necessary software tools for gsoc project

Install blast and test protein alignment generation
Install database system (mysql or other rDBMS)
Install or test pdb parsing library (unless all sequence and coordinate info can be downloaded from rcsb
Install web application container if not using spring-boot to provide a container

practice using github fork to submit pull request

create a fork of this repository
clone the fork to a local machine
add a change, such as a comment in a README file or addition of a document file
commit the change on the local machine using git
push the change back to the fork on github
create a new pull request to this repository

for the purposes of this exercise, use the default branch (master). In normal development practice, we typically create a new branch off the master branch before making changes. We also make pull requests on a different branch (such as hotfix or rc) when submitting PR's back to the cbioportal repository. So after exercise is complete, read about creating and switching to a new branch -- and maybe repeat the exercise using a named development branch (it could be called "pull-request-exercise" or something similar)

Deploy beta to production

Deploy beta war to production. Then create frontend fix for new beta/

Errors not reported if mysql insert script fails

when running the init step (java -jar -Xmx7000m target/pdb-alignment-pipeline-0.1.0.jar init) if the import of alignment records fails (for instance if there is no database present or no tables created or if the data does not properly it into the created table schema) no error report is given.

If import fails, an error message must be returned to the user and the application must exit with a non-0 status code.

Add POST of bulk query using genome nexus POST

wget functionality via external process : error detection needed

PdbScriptsPipelineRunCommand.java has code for downloadfile(), but the error detection of failed downloads is not adequate. The return code / exit code of the wget command must be checked. Also the output streams from the process must be handled.

(These problems also occur for other ProcessBuilder use cases)

Support only Human Uniprot and remove alignment limitation of 50

Use newly developed G2Smutation parts

many functions with Boolean return type always return true

There are functions with a boolean return type (hypothetically to indicate success or failure of function) but which only return a true value. Either identify, detect, and report failures, or change these functions into return type "void". Failures can be reported either by the return type or by throwing an exception, but if the return type is never "false" then switch to void.

Also, if a function does return a true/false success value the calling code must check that value and report a problem if there was a failure, then take corrective action.

If exceptions are thrown, the appropriate "throws" clause must be put on the throwing function and the calling code should have an appropriate "try/catch". Do not use the "unchecked" RuntimeException classes unless they are actually appropriate to the case. The program should not "exception out" of execution unless it is a serious and hard to predict situation.

examples:
PdbScriptsPipelinePreprocessing.java

preprocessPDBsequences()
preprocessPDBsequencesUpdate()
PdbScriptsPipelineMakeSQL.java
parse2sql
parseblastresultsSmallMem
generateSQLstatementsSingle
genereateSQLstatementsSmallMem

FTP connections through URLConnection fail from within private network / firewall

Code in PdbScriptsPipelineRunCommand.java may attempt to access files through a URLConnection object from the standard java library. If the url specifies the "ftp:" protocol, and if the program is running inside a private network or behind a firewall, the data connection for the ftp transfer may fail.

Luckily, all URLs in use currently come from servers which offer dual ftp/http protocol services, so we can avoid the exceptions by specifying in the URL to use http protocol. But it would be better if the code used an FTP client which had proper support for passive mode file transfers such as org.apache.commons.net.ftp.FTPClient.

Example of the error as observed:

[SHELL] Weekly Update: Create deleted list
java.io.IOException: sun.net.ftp.FtpProtocolException: PORT 172,18,233,39,194,100:500 Illegal command, EPSV ALL in effect.

at sun.net.www.protocol.ftp.FtpURLConnection.getInputStream(FtpURLConnection.java:488)
at org.cbioportal.pdb_annotation.scripts.PdbScriptsPipelineRunCommand.readFTPfile2List(PdbScriptsPipelineRunCommand.java:348)
at org.cbioportal.pdb_annotation.scripts.PdbScriptsPipelineRunCommand.prepareUpdatePDBFile(PdbScriptsPipelineRunCommand.java:404)
at org.cbioportal.pdb_annotation.scripts.PdbScriptsPipelineRunCommand.runUpdatePDB(PdbScriptsPipelineRunCommand.java:484)
at org.cbioportal.pdb_annotation.scripts.PdbScriptsPipelineStarter.main(PdbScriptsPipelineStarter.java:123)

(local machine on wireless network was assigned IP address 172.18.233.39)

change front page text

<h1>G2S (Genome to Structure)</h1>

<div class="container">
		<h2>
			
		</h2>
		
		<p>The G2S web service provides an API that map genomic and proteomic coordinates to locations in 3D structures in PDB. The underlying alignment data are be updated on a weekly basis.</p>
		<p>
			There are two components in G2S web service:
		</p>
		<ul>
		     <li><b><a href="/pageapi">G2S API</a></b>:       Query by Uniprot/Human Ensembl/Genomic Position. </li>
		     <li><b><a href="/sequence">Sequence Web Interface</a></b> : Query by any protein sequences. </li>
		</ul>
		<p>
		     G2S web service supports:
		</p>
		<ul>
			<li>Retrieving aligned protein structure chains for an
				Ensembl/UniProt entry.</li>
			<li>Retrieving aligned protein structure chains for any protein
				sequence.</li>
			<li>Retrieving residue mapping between protein sequences and
				protein structures.</li>
			<li>Retrieving structural position of a genomic variant.</li>
		</ul>
		
	</div>

Maybe use a larger font for the text too.

Choice : mysql external script processing versus java DB interaction (repository)

The current code base has a java interface to the database for use by the API. The pipeline section uses an approach where external sql syntax script files are created and then processed with an external process call to the mysql client.

We should choose whether we will continue to use the external mysql script files, or switch to using the java database interface for doing updates.

If we continue to use the mysql script files, then we should clean up / break up the java classes which generate and execute the mysql scripts. We should also do better error detection and deal with the output streams from the external processes since they could block if too much output is generated by the mysql process and it is not taken from the buffers.

If we switch to using a java interface to the databases, we should drop all the code involved in external "mysql" processes and the writing of mysql INSERT or DELETE statements. Of course, there would need to be development of proper logic for doing insert operations and checking for exceptions and errors.

support querying dbsnp

A common request is to query by dbSNP ID.

Add an option dbsnp/[dbsnp_id] as id_type/id to the end points that support genomic locations.
Use VEP to map dbsnp to protein annotation and then map to structures, similar to genomic locations.

adding new query sequences dynamically

Currently, we need to rebuild/re-initiate the whole database from scratch in order to update to a new UniPort release. We should really eliminate this step.

support adding new query sequences into the database
when a new sequence (or a new uniprot/ensembl id) is queried, the alignments will be stored
the new query sequences will also be used for the weekly update to the PDB data

Once that's finished, adding a new sequencing (or uniprot entry) is just about querying the API once.

rename db_schema in application.properties to db_name or db_database_name

This property is to hold the name of the database which is being used to store data.

schema is a more general concept, as multiple databases can have identical schema.

In the mysql documentation, they use the pattern "db_name" to represent this string, for examples see:
https://dev.mysql.com/doc/refman/5.7/en/mysql.html
https://dev.mysql.com/doc/refman/5.7/en/alter-database.html

add page parameters to api endpoints

We need an api to determine whether alignments exists at all for a given transcript so that we don't have to pull all alignments just to determine that.

After discussion, we would like to add pagination pagination similar to what we do in cBioPortal API.

An empty list should be return if no alignments in the result.

Please add the following parameters similar to cBioPortal endpoints:

If it's too much work to add to all, please start with /api/alignments/{id_type}/{id} which is used by cBioPortal.

Rewording some messages

There are some messages that could have simpler / more accurate wording. Examples will be added in discussion comments to this Issue.