Coder Social home page Coder Social logo

genome-nexus / g2s Goto Github PK

View Code? Open in Web Editor NEW
7.0 7.0 9.0 759 KB

A standalone component to provide mappings between protein sequence positions and PDB 3D protein structure models

License: GNU Affero General Public License v3.0

Java 93.54% CSS 0.02% Python 0.19% HTML 5.27% Shell 0.98%

g2s's Introduction

Tutorial of Running this project (Beta 1)

Prerequest:
OS: Linux 64bit
java: openjdk_1.8.0
maven: 3.3.9
mysql: Ver 15.1 Distrib 10.0.21-MariaDB
blast: 2.4.0+
*Please make sure java, mvn, mysql, blastp are all in your paths. 

How to run this project:
Step 1. Init the Database
1. Create an empty database schema named as "pdb", username as "cbio", password as "cbio" in mysql:
	In mysql prompt,type:
	CREATE USER 'cbio'@'localhost' IDENTIFIED BY 'cbio';
	GRANT ALL PRIVILEGES ON * . * TO 'cbio'@'localhost';
	FLUSH PRIVILEGES;
	create database pdb;
2. In your code workspace, git clone https://github.com/cBioPortal/pdb-annotation.git
3. Change settings in src/main/resources/application.properties 
	(i) Change workspace to the input sequences located ${workdir}. 
	(ii)Change resource_dir to "~/pdb-annotation/pdb/src/main/resources/"  
	(iii)Change ensembl_input_interval for memory performance consideration
	(iv) * If you want to use other test ensembl sequences, please change both ensembl_download_file and ensembl_fasta_file in your workspace
4. mvn package
5. in pdb-annotation/pdb-alignment-pipeline/target/: java -jar -Xmx7000m pdb-0.1.0.jar init
 
Step 2. Check the API
1. in pdb-annotation/pdb-alignment-api/: mvn spring-boot:run
2. Swagger-UI:
http://localhost:8080/swagger-ui.html
3. Directly using API:
http://localhost:8080/pdb_annotation/StructureMappingQuery?ensemblId=ENSP00000483207.2
http://localhost:8080/pdb_annotation/ProteinIdentifierRecognitionQuery?ensemblId=ENSP00000483207.2

Step 3. Weekly update
1. in pdb-annotation/pdb-alignment-pipeline/target/: java -jar -Xmx7000m pdb-0.1.0.jar update

Notes:
Typical Running time for pipeline Init  : 80219.905 Seconds (around 22 hours)
Typical Running time for pipeline Update: 1062.796  Seconds (around 20 minutes)

Test on Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 8 cores, 8G Memory
Linux version 3.18.7-200.fc21.x86_64 (gcc version 4.9.2 20141101 (Red Hat 4.9.2-1) (GCC) ) #1 SMP
OpenJDK version "1.8.0_65" 64-Bit Server VM (build 25.65-b01, mixed mode)
mysql  Ver 15.1 Distrib 10.0.21-MariaDB, for Linux (x86_64) using  EditLine wrapper

Please let me know if you have questions.

g2s's People

Contributors

inodb avatar juexinwang avatar onursumer avatar sheridancbio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

g2s's Issues

construction of mysql command hardcodes max_allowed_packet setting

in PdbScriptsPipelineRunCommand.java, we have:
private List makeDBCommand() {
List list = new ArrayList();
// Building the following process command
list.add(rc.mysql);
list.add("--max_allowed_packet=1024M");

This max_allowed_packet setting must be configured in both the client and in the mysqld server to function correctly. The setting should not be hardcoded here, but be set in application.properties

PdbScriptsPipelineMakeSQL.java should be improved if we are keeping it

If we are switching to using a direct DB interaction through a java repository, then this class will be mostly deleted.

Otherwise, responsibilities should be grouped and broken out into other classes. One example might be to separate the parsing of blast xml, and separate the generation of insert statements.

Other improvements:

  • "choose" argument into parse2sql should be renamed so that it is clear what choice is being made. Also "1" and "0" as choices are unclear. Maybe use an enum with named values ... or define static final int class members at least.
  • parseblastresults : we have alternatives SmallMem and Single, but maybe there is no use for Single? If SmallMem will always work, drop Single from code.

experiment with spring-MVC and spring-boot for creating Web API

  • work through spring tutorial(s) and try ApiController example from cbioportal repository, to gain familiarity with Web API techniques.
  • also begin to think about what Web API endpoints (URL) and services (input parameters, and response format / java model) would make sense

PdbScriptsPipelineRunCommand.java is too big

I would separate all the "construct a command" type of functions to a utility class (or classes). I would create separate classes to handle:
- run a local process functions (runwithRedirect)
- downloadfile functionality
- blast processing functions (makedb, blastp)
- gunzip functionality
- ftp file download and parsing
That would leave the main run methods (run, runInit, runUpdatePDB) .. which might also be split up unless they share some common code. The ReadConfig class can/should be used in many places to provide access to the properties.

Add "DATE_ADDED" field to table pdb_ensembl_alignment

Since the alignment table is subject to modification, it would be helpful for development and debugging to know when records were added to the table. The same goes for pdb_entry and any other tables which are modified as part of the update process

gzip external process for insert archiving is broken

command : java -jar -Xmx10g pdb-alignment-pipeline/target/pdb-alignment-pipeline-0.1.0.jar init

This was run with batch increment sizes of 10000 and a Ensembl sequence database size of 50000.

  • First, the cleanup steps which gzip insert scripts fail when a gzip file already exists in the current directory .. this should be fixed to succeed (or not be run) when the user issues the "init" command on the java command line
  • There is a loop for reading the error stream. This loop has several problems: (1) the standard error stream is a stream of characters, but the code will read these characters as integer values and output them with tab characters between them. This destroys the meaning of the character stream. (2) The loop condition is set by calling InputStream.available() ... which only tells how many characters from the file are currently in the memory buffer. Why should we be printing out only the very small memory buffer? We should print the entire contents of the error stream. read() returns -1 when the system reaches EOF. (3) The reading of the input stream and output stream should be done no matter what --- not just when there is an error. We must flush both streams so that the buffers do not fill up and cause the problem to hang because the next write operation to the standard stream cannot complete due to a full buffer. Use a thread which reads an input stream and collects output into a String. I will send you example code.

Here are examples of the stderr output stream:

As the code is currently written:
2016-08-15 14:23:23 ERROR CommandProcessUtil:33 - [Process] Error: 103 122 105 112 58 32 47 85 115 101 114 115 47 115 104 101 114 105 100 97 114 47 114 101 112 111 115 47 115 104 101 114 105 100 97 110 99 98 105 111 47 112 100 98 45 97 110 110 111 116 97 116 105 111 110 47 103 115 111 99

After changing to output characters, not integers:
2016-08-15 14:53:59 ERROR CommandProcessUtil:33 - [Process] Error: g z i p : / U s e r s / s c

After recoding the loop, so that the reading continues until end of file, and getting rid of tab characters between each letter:
2016-08-15 15:17:19 ERROR CommandProcessUtil:36 - [Process] Error: gzip: /Users/sheridar/repos/sheridancbio/pdb-annotation/gsoc_3d_testing/insert.sql.0.gz already exists; not overwritten

Does the "updateweekly" mode work correctly currently?

I have not tested the scheduled weekly updates using the Calendar function and Timer.schedule option.

From looking at the code it seems that it will run the update at a future time and then exit .. but won't the user need to reset this process after each update? If the user needs to reset after each week, then they are not saving any time compared to running the process manually each week. So I think the internal scheduler should loop ... scheduling the next update after it finishes each update.

If the desired functionality is that the program run continuously in this mode, running each week without needing to be manually restarted by an external user, then convert this issue from "Question" to "Bug" (unless the code is functioning correctly already and I misunderstood it).

runwithRedirectFrom() call needs error checking and output stream handling

In PdbScriptsPipelineRunCommand.java : runwithRedirectFrom(), a process is created and input is redirected to it, but output streams are ignored and the return code / status code is not checked.

Long standard output or standard error streams can cause the small process buffers to overload .. but in the case of mysql we probably don't expect that. Still, maybe we should redirect output streams to temporary files and then delete them when the mysql commands are complete.

But return codes must be checked to see if the mysql command failed or exited with an error.

access levels not appropriate for data members

PdbScriptsPipelineRunCommand.java has some public data members, and one which is not specified (so it defaults to "package" level access.

In general all data members should have access level "private" specified unless there is a good reason not to. Make these private or discuss alternatives.

choose schema and create database tables for storing sequences and alignments

  • Create database tables for holding sequences. Handle sequence identifiers properly, including a mapping of original sequence identifiers from sequence database source (such as Ensembl or Uniprot) to a set of non-redundant sequences with unique internal identifiers.
  • Create database tables for holding alignments between sequence database sequences and pdb sequences. Primary goal is that system can retrieve an amino acid to amino acid mapping between sequence database sequence and pdb structure coordinates (resSeq). Secondary goal is to retrieve information about entire alignment (such as alignment start / end positions and a text representation of the alignment which show matches, mismatches and gaps as shown in a blast hit report.

Choice : custom application.properties parser versus spring based properties

There is currently custom code for parsing the properties files and making values available to the code. There are also spring based directives for connecting to properties files.

We should have one or the other.

If we keep the custom parsing code then:

  • drop the spring annotations for properties processing
  • make the parsing code function so that the file is parsed only once, values are cached in memory, and many java classes can read the properties without causing a new parsing of the file.

If we get the spring based property parsing to work, then remove the custom code, and replace all properties references using the spring framework.

plan components and packages / modules for gsoc code development

  • consider what classes and other resources are likely to be needed for the gsoc project development; draft a design list
  • group new development items into components
  • decide on reasonable code organization [modules, packages, paths] for components being planned

exception when "interval" properties divide evenly into alignment count

The two application.properties settings:
ensembl_input_interval=10000
sql_insert_output_interval=10000

There are errors when these numbers evenly divide into the number of generated alignments. We need to add code into the pipeline to avoid these exceptions. We could try to detect and delete empty files before attempting to parse them .. or use clever record counting and interval number factorization (using disparate prime numbers)

default data import is too slow

With the current DB settings, data import service runs slow, find a way to optimize it. A possible solution is MySQL bulk load.

Rev2: changes to the gsoc project completion wiki page

Here are some suggested improvements and points of feedback:

  • The title of the page doesn't need to say "wiki", since that is obvious because the page is found under the "wiki" tab. It also doesn't need to say pdb-annotation .. since visitors already know the project they are looking at. For a title, describe what this particular wiki page is meant to capture .. my suggestion: "gsoc 2016 project summary and guide"
  • delete the welcome line (not helpful)
  • Use better text formatting and indentation to make clear which parts of the instructions in section 2 (Installation Guide) are "commands for the user to type". If these are all in a distinctive font and have a standard indent it makes it much easier for the reader to know what they are supposed to type and where they are in the process.
  • The instructions themselves in section 2 (Installation Guide) should be rewritten so that someone who knows nothing about the project will still be able to know what to do and how to get the program installed and running. Too many of these instructions assume the user knows the project and can figure out appropriate settings ... and some of the example file paths will probably need to be different depending on where the user chooses to install and run the code.
  • the API Documentation Section "Functional Details" for the first API endpoint is not accurate. Rewrite this text so it describes what the API actually does
  • the one main figure / diagram is good. Look for other opportunities to present helpful images which capture the purpose of document sections "at a glance" A figure for the API Documentation section might be helpful for example.
  • there are many suggested rewordings and comments on how to improve the text here: https://docs.google.com/document/d/1tU0lrjmdRu0NrESq1K4mgcHMR7FwMt5HLQklgSw8ikQ Consider these suggestions and make improvements as possible.
  • consider adding a section describing the remaining functionality that needs to be added to meet all the needs of the cBioPortal project. For example, describe the need for an API query to find the PDB coordinate for a specific residue in the protein sequence ... and mention some of the challenges that need to be dealt with, such as the segmentation problem and gaps in the pdb sequence.

Web API does not report exceptions

If the database is inaccessible, or other exceptions are thrown during database query, the exceptions are not seen by the end user and the api may return incorrect values. For example, the following exception was recorded on startup in a local installation but querying the Api endpoint http://localhost:8080/pdb_annotation/ProteinIdentifierRecognitionQuery?ensemblId=ENSP00000485937.1 resulted in a simple return value of "false"

Beginning of exception stacktrace:

2016-09-13 05:52:30.490 ERROR 11452 --- [ main] o.a.tomcat.jdbc.pool.ConnectionPool : Unable to create initial connections of pool.

com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown database 'pdbtest'
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_91]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[na:1.8.0_91]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.8.0_91]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[na:1.8.0_91]
at com.mysql.jdbc.Util.handleNewInstance(Util.java:404) ~[mysql-connector-java-5.1.38.jar:5.1.38]
at com.mysql.jdbc.Util.getInstance(Util.java:387) ~[mysql-connector-java-5.1.38.jar:5.1.38]

database update process does not report numbers

The output from running the update step of the pdb-alignment-pipeline should report the number of pdb files added, deleted, modified in the runtime report. Total Input Queries is reported in the runtime report below (596) --- this is good. We also would be able to report the number of deletions from the alignment table if we used a true database interaction rather than the external .sql script.

[SHELL] Weekly Update: Create deleted list
[Preprocessing] Preprocessing PDB sequences...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/sheridar/repos/sheridancbio/pdb-annotation/pdb-alignment-pipeline/target/pdb-alignment-pipeline-0.1.0.jar!/lib/logback-classic-1.1.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/sheridar/repos/sheridancbio/pdb-annotation/pdb-alignment-pipeline/target/pdb-alignment-pipeline-0.1.0.jar!/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder]
[BLAST] Running makeblastdb command...
[BLAST] Command makeblastdb complete
[BLAST] Running blastp command...
[BLAST] Command blastp complete
[BLAST] Read blast results from xml file...
[BLAST] Total Input Queries = 596
[SHELL] Start Write insert.sql File...
[SHELL] Write insert.sql Done
[DATABASE] Running mysql command...
[DATABASE] Command mysql complete
[Shell] Generating delete SQL
[DATABASE] Running mysql command...
[DATABASE] Command mysql complete
[Shell] All Execution time is 52.981 seconds

install and test necessary software tools for gsoc project

  • Install blast and test protein alignment generation
  • Install database system (mysql or other rDBMS)
  • Install or test pdb parsing library (unless all sequence and coordinate info can be downloaded from rcsb
  • Install web application container if not using spring-boot to provide a container

practice using github fork to submit pull request

  • create a fork of this repository
  • clone the fork to a local machine
  • add a change, such as a comment in a README file or addition of a document file
  • commit the change on the local machine using git
  • push the change back to the fork on github
  • create a new pull request to this repository

for the purposes of this exercise, use the default branch (master). In normal development practice, we typically create a new branch off the master branch before making changes. We also make pull requests on a different branch (such as hotfix or rc) when submitting PR's back to the cbioportal repository. So after exercise is complete, read about creating and switching to a new branch -- and maybe repeat the exercise using a named development branch (it could be called "pull-request-exercise" or something similar)

Errors not reported if mysql insert script fails

when running the init step (java -jar -Xmx7000m target/pdb-alignment-pipeline-0.1.0.jar init) if the import of alignment records fails (for instance if there is no database present or no tables created or if the data does not properly it into the created table schema) no error report is given.

If import fails, an error message must be returned to the user and the application must exit with a non-0 status code.

wget functionality via external process : error detection needed

PdbScriptsPipelineRunCommand.java has code for downloadfile(), but the error detection of failed downloads is not adequate. The return code / exit code of the wget command must be checked. Also the output streams from the process must be handled.

(These problems also occur for other ProcessBuilder use cases)

many functions with Boolean return type always return true

There are functions with a boolean return type (hypothetically to indicate success or failure of function) but which only return a true value. Either identify, detect, and report failures, or change these functions into return type "void". Failures can be reported either by the return type or by throwing an exception, but if the return type is never "false" then switch to void.

Also, if a function does return a true/false success value the calling code must check that value and report a problem if there was a failure, then take corrective action.

If exceptions are thrown, the appropriate "throws" clause must be put on the throwing function and the calling code should have an appropriate "try/catch". Do not use the "unchecked" RuntimeException classes unless they are actually appropriate to the case. The program should not "exception out" of execution unless it is a serious and hard to predict situation.

examples:
PdbScriptsPipelinePreprocessing.java

  • preprocessPDBsequences()
  • preprocessPDBsequencesUpdate()
    PdbScriptsPipelineMakeSQL.java
  • parse2sql
  • parseblastresultsSmallMem
  • generateSQLstatementsSingle
  • genereateSQLstatementsSmallMem

FTP connections through URLConnection fail from within private network / firewall

Code in PdbScriptsPipelineRunCommand.java may attempt to access files through a URLConnection object from the standard java library. If the url specifies the "ftp:" protocol, and if the program is running inside a private network or behind a firewall, the data connection for the ftp transfer may fail.

Luckily, all URLs in use currently come from servers which offer dual ftp/http protocol services, so we can avoid the exceptions by specifying in the URL to use http protocol. But it would be better if the code used an FTP client which had proper support for passive mode file transfers such as org.apache.commons.net.ftp.FTPClient.

Example of the error as observed:

[SHELL] Weekly Update: Create deleted list
java.io.IOException: sun.net.ftp.FtpProtocolException: PORT 172,18,233,39,194,100:500 Illegal command, EPSV ALL in effect.

at sun.net.www.protocol.ftp.FtpURLConnection.getInputStream(FtpURLConnection.java:488)
at org.cbioportal.pdb_annotation.scripts.PdbScriptsPipelineRunCommand.readFTPfile2List(PdbScriptsPipelineRunCommand.java:348)
at org.cbioportal.pdb_annotation.scripts.PdbScriptsPipelineRunCommand.prepareUpdatePDBFile(PdbScriptsPipelineRunCommand.java:404)
at org.cbioportal.pdb_annotation.scripts.PdbScriptsPipelineRunCommand.runUpdatePDB(PdbScriptsPipelineRunCommand.java:484)
at org.cbioportal.pdb_annotation.scripts.PdbScriptsPipelineStarter.main(PdbScriptsPipelineStarter.java:123)

(local machine on wireless network was assigned IP address 172.18.233.39)

change front page text

<h1>G2S (Genome to Structure)</h1>

<div class="container">
		<h2>
			
		</h2>
		
		<p>The G2S web service provides an API that map genomic and proteomic coordinates to locations in 3D structures in PDB. The underlying alignment data are be updated on a weekly basis.</p>
		<p>
			There are two components in G2S web service:
		</p>
		<ul>
		     <li><b><a href="/pageapi">G2S API</a></b>:       Query by Uniprot/Human Ensembl/Genomic Position. </li>
		     <li><b><a href="/sequence">Sequence Web Interface</a></b> : Query by any protein sequences. </li>
		</ul>
		<p>
		     G2S web service supports:
		</p>
		<ul>
			<li>Retrieving aligned protein structure chains for an
				Ensembl/UniProt entry.</li>
			<li>Retrieving aligned protein structure chains for any protein
				sequence.</li>
			<li>Retrieving residue mapping between protein sequences and
				protein structures.</li>
			<li>Retrieving structural position of a genomic variant.</li>
		</ul>
		
	</div>

Maybe use a larger font for the text too.

Choice : mysql external script processing versus java DB interaction (repository)

The current code base has a java interface to the database for use by the API. The pipeline section uses an approach where external sql syntax script files are created and then processed with an external process call to the mysql client.

We should choose whether we will continue to use the external mysql script files, or switch to using the java database interface for doing updates.

If we continue to use the mysql script files, then we should clean up / break up the java classes which generate and execute the mysql scripts. We should also do better error detection and deal with the output streams from the external processes since they could block if too much output is generated by the mysql process and it is not taken from the buffers.

If we switch to using a java interface to the databases, we should drop all the code involved in external "mysql" processes and the writing of mysql INSERT or DELETE statements. Of course, there would need to be development of proper logic for doing insert operations and checking for exceptions and errors.

support querying dbsnp

A common request is to query by dbSNP ID.

  • Add an option dbsnp/[dbsnp_id] as id_type/id to the end points that support genomic locations.
  • Use VEP to map dbsnp to protein annotation and then map to structures, similar to genomic locations.

adding new query sequences dynamically

Currently, we need to rebuild/re-initiate the whole database from scratch in order to update to a new UniPort release. We should really eliminate this step.

  • support adding new query sequences into the database
  • when a new sequence (or a new uniprot/ensembl id) is queried, the alignments will be stored
  • the new query sequences will also be used for the weekly update to the PDB data

Once that's finished, adding a new sequencing (or uniprot entry) is just about querying the API once.

add page parameters to api endpoints

We need an api to determine whether alignments exists at all for a given transcript so that we don't have to pull all alignments just to determine that.

After discussion, we would like to add pagination pagination similar to what we do in cBioPortal API.

An empty list should be return if no alignments in the result.

Please add the following parameters similar to cBioPortal endpoints:

image

If it's too much work to add to all, please start with /api/alignments/{id_type}/{id} which is used by cBioPortal.

Rewording some messages

There are some messages that could have simpler / more accurate wording. Examples will be added in discussion comments to this Issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.