genome-nexus / g2s Goto Github PK
View Code? Open in Web Editor NEWA standalone component to provide mappings between protein sequence positions and PDB 3D protein structure models
License: GNU Affero General Public License v3.0
A standalone component to provide mappings between protein sequence positions and PDB 3D protein structure models
License: GNU Affero General Public License v3.0
Tutorial of Running this project (Beta 1) Prerequest: OS: Linux 64bit java: openjdk_1.8.0 maven: 3.3.9 mysql: Ver 15.1 Distrib 10.0.21-MariaDB blast: 2.4.0+ *Please make sure java, mvn, mysql, blastp are all in your paths. How to run this project: Step 1. Init the Database 1. Create an empty database schema named as "pdb", username as "cbio", password as "cbio" in mysql: In mysql prompt,type: CREATE USER 'cbio'@'localhost' IDENTIFIED BY 'cbio'; GRANT ALL PRIVILEGES ON * . * TO 'cbio'@'localhost'; FLUSH PRIVILEGES; create database pdb; 2. In your code workspace, git clone https://github.com/cBioPortal/pdb-annotation.git 3. Change settings in src/main/resources/application.properties (i) Change workspace to the input sequences located ${workdir}. (ii)Change resource_dir to "~/pdb-annotation/pdb/src/main/resources/" (iii)Change ensembl_input_interval for memory performance consideration (iv) * If you want to use other test ensembl sequences, please change both ensembl_download_file and ensembl_fasta_file in your workspace 4. mvn package 5. in pdb-annotation/pdb-alignment-pipeline/target/: java -jar -Xmx7000m pdb-0.1.0.jar init Step 2. Check the API 1. in pdb-annotation/pdb-alignment-api/: mvn spring-boot:run 2. Swagger-UI: http://localhost:8080/swagger-ui.html 3. Directly using API: http://localhost:8080/pdb_annotation/StructureMappingQuery?ensemblId=ENSP00000483207.2 http://localhost:8080/pdb_annotation/ProteinIdentifierRecognitionQuery?ensemblId=ENSP00000483207.2 Step 3. Weekly update 1. in pdb-annotation/pdb-alignment-pipeline/target/: java -jar -Xmx7000m pdb-0.1.0.jar update Notes: Typical Running time for pipeline Init : 80219.905 Seconds (around 22 hours) Typical Running time for pipeline Update: 1062.796 Seconds (around 20 minutes) Test on Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 8 cores, 8G Memory Linux version 3.18.7-200.fc21.x86_64 (gcc version 4.9.2 20141101 (Red Hat 4.9.2-1) (GCC) ) #1 SMP OpenJDK version "1.8.0_65" 64-Bit Server VM (build 25.65-b01, mixed mode) mysql Ver 15.1 Distrib 10.0.21-MariaDB, for Linux (x86_64) using EditLine wrapper Please let me know if you have questions.
in PdbScriptsPipelineRunCommand.java, we have:
private List makeDBCommand() {
List list = new ArrayList();
// Building the following process command
list.add(rc.mysql);
list.add("--max_allowed_packet=1024M");
This max_allowed_packet setting must be configured in both the client and in the mysqld server to function correctly. The setting should not be hardcoded here, but be set in application.properties
The class member dataVesrsion should be changed to dataVersion
at the end of the update process (and the init process as well) the mysql client is not actually executing the "insert" and other commands. No exception or error is reported.
I will soon be debugging this problem
If we are switching to using a direct DB interaction through a java repository, then this class will be mostly deleted.
Otherwise, responsibilities should be grouped and broken out into other classes. One example might be to separate the parsing of blast xml, and separate the generation of insert statements.
Other improvements:
I would separate all the "construct a command" type of functions to a utility class (or classes). I would create separate classes to handle:
- run a local process functions (runwithRedirect)
- downloadfile functionality
- blast processing functions (makedb, blastp)
- gunzip functionality
- ftp file download and parsing
That would leave the main run methods (run, runInit, runUpdatePDB) .. which might also be split up unless they share some common code. The ReadConfig class can/should be used in many places to provide access to the properties.
The root pom.xml file specifies java 1.7, but the alignment-api specifies java 1.8.
The project as a whole should be using one consistent version of java, so these should be reconciled.
PdbScriptsPipelineRunCommand.java has commented out reference to the -word_size parameter to blast.
Delete this comment ... or make this a parameter in application.properties and include the parameter to blast.
Since the alignment table is subject to modification, it would be helpful for development and debugging to know when records were added to the table. The same goes for pdb_entry and any other tables which are modified as part of the update process
command : java -jar -Xmx10g pdb-alignment-pipeline/target/pdb-alignment-pipeline-0.1.0.jar init
This was run with batch increment sizes of 10000 and a Ensembl sequence database size of 50000.
Here are examples of the stderr output stream:
As the code is currently written:
2016-08-15 14:23:23 ERROR CommandProcessUtil:33 - [Process] Error: 103 122 105 112 58 32 47 85 115 101 114 115 47 115 104 101 114 105 100 97 114 47 114 101 112 111 115 47 115 104 101 114 105 100 97 110 99 98 105 111 47 112 100 98 45 97 110 110 111 116 97 116 105 111 110 47 103 115 111 99
After changing to output characters, not integers:
2016-08-15 14:53:59 ERROR CommandProcessUtil:33 - [Process] Error: g z i p : / U s e r s / s c
After recoding the loop, so that the reading continues until end of file, and getting rid of tab characters between each letter:
2016-08-15 15:17:19 ERROR CommandProcessUtil:36 - [Process] Error: gzip: /Users/sheridar/repos/sheridancbio/pdb-annotation/gsoc_3d_testing/insert.sql.0.gz already exists; not overwritten
I have not tested the scheduled weekly updates using the Calendar function and Timer.schedule option.
From looking at the code it seems that it will run the update at a future time and then exit .. but won't the user need to reset this process after each update? If the user needs to reset after each week, then they are not saving any time compared to running the process manually each week. So I think the internal scheduler should loop ... scheduling the next update after it finishes each update.
If the desired functionality is that the program run continuously in this mode, running each week without needing to be manually restarted by an external user, then convert this issue from "Question" to "Bug" (unless the code is functioning correctly already and I misunderstood it).
In PdbScriptsPipelineRunCommand.java : runwithRedirectFrom(), a process is created and input is redirected to it, but output streams are ignored and the return code / status code is not checked.
Long standard output or standard error streams can cause the small process buffers to overload .. but in the case of mysql we probably don't expect that. Still, maybe we should redirect output streams to temporary files and then delete them when the mysql commands are complete.
But return codes must be checked to see if the mysql command failed or exited with an error.
PdbScriptsPipelineRunCommand.java has some public data members, and one which is not specified (so it defaults to "package" level access.
In general all data members should have access level "private" specified unless there is a good reason not to. Make these private or discuss alternatives.
There is currently custom code for parsing the properties files and making values available to the code. There are also spring based directives for connecting to properties files.
We should have one or the other.
If we keep the custom parsing code then:
If we get the spring based property parsing to work, then remove the custom code, and replace all properties references using the spring framework.
The two application.properties settings:
ensembl_input_interval=10000
sql_insert_output_interval=10000
There are errors when these numbers evenly divide into the number of generated alignments. We need to add code into the pipeline to avoid these exceptions. We could try to detect and delete empty files before attempting to parse them .. or use clever record counting and interval number factorization (using disparate prime numbers)
preprocessPDBsequences() comments out alternative ways of writing the fasta file. Either create a functional conditional alternative which is chosen in application.properties, or delete the commented out code.
GRCh37:17:g.79478130C>G
GRCh38:17:g.79478130C>G
.Is this repo still actively used anywhere? The main site seems to just use endpoints like https://mmtf.rcsb.org/v1.0/full/6GYR. Maybe we should archive this repo and remove its reference from https://docs.cbioportal.org/2.1-deployment/architecture-overview#g2s?
With the current DB settings, data import service runs slow, find a way to optimize it. A possible solution is MySQL bulk load.
Here are some suggested improvements and points of feedback:
Cache the data retrieved from the PDB header service (service/internal/PdbDataServiceImpl.java)
If the database is inaccessible, or other exceptions are thrown during database query, the exceptions are not seen by the end user and the api may return incorrect values. For example, the following exception was recorded on startup in a local installation but querying the Api endpoint http://localhost:8080/pdb_annotation/ProteinIdentifierRecognitionQuery?ensemblId=ENSP00000485937.1 resulted in a simple return value of "false"
Beginning of exception stacktrace:
2016-09-13 05:52:30.490 ERROR 11452 --- [ main] o.a.tomcat.jdbc.pool.ConnectionPool : Unable to create initial connections of pool.
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown database 'pdbtest'
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_91]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[na:1.8.0_91]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.8.0_91]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[na:1.8.0_91]
at com.mysql.jdbc.Util.handleNewInstance(Util.java:404) ~[mysql-connector-java-5.1.38.jar:5.1.38]
at com.mysql.jdbc.Util.getInstance(Util.java:387) ~[mysql-connector-java-5.1.38.jar:5.1.38]
reported by user:
a strange case where a genomic position maps onto several different residues on the same protein chain. Here is an example:
https://g2s.genomenexus.org/api/alignments/hgvs-grch37/chr12:g.104713303C%3ET/pdb/3QFB_A/residueMapping
After getting the genome nexus data, I suspect that the gene_id was used for retrieving alignments. If so, transcript_id or trotein_id should be used instead.
The output from running the update step of the pdb-alignment-pipeline should report the number of pdb files added, deleted, modified in the runtime report. Total Input Queries is reported in the runtime report below (596) --- this is good. We also would be able to report the number of deletions from the alignment table if we used a true database interaction rather than the external .sql script.
[SHELL] Weekly Update: Create deleted list
[Preprocessing] Preprocessing PDB sequences...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/sheridar/repos/sheridancbio/pdb-annotation/pdb-alignment-pipeline/target/pdb-alignment-pipeline-0.1.0.jar!/lib/logback-classic-1.1.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/sheridar/repos/sheridancbio/pdb-annotation/pdb-alignment-pipeline/target/pdb-alignment-pipeline-0.1.0.jar!/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder]
[BLAST] Running makeblastdb command...
[BLAST] Command makeblastdb complete
[BLAST] Running blastp command...
[BLAST] Command blastp complete
[BLAST] Read blast results from xml file...
[BLAST] Total Input Queries = 596
[SHELL] Start Write insert.sql File...
[SHELL] Write insert.sql Done
[DATABASE] Running mysql command...
[DATABASE] Command mysql complete
[Shell] Generating delete SQL
[DATABASE] Running mysql command...
[DATABASE] Command mysql complete
[Shell] All Execution time is 52.981 seconds
for the purposes of this exercise, use the default branch (master). In normal development practice, we typically create a new branch off the master branch before making changes. We also make pull requests on a different branch (such as hotfix or rc) when submitting PR's back to the cbioportal repository. So after exercise is complete, read about creating and switching to a new branch -- and maybe repeat the exercise using a named development branch (it could be called "pull-request-exercise" or something similar)
Deploy beta war to production. Then create frontend fix for new beta/
when running the init step (java -jar -Xmx7000m target/pdb-alignment-pipeline-0.1.0.jar init) if the import of alignment records fails (for instance if there is no database present or no tables created or if the data does not properly it into the created table schema) no error report is given.
If import fails, an error message must be returned to the user and the application must exit with a non-0 status code.
PdbScriptsPipelineRunCommand.java has code for downloadfile(), but the error detection of failed downloads is not adequate. The return code / exit code of the wget command must be checked. Also the output streams from the process must be handled.
(These problems also occur for other ProcessBuilder use cases)
Use newly developed G2Smutation parts
There are functions with a boolean return type (hypothetically to indicate success or failure of function) but which only return a true value. Either identify, detect, and report failures, or change these functions into return type "void". Failures can be reported either by the return type or by throwing an exception, but if the return type is never "false" then switch to void.
Also, if a function does return a true/false success value the calling code must check that value and report a problem if there was a failure, then take corrective action.
If exceptions are thrown, the appropriate "throws" clause must be put on the throwing function and the calling code should have an appropriate "try/catch". Do not use the "unchecked" RuntimeException classes unless they are actually appropriate to the case. The program should not "exception out" of execution unless it is a serious and hard to predict situation.
examples:
PdbScriptsPipelinePreprocessing.java
Code in PdbScriptsPipelineRunCommand.java may attempt to access files through a URLConnection object from the standard java library. If the url specifies the "ftp:" protocol, and if the program is running inside a private network or behind a firewall, the data connection for the ftp transfer may fail.
Luckily, all URLs in use currently come from servers which offer dual ftp/http protocol services, so we can avoid the exceptions by specifying in the URL to use http protocol. But it would be better if the code used an FTP client which had proper support for passive mode file transfers such as org.apache.commons.net.ftp.FTPClient.
Example of the error as observed:
[SHELL] Weekly Update: Create deleted list
java.io.IOException: sun.net.ftp.FtpProtocolException: PORT 172,18,233,39,194,100:500 Illegal command, EPSV ALL in effect.
at sun.net.www.protocol.ftp.FtpURLConnection.getInputStream(FtpURLConnection.java:488)
at org.cbioportal.pdb_annotation.scripts.PdbScriptsPipelineRunCommand.readFTPfile2List(PdbScriptsPipelineRunCommand.java:348)
at org.cbioportal.pdb_annotation.scripts.PdbScriptsPipelineRunCommand.prepareUpdatePDBFile(PdbScriptsPipelineRunCommand.java:404)
at org.cbioportal.pdb_annotation.scripts.PdbScriptsPipelineRunCommand.runUpdatePDB(PdbScriptsPipelineRunCommand.java:484)
at org.cbioportal.pdb_annotation.scripts.PdbScriptsPipelineStarter.main(PdbScriptsPipelineStarter.java:123)
(local machine on wireless network was assigned IP address 172.18.233.39)
<h1>G2S (Genome to Structure)</h1>
<div class="container">
<h2>
</h2>
<p>The G2S web service provides an API that map genomic and proteomic coordinates to locations in 3D structures in PDB. The underlying alignment data are be updated on a weekly basis.</p>
<p>
There are two components in G2S web service:
</p>
<ul>
<li><b><a href="/pageapi">G2S API</a></b>: Query by Uniprot/Human Ensembl/Genomic Position. </li>
<li><b><a href="/sequence">Sequence Web Interface</a></b> : Query by any protein sequences. </li>
</ul>
<p>
G2S web service supports:
</p>
<ul>
<li>Retrieving aligned protein structure chains for an
Ensembl/UniProt entry.</li>
<li>Retrieving aligned protein structure chains for any protein
sequence.</li>
<li>Retrieving residue mapping between protein sequences and
protein structures.</li>
<li>Retrieving structural position of a genomic variant.</li>
</ul>
</div>
Maybe use a larger font for the text too.
The current code base has a java interface to the database for use by the API. The pipeline section uses an approach where external sql syntax script files are created and then processed with an external process call to the mysql client.
We should choose whether we will continue to use the external mysql script files, or switch to using the java database interface for doing updates.
If we continue to use the mysql script files, then we should clean up / break up the java classes which generate and execute the mysql scripts. We should also do better error detection and deal with the output streams from the external processes since they could block if too much output is generated by the mysql process and it is not taken from the buffers.
If we switch to using a java interface to the databases, we should drop all the code involved in external "mysql" processes and the writing of mysql INSERT or DELETE statements. Of course, there would need to be development of proper logic for doing insert operations and checking for exceptions and errors.
A common request is to query by dbSNP ID.
dbsnp/[dbsnp_id]
as id_type/id
to the end points that support genomic locations.Currently, we need to rebuild/re-initiate the whole database from scratch in order to update to a new UniPort release. We should really eliminate this step.
Once that's finished, adding a new sequencing (or uniprot entry) is just about querying the API once.
This property is to hold the name of the database which is being used to store data.
schema is a more general concept, as multiple databases can have identical schema.
In the mysql documentation, they use the pattern "db_name" to represent this string, for examples see:
https://dev.mysql.com/doc/refman/5.7/en/mysql.html
https://dev.mysql.com/doc/refman/5.7/en/alter-database.html
We need an api to determine whether alignments exists at all for a given transcript so that we don't have to pull all alignments just to determine that.
After discussion, we would like to add pagination pagination similar to what we do in cBioPortal API.
An empty list should be return if no alignments in the result.
Please add the following parameters similar to cBioPortal endpoints:
If it's too much work to add to all, please start with /api/alignments/{id_type}/{id}
which is used by cBioPortal.
There are some messages that could have simpler / more accurate wording. Examples will be added in discussion comments to this Issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.