Coder Social home page Coder Social logo

sbl-sdsc / mmtf-spark Goto Github PK

View Code? Open in Web Editor NEW
22.0 9.0 34.0 1.61 MB

Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.

License: Apache License 2.0

Java 100.00%
protein-structure protein-data-bank protein-sequences protein-ligand-interactions protein-protein-interaction machine-learning big-data apache-spark scientific-computing

mmtf-spark's Introduction

mmtf-spark

Build Status GitHub License Version Download MMTF Download MMTF Reduced Twitter

MMTF-Spark is a Java open source project that provides APIs and sample applications for the scalable mining of 3D biomacromolecular structures, such as the Protein Data Bank (PDB) archive. MMTF-Spark uses Big Data technologies to enable high-performance parallel processing of macromolecular structures. MMTF-Spark use the following technology stack:

  • Apache Spark a fast and general engine for large-scale distributed data processing.
  • MMTF the Macromolecular Transmission Format for compact data storage, transmission and high-performance parsing
  • Hadoop Sequence File a Big Data file format for parallel I/O
  • Apache Parquet a columnar data format to store dataframes
  • BioJava a framework for processing biological data

Tutorials

The companion project mmtf-workshop-2017 offers an introduction to Apache Spark and in-depth tutorials and sample code how to use MMTF-Spark.

In addition, a Python version MMTF-PySpark is under development. MMTF-PySpark offers demos as Jupyter notebooks as well as an experimental zero-install Binder 2.0 deployment of MMTF-PySpark.

Installation

MacOS and LINUX

Windows

PDB archive as MMTF-Hadoop Sequence Files

For high-performance, parallel processing, mmtf-spark can read the PDB archive in the MMTF file format from Hadoop Sequence Files. See mmtf.rcsb.org for more details. The installation instructions cover the download of MMTF-Hadoop Sequence files.

Running a Demo Application using spark-submit

Example of running a simple structural query (see PolyPeptideChainStatistics.java).

spark-submit --class edu.sdsc.mmtf.spark.mappers.demos.PolyPeptideChainStatistics  INSTALL_DIRECTORY/mmtf-spark/target/mmtf-spark-0.3.0-SNAPSHOT.jar

Example of running a structural alignment (see DemoQueryVsAll.java).

spark-submit --class edu.sdsc.mmtf.spark.alignments.demos.DemoQueryVsAll  INSTALL_DIRECTORY/mmtf-spark/target/mmtf-spark-0.3.0-SNAPSHOT.jar

Example of retrieving PDB metadata (see PdbMetadataDemo.java).

spark-submit --class edu.sdsc.mmtf.spark.datasets.demos.PdbMetadataDemo  INSTALL_DIRECTORY/mmtf-spark/target/mmtf-spark-0.3.0-SNAPSHOT.jar

Example of retrieving PDB annotations from the SIFTS project (see SiftsDataDemo.java).

spark-submit --class edu.sdsc.mmtf.spark.datasets.demos.SiftsDataDemo INSTALL_DIRECTORY/mmtf-spark/target/mmtf-spark-0.3.0-SNAPSHOT.jar

Example with command line arguments. This example reads the PDB files in an input directory (recursively) and creates an MMTF-Hadoop Sequence file directory (see PdbToMmtfFull.java).

spark-submit --class edu.sdsc.mmtf.spark.io.demos.PdbToMmtfFull  INSTALL_DIRECTORY/mmtf-spark/target/mmtf-spark-0.2.0-SNAPSHOT.jar PDB_FILE_DIRECTORY MMTF_HADOOP_FILE_DIRECTORY

How to Cite this Work

Bradley AR, Rose AS, Pavelka A, Valasatava Y, Duarte JM, Prlić A, Rose PW (2017) MMTF - an efficient file format for the transmission, visualization, and analysis of macromolecular structures. PLOS Computational Biology 13(6): e1005575. doi: 10.1371/journal.pcbi.1005575

Valasatava Y, Bradley AR, Rose AS, Duarte JM, Prlić A, Rose PW (2017) Towards an efficient compression of 3D coordinates of macromolecular structures. PLOS ONE 12(3): e0174846. doi: 10.1371/journal.pone.01748464

Rose AS, Bradley AR, Valasatava Y, Duarte JM, Prlić A, Rose PW (2018) NGL viewer: web-based molecular graphics for large complexes, Bioinformatics, bty419. doi: 10.1093/bioinformatics/bty419

Rose AS, Bradley AR, Valasatava Y, Duarte JM, Prlić A, Rose PW (2016) Web-based molecular graphics for large complexes. In Proceedings of the 21st International Conference on Web3D Technology (Web3D '16). ACM, New York, NY, USA, 185-186. doi: 10.1145/2945292.2945324

Funding

This project is supported by the National Cancer Institute of the National Institutes of Health under Award Number U01CA198942. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

mmtf-spark's People

Contributors

fnothaft avatar gjbekker avatar pwrose avatar yuy079 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mmtf-spark's Issues

StructureToPolymerSequences

StructureToPolymerSequences has the parameter to remove duplicates by using a list. However, the sequences arrayList is always returned regardless of the "remove duplicate" flag.

Installation on Windows x64 machine

I've followed the instructions provided (including extra steps of downloading hadoop-2.6.0, setting path variables, etc.) but am still unable to successfully build mmtf-spark in eclipse. Any help would be greatly appreciated.

Can't tell if this is a POM issue as it gives all of the 'error reading' problems from the .m2 directory, or if the important part is the 'invalid LOC header'.

Windows 7 - 64bit
Eclipse IDE for Java Developers - Oxygen Release (4.7.0)
JDK v 1.8.0_131 (double-checked the jdk is being used not the jre)

[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building mmtf-spark 0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[WARNING] The POM for org.biojava.thirdparty:forester:jar:1.038 is invalid, transitive dependencies (if any) will not be available, enable debug logging for more details
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ mmtf-spark ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 1 resource
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ mmtf-spark ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 131 source files to C:\Users\kmskvf\Desktop\Fragment_search\mmtf-spark\mmtf-spark\target\classes
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\biojava\biojava-structure\5.0.0-alpha8\biojava-structure-5.0.0-alpha8.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\spark\spark-core_2.11\2.1.0\spark-core_2.11-2.1.0.jar; invalid CEN header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\codehaus\jackson\jackson-mapper-asl\1.9.13\jackson-mapper-asl-1.9.13.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\hadoop\hadoop-common\2.2.0\hadoop-common-2.2.0.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\commons\commons-math\2.1\commons-math-2.1.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\hadoop\hadoop-hdfs\2.2.0\hadoop-hdfs-2.2.0.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\mortbay\jetty\jetty-util\6.1.26\jetty-util-6.1.26.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-app\2.2.0\hadoop-mapreduce-client-app-2.2.0.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-common\2.2.0\hadoop-mapreduce-client-common-2.2.0.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\com\google\inject\guice\3.0\guice-3.0.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\hadoop\hadoop-yarn-server-common\2.2.0\hadoop-yarn-server-common-2.2.0.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\hadoop\hadoop-yarn-api\2.2.0\hadoop-yarn-api-2.2.0.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-core\2.2.0\hadoop-mapreduce-client-core-2.2.0.jar; invalid CEN header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\hadoop\hadoop-yarn-common\2.2.0\hadoop-yarn-common-2.2.0.jar; invalid CEN header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\spark\spark-network-common_2.11\2.1.0\spark-network-common_2.11-2.1.0.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\fusesource\leveldbjni\leveldbjni-all\1.8\leveldbjni-all-1.8.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\scala-lang\scala-library\2.11.8\scala-library-2.11.8.jar; invalid CEN header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\scala-lang\scala-compiler\2.11.0\scala-compiler-2.11.0.jar; invalid CEN header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\io\netty\netty-all\4.0.42.Final\netty-all-4.0.42.Final.jar; invalid CEN header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\io\netty\netty\3.8.0.Final\netty-3.8.0.Final.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\scala-lang\scala-reflect\2.11.7\scala-reflect-2.11.7.jar; invalid CEN header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\ivy\ivy\2.4.0\ivy-2.4.0.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\scalatest\scalatest_2.11\2.2.6\scalatest_2.11-2.2.6.jar; invalid CEN header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\spark\spark-sql_2.11\2.1.0\spark-sql_2.11-2.1.0.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\com\univocity\univocity-parsers\2.2.1\univocity-parsers-2.2.1.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\spark\spark-catalyst_2.11\2.1.0\spark-catalyst_2.11-2.1.0.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\codehaus\janino\janino\3.0.0\janino-3.0.0.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\parquet\parquet-column\1.8.1\parquet-column-1.8.1.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\spark\spark-mllib_2.11\2.1.0\spark-mllib_2.11-2.1.0.jar; invalid CEN header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\apache\spark\spark-graphx_2.11\2.1.0\spark-graphx_2.11-2.1.0.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\spire-math\spire_2.11\0.7.4\spire_2.11-0.7.4.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\com\chuusai\shapeless_2.11\2.0.0\shapeless_2.11-2.0.0.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\org\jpmml\pmml-model\1.2.15\pmml-model-1.2.15.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\net\sourceforge\f2j\arpack_combined_all\0.1\arpack_combined_all-0.1.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\com\github\fommil\netlib\netlib-native_ref-osx-x86_64\1.1\netlib-native_ref-osx-x86_64-1.1-natives.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\com\github\fommil\netlib\netlib-native_ref-linux-x86_64\1.1\netlib-native_ref-linux-x86_64-1.1-natives.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\com\github\fommil\netlib\netlib-native_ref-linux-i686\1.1\netlib-native_ref-linux-i686-1.1-natives.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\com\github\fommil\netlib\netlib-native_ref-win-x86_64\1.1\netlib-native_ref-win-x86_64-1.1-natives.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\com\github\fommil\netlib\netlib-native_ref-win-i686\1.1\netlib-native_ref-win-i686-1.1-natives.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\com\github\fommil\netlib\netlib-native_ref-linux-armhf\1.1\netlib-native_ref-linux-armhf-1.1-natives.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\com\github\fommil\netlib\netlib-native_system-osx-x86_64\1.1\netlib-native_system-osx-x86_64-1.1-natives.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\com\github\fommil\netlib\netlib-native_system-linux-x86_64\1.1\netlib-native_system-linux-x86_64-1.1-natives.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\com\github\fommil\netlib\netlib-native_system-linux-i686\1.1\netlib-native_system-linux-i686-1.1-natives.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\com\github\fommil\netlib\netlib-native_system-linux-armhf\1.1\netlib-native_system-linux-armhf-1.1-natives.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\com\github\fommil\netlib\netlib-native_system-win-x86_64\1.1\netlib-native_system-win-x86_64-1.1-natives.jar; invalid LOC header (bad signature)
[WARNING] error reading C:\Users\kmskvf\.m2\repository\com\github\fommil\netlib\netlib-native_system-win-i686\1.1\netlib-native_system-win-i686-1.1-natives.jar; invalid LOC header (bad signature)
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ mmtf-spark ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory C:\Users\kmskvf\Desktop\Fragment_search\mmtf-spark\mmtf-spark\src\test\resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ mmtf-spark ---
[INFO] Nothing to compile - all classes are up to date
[INFO] 
[INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @ mmtf-spark ---

-------------------------------------------------------
 T E S T S
-------------------------------------------------------

Results :

Tests run: 0, Failures: 0, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.155 s
[INFO] Finished at: 2017-07-27T13:23:59-05:00
[INFO] Final Memory: 45M/362M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project mmtf-spark: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test failed: java.lang.NoClassDefFoundError: org/apache/spark/api/java/function/PairFlatMapFunction: org.apache.spark.api.java.function.PairFlatMapFunction -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException

MmtfImporter: setParseBioAssembly

By default, MMCIFFileReader does not read bioassembly records. File parsing parameter must be set:
params.setParseBioAsembly(true);

InteractionFilter.java docstring typos

[line 334] filter.setQueryAtomNames(true, "CA", ,"CB"); -> filter.setQueryAtomNames(true, "CA", "CB");
[line 347] @param groups -> @param atomNames
[line 367] filter.setQueryAtomNames(true, "CA", ,"CB"); -> filter.setQueryAtomNames(true, "CA", "CB");
[line 380] @param groups -> @param atomNames

DatasetFileConverter

Need converter for Spark datasets that can convert among supported file format and compression codecs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.