sbl-sdsc / mmtf-pyspark Goto Github PK

Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.

License: Apache License 2.0

Python 89.78% Batchfile 0.02% Shell 0.62% Jupyter Notebook 9.28% Dockerfile 0.31%

pyspark binder protein-data-bank jupyter-notebook jupyter machine-learning scientific-computing big-data protein-structure protein-sequences protein-protein-interaction protein-ligand-interactions apache-spark

mmtf-pyspark's Introduction

MMTF PySpark

mmtfPyspark is a python package that provides APIs and sample applications for distributed analysis and scalable mining of 3D biomacromolecular structures, such as the Protein Data Bank (PDB) archive. mmtfPyspark uses Big Data technologies to enable high-performance parallel processing of macromolecular structures. mmtfPyspark use the following technology stack:

Apache Spark a fast and general engine for large-scale distributed data processing.
MMTF the Macromolecular Transmission Format for compact data storage, transmission and high-performance parsing
Hadoop Sequence File a Big Data file format for parallel I/O
Apache Parquet a columnar data format to store dataframes

This project is under development.

Run mmtf-pyspark in your Web Browser

The Jupyter Notebooks in this repository can be run in your web browser using two freely available servers: Binder and CyVerse/VICE. Click on the buttons below to launch Jupyter Lab. It may take several minutes for Jupyter Lab to launch.

Navigate to the demos directory to run any of the example notebooks.

Binder

Binder is an experimental platform for reproducible research developed by Project Jupyter. Learn more about Binder. There are specific links for each notebook below, however, once Jupyter Lab is launched, navigate to any of the other notebooks using the Jupyter Lab file panel.

NOTE: Authentication is now required to launch binder! Sign into GitHub from your browser, then click on the launch binder badge below to launch Jupyter Lab.

CyVerse (experimental version)

The new VICE (Visual Interactive Computing Environment) in the CyVerse Discovery Environment enables users to run Jupyter Lab in a production environment. To use VICE, sign up for a free CyVerse account.

The VICE environment supports large-scale analyses. Users can upload and download files, and save and share results of their analyses in their user accounts (up to 100GB of data). The environment is preloaded with a local copy of the entire Protein Data Bank (~148,000 structures).

Follow these step to run Jupyter Lab on VICE

Documentation

In Depth Tutorial

Installation

Python

We strongly recommend that you have anaconda and we require at least python 3.8 installed. To check your python version:

python --version

mmtfPyspark and dependencies

Since mmtfPyspark uses parallel computing to ensure high-performance, it requires additional dependencies such as Apache Spark. Therefore, please read follow the installation instructions for your OS carefully:

MacOS and LINUX

Windows

Hadoop Sequence Files

This project uses the PDB archive in the form of MMTF Hadoop Sequence File. The files can be downloaded by:

curl -O https://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar
tar -xvf full.tar

curl -O https://mmtf.rcsb.org/v1.0/hadoopfiles/reduced.tar
tar -xvf reduced.tar

For Mac and Linux, the Hadoop sequence files can be downloaded and saved as environmental variables by running the following command:

curl https://raw.githubusercontent.com/sbl-sdsc/mmtf-pyspark/master/bin/download_mmtf_files.sh -o download_mmtf_files.sh
. ./download_mmtf_files.sh

How to Cite this Work

Bradley AR, Rose AS, Pavelka A, Valasatava Y, Duarte JM, Prlić A, Rose PW (2017) MMTF - an efficient file format for the transmission, visualization, and analysis of macromolecular structures. PLOS Computational Biology 13(6): e1005575. doi: 10.1371/journal.pcbi.1005575

Valasatava Y, Bradley AR, Rose AS, Duarte JM, Prlić A, Rose PW (2017) Towards an efficient compression of 3D coordinates of macromolecular structures. PLOS ONE 12(3): e0174846. doi: 10.1371/journal.pone.01748464

Rose AS, Bradley AR, Valasatava Y, Duarte JM, Prlić A, Rose PW (2018) NGL viewer: web-based molecular graphics for large complexes, Bioinformatics, bty419. doi: 10.1093/bioinformatics/bty419

Rose AS, Bradley AR, Valasatava Y, Duarte JM, Prlić A, Rose PW (2016) Web-based molecular graphics for large complexes. In Proceedings of the 21st International Conference on Web3D Technology (Web3D '16). ACM, New York, NY, USA, 185-186. doi: 10.1145/2945292.2945324

Binder

Project Jupyter, et al. (2018) Binder 2.0 - Reproducible, Interactive, Sharable Environments for Science at Scale. Proceedings of the 17th Python in Science Conference. 2018. doi: 10.25080/Majora-4af1f417-011

CyVerse

Merchant N, Lyons E, Goff S, Vaughn M, Ware D, Micklos D, et al. (2016) The iPlant Collaborative: Cyberinfrastructure for Enabling Data to Discovery for the Life Sciences. PLoS Biol 14(1): e1002342. doi: 10.1371/journal.pbio.1002342

Py3Dmol

Rego N, Koes, D (2015) 3Dmol.js: molecular visualization with WebGL, Bioinformatics 31, 1322–1324. doi: 10.1093/bioinformatics/btu829

Funding

The MMTF project (Compressive Structural BioInformatics: High Efficiency 3D Structure Compression) is supported by the National Cancer Institute of the National Institutes of Health under Award Number U01CA198942. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

The CyVerse project is supported by the National Science Foundation under Award Numbers DBI-0735191, DBI-1265383, and DBI-1743442. URL: www.cyverse.org

mmtf-pyspark's People

Contributors

Stargazers

Watchers

mmtf-pyspark's Issues

Add classes to calculate interaction fingerprints

Java versions:

Interactions:
https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/main/java/edu/sdsc/mmtf/spark/interactions/LigandInteractionFingerprint.java

https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/main/java/edu/sdsc/mmtf/spark/interactions/PolymerInteractionFingerprint.java
https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/main/java/edu/sdsc/mmtf/spark/interactions/PolymerInteractionFingerprint.java

Interactions demos
https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/main/java/edu/sdsc/mmtf/spark/interactions/demos/LigandInteractionFingerprintDemo.java

https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/main/java/edu/sdsc/mmtf/spark/interactions/demos/PolymerInteractionFingerprintDemo.java

Interactions tests
https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/test/java/edu/sdsc/mmtf/spark/interactions/PolymerInteractionFingerprintTest.java

Upgrade to Spark 2.3.0

The download script has an old version 2.1.0.

Implement SiftsDataDemo

https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/main/java/edu/sdsc/mmtf/spark/datasets/demos/SiftsDataDemo.java

[!] mmtfReader.download_reduced_mmtf_files: url is not defined

This method does not work:
structures = mmtfReader.download_reduced_mmtf_files(pdbids, sc)

It gives the following error:

File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtfPyspark/io/mmtfReader.py", line 186, in _get_structure
unpack = default_api.get_raw_data_from_url(pdbId, reduced)
File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtf/api/default_api.py", line 53, in get_raw_data_from_url
url = get_url(pdb_id,reduced)
File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtf/api/default_api.py", line 69, in get_url
return BASE_URL_REDUCED + pdb_id
NameError: name 'BASE_URL_REDUCED' is not defined

Implement PdbMetadataDemo

https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/main/java/edu/sdsc/mmtf/spark/datasets/demos/PdbMetadataDemo.java

ProteinFoldDatasetCreator.ipynb

The following line should be changed from:

when((col("alpha") > maxThreshold) & (col("beta") < minThreshold), "alpha+beta")

to:
when((col("alpha") > maxThreshold) & (col("beta") > maxThreshold), "alpha+beta")

Add atom name filter criteria to interactionFilter

See Java version for added functionality to filter by atom names

https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/main/java/edu/sdsc/mmtf/spark/interactions/InteractionFilter.java

[!] MetalInteractionsAdvanced: error

MetalInteractionsAdvanced:

interaction = self._get_interactions(arrays, queryAtomIndex, box)
File "/srv/conda/lib/python3.6/site-packages/mmtfPyspark/interactions/structureToAtomInteractions.py", line 117, in _get_interactions
n for neighbors in neighborIndices for n in neighbors]
File "/srv/conda/lib/python3.6/site-packages/mmtfPyspark/interactions/structureToAtomInteractions.py", line 117, in
n for neighbors in neighborIndices for n in neighbors]
TypeError: 'int' object is not iterable

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)

Build is failing

[!] structureviewer.view_structure

if the input contains a chain id, change:

        viewer.setStyle({'hetflag': True},{'stick':{'singleBond':False}}) -->
       viewer.setStyle({'chain': chainid, 'hetflag': True},{'stick':{'singleBond':False}})

pdbid,chainid = pdbIds[i].split('.')
viewer = py3Dmol.view(query='pdb:' + pdbid, options={'doAssembly': bioAssembly})
viewer.setStyle({})
viewer.setStyle({'chain': chainid}, {style: {'color': color}})
viewer.setStyle({'hetflag': True},{'stick':{'singleBond':False}})
viewer.zoomTo({'chain': chainid})

[!] Update PdbjMineSearch

https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/main/java/edu/sdsc/mmtf/spark/webfilters/PdbjMineSearch.java

proteinSequenceEncoder.py

Change the following line from:
'A' : [2.34,0.29,6.13,-1.01,10.74,0.36,0.25],

'R' : [2.34,0.29,6.13,-1.01,10.74,0.36,0.25],

demos/filters/FilterByPolymerChainTypeDemo.ipynb

There are no results using the updated sample files.

Dealing with MMTF path

If we insert the following check, the notebooks should work both locally as well as in binder:

import os
if not "MMTF_FULL" in os.environ:
os.environ["MMTF_FULL"] = path

Remove .part-*.crc files in mmtf sample file directories

remove the .part-*.crc in the sample files directory.

These files have caused problems before.

structureViewer.view_binding_site: Jupyter notebook scrollbar issue

This method causes the Jupyter notebook scrollbar to appear. It may be related to the zoom call.

remove line: viewer.zoomTo(center)

replace line : viewer.zoom(0.3, 1000) -> viewer.zoomTo(neighbors)

Hosting sample files in a separate Git repo

We should consider hosting sample files in a separate Git repo (e.g., mmtf-samples).

For the binder setup, we should be able to download the data in the postBuild file (see: https://media.readthedocs.org/pdf/mybinder/latest/mybinder.pdf):

curl -O -L https://github.com/sbl-sdsc/mmtf-samples/raw/master/resources/mmtf_full_sample/part-0000[0-7]
mkdir mmtf_full_sample
mv part-0000* ./mmtf_full_sample

Implement DrugBankDemo

https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/main/java/edu/sdsc/mmtf/spark/datasets/demos/DrugBankDemo.java

demos/mappers/MapToListDemo.ipynb: fix import

demos/mappers/MapToListDemo.ipynb

import for experimentalMethods needs to be fixed:

AttributeError Traceback (most recent call last)
in ()
----> 1 pdb = pdb.filter(experimentalMethods(experimentalMethods.X_RAY_DIFFRACTION))

AttributeError: module 'mmtfPyspark.filters.experimentalMethods' has no attribute 'X_RAY_DIFFRACTION'

Implement DrugBankDataset & test

https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/main/java/edu/sdsc/mmtf/spark/datasets/DrugBankDataset.java

https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/test/java/edu/sdsc/mmtf/spark/datasets/DrugBankDatasetTest.java

[!] mmtfWriter.write_sequence_file: error when writing Hadoop Sequence file

Read a structure

pdb = mmtfReader.download_full_mmtf_files(["4HHB"], sc)

Write a structure

mmtfWriter.write_sequence_file(path, sc, pdb)

The are some error messages regarding char encoding and altLocList. This may have something to do with the lazy initialization. Here is part of the error message:

File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtf/api/mmtf_writer.py", line 219, in encode_data
output_data["altLocList"] = encode_array(self.alt_loc_list, 6, 0)
File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtf/codecs/default_codec.py", line 29, in encode_array
return add_header(codec_dict[codec].encode(input_array, param), codec, len(input_array), param)
File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtf/codecs/codecs.py", line 62, in encode
converters.convert_chars_to_ints(in_array)),4)
File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtf/converters/converters.py", line 88, in convert_chars_to_ints
return [ord(x) for x in in_chars]
File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtf/converters/converters.py", line 88, in
return [ord(x) for x in in_chars]
TypeError: ord() expected string of length 1, but int found

ExperimentalMethods filter: change in logic

The logic of the ExperimentalMethods filter has changed. Now any instead of all experimental method match will return true. The documentation also needs to be updated accordingly.

See Java version

Change in ColumnarStructureX

In getNormalizedbFactors(), the check for types[i] should be changed to:
if (! (types[i].equals("WAT"))) {

Implement SwissModel dataset

https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/main/java/edu/sdsc/mmtf/spark/datasets/SwissModelDataset.java

Implement PdbjMineDataset test

https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/test/java/edu/sdsc/mmtf/spark/datasets/PdbjMineDatasetTest.java

[!] StructureToPolymerChains: missing atoms

StructureToPolymerChains doesn't copy atoms:

pdb = mmtfReader.download_full_mmtf_files(["4HHB"], sc)
traverseStructureHierarchy.print_structure_data(pdb.first()[1])
*** STRUCTURE DATA ***
Number of models : 1
Number of chains : 14
Number of groups : 801
Number of atoms : 4779
Number of bonds : 4700

pdb = pdb.flatMap(StructureToPolymerChains())
traverseStructureHierarchy.print_structure_data(pdb.first()[1])

*** STRUCTURE DATA ***
Number of models : 1
Number of chains : 1
Number of groups : 141
Number of atoms : 12
Number of bonds : 11

ContainsGroup filter method: change logic

The logic of the ContainsGroup method has changed. Now any instead of all group matches will return true. The documentation also needs to be updated accordingly.
See Java version

SequenceSimilarity filter

The current SequenceSimilarity filter is only a placeholder that needs to be implemented.

A full implementation is available in [mmtf-spark]https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/main/java/edu/sdsc/mmtf/spark/webfilters/SequenceSimilarity.java)

Restore TraverseStructureHierarchy

TraverseStructureHierarchy must have been deleted accidentally. Please restore and put in utils.

Here is an old commit:
7f595ca

StructureToBioassembly: redundant entity_list

The entity list should have only unique entities, however, the entities are repeated. Example:

mmtfReader.download_full_mmtf_files(["1STP"], sc).flatMap(StructureToBioassembly()).first()[1].entity_list

Here, identical entities are listed 4 times:

[{'description': 'STREPTAVIDIN COMPLEX WITH BIOTIN',
'type': 'polymer',
'chainIndexList': [0],
'sequence': 'DPSKDSKAQVSAAEAGITGTWYNQLGSTFIVTAGADGALTGTYESAVGNAESRYVLTGRYDSAPATDGSGTALGWTVAWKNNYRNAHSATTWSGQYVGGAEARINTQWLLTSGTTEANAWKSTLVGHDTFTKVKPSAASIDAAKKAGVNNGNPLDAVQQ'},
{'description': 'STREPTAVIDIN COMPLEX WITH BIOTIN',
'type': 'polymer',
'chainIndexList': [1],
'sequence': 'DPSKDSKAQVSAAEAGITGTWYNQLGSTFIVTAGADGALTGTYESAVGNAESRYVLTGRYDSAPATDGSGTALGWTVAWKNNYRNAHSATTWSGQYVGGAEARINTQWLLTSGTTEANAWKSTLVGHDTFTKVKPSAASIDAAKKAGVNNGNPLDAVQQ'},
{'description': 'STREPTAVIDIN COMPLEX WITH BIOTIN',
'type': 'polymer',
'chainIndexList': [2],
'sequence': 'DPSKDSKAQVSAAEAGITGTWYNQLGSTFIVTAGADGALTGTYESAVGNAESRYVLTGRYDSAPATDGSGTALGWTVAWKNNYRNAHSATTWSGQYVGGAEARINTQWLLTSGTTEANAWKSTLVGHDTFTKVKPSAASIDAAKKAGVNNGNPLDAVQQ'},
{'description': 'STREPTAVIDIN COMPLEX WITH BIOTIN',
'type': 'polymer',
'chainIndexList': [3],
'sequence': 'DPSKDSKAQVSAAEAGITGTWYNQLGSTFIVTAGADGALTGTYESAVGNAESRYVLTGRYDSAPATDGSGTALGWTVAWKNNYRNAHSATTWSGQYVGGAEARINTQWLLTSGTTEANAWKSTLVGHDTFTKVKPSAASIDAAKKAGVNNGNPLDAVQQ'},
{'description': 'BIOTIN',
'type': 'non-polymer',
'chainIndexList': [4],
'sequence': ''},
{'description': 'BIOTIN',
'type': 'non-polymer',
'chainIndexList': [5],
'sequence': ''},
{'description': 'BIOTIN',
'type': 'non-polymer',
'chainIndexList': [6],
'sequence': ''},
{'description': 'BIOTIN',
'type': 'non-polymer',
'chainIndexList': [7],
'sequence': ''},
{'description': 'water',
'type': 'water',
'chainIndexList': [8],
'sequence': ''},
{'description': 'water',
'type': 'water',
'chainIndexList': [9],
'sequence': ''},
{'description': 'water',
'type': 'water',
'chainIndexList': [10],
'sequence': ''},
{'description': 'water',
'type': 'water',
'chainIndexList': [11],
'sequence': ''}]

cKDTree for spatial neighbor search

Explore is cKDTree can be used instead of the distanceBox method for spatial neighbor search within a search radius.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html#scipy.spatial.cKDTree

Implement MyVariantDataset

code:
https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/main/java/edu/sdsc/mmtf/spark/datasets/MyVariantDataset.java

test:
https://github.com/sbl-sdsc/mmtf-spark/blob/master/src/test/java/edu/sdsc/mmtf/spark/datasets/MyVariantDatasetTest.java

StructureToPolymerChains: excludeDuplicates has no effect

Setting excludeDuplicates=True has no effect. Chain with identical sequences should be excluded if this flag is set.

pip install --upgrade mmtf-pyspark/: doesn't work properly

For example, it's can't find webfilters, although that's what's in the git repo.

ModuleNotFoundError Traceback (most recent call last)
in ()
1 from pyspark import SparkConf, SparkContext
2 from mmtfPyspark.io import mmtfReader
----> 3 from mmtfPyspark.webfilters import AdvancedQuery, ChemicalStructureQuery, Pisces
4 from mmtfPyspark.mappers import StructureToPolymerChains
5 from mmtfPyspark.structureViewer import view_structure

ModuleNotFoundError: No module named 'mmtfPyspark.webfilters'

New spark initialization

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("1-Input").getOrCreate()
spark.stop()

remove sc argument from mmtfRead: inside of each read/download method:
from pyspark import SparkContext
from pyspark.sql import SparkSession

spark = SparkSession.getOrCreate()
sc = spark.sparkContext

simple_structure_viewer: handle pdb.chainId

This method currently only accepts pdbIds. However, in many cases, we are working with chains.

If the passed in id is of the form: pdbid.chainId, then we should split the string by "." and use the chainId to only display that specified chain. Here is a simple example of only showing a specific chain:

pdbid = "4HHB"
chainid = "A"
viewer = py3Dmol.view(query="pdb:" + pdbid)
viewer.setStyle({})
viewer.setStyle({'chain': chainid}, {'cartoon': {'color': 'spectrum'}})
viewer.zoomTo({'chain': chainid})
viewer.show()

Disable unit tests that need network access

Fix path in MetalInteractionsExample.ipynb

ColumnarStructure: getEntityTypes()

The way water groups are identified has changed:
was:
} else if (ccType.equals("HOH")) {

should be:
} else if (groupNames[start].equals("HOH") || groupNames[start].equals("DOD"))

Update unit tests for experimentalMethods filter

With the change in logic in this filter, current unit tests will fail. See updated Java unit tests

Naming of download scripts

The names of the download scripts in /bin don't match the content.

Need new viewer method

Need a methods similiar to this one:
def view_group_interaction(pdbIds, interacting_atom='None', style='cartoon', color='spectrum'):

but, it would take an interacting_group instead of interacting_atom.

if interacting_group != "None":

        viewer.setStyle({'resn': interacting_group}, {
                        'sphere': {}})

    return viewer.show()

StructureToPolymerSequences: error

The following example:

mmtfReader.download_full_mmtf_files(["4HHB"], sc).flatMap(StructureToPolymerSequences()).keys().collect()

gives this error message:

polymer = structure.entity_list[i]['type'] == 'polymer'
IndexError: list index out of range

sbl-sdsc / mmtf-pyspark Goto Github PK

mmtf-pyspark's Introduction

MMTF PySpark

Run mmtf-pyspark in your Web Browser

Binder

CyVerse (experimental version)

Documentation

Installation

Python

mmtfPyspark and dependencies

Hadoop Sequence Files

How to Cite this Work

Binder

CyVerse

Py3Dmol

Funding

mmtf-pyspark's People

Contributors

Stargazers

Watchers

Forkers

mmtf-pyspark's Issues

Read a structure

Write a structure

Recommend Projects

Recommend Topics

Recommend Org