Coder Social home page Coder Social logo

sbl-sdsc / mmtf-pyspark Goto Github PK

View Code? Open in Web Editor NEW
67.0 8.0 27.0 536.47 MB

Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.

License: Apache License 2.0

Python 89.78% Batchfile 0.02% Shell 0.62% Jupyter Notebook 9.28% Dockerfile 0.31%
pyspark binder protein-data-bank jupyter-notebook jupyter machine-learning scientific-computing big-data protein-structure protein-sequences protein-protein-interaction protein-ligand-interactions apache-spark

mmtf-pyspark's Introduction

MMTF PySpark

Build Status GitHub license Version Download MMTF Download MMTF Reduced Twitter URL

mmtfPyspark is a python package that provides APIs and sample applications for distributed analysis and scalable mining of 3D biomacromolecular structures, such as the Protein Data Bank (PDB) archive. mmtfPyspark uses Big Data technologies to enable high-performance parallel processing of macromolecular structures. mmtfPyspark use the following technology stack:

  • Apache Spark a fast and general engine for large-scale distributed data processing.
  • MMTF the Macromolecular Transmission Format for compact data storage, transmission and high-performance parsing
  • Hadoop Sequence File a Big Data file format for parallel I/O
  • Apache Parquet a columnar data format to store dataframes

This project is under development.

Run mmtf-pyspark in your Web Browser

The Jupyter Notebooks in this repository can be run in your web browser using two freely available servers: Binder and CyVerse/VICE. Click on the buttons below to launch Jupyter Lab. It may take several minutes for Jupyter Lab to launch.

Navigate to the demos directory to run any of the example notebooks.

Binder

Binder is an experimental platform for reproducible research developed by Project Jupyter. Learn more about Binder. There are specific links for each notebook below, however, once Jupyter Lab is launched, navigate to any of the other notebooks using the Jupyter Lab file panel.

NOTE: Authentication is now required to launch binder! Sign into GitHub from your browser, then click on the launch binder badge below to launch Jupyter Lab.

CyVerse (experimental version)

The new VICE (Visual Interactive Computing Environment) in the CyVerse Discovery Environment enables users to run Jupyter Lab in a production environment. To use VICE, sign up for a free CyVerse account.

The VICE environment supports large-scale analyses. Users can upload and download files, and save and share results of their analyses in their user accounts (up to 100GB of data). The environment is preloaded with a local copy of the entire Protein Data Bank (~148,000 structures).

docs/vice_badge.png

Follow these step to run Jupyter Lab on VICE

Documentation

Documentation

In Depth Tutorial

Installation

Python

We strongly recommend that you have anaconda and we require at least python 3.8 installed. To check your python version:

python --version

mmtfPyspark and dependencies

Since mmtfPyspark uses parallel computing to ensure high-performance, it requires additional dependencies such as Apache Spark. Therefore, please read follow the installation instructions for your OS carefully:

MacOS and LINUX

Windows

Hadoop Sequence Files

This project uses the PDB archive in the form of MMTF Hadoop Sequence File. The files can be downloaded by:

curl -O https://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar
tar -xvf full.tar

curl -O https://mmtf.rcsb.org/v1.0/hadoopfiles/reduced.tar
tar -xvf reduced.tar

For Mac and Linux, the Hadoop sequence files can be downloaded and saved as environmental variables by running the following command:

curl https://raw.githubusercontent.com/sbl-sdsc/mmtf-pyspark/master/bin/download_mmtf_files.sh -o download_mmtf_files.sh
. ./download_mmtf_files.sh

How to Cite this Work

Bradley AR, Rose AS, Pavelka A, Valasatava Y, Duarte JM, Prlić A, Rose PW (2017) MMTF - an efficient file format for the transmission, visualization, and analysis of macromolecular structures. PLOS Computational Biology 13(6): e1005575. doi: 10.1371/journal.pcbi.1005575

Valasatava Y, Bradley AR, Rose AS, Duarte JM, Prlić A, Rose PW (2017) Towards an efficient compression of 3D coordinates of macromolecular structures. PLOS ONE 12(3): e0174846. doi: 10.1371/journal.pone.01748464

Rose AS, Bradley AR, Valasatava Y, Duarte JM, Prlić A, Rose PW (2018) NGL viewer: web-based molecular graphics for large complexes, Bioinformatics, bty419. doi: 10.1093/bioinformatics/bty419

Rose AS, Bradley AR, Valasatava Y, Duarte JM, Prlić A, Rose PW (2016) Web-based molecular graphics for large complexes. In Proceedings of the 21st International Conference on Web3D Technology (Web3D '16). ACM, New York, NY, USA, 185-186. doi: 10.1145/2945292.2945324

Binder

Project Jupyter, et al. (2018) Binder 2.0 - Reproducible, Interactive, Sharable Environments for Science at Scale. Proceedings of the 17th Python in Science Conference. 2018. doi: 10.25080/Majora-4af1f417-011

CyVerse

Merchant N, Lyons E, Goff S, Vaughn M, Ware D, Micklos D, et al. (2016) The iPlant Collaborative: Cyberinfrastructure for Enabling Data to Discovery for the Life Sciences. PLoS Biol 14(1): e1002342. doi: 10.1371/journal.pbio.1002342

Py3Dmol

Rego N, Koes, D (2015) 3Dmol.js: molecular visualization with WebGL, Bioinformatics 31, 1322–1324. doi: 10.1093/bioinformatics/btu829

Funding

The MMTF project (Compressive Structural BioInformatics: High Efficiency 3D Structure Compression) is supported by the National Cancer Institute of the National Institutes of Health under Award Number U01CA198942. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

The CyVerse project is supported by the National Science Foundation under Award Numbers DBI-0735191, DBI-1265383, and DBI-1743442. URL: www.cyverse.org

mmtf-pyspark's People

Contributors

dkoes avatar marshuang80 avatar pwrose avatar sbliven avatar william-tzuhuan-hsu avatar yuy079 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mmtf-pyspark's Issues

Add classes to calculate interaction fingerprints

[!] mmtfReader.download_reduced_mmtf_files: url is not defined

This method does not work:
structures = mmtfReader.download_reduced_mmtf_files(pdbids, sc)

It gives the following error:

File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtfPyspark/io/mmtfReader.py", line 186, in _get_structure
unpack = default_api.get_raw_data_from_url(pdbId, reduced)
File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtf/api/default_api.py", line 53, in get_raw_data_from_url
url = get_url(pdb_id,reduced)
File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtf/api/default_api.py", line 69, in get_url
return BASE_URL_REDUCED + pdb_id
NameError: name 'BASE_URL_REDUCED' is not defined

ProteinFoldDatasetCreator.ipynb

The following line should be changed from:

when((col("alpha") > maxThreshold) & (col("beta") < minThreshold), "alpha+beta")

to:
when((col("alpha") > maxThreshold) & (col("beta") > maxThreshold), "alpha+beta")

[!] MetalInteractionsAdvanced: error

MetalInteractionsAdvanced:

interaction = self._get_interactions(arrays, queryAtomIndex, box)
File "/srv/conda/lib/python3.6/site-packages/mmtfPyspark/interactions/structureToAtomInteractions.py", line 117, in _get_interactions
n for neighbors in neighborIndices for n in neighbors]
File "/srv/conda/lib/python3.6/site-packages/mmtfPyspark/interactions/structureToAtomInteractions.py", line 117, in
n for neighbors in neighborIndices for n in neighbors]
TypeError: 'int' object is not iterable

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)

[!] structureviewer.view_structure

if the input contains a chain id, change:

        viewer.setStyle({'hetflag': True},{'stick':{'singleBond':False}}) -->
       viewer.setStyle({'chain': chainid, 'hetflag': True},{'stick':{'singleBond':False}}) 

pdbid,chainid = pdbIds[i].split('.')
viewer = py3Dmol.view(query='pdb:' + pdbid, options={'doAssembly': bioAssembly})
viewer.setStyle({})
viewer.setStyle({'chain': chainid}, {style: {'color': color}})
viewer.setStyle({'hetflag': True},{'stick':{'singleBond':False}})
viewer.zoomTo({'chain': chainid})

proteinSequenceEncoder.py

Change the following line from:
'A' : [2.34,0.29,6.13,-1.01,10.74,0.36,0.25],

to

'R' : [2.34,0.29,6.13,-1.01,10.74,0.36,0.25],

Dealing with MMTF path

If we insert the following check, the notebooks should work both locally as well as in binder:

import os
if not "MMTF_FULL" in os.environ:
os.environ["MMTF_FULL"] = path

demos/mappers/MapToListDemo.ipynb: fix import

demos/mappers/MapToListDemo.ipynb

import for experimentalMethods needs to be fixed:


AttributeError Traceback (most recent call last)
in ()
----> 1 pdb = pdb.filter(experimentalMethods(experimentalMethods.X_RAY_DIFFRACTION))

AttributeError: module 'mmtfPyspark.filters.experimentalMethods' has no attribute 'X_RAY_DIFFRACTION'

[!] mmtfWriter.write_sequence_file: error when writing Hadoop Sequence file

Read a structure

pdb = mmtfReader.download_full_mmtf_files(["4HHB"], sc)

Write a structure

mmtfWriter.write_sequence_file(path, sc, pdb)

The are some error messages regarding char encoding and altLocList. This may have something to do with the lazy initialization. Here is part of the error message:

File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtf/api/mmtf_writer.py", line 219, in encode_data
output_data["altLocList"] = encode_array(self.alt_loc_list, 6, 0)
File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtf/codecs/default_codec.py", line 29, in encode_array
return add_header(codec_dict[codec].encode(input_array, param), codec, len(input_array), param)
File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtf/codecs/codecs.py", line 62, in encode
converters.convert_chars_to_ints(in_array)),4)
File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtf/converters/converters.py", line 88, in convert_chars_to_ints
return [ord(x) for x in in_chars]
File "/Users/peter/anaconda3/lib/python3.6/site-packages/mmtf/converters/converters.py", line 88, in
return [ord(x) for x in in_chars]
TypeError: ord() expected string of length 1, but int found

[!] StructureToPolymerChains: missing atoms

StructureToPolymerChains doesn't copy atoms:

pdb = mmtfReader.download_full_mmtf_files(["4HHB"], sc)
traverseStructureHierarchy.print_structure_data(pdb.first()[1])
*** STRUCTURE DATA ***
Number of models : 1
Number of chains : 14
Number of groups : 801
Number of atoms : 4779
Number of bonds : 4700

pdb = pdb.flatMap(StructureToPolymerChains())
traverseStructureHierarchy.print_structure_data(pdb.first()[1])

*** STRUCTURE DATA ***
Number of models : 1
Number of chains : 1
Number of groups : 141
Number of atoms : 12
Number of bonds : 11

StructureToBioassembly: redundant entity_list

The entity list should have only unique entities, however, the entities are repeated. Example:

mmtfReader.download_full_mmtf_files(["1STP"], sc).flatMap(StructureToBioassembly()).first()[1].entity_list

Here, identical entities are listed 4 times:

[{'description': 'STREPTAVIDIN COMPLEX WITH BIOTIN',
'type': 'polymer',
'chainIndexList': [0],
'sequence': 'DPSKDSKAQVSAAEAGITGTWYNQLGSTFIVTAGADGALTGTYESAVGNAESRYVLTGRYDSAPATDGSGTALGWTVAWKNNYRNAHSATTWSGQYVGGAEARINTQWLLTSGTTEANAWKSTLVGHDTFTKVKPSAASIDAAKKAGVNNGNPLDAVQQ'},
{'description': 'STREPTAVIDIN COMPLEX WITH BIOTIN',
'type': 'polymer',
'chainIndexList': [1],
'sequence': 'DPSKDSKAQVSAAEAGITGTWYNQLGSTFIVTAGADGALTGTYESAVGNAESRYVLTGRYDSAPATDGSGTALGWTVAWKNNYRNAHSATTWSGQYVGGAEARINTQWLLTSGTTEANAWKSTLVGHDTFTKVKPSAASIDAAKKAGVNNGNPLDAVQQ'},
{'description': 'STREPTAVIDIN COMPLEX WITH BIOTIN',
'type': 'polymer',
'chainIndexList': [2],
'sequence': 'DPSKDSKAQVSAAEAGITGTWYNQLGSTFIVTAGADGALTGTYESAVGNAESRYVLTGRYDSAPATDGSGTALGWTVAWKNNYRNAHSATTWSGQYVGGAEARINTQWLLTSGTTEANAWKSTLVGHDTFTKVKPSAASIDAAKKAGVNNGNPLDAVQQ'},
{'description': 'STREPTAVIDIN COMPLEX WITH BIOTIN',
'type': 'polymer',
'chainIndexList': [3],
'sequence': 'DPSKDSKAQVSAAEAGITGTWYNQLGSTFIVTAGADGALTGTYESAVGNAESRYVLTGRYDSAPATDGSGTALGWTVAWKNNYRNAHSATTWSGQYVGGAEARINTQWLLTSGTTEANAWKSTLVGHDTFTKVKPSAASIDAAKKAGVNNGNPLDAVQQ'},
{'description': 'BIOTIN',
'type': 'non-polymer',
'chainIndexList': [4],
'sequence': ''},
{'description': 'BIOTIN',
'type': 'non-polymer',
'chainIndexList': [5],
'sequence': ''},
{'description': 'BIOTIN',
'type': 'non-polymer',
'chainIndexList': [6],
'sequence': ''},
{'description': 'BIOTIN',
'type': 'non-polymer',
'chainIndexList': [7],
'sequence': ''},
{'description': 'water',
'type': 'water',
'chainIndexList': [8],
'sequence': ''},
{'description': 'water',
'type': 'water',
'chainIndexList': [9],
'sequence': ''},
{'description': 'water',
'type': 'water',
'chainIndexList': [10],
'sequence': ''},
{'description': 'water',
'type': 'water',
'chainIndexList': [11],
'sequence': ''}]

pip install --upgrade mmtf-pyspark/: doesn't work properly

For example, it's can't find webfilters, although that's what's in the git repo.

ModuleNotFoundError Traceback (most recent call last)
in ()
1 from pyspark import SparkConf, SparkContext
2 from mmtfPyspark.io import mmtfReader
----> 3 from mmtfPyspark.webfilters import AdvancedQuery, ChemicalStructureQuery, Pisces
4 from mmtfPyspark.mappers import StructureToPolymerChains
5 from mmtfPyspark.structureViewer import view_structure

ModuleNotFoundError: No module named 'mmtfPyspark.webfilters'

New spark initialization

  1. from pyspark.sql import SparkSession
  2. spark = SparkSession.builder.master("local[*]").appName("1-Input").getOrCreate()
  3. spark.stop()

remove sc argument from mmtfRead: inside of each read/download method:
from pyspark import SparkContext
from pyspark.sql import SparkSession

spark = SparkSession.getOrCreate()
sc = spark.sparkContext

simple_structure_viewer: handle pdb.chainId

This method currently only accepts pdbIds. However, in many cases, we are working with chains.

If the passed in id is of the form: pdbid.chainId, then we should split the string by "." and use the chainId to only display that specified chain. Here is a simple example of only showing a specific chain:

pdbid = "4HHB"
chainid = "A"
viewer = py3Dmol.view(query="pdb:" + pdbid)
viewer.setStyle({})
viewer.setStyle({'chain': chainid}, {'cartoon': {'color': 'spectrum'}})
viewer.zoomTo({'chain': chainid})
viewer.show()

ColumnarStructure: getEntityTypes()

The way water groups are identified has changed:
was:
} else if (ccType.equals("HOH")) {

should be:
} else if (groupNames[start].equals("HOH") || groupNames[start].equals("DOD"))

Need new viewer method

Need a methods similiar to this one:
def view_group_interaction(pdbIds, interacting_atom='None', style='cartoon', color='spectrum'):

but, it would take an interacting_group instead of interacting_atom.

if interacting_group != "None":

        viewer.setStyle({'resn': interacting_group}, {
                        'sphere': {}})

    return viewer.show()

StructureToPolymerSequences: error

The following example:

mmtfReader.download_full_mmtf_files(["4HHB"], sc).flatMap(StructureToPolymerSequences()).keys().collect()

gives this error message:

polymer = structure.entity_list[i]['type'] == 'polymer'
IndexError: list index out of range

Solve PyCharm import issues

PyCharm cannot find import for mmtfPyspark. We need instructions how to import this project into PyCharm.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.