Coder Social home page Coder Social logo

biopandas / biopandas Goto Github PK

View Code? Open in Web Editor NEW
697.0 697.0 117.0 22.4 MB

Working with molecular structures in pandas DataFrames

Home Page: https://BioPandas.github.io/biopandas/

License: BSD 3-Clause "New" or "Revised" License

Python 98.08% Shell 1.11% TeX 0.80%
bioinformatics computational-biology drug-discovery mol2 molecular-structures molecule molecules pandas-dataframe pdb pdb-files protein-structure

biopandas's Introduction

LogoLogo

Working with molecular structures in pandas DataFrames

Continuous Integration Build status Code Coverage PyPI Version License Python 3 JOSS Discuss


Links


If you are a computational biologist, chances are that you cursed one too many times about protein structure files. Yes, I am talking about ye Goode Olde Protein Data Bank format, aka "PDB files." Nothing against PDB, it's a neatly structured format (if deployed correctly); yet, it is a bit cumbersome to work with PDB files in "modern" programming languages -- I am pretty sure we all agree on this.

As machine learning and "data science" person, I fell in love with pandas DataFrames for handling just about everything that can be loaded into memory.
So, why don't we take pandas to the structural biology world? Working with molecular structures of biological macromolecules (from PDB and MOL2 files) in pandas DataFrames is what BioPandas is all about!


Examples

3eiy3eiy

# Initialize a new PandasPdb object
# and fetch the PDB file from rcsb.org
>>> from biopandas.pdb import PandasPdb
>>> ppdb = PandasPdb().fetch_pdb('3eiy')
>>> ppdb.df['ATOM'].head()

3eiy head3eiy head





3eiy head3eiy head

# Load structures from your drive and compute the
# Root Mean Square Deviation
>>> from biopandas.pdb import PandasPdb
>>> pl1 = PandasPdb().read_pdb('./docking_pose_1.pdb')
>>> pl2 = PandasPdb().read_pdb('./docking_pose_2.pdb')
>>> r = PandasPdb.rmsd(pl1.df['HETATM'], pl2.df['HETATM'],
                       s='hydrogen', invert=True)
>>> print('RMSD: %.4f Angstrom' % r)

RMSD: 2.6444 Angstrom





Quick Install

  • install the latest version (from GitHub): pip install git+git://github.com/rasbt/biopandas.git#egg=biopandas
  • install the latest PyPI version: pip install biopandas
  • install biopandas via conda-forge: conda install biopandas -c conda-forge

Requirements

For more information, please see https://BioPandas.github.io/biopandas/installation/.





Cite as

If you use BioPandas as part of your workflow in a scientific publication, please consider citing the BioPandas repository with the following DOI:

  • Sebastian Raschka. Biopandas: Working with molecular structures in pandas dataframes. The Journal of Open Source Software, 2(14), jun 2017. doi: 10.21105/joss.00279. URL http://dx.doi.org/10.21105/joss.00279.
@article{raschkas2017biopandas,
  doi = {10.21105/joss.00279},
  url = {http://dx.doi.org/10.21105/joss.00279},
  year  = {2017},
  month = {jun},
  publisher = {The Open Journal},
  volume = {2},
  number = {14},
  author = {Sebastian Raschka},
  title = {BioPandas: Working with molecular structures in pandas DataFrames},
  journal = {The Journal of Open Source Software}
}

biopandas's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

biopandas's Issues

Unnecessary source code in a tutorial

Describe the documentation issue

A little issue with a PDB tutorial on the BioPandas website. Unnecessary code lines is present at the beginning of the tutorial:

import pandas as pd
import numpy as np
import sys
import gzip
from warnings import warn
try:
    from urllib.request import urlopen
    from urllib.error import HTTPError, URLError
except ImportError:
    from urllib2 import urlopen, HTTPError, URLError  # Python 2.7 compatible
from biopandas.pdb.engines import pdb_records
from biopandas.pdb.engines import pdb_df_columns
from biopandas.pdb.engines import amino3to1dict
import warnings
from distutils.version import LooseVersion



class PandasPdb(object):
    """
    Object for working with Protein Databank structure files.

...

    def parse_sse(self):
        """Parse secondary structure elements"""

ppdb = PandasPdb().fetch_pdb('3eiy')
ppdb.df

Suggest a potential improvement or addition

Delete the PandasPdb class source code from this section.

Thank you!

Get Carbon Method in PandasPDB

Describe the bug

Hiya, just wanted to check in if this is the intended behaviour

    @staticmethod
    def _get_carbon(df, invert):
        """Return c-alpha atom entries from a DataFrame"""
        if invert:
            return df[df["element_symbol"] == "C"]
        else:
            return df[df["element_symbol"] != "C"]

I think the cases should be switched; invert=True grabs the "C" entries, whereas invert=False grabs the non-carbon entries.

If this is a bug I'll submit a PR

Sort when Saving PDB

When using to_pdb() so save a PandasPdb() object with modified data frames, the following warning is issued:

FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default.

  To accept the future behavior, pass 'sort=False'.

  To retain the current behavior and silence the warning, pass 'sort=True'.

    df = pd.concat(dfs)

Unfortunately, to_pdb() does not accept (or forward) a sort=False keyword argument. I need to write the new data frames unsorted, but this does not seem to be possible at the moment (despite what the warning is suggesting).

Conda environment:

Python: 3.6
BioPimportandas: 0.2.3
Pandas: 0.23.4
Numpy: 1.15.4
Scipy: 1.2.0

Unable to export PDB DFs with model_id columns

Currently, PDB export breaks when model_id columns are added to the dataframe. This should be an easy fix by copying the dataframe and dropping any extraneous (i.e. non standard PDB colums) prior to parsing the output from the dataframe.

I also added a workaround that's needed when exporting (i.e., calling to_pdb() on) PandasPdb objects that have had a model_id column added to them. In the long term, I think it'd be good to have a fix merged into the master branch of BioPandas that types the model_id column as a str -> object column, but this workaround I've proposed should work for now.

Originally posted by @amorehead in a-r-j/graphein#309 (comment)

[fyi] Dask for dataframes bigger than memory

Hey there @rasbt,

You may already know of it but just in case --

As machine learning and "data science" person, I fell in love with pandas DataFrames for handling just about everything that can be loaded into memory.

For things that don't fit into memory, check out Dask and Dask DataFrames.

https://dask.org
http://docs.dask.org/en/latest/dataframe.html
http://docs.dask.org/en/latest/why.html

I've had great experiences with it recently

I don't know anything about your field! I was looking around at some interesting things on Discover and topics :)

Cheers

Latest conda-forge release 0.2.7 instead of 0.2.8?

Describe the bug

Hi!

Thank you for cutting a new release v0.2.8 on GH!
https://github.com/rasbt/biopandas/releases/tag/v0.2.8

As far as I understand, you also cut a new release on conda-forge, right?
conda-forge/biopandas-feedstock#12

However, when installing biopandas via conda-forge I only retrieve version 0.2.7.
Could you please check if you run into the same issue?

Steps/Code to Reproduce

conda create -n biopandas biopandas
conda activate biopandas
conda list biopandas

Expected Results

biopandas 0.2.8 installation.

Actual Results

biopandas 0.2.7 installation.

# Name                    Version                   Build  Channel
biopandas                 0.2.7              pyh9f0ad1d_1    conda-forge

If I ask for 0.2.8 directly

conda install biopandas=0.2.8

I get

PackagesNotFoundError: The following packages are not available from current channels:

  - biopandas=0.2.8

Current channels:

  - https://conda.anaconda.org/conda-forge/linux-64
  - https://conda.anaconda.org/conda-forge/noarch
  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

Versions

>>> import biopandas; print("biopandas", biopandas.__version__)
biopandas 0.2.7
>>> import platform; print(platform.platform())
Linux-4.15.0-154-generic-x86_64-with-glibc2.10
>>> import sys; print("Python", sys.version)
Python 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) 
[GCC 9.3.0]

# conda install scikit-learn numpy scipy -y
>>> import sklearn; print("Scikit-learn", sklearn.__version__)
Scikit-learn 0.24.2
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.21.2
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.7.1

Column names differ from VMD. Were they chosen intentionally?

I just started using biopandas and it seems nice. Toward promoting discussion related to its design, I have a general question regarding the column names, particularly for df['ATOM']. Note that I work with simulations of molecular structures so I care most about the df['ATOM'] field.

Coming from using VMD for several years, I noticed many differences to access the PDB fields. I have not compared to Chimera or PyMol, but below is a comparison between biopandas, VMD, and the documentation for the PDB file format. I would prefer the VMD options, partially because they are familiar but they are also more succinct. They are probably not as clear as a verbose name, but they are a closer match to the PDB field.

@rasbt, did you have strong motivation for the selections you made? If anyone else have other ideas or preferences, I would be interested to hear them.

BIOPANDAS        VMD        | COLUMNS        DATA  TYPE    FIELD        DEFINITION
----------------------------|---------------------------------------------------------------------------------------
record_name      atom       |  1 -  6        Record name   "ATOM  "
atom_number      index      |  7 - 11        Integer       serial       Atom  serial number.
atom_name        name       | 13 - 16        Atom          name         Atom name.
alt_loc          altloc     | 17             Character     altLoc       Alternate location indicator.
residue_name     resname    | 18 - 20        Residue name  resName      Residue name.
chain_id         chain      | 22             Character     chainID      Chain identifier.
residue_number   resid      | 23 - 26        Integer       resSeq       Residue sequence number.
insertion        insertion  | 27             AChar         iCode        Code for insertion of residues.
x_coord          x          | 31 - 38        Real(8.3)     x            Orthogonal coordinates for X in Angstroms.
y_coord          y          | 39 - 46        Real(8.3)     y            Orthogonal coordinates for Y in Angstroms.
z_coord          z          | 47 - 54        Real(8.3)     z            Orthogonal coordinates for Z in Angstroms.
occupancy        occupancy  | 55 - 60        Real(6.2)     occupancy    Occupancy.
b_factor         beta       | 61 - 66        Real(6.2)     tempFactor   Temperature  factor.
segment_id       segname    |
element_symbol   element    | 77 - 78        LString(2)    element      Element symbol, right-justified.
charge           charge     | 79 - 80        LString(2)    charge       Charge  on the atom.

Implement CIF and mmCIF format support for large biological assemblies

Discussed in #92

Originally posted by rjboyd00 March 6, 2022

Word on the street is that the PDB is moving towards the mmCIF file format as the primary file format in the future and it would be great to have support for a pythonic way to interact with these file formats that plays well with large structures.

As discussed in #92, it would be nice to add a PandasCIF/PandasMmCIF (analogous to PandasPdb) class to support the new file format.

Renumbering residues

Hi! Nice library, has a lot of potential.

Is there a way to renumber residues?
Renumbering atoms seems trivial (just assign a range to the atom_number), however renumbering residues would probably require some heavy duty group_by magic and could be built in.
(renumbering atoms could also be built in:)

pypi version completely useless

The pypi tarball has major packaging problems. Simply commands like

from biopandas.pdb import PandasPDB

Do not work as no submodule is included

$ find /usr/local/lib/python3.4/site-packages/biopandas
/usr/local/lib/python3.4/site-packages/biopandas
/usr/local/lib/python3.4/site-packages/biopandas/__init__.py
/usr/local/lib/python3.4/site-packages/biopandas/__pycache__
/usr/local/lib/python3.4/site-packages/biopandas/__pycache__/__init__.cpython-34.pyc

Biopandas PDB output formatting leads to a ton of segments when reading with MDAnalysis: reason and my quick fix

Hi all,

was a bit baffled opening biopandas PDB output with MDAnalysis. Instead of some dozen segments, I got thousands. Here's why & my hacky fix:

Biopandas outputs the rows in a following way:

ATOM  50786  CB  ASP q  96     219.123 233.404 332.880  1.00 97.39           C
ATOM  50787  N   PRO q  97     222.483 233.701 332.586  1.00 100.66           N

while in MDAnalysis expects this format:

ATOM  51419  O   UNK r 113     214.624 201.542 285.597  1.00 99.63           O
ATOM  51420  CB  UNK r 113     217.297 202.297 286.117  1.00100.32           C

Due to this formatting when B-factors have five numbers (>99.99), MDAnalysis parses the last digit of the B-factor to be the segid and uses them as chains, see the code for th eparser:
Line 297:

                segids.append(line[66:76].strip())

Lines 304-306:

        # If segids not present, try to use chainids
        if not any(segids):
            segids = chainids

As a quick fix, I commented out the last if statement in MDAnalysis.

Biopandas cannot open ent.gz

Hi!

I have a problem with opening gzipped PDB files. We have a local copy of the PDB, downloaded in the ent.gz format, as they provide it. However, biopandas since a commit on 22nd October throws an error of:
"Wrong file format; allowed file formats are .pdb and .pdb.gz."
Since this is the filename provided by PDB, it would be nice to get it open without renaming/gzipping.

I have two prposed solutions:
Either only check for .gz, or inlcude an other elseif for ent.gz
Alternatively, providing a way to parse a PDB file from handle could also work.

Thank you!

ppdb.amino3to1()

Appears to drop from df['ATOM'] on only 'residue_number' rather than including 'insertion.'

Mol2 files bonds

PandasMol2().read_mol2() reads and parses a mol2 file, however, in the dataframe, only the @ATOM section is present. Is there any way to access the bonds?

Thanks
Botond

RMSD calculation for whole PDBs

Describe the workflow you want to enable

From what I can understand PandasPDB.rmsd can calculate the rmsd only if both dataframes have the same length. However, if I want to compare 2 PDBs (one chain each) where the target protein (Uniprot ID) is the same (for example 2vua vs 2vu9) I can't because although the protein is the same one has a different purification tag and thus I cannot use the rsmd function

Describe your proposed solution

it should be pretty easy to calculate de identity between 2 structures and select only those residues. This can ofc be done manually by each user, but I think it would be a great improvement.

amino3to1 in protein-protein complexes

Some pdb files consist of different protein chains with different amino acid sequence, for example 5mtn. It would be great if amino3to1 took this into account and returned something like dictionary of chain_ids and corresponding series of 1-letter codes.

At the moment, amino3to1 for 5mtn returns
SLEPEPWFFKNLSRKDAERQLLAPGNTHGSFLIRESESTAGSFSLSVRDFDQGEVVKHYKIRNLDNGGFYISPRITFPGLHELVRHYTSVSSST

although the residues in the pdb file are

>5mtn.pdb chain A 
 SLEPEPWFFK NLSRKDAERQ LLAPGNTHGS FLIRESESTA GSFSLSVRDF DQGEVVKHYK
 IRNLDNGGFY ISPRITFPGL HELVRHYT

>5mtn.pdb chain B 
 SVSSVPTKLE VVAATPTSLL ISWDAPAVTV VYYLITYGET GSPWPGGQAF EVPGSKSTAT
 ISGLKPGVDY TITVYAHRSS YGYSENPISI NYRT

Merging HETATM and ATOM entries into one DataFrame

Following up on the comment by @wojdyr in #52

by the way, having atoms in two separate frames is rather not a good idea.

At first glance it may look like the protein chains are all ATOM, but wwPDB uses different criterium:

only natural amino-acids (and nucleic acids) are marked as ATOM, and the modified ones are > > > marked as HETATM.
So MET is ATOM but MSE is HETATM.

If you keep them both separately such an example:

  ppdb.df['ATOM']['b_factor'].plot(kind='hist')

won't work as expected - it may skip some residues

That's a good point and I haven't thought of that! The reason why I kept these separate is that I am mostly working on cases where HETATMs refer to non-protein residues. The HETATM--MSE issue should definitely be addressed somehow and I would have to think about it more ... Suggestions would be welcome.

Stream support for exporting pdbs

Describe the workflow you want to enable

I'd like to be able to export a pdb to a stream instead of to disk. In particular the reason why I'd like to do so is so that I can pass the stream directly to wandb.Molecule

Describe your proposed solution

The PandasPdb.to_pdb method could accept a path_or_stream: typing.Union[io.StringIO, str] instead of just a path: str argument. Internally, if path_or_stream happens to be a io.StringIO object, we don't need an openf function and instread can just execute the internal loops seen here, where f is now the io.StringIO object.

Making this change would enable inplace filling the stream with the pdb text.

Describe alternatives you've considered, if relevant

Currently I am needlessly writing to disk temporarily, reopening the file, and passing its contents to the wandb.Molecule object.

Additional context

RMSD calculation of two RNA residues returns nan

The following code sniplet returns a nan, although the two residues have the same size and order of atom names:

`from biopandas.pdb import PandasPdb

ehz = PandasPdb().fetch_pdb('1ehz')

at = ehz.df['ATOM']

a64 = at[at['residue_number'] == 64]
a66 = at[at['residue_number'] == 66]

r = PandasPdb.rmsd(a64, a66)
print(r)
`

Add/Parse Information about Secondary Structure Elements

Add a new dataframe object as PandasPdb.df['SSE'], which contains secondary structure element information. This pandas DataFrame would have the same number of rows as the coordinate section DataFrame ('ATOM') and columns "helix" and "sheet" with type "bool."

For ease of use, the SSE dataframe could share the dataframe indices with the PandasPdb.df['ATOM'] section. It may only get tricky if one ('ATOM' or 'SSE') gets updated & reindex and not the other.

Thus, alternatively, SSE information could be added directly to the 'ATOM' DataFrame, for instance as bool columns 'helix' and 'sheet'.

In any case, I would suggest to make this feature optional; for instance, by calling a function "parse_sse" that uses the information provided in the .pdb_text.

Handling multi-PDB files

I am cross-posting a discussion from the mailing list with regard to multi-PDB files containing MODEL & ENDMDL tags, which are currently not handled by BioPandas.

However, it should definitely be handled in one way or the other. Currently, I don't have any best idea on how to handle that and would welcome and thoughts and feedback (let me cross-post that on the GitHub issue tracker -- maybe better to continue the discussion about potential ways to implement it there).

I think one of the problems with the DataFrame format is that having them all in one DataFrame would probably result in a lot of weird -- or unexpected -- results, thus it would probably best to separate the structures one way or the other ...

  1. One option would be to provide a utility function (analogous to the split_multimol2 function, http://rasbt.github.io/biopandas/tutorials/Working_with_MOL2_Structures_in_DataFrames/#parsing-multi-mol2-files) that generates multiple PandasPdb objects from such a file. I.e., it would simply be a list

    pdbs = [pdb_1, pdb_2, .... pdb_n]

which would preserve the current functionality of the library without any e.g., backwards-incompatible changes. This would then also help with using the multiprocessing library more easily and efficiently for the analysis of multiple PandasPdb objects in parallel.

  1. Right now, the PandasPdb objects have a dictionary containing multiple DataFrames
    dict_keys(['ATOM', 'HETATM', 'ANISOU', 'OTHERS'])

For multi-PDB files, the dictionary could be expanded to

dict_keys(['ATOM_1', 'HETATM_1', 'ANISOU_1', 'OTHERS_1', 'ATOM_2', 'HETATM_2', 'ANISOU_2', 'OTHERS_2', ...])

I strongly favor scenario 1) though; however, I would love to hear feedback on this and are open to other suggestions!

In any case, also an error (or at least a warning) should be raised if MODEL & ENDMDL tags are found in a PDB file if the current read_pdb method is used such that this doesn't lead to any unexpected behavior.

Missing files in the PyPi tarball

Hi,

The setup.py has the lines:

      package_data={'': ['LICENSE.txt',
                         'README.md',
                         'requirements.txt']
                    },

But the license and requirements files are not present in the 0.2.5 tarball, thus it is not possible to install the package from source.

column 'line_idx' gets 'object' d_type for empty frames.

If a PDB file has no records of a certain type (for instance no HETATM or no ATOM), then the (empty) dataframe is created with the 'line_idx' column as type 'object' (default for pandas?)
I've noticed that there's no 'line_idx' record in the 'pdb_atomdict' (engines.py).
suggested fix would either to add it to that dict, removing the need for this 'hack', (starred) in 'pandas_pdb.py', line 363:

            df = pd.DataFrame(r[1], columns=[c['id'] for c in
                                             pdb_records[r[0]]] ** + ['line_idx']** )

unfortunately I have no idea if this will have a cascading effect, as I'm certain this was done on purpose.

Another quick and dirty workaround would be to add the (starred) line:

            for c in pdb_records[r[0]]:
                try:
                    df[c['id']] = df[c['id']].astype(c['type'])
                except ValueError:
                # expect ValueError if float/int columns are empty strings
                    df[c['id']] = pd.Series(np.nan, index=df.index)
            **df['line_idx'] = df['line_idx'].astype(int)**

after the d_type correction loop right after the above code in line 363.

This is an incredibly minor issue, but has caused some unexpected glitches for me when fetching the columns with type 'object' and then converting them to string in both ATOM and HETATM frames, as one frame would have the wrong datatype and conversion would crash.

Error handling when reading wrong file formats

Describe the workflow you want to enable

Thanks again for your work on biopandas!

I have a small comment on the error handling when loading pdb files with read_mol2 (or mol2 files with read_pdb).

The current behavior looks like this:

mol2 module
from biopandas.mol2 import PandasMol2
pmol = PandasMol2()
pmol.read_mol2("xxxx.pdb")

Example output: UnboundLocalError: local variable 'first_idx' referenced before assignment (might look different depending on the input file and file format).

pdb module
from biopandas.pdb import PandasPdb
ppdb = PandasPdb()
ppdb.read_pdb("xxxx.mol2")

Example output: All data is loaded into the dict key "OTHER" (might look different depending on the input file and file format).

Describe your proposed solution

Would you consider adding a check for the correct input and throwing a descriptive error message?

I am using a ValueError at the moment but I am sure there are nicer ways to handle this:
https://github.com/volkamerlab/opencadd/blob/912d4e98e89edf38707249fd4f034cea136e1932/opencadd/io/dataframe.py#L202

This issue is not urgent at all.
It simply would make it easier / less verbose to use biopandas in other packages where we try to catch common user mistakes.

Thank you again for your time and work!

Describe alternatives you've considered, if relevant

None.

Additional context

None.

Let PandasPDB's and PandasMOL2's distance method accept an additional dataframe for comparison

The current signature of PandasMOL2's distance methods is

distance(self, xyz=(0.00, 0.00, 0.00))

where the pair-wise distance of all atoms in PandasMOL2.df are compared to the xyz distance. This can be quite wasteful in certain instances; for example, if we are only interested in the distance to certain atoms.

The signature should be changed to

distance(self, df=None, xyz=(0.00, 0.00, 0.00))

where the behavior remains unchanged if df=None, but if and argument (DataFrame) for df is provided, it will be used instead of PandasMOL2.df

Using SIFTS data for renumbering residues to match the Uniprot sequence resids

Hi all,

stumbled upon this paper describing the mapping of PDB residue id's to the ones in the sequence deposited in Uniprot:

  • Choudhary, P.; Anyango, S.; Berrisford, J.; Varadi, M.; Tolchard, J.; Velankar, S. Unified Access to up-to-Date Residue-Level Annotations from UniProt and Other Biological Databases for PDB Data via PDBx/mmCIF Files. bioRxiv, 2022, 2022.08.10.503473. https://doi.org/10.1101/2022.08.10.503473.

Frustrated by the inconsistencies in numbering, I'm writing some code to output pdb's with these Uniprot sequence matching id's, and using biopandas for the crunching.

The mmCIF's with the mapped residues can be downloaded from the url:

https://www.ebi.ac.uk/pdbe/entry-files/download/{pdb_id}_updated.cif"

The CIF file is nicely read with the mmCIF parser. The resid matching the one in Uniprot is in the column pdbx_sifts_xref_db_num, giving None for those without mapping to sequence, eg. ligands and the UNK's.

This paper/python code/webserver describes a similar thing using the SIFTS:

  • Faezov, B.; Dunbrack, R. L., Jr. PDBrenum: A Webserver and Program Providing Protein Data Bank Files Renumbered according to Their UniProt Sequences. PLoS One 2021, 16 (7), e0253411. https://doi.org/10.1371/journal.pone.0253411.

For the residues without a mapping, the residues are renumbered using an offset of 5k/50k so that there's no overlap with the new resids of amino acids.

However, occasionally a part of the chain is are UNK's, so I will implemented a way to use continuous numbering wrt the Uniprot mapped resids for these.

Work in progress - if there's an already existing way to do this, let me know :)

PDB link on rcsb has changed

The method _fetch_pdb uses the template http://www.rcsb.org/pdb/files/%s.pdb but RCSB seems to use https://files.rcsb.org/download/%s.pdb now.

What's up next?

Hey,

Love pandas, would like to contribute.

what's on the todo?

Please ship tests to pypi

Downstream distro maintainers love to test during packaging. Please included the test suite in the pypi tarball so that we can test during packaging.

Store path of file that was read in

Often when maniplating a PDB, I want to create an output PDB using a name that matches the input path with a one/two word description appended to the name. To facilitate this, I wonder your thoughts about storing the path string for an input file as metadata on the PandasPdb class. Looking at the code called when reading in a PDB, it does not seem this is already done (though I may have missed it).

https://github.com/rasbt/biopandas/blob/master/biopandas/pdb/pandas_pdb.py#L58-L74

This could potentially be added to the __str__ property so when printing the object it returns more helpful info than <biopandas.pdb.pandas_pdb.PandasPdb object at 0x7fe1f5f2a390>.

Read pdb from list of strings (instead of from file)

Describe the workflow you want to enable

Thank you for providing this great package - I am using it in most of my projects!

When a database is queried for structural data in e.g. the pdb format, the file content is often returned in the form of a string.
I would like to load DataFrames from such a string (or list of strings):
https://github.com/volkamerlab/opencadd/blob/912d4e98e89edf38707249fd4f034cea136e1932/opencadd/io/dataframe.py#L128

Currently, I use the private PandasPdb method _construct_df, which - I know - is bad practice.

Describe your proposed solution

In the mol2 module, we can load DataFrames from a file or from a list of strings.

pmol = PandasMol2()
pmol.read_mol2()
pmol.read_mol2_from_list()

Would it be possible to provide the same behavior in the pdb module?

ppdb = PandasPdb()
ppdb.read_pdb()
ppdb.read_pdb_from_list()  # New feature?

Thank you for your time!

Describe alternatives you've considered, if relevant

None.

Additional context

None.

0.5.0dev Release

Hey @rasbt, can we get a dev release? I think we're good to go with the recent PRs.

FWIW, it might be worth setting up a GitHub actions workflow to push to PyPI triggers by making a new release.

PandasPdb.to_pdb() will add one more blank for coordinate larger than 1000

input.pdb file looks like

PFRMAT TS
TARGET T1039
MODEL 3
PARENT N/A
ATOM      1  N   ASN     1    1176.362 706.517 524.380  1.00 60.95           N
ATOM      2  CA  ASN     1    1176.148 706.575 522.918  1.00 60.95           C
ATOM      3  CB  ASN     1    1174.870 707.376 522.602  1.00 60.95           C
ATOM      4  CG  ASN     1    1174.505 707.144 521.143  1.00 60.95           C
ATOM      5  OD1 ASN     1    1174.080 706.054 520.762  1.00 60.95           O
ATOM      6  ND2 ASN     1    1174.672 708.199 520.301  1.00 60.95           N
ATOM      7  C   ASN     1    1177.308 707.249 522.267  1.00 60.95           C
ATOM      8  O   ASN     1    1178.430 706.745 522.293  1.00 60.95           O
ATOM      9  N   ASN     2    1177.054 708.428 521.672  1.00 63.15           N
ATOM     10  CA  ASN     2    1178.084 709.133 520.973  1.00 63.15           C
ATOM     11  CB  ASN     2    1177.574 710.406 520.271  1.00 63.15           C
ATOM     12  CG  ASN     2    1178.616 710.795 519.237  1.00 63.15           C
ATOM     13  OD1 ASN     2    1179.777 710.402 519.335  1.00 63.15           O
ATOM     14  ND2 ASN     2    1178.192 711.579 518.208  1.00 63.15           N
ATOM     15  C   ASN     2    1179.154 709.511 521.945  1.00 63.15           C
ATOM     16  O   ASN     2    1180.339 709.379 521.645  1.00 63.15           O
TER
END

from biopandas.pdb import PandasPdb
target_biopdb = PandasPdb().read_pdb('input.pdb')
target_biopdb.to_pdb('output.pdb')

output.pdb file looks like

PFRMAT TS
TARGET T1039
MODEL 3
PARENT N/A
ATOM      1  N   ASN     1     1176.362 706.517 524.380  1.00 60.95           N
ATOM      2  CA  ASN     1     1176.148 706.575 522.918  1.00 60.95           C
ATOM      3  CB  ASN     1     1174.870 707.376 522.602  1.00 60.95           C
ATOM      4  CG  ASN     1     1174.505 707.144 521.143  1.00 60.95           C
ATOM      5  OD1 ASN     1     1174.080 706.054 520.762  1.00 60.95           O
ATOM      6  ND2 ASN     1     1174.672 708.199 520.301  1.00 60.95           N
ATOM      7  C   ASN     1     1177.308 707.249 522.267  1.00 60.95           C
ATOM      8  O   ASN     1     1178.430 706.745 522.293  1.00 60.95           O
ATOM      9  N   ASN     2     1177.054 708.428 521.672  1.00 63.15           N
ATOM     10  CA  ASN     2     1178.084 709.133 520.973  1.00 63.15           C
ATOM     11  CB  ASN     2     1177.574 710.406 520.271  1.00 63.15           C
ATOM     12  CG  ASN     2     1178.616 710.795 519.237  1.00 63.15           C
ATOM     13  OD1 ASN     2     1179.777 710.402 519.335  1.00 63.15           O
ATOM     14  ND2 ASN     2     1178.192 711.579 518.208  1.00 63.15           N
ATOM     15  C   ASN     2     1179.154 709.511 521.945  1.00 63.15           C
ATOM     16  O   ASN     2     1180.339 709.379 521.645  1.00 63.15           O
TER
END

one more blank is added on each line, which makes it not recognizable for some applications.

Add function for structural alignments via least-squares fit

A staticmethod least-squares superposition. We could call it align and implement it with similar parameters as the rmsd function. In addition, it would be nice to have a substructure parameter for substructure alignment -- maybe accepting an iterable of residue numbers here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.