douweschulte / pdbtbx Goto Github PK

View Code? Open in Web Editor NEW

49.0 3.0 12.0 17.48 MB

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.

Home Page: https://crates.io/crates/pdbtbx

License: MIT License

Rust 97.75% Python 2.25%

crystallography rust rust-crate pdb pdb-parser pdbx-files scientific atoms mmcif

pdbtbx's Introduction

Description

This is a Rust library helping to parse, edit and save crystallographic PDB/mmCIF files. It can read most atomic data from PDB/mmCIF files. Its high level goal is to create a stable, efficient and easy to use interface to PDB/mmCIF files written in pure Rust.

Contributing

As this is a library in active development, feel free to share your thoughts, ideas, hopes, and criticisms. Every comment will be read and discussed so that this library is as useful as possible for all users. Of course we all like a civilised discussion so please follow the community guidelines, but over all please be a civilised human being.

License

MIT

Why

It started as a way to use Rust in a scientific project. But it moved to an open source project because I think that using Rust in scientific computing is be really helpful and a great addition alongside the ubiquitous Python. So by creating it I hope to extend the usability of Rust a little bit more. Since Nature published an article (technology feature) which laid out the benefits of using Rust and showed that Rust is used more and more, I am planning on working more with Rust in scientific projects. And I think that the best way to help Rust move forward (in the scientific community) is by creating more support for scientific projects in Rust.

Contributors

Supported features

As the main goal of this library is to allow access to the atomical data many metadata features of both PDB and mmCIF are unsupported. For both file formats the recent versions (PDB v3.30 and mmcif v5.338) are used, but as both are quite stable file formats the exact version should not matter to end users.

PDB Feature	PDB	mmCIF	Corresponding in mmCIF
HEADER (ID)	✔️	✔️	entry.id
REMARK	✔️	❌	_pdbx_database_remark.id
ATOM	✔️	✔️	atom_site
ANISOU	✔️	✔️	atom_site
SCALE	✔️	✔️	_atom_sites.Cartn_transf
ORIGX	✔️	✔️	_database_PDB_matrix.origx
MATRIX	✔️	✔️	struct_ncs_oper
CRYSTAL	✔️	✔️	cell + symmetry
MODEL	✔️	✔️	atom_site
MASTER	〰️	❌	_pdbx_database_PDB_master
SEQRES	〰️	❌	?
DBREF	✔️	❌	pdbx_dbref
DBREF1/2	✔️	❌	pdbx_dbref
MODRES	✔️	❌	?
SEQADV	✔️	❌	?

Section	Keywords	Support
Heterogen	HET, HETNAM, HETSYN, FORMUL	🔍
Secondary structure	HELIX, SHEET	🔍
Connectivity	SSBOND, LINK, CISPEP, CONNECT	🔍
Title	OBSLTE, TITLE, SPLIT, CAVEAT, COMPND, SOURCE, KEYWDS, EXPDTA, NUMMDL, MDLTYP, AUTHOR, REVDAT, SPRSDE, JRNL, HEADER (other columns)	❌
Misc.	SITE	❌

Symbol	Description
✔️	Supported
〰️	Partially supported
⏲	Support planned (v1.0)
🔍	Support envisioned (>v1.0)
❌	Support not envisioned

The features where support is planned are planned to be included in the 1.0 release. The features where support is envisioned are candidates to be included, but not necessarily in the 1.0 release. The features where support is not planned are thought to be unnecessary for atomic data computations on theses files. If any of these are really needed for your use case, please raise an issue and we can discuss its inclusion. Also if you need a feature that is 'planned' or 'envisioned', feel free to raise an issue to guide development to spots where it can make a real life impact.

Additional crate features

The crate leverages the power of the crate Rayon to provide multithreaded iterators over PDB structures.
The crate leverages the power of the crate rstar to provide very efficient spatial lookup.
The crate has a performant way of selecting atom(s), see Search.
The crate has many nice helper methods for common PDB operations (renumbering, sorting, atomic properties lookup).
The crate has many ways of iterating over the PDB structure to allow for convenient access and control over the performance.

Latest update

v0.11.0

Added support for zipped (.gz) files (Thanks to OWisset)
Does not automatically convert chain names to uppercase anymore (Thanks to OWisset)

Also see changelog.

Support and development

When I am actively using this crate in my own projects this crate is actively worked on and extended. I am more than happy to receive and work on PRs and Issues even if the project seems a bit stale. But if anyone finds this project stale and wants to take over moderation and/or main development feel free to reach out and we can discuss. I would be happy to transfer the project and access to crates.io if that means the project will live on.

Join the Discussion

If you are interested in helping develop this crate and want to share ideas and plans, feel free to join our discord server.

https://discord.gg/wbjRznTVZ7

pdbtbx's People

Contributors

Stargazers

Watchers

Forkers

tianyishi2001 byron amadeusine dockde i-vresse owissett ykgkhakase rvhonorato moomboh truman-xu nbruciaferri y1zhou

pdbtbx's Issues

Human friendly find methods

It would be really nice if a 'human friendly' find method could be built in. This should allow for the selection of all matching hierarchies based on the given information. Every level in the hierarchy should have the option to present no information, the index, or any of the info of that level. An additional variant which gives mutable hierarchies is also useful.

//PDB
pub fn find(&'a self, model: ModelFind, chain: ChainFind, residue: ResidueFind, conformer: ConformerFind, atom: AtomFind) ->
 Vec<AtomConformerResidueChainModel<'a>> {
    //find all matching models and dispatch the find code to those models using their comparable find methods
   //integrate the model info in the resulting hierarchies
}

//eg ModelFind
pub enum ModelFind {
    NoInfo,
    Index(usize),
    SerialNumber(usize),
}

Making each level in the hierarchy "slice-like"

This will allow semantics that looks more idiomatic for some people, such as:

for chain in &mut model { // instead of model.chains_mut()
    for residue in &mut chain { // instead of chain.residues_mut()
        ....
    }
}

and

chain[256] = some_residue

[The answers to this stackoverflow question] seem to help to achieve this.

Handle Models

If I open and save pTLS-6484.pdb (an ensemble crystallographic pdb file) it crashes Chimera, but the original file does not. So clearly there is something different in the output. Also it would be cool to know what has to be enforced in terms of Atom list equality between the models, right now it enforces that every model has the exact same Atoms (except different position etc) which is not true for the mentioned PDB file.

Detect number overflows on save

If a number is too big (takes 5 chars in a place where 4 is the max) it now just overflows and causes trouble with the file format. This should be handled. I suppose generating a warning while at the same time using modulo to set it back to a valid number. Also PDB.renumber() should never introduce numbers that are too big.

Standardize precision of floating points in mmCIF output

PDB itself servers files that have occupancy of 1.00, it should be easy enough to put at least a point and one digit after that;
The above goes even for xyz coordinates, some of them outputted without a dot.
16 digits after the dot in coordinates... Excessive precision is being frowned upon in the community when it comes to reporting in e.g. "Table 1". This implies that all the digits are meaningful. Writing out 16 digits in coordinates makes the precision of 1e-26 meters. Proton size is ~1e-15 meters (from Wiki). Let alone every digit and every whitespace to format them nicely is 1 byte on the disk.

(thanks to Oleg for raising the issue!)

I would propose to have a minimum of 1 decimal and a maximum of 5 decimals.

Support CONNECT

CONNECT provides connectivity information between atoms. This needs to be supported in the parser, IR, and save systems. I guess that having a vec of serial numbers (Vec<usize>) is a right way of saving the information. But thought needs to go into changing the serial number of an Atom and how that should be updated in the whole PDB struct.

Support MMTF

MMTF is a more compact and easier to parse (apperently) file format for molecular structures. See https://mmtf.rcsb.org/ for more details. Also there already is a (WIP) crate to parse these structures: https://crates.io/crates/mmtf. It could be interesting to implement, but it does not have high priority (I have never seen these files 'in the wild').

Change charge representation to isize

Internally charge is represented using [char; 2] which can be changed to isize. The format if pretty easy but somewhat of a pain to parse and save. In the PDB file the charge is saved like this 1+, 2-, (no charge). So to parse this the chars have to be reverted. And to save this needs to happen again.

Also the atom struct needs adequate getters/setters of which the last variant needs to do bounds checking on the new charge [-9..10].

Residue serial numbers > 9999

This probably never comes up when using "regular" PDB files but when solvating a protein (usually with water) it frequently happens that the number of residues exceeds 9999. At first I thought there was a bug in my code but it seems the pdbtbx open_pdb function is to blame. This is the output I get after opening such a PDB file and printing it back to stdout:

ATOM      1 N    HIM     1      20.872  37.523  43.475  0.00  0.00           N
ATOM      2 H1   HIM     1      21.537  38.222  43.773  0.00  0.00           H
ATOM      3 H2   HIM     1      21.347  36.635  43.400  0.00  0.00           H
ATOM      4 CA   HIM     1      20.442  37.855  42.095  0.00  0.00           C
ATOM      5 HA   HIM     1      20.697  37.031  41.428  0.00  0.00           H
ATOM      6 CB   HIM     1      18.932  38.092  42.024  0.00  0.00           C
ATOM      7 HB2  HIM     1      18.594  38.695  42.867  0.00  0.00           H
ATOM      8 HB3  HIM     1      18.678  38.597  41.092  0.00  0.00           H
ATOM      9 CG   HIM     1      18.135  36.826  42.056  0.00  0.00           C
ATOM     10 ND1  HIM     1      18.100  35.999  43.156  0.00  0.00           N
ATOM     11 CE1  HIM     1      17.336  34.954  42.900  0.00  0.00           C
ATOM     12 HE1  HIM     1      17.192  34.166  43.639  0.00  0.00           H
ATOM     13 NEM  HIM     1      16.857  35.084  41.678  0.00  0.00           N
ATOM     14 CD2  HIM     1      17.352  36.239  41.124  0.00  0.00           C
ATOM     15 HD2  HIM     1      17.072  36.495  40.102  0.00  0.00           H
ATOM     16 C    HIM     1      21.229  39.064  41.640  0.00  0.00           C
ATOM     17 O    HIM     1      21.045  40.168  42.148  0.00  0.00           O
ATOM     18 CME  HIM     1      15.954  34.127  41.011  0.00  0.00           C
ATOM     19 HM1  HIM     1      15.474  34.610  40.161  0.00  0.00           H
ATOM     20 HM2  HIM     1      15.192  33.791  41.715  0.00  0.00           H
ATOM     21 HM3  HIM     1      16.527  33.267  40.662  0.00  0.00           H
ATOM  32543 O    HIM     1      11.318  29.317   3.009  0.00  0.00           O
ATOM  32544 H1   HIM     1      11.670  28.720   2.350  0.00  0.00           H
ATOM  32545 H2   HIM     1      10.434  28.992   3.180  0.00  0.00           H
ATOM     22 N    THR     2      22.144  38.833  40.709  0.00  0.00           N
ATOM     23 H    THR     2      22.246  37.909  40.315  0.00  0.00           H
ATOM     24 CA   THR     2      23.072  39.863  40.292  0.00  0.00           C
ATOM     25 HA   THR     2      22.590  40.827  40.129  0.00  0.00           H
ATOM     26 CB   THR     2      24.177  40.038  41.352  0.00  0.00           C
ATOM     27 HB   THR     2      23.704  40.224  42.316  0.00  0.00           H

I think what happened is that all residues with serial number > 9999 (all waters in my case) got wrapped around to serial number 1 again and were then appended to the already existing residue with that serial number (retaining the atom serial number). Would it be possible to fix this?
I know that in cases like this there should be an insertion code or something to distinguish residues but unfortunately I have no say in the creating of these files.

Error messages

I really like fancy (helpful) error messages so I think it would be really helpful to add those to the parser (and rest of the project). In general a sort of inner library (sub-module) needs to be constructed to help with presenting the information in a way that is as clear as possible to users as to programmers of the library itself.

I think the Rust approach is quite good (see below) where first a header is with the concise error (along with the level error/warning), followed by the line (in a parser, or something else for the structs) where the error is visually underlined/shown, followed by some explanation and/or help to help resolve the issue as fast as possible for the users.

error[E0616]: field `unit_cell` of struct `pdb::PDB` is private
   --> src\parser\parser.rs:124:25
    |
124 |                     pdb.unit_cell = Some(UnitCell::new(a, b, c, alpha, beta, gamma));
    |                         ^^^^^^^^^ private field

error: aborting due to previous error

I wrote something like this in an earlier project (https://git.science.uu.nl/d.schulte/research-project-amino-acid-alignment/-/blob/master/source/ParseBatchfiles/ErrorMessage.cs) so this can be blatantly translated into Rust (licenses are all fine).

Create continuous benchmarks

To keep an eye on performance and see when thing unnecessarily digress it would be cool to have continuous benchmarks upon pushing to the server. There is a GitHub action made for this: https://github.com/marketplace/actions/continuous-benchmark. But keeping in mind that lots of behaviour is not yet implemented the run times will get higher over time.

Proposal for benchmarks

open
pdb.apply_transformation, rotation x 90°
pdb.remove_atom_by, every odd numbered atom
save
pdb.atoms, calculate average B factor
validate
pdb.clone

All benchmarks should be run on 1ubq.pdb and pTLS-6484.pdb to give an idea of the impact of the PDB size.

Unit tests

Having unit tests, especially in combination with continuous integration is needed if this project will become more used.

Some tests that can be written:

Integration: test if the PDB struct is equal to itself saved and than read back
Unit test: test invariants in the code (like sizes of names, allowed characters etc)

Serde support

Supporting serde could be a great addition to this library, maybe in the way of a optional thing (like in https://docs.rs/rstar/0.8.2/rstar/struct.RTree.html).

Document every function

To help people using the library in depth documentation for each function is needed along with structs etc. Also documentation to tricky parts of the code internals would be nice.

It would be best if those can be automatically compiled by cargo doc so use doc comments (https://doc.rust-lang.org/rust-by-example/meta/doc.html).

Support negative sequence numbers

Apparently this is a thing...
https://proteopedia.org/wiki/index.php/Unusual_sequence_numbering
Example: 5LCC

b-value out of bounds

I think I have reported this earlier and misinterpreted this somewhat.
I noticed that when validate_pdb() is called on a sample PDB of mine and the resulting vec printed out, a lot of warnings are shown that say:

LooseWarning: Atom b factor out of bounds

Atom 24628 has a b factor which is out of bounds, max is 999.99 min is 0.00.

However, none of my b factors are actually below 0.00 or above 999.99.

Have a good name, and publish!

I think calling it rust-pdb would be pretty lame, so @TianyiShi2001 do you have any ideas? I think about something stressing the crystallographic or protein part as there apparently is a Microsoft data structure also called pdb getting high results when googling for rust pdb. With a good name I think we are ready to publish the first version on crates.io with the remark that it still is pretty experimental and so can (and probably will) break between versions.

Implement fast predicate removal of children

As outlined in merge request #10, a fast way to remove children of a struct (PDB/Model/Chain/Residue) is really nice to have. The merge request implemented this for Chains. It would be nice to extend this to the other structs.

Multithreading support

As some operations over iterators van benefit greatly from multithreading git would be great if the library supported Rayon iterators. Problems could arise in combination with #57 though so that should be looked into.

Minimum b-factor value

Hi again!
Your fix for the TER statements works fine but I realized that another issue cropped up: I use the occupancy and b-value columns to assign atom to regions in QM/MM calculations. As such, most of the values are necessarily zero. I realize that this is contrary to the intended use of the b-value column but this behavior breaks my use case.
I'm not sure what a good solution might be here. Maybe b-values of 0.00 could be allowed (with warnings shown perhaps) with StrictnessLevel::Loose for open_pdb?
I'm currently in the process of refactoring my code to use this crate and it would be a pity if this excluded the possibility.

Remove pointers from children

After some thought I think that the pointers from children to its parent are suboptimal. And completely removing them would solve some issues I foresee in the future (eg .remove() on an element while iterating...). So I guess I should vote in favour of removing them altogether, but at the same time I like the idea of having acces to some information, like the backbone() function. But that could be solved by saving a boolean in the Atom. And I guess that function (and some like that I would like to provide) are much less important then the safety and stability of the library.

So removing all pointers and the functions around it would be my proposal, but @TianyiShi2001 I would like some input if you have some? Do you think those pointers serve a purpose or are they just forced by me as I wanted to copy CCTBX?

Add a join on PDB (or extend?)

Having the option to join/extend two PDBs is quite handy. So a &mut PDB.join(PDB) in the fashion of the other join functions would be cool, otherwise adding an &mut PDB.extend(IntoIterator) would be cool, as long as it is ergonomic to use from the position of two PDB variables in the top scope, and with the minimum amount of cloning structs.

Parsing PDB file without header

Okay, so I have what might be construed as a weird edge case. I am only working with the atoms section of PDB files, the header and all other info is of no interest to me. Also I usually run some kind of preprocessing on them to add missing hydrogens, remove alternative conformations and so on. I just tried to read such a file and save it again, however it was mangled in the process:

Input file:

ATOM      1 N    HIE     1      66.397  49.061  85.017  0.00  1.00           N          
ATOM      2 H1   HIE     1      66.306  48.101  84.696  0.00  1.00           H
ATOM      3 H2   HIE     1      67.181  49.491  84.536  0.00  1.00           H
ATOM      4 CA   HIE     1      66.603  49.087  86.441  2.00  1.00           C
ATOM      5 HA   HIE     1      67.052  50.039  86.723  0.00  1.00           H
ATOM      6 CB   HIE     1      65.332  48.876  87.271  0.00  1.00           C
ATOM      7 HB2  HIE     1      64.794  47.999  86.927  0.00  1.00           H
ATOM      8 HB3  HIE     1      65.620  48.701  88.303  0.00  1.00           H
ATOM      9 CG   HIE     1      64.499  50.108  87.226  0.00  1.00           C
ATOM     10 ND1  HIE     1      64.139  50.737  86.058  0.00  1.00           N
...

Output file:

ATOM      1  N   HIE    1      66.397  49.061  85.017  0.00  1.00           N           
TER    1      HIE    1
ATOM      2  H1  HIE    1      66.306  48.101  84.696  0.00  1.00           H
TER    2      HIE    1
ATOM      3  H2  HIE    1      67.181  49.491  84.536  0.00  1.00           H
TER    3      HIE    1
ATOM      4  CA  HIE    1      66.603  49.087  86.441  2.00  1.00           C
TER    4      HIE    1
ATOM      5  HA  HIE    1      67.052  50.039  86.723  0.00  1.00           H
TER    5      HIE    1
ATOM      6  CB  HIE    1      65.332  48.876  87.271  0.00  1.00           C
TER    6      HIE    1

So apparently a TER statement was inserted after each atom. I assume this isn't intended behavior and it probably is a result of me using unusual PDB files. I just wanted to bring this to your attention because I'm not sure if this is something you'd want to support or not.
In principle I wouldn't mind helping out with a pull request but I'm only just learning Rust and your PDB parsing function looks a bit daunting to me :D
However, if I can help (with some directions maybe) I will.
Best!

Remove set_{parent}_pointer

As &mut Struct is automatically dereferenced to *mut Struct the set_{parent}pointer functions are unnecessary and should be removed. The set{parent} functions should accept *mut Struct instead.

Detect overlapping atoms in validation

It would be nice if the validation detected atoms being way too close to each other.

Contributors preferences

I added a CITATON.cff file to help other people cite the software and recognise all contributors. If there is anything you would like to add/change or if you do not want to be listed please let me know.
@TianyiShi2001 @DocKDE

current authors section:

authors:
  - affiliation: Utrecht University
    family-names: Schulte
    given-names: Douwe
    orcid: https://orcid.org/0000-0003-0594-0993
  - family-names: Shi
    given-names: Tianyi
  - name: DocKDE

PDB file recognition

I recently tried to point the openPDB function at a random text file and it seems no error is thrown when that happens. I realize this is not an intended use of the function at all but I thought it might be worth a thought to add a warning or breaking error for such cases. Something like "No data could be parsed from file, please make sure it is a valid PDB file" maybe.
Would that maybe be reasonable?

Get transformation for space group

In the international tables for crystallography vol A are some tables which provide the transformations for every space group to get the full unit cell construct. A way to retrieve those in the code would be cool. But there are some problems: I cannot find the tables (no online access and no tables ready to be found), issues with copyright? (think about it a little more).

Mutable atoms_with_hierarchy

I realized that the AtomWithHierarchy struct comes with the limitation of containing only immutable references to everything. So I came across the issue that if I want to replace iterations over, e.g., all residues and then all atoms within those residues with just the iteration over AtomWithHierarchys, I cannot mutate those (which is something i frequently need to do).
Something like:

pub struct AtomWithHierarchyMut<'a> {
    /// The Chain containing this Atom
    pub chain: &'a mut Chain,
    /// The Residue containing this Atom
    pub residue: &'a mut Residue,
    /// The Conformer containing this Atom
    pub conformer: &'a mut Conformer,
    /// This Atom
    pub atom: &'a mut Atom,
}

And a method (or two) to instantiate this?

Would it be feasible to create a method like atoms_with_hierarchy_mut to remedy this? I haven't tried implementing this because I was afraid I had overlooked some obvious problem with borrowing this would raise and I wanted to discuss it first.

Add ENDMDL records

Make sure all ENDMDL records are properly placed when saving a pdb.

Conformer copy incorrect occupancy

If an atom is shared between conformers (it's alt_loc is blank) it is now copied to all conformers as is. But it's occupancy should reflect the fact that it is now present multiple times. So the occupancy should be set to one over the amount of conformers.

Reassess the handling of Hetero atoms

I am not sure the current placement (of separate Hetero Chains) is flexible enough to handle all valid ways to use hetero atoms. I propose to take a look at other PDB packages to see how they handle Hetero atoms and look again at the PDB file format to see if another placement would be better.

Atom serial numbers wrap at 9999

Okay, last one for today:
I noticed the atom serial number in the PDB struct wraps around at 9999 even if that's not the case in the input PDB file. Is this standard behavior? Because I seem to recall that 5 columns are reserved for this value so it could contain larger numbers, no?
Is that something that could be changed with a separate method or an argument?
Sorry for pestering you lately.
Have a great weekend!

Update mmCIF correspondences

Update the image and the planning based on the following link.
https://mmcif.wwpdb.org/docs/pdb_to_pdbx_correspondences.html

Document what is supported in open/save

As there is a pdb and mmCIF open and save options, it will now be needed to document what is supported in each of those functions. Also this can be used to show what will have to be supported for version 1.0.

Represent space group differently

Currently space groups are represented as a char followed by a vec of usize. Which proves to be wrong for many types of space groups, so the best would be to save it as a string, while at the same time saving it as the index in the IToC Vol A table (reference_tables::get_index_for_symbol).

Test another remove_child method

There is a method that does exactly what the implementation in remove_child_by does, so it could be that this is faster, in which case it should be used. If it is equal use this method as well, because it is cleaner.

atom() method Index

Hi,
I have a test PDB file. Excerpt shown here:

ATOM     26  N   TYR     3      71.828  47.686  89.455  0.00  0.00           N
ATOM     27  H   TYR     3      71.781  46.672  89.403  0.00  0.00           H
ATOM     28  CA  TYR     3      72.947  48.274  90.203  0.00  0.00           C
ATOM     29  HA  TYR     3      73.190  49.241  89.761  0.00  0.00           H
ATOM     30  CB  TYR     3      72.523  48.510  91.664  0.00  0.00           C
ATOM     31 HB2  TYR     3      73.354  48.961  92.204  0.00  0.00           H
ATOM     32 HB3  TYR     3      71.713  49.240  91.671  0.00  0.00           H
ATOM     33  CG  TYR     3      72.079  47.277  92.434  0.00  0.00           C

When I call pdb.atom(26) I find that, in fact, The H atom of TYR is returned. Is this intentional? I must say I find this counterintuitive. If unintentional this is probably due to 0- vs. 1-based indexing in Rust and PDB files respectively.

Generate data appropriate to strictnesslevel in save

Based on the strictnesslevel different records should be omitted. I propose to only give the following with Strict:

MASTER
ORIGX (unless defined)
SCALE (unless defined)

The following should only be generated if a DatabaseReference is present:

DBREF
SEQADV
SEQRES

The following should only be generated in medium and up:

TER full line (copies the data from the last ATOM)

Add examples

Having examples which show how a library works and how people can use it for their own projects is a very important part of documentation. I would like to invite anyone to add more examples to the documentation, especially if you have used this library for one of your own projects.

A great place to initially add an example would be to the integration tests /tests/ folder in this way it will automatically be run every time the code is updated so it is guaranteed to keep running. As soon as an example is added we can discuss the best place to put it like the readme or docs or something else.

SEQRES issues

With the removal of Hetero chains #46 and a fix in the clone of Chain in 91f0e41 the SEQRES residues give rise to problems in the saving as PDB files.

Spatial lookup

Hi,
I saw that you updated the docs with the version upgrade. If I'm reading the rtree docs correctly (and from my own experience), the locate_within_distance takes a squared distance as input so in your example the atoms within 12.5^0.5 would be located.
You may want to check up on that and adjust tests accordingly.

Accept Hall symbols for symmetries

Sometimes the symmetry in the CRYST line is provided as a Hall symbol instead of Hermann Mauguin. It would be great if the program can detect this and still handle it.

Residue Name and Serial Number in Atom struct

Okay, so I have the following situation:

    pub fn find_contacts(&self, level: u8) {
        let mut vdw_radii = HashMap::new();
        vdw_radii.insert("H".to_string(), 1.54);
        vdw_radii.insert("C".to_string(), 1.90);
        vdw_radii.insert("N".to_string(), 1.79);
        vdw_radii.insert("O".to_string(), 1.71);
        vdw_radii.insert("P".to_string(), 2.23);
        vdw_radii.insert("S".to_string(), 2.14);
        vdw_radii.insert("CL".to_string(), 2.06);
        vdw_radii.insert("NA".to_string(), 2.25);
        vdw_radii.insert("CU".to_string(), 2.17);

        let mut table = Table::new();
        table.set_format(*format::consts::FORMAT_NO_LINESEP_WITH_TITLE);
        table.set_titles(row![
            "Atom ID 1",
            "Atom Name 1",
            "Residue Name 1",
            "Atom ID 2",
            "Atom Name 2",
            "Residue Name 2",
            "Distance"
        ]);

        let mut tree = RTree::new();

        for residue in &self.residues {
            for atom in &residue.atoms {
                tree.insert(PointWithData::new(atom, atom.coords))
            }
        }

        for residue in &self.residues {
            for atom in &residue.atoms {
                let radius: f64 = match level {
                    0 => 1.0,
                    1 => {
                        let rad = vdw_radii
                            .get(&atom.element.to_uppercase())
                            .expect("No Radius found for given element.");
                        rad * rad
                    }
                    _ => panic!("Too high level given"),
                };
                let contacts = tree.locate_within_distance(atom.coords, radius);
                for item in contacts {
                    if item.data.atom_id < atom.atom_id
                        && !(item.data.res_id == atom.res_id && item.data.res_name == atom.res_name)
                        && !(item.data.atom_name == "C"
                            && atom.atom_name == "N"
                            && item.data.res_id + 1 == atom.res_id)
                    {
                        let distance = item.data.calc_distance(atom);
                            table.add_row(row![
                                bByFd =>
                                item.data.atom_id,
                                item.data.atom_name,
                                item.data.res_name,
                                atom.atom_id,
                                atom.atom_name,
                                atom.res_name,
                                format!("{:.2}", distance)
                    }
                }
            }
        }
        if table.len() > 1 {
            if level == 0 {
                println!("\nClash Analysis");
            } else if level == 1 {
                println!("\nContact Analysis");
            }
            table.printstd();
        }
    }

This function builds an R-Tree from Atoms and uses that to conduct searches for other Points that are spatially nearby. The idea is to use such information to find clashes of atoms that are too close by (had a problem with such a case recently). In order to filter out results from atoms that belong to the same residue (and thus are close to each other by construction) I use the information about the respective residue that I stored in the Atom structs.
During the process of refactoring my code with pdbtbx I couldn't think of a straightforward way to do this. In most cases I can get what I need by just iterating over residues and the atoms within them but here I only have the Points in the R-Tree as results which are Atom structs. I would have to do a separate iteration over the PDB struct to determine which residues these belonged to and then go from there.
So for cases such as this I would propose to add a residue_serial_number and residue_name field to the Atom struct (or maybe re-add since I think you had something like this at some point). If the respective fields of the Residue struct are kept, this should not break existing code or cause trouble with ownership, yes? I realize this would necessitate changing the instantiation methods and unit test and I tried implementing this myself but failed making the macro errors that cropped up go away...
What are your thoughts on this?

Add lookup functions for elementary data on Atoms

Looking up the elementary number && radius would be handy functionality. Obviously it should have all elements (up to 118) and return Option<>s.

Get information across hierarchies

Hello!
I recently started working on a project involving reading and editing PDB files for a very specific purpose. Somewhere along the way I became aware of this crate and am considering switching to using it because it's much more feature-complete than what I have written.
However, a question came to mind: Is there a way to access information across different hierarchies in the Structure? E.g. get a list of atoms from a given residue or determine which residue a given atom belongs to? I have not seen this functionality in the docs but I may have missed it.
Great work by the way!

Use thiserror for error handling

I suggest using thiserror for error handling. There's a great article explaining why it's more ergonomic and efficient than manual implementation (which convinced me to use thiserror in my crates): https://nick.groenen.me/posts/rust-error-handling/

rust-bio have also adopted thiserror recently.

Proofread docs

As documentation is very important for reuse of code and especially important for libraries it would be great if people using this library can help make the docs readable and understandable. As the main writer of the library it is hard to write documentation that is sufficiently clear for starters with this library.

Feel free to open a PR (Pull Request) with changes to the docs or point grammar errors etc out here.

Prefer `Vec<u8>/[u8]` over `Vec<char>/[char]`

When creating a Vec<char> from a string, s.chars().collect() is the standard practice. The .chars() method works on any UTF-8 encoded string and it takes time to find the boundary for each character. However, we know that PDB files contain ASCII characters only, therefore it will be faster to transmute a string directly to bytes using as_bytes() and use vectors/arrays/slices of bytes instead of chars throughout the crate. This practice is used, for example in rust-bio and seq-io

Spatial query support

It would be great if the library supported ways to get close atoms to specified points. Proposed is to do this by making an r*tree of all atoms upon request from the user fn create_rtree(self: &PDB). For this Atom should implement Point. Caution should be taken to avoid mutability problems as the rtree will be disentangled from the PDB hierarchy.
https://docs.rs/rstar/0.8.2/rstar/struct.RTree.html

douweschulte / pdbtbx Goto Github PK

pdbtbx's Introduction

Description

Contributing

License

Why

Contributors

Supported features

Additional crate features

Latest update

v0.11.0

Support and development

Join the Discussion

pdbtbx's People

Contributors

Stargazers

Watchers

Forkers

pdbtbx's Issues

Recommend Projects

Recommend Topics

Recommend Org