biojulia / biostructures.jl Goto Github PK

View Code? Open in Web Editor NEW

87.0 11.0 22.0 4.93 MB

A Julia package to read, write and manipulate macromolecular structures (particularly proteins)

License: Other

Julia 100.00%

bioinformatics structural-biology julia pdb protein-structure biology structural-bioinformatics biojulia

biostructures.jl's Introduction

BioStructures.jl

Latest Release:

Development status:

Description

BioStructures provides functionality to read, write and manipulate macromolecular structures, in particular proteins. Protein Data Bank (PDB), mmCIF and MMTF format files can be read in to a hierarchical data structure. Spatial calculations and functions to access the PDB are also provided. It compares favourably in terms of performance to other PDB parsers - see some benchmarks online.

Installation

Install BioStructures from the Julia package REPL, which can be accessed by pressing ] from the Julia REPL:

add BioStructures

See the documentation for information on how to use BioStructures.

Citation

If you use BioStructures, please cite the paper:

Greener JG, Selvaraj J and Ward BJ. BioStructures.jl: read, write and manipulate macromolecular structures in Julia, Bioinformatics 36(14):4206-4207 (2020) - link - PDF

Contributing and questions

We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.

If you have a question about contributing or using this package, you are encouraged to use the #biology channel of the Julia Slack or the Bio category of the Julia discourse site.

biostructures.jl's People

Contributors

Stargazers

Watchers

biostructures.jl's Issues

BioStructures fails to parse certain PDB files from SCOPe/ASTRAL archive

Expected Behavior

BioStructures.jl should be able to parse all files from SCOPe/ASTRAL: http://scop.berkeley.edu/downloads/pdbstyle/pdbstyle-sel-gs-bib-95-2.07.tgz

Current Behavior

BioStructures.jl fails on 2962 files, for example d9pcya_. Full list of files attached:
biostructures_test_scop95.txt

Steps to Reproduce (for bugs)

download pdbs from http://scop.berkeley.edu/downloads/pdbstyle/pdbstyle-sel-gs-bib-95-2.07.tgz and unzip

using BioStructures
read("d9pcya_.ent", PDB)

Context

Your Environment

Package Version used: 0.4.0
Julia Version used: 1.0.3
Operating System and version (desktop or mobile): Linux 3.10.0-862.14.4.el7.x86_64
Link to your project:

Error using BioStructures, no "libz1" module found

This template is rather extensive. Fill out all that you can, if are a new contributor or you're unsure about any section, leave it unchanged and a reviewer will help you 😄. This template is simply a tool to help everyone remember the BioJulia guidelines, if you feel anything in this template is not relevant, simply delete it.

Expected Behavior

Current Behavior

The module "libz1" cannot be found.

Possible Solution / Implementation

Steps to Reproduce (for bugs)

1.add BioStructures
2.using BioStructures
3.
4.

Context

I am installing a fresh JuliaPro with julia 1.3, now I am adding packages

Your Environment

Package Version used:
Julia Version used:
Operating System and version (desktop or mobile):
Link to your project:

convert pdb to mol

Not an issue, the package looks great but I was looking to feed a ligand into the MolecularGraph.jl package and I was wondering if you had a suggestion of how to convert a ligand.pbd into a mol/sdf in Julia?

Secondary Structure Information

Hello. In my project, I am using secondary structures. Without going into detail, I need to sample CA atoms from different secondary structures for my algorithm. This can easily be done by parsing strings, I know, but I think it might be a nice feature to be able to obtain secondary structure information like any other feature in a PDB file.

Expected Behavior

One solution might be creating a function such as helix(struc::ProteinStructure) or sheet(struc::ProteinStructure) and obtain the lines which contains the starting and ending residues of the secondary structure. Then we can get the list of atoms/residues with a combination like this:

collectatoms(struc::ProteinStructure, calphaselector)[helix(struc::ProteinStructure)[1]] # to obtain the atoms belonging the first Alpha Helix

Current Behavior

I could not find something related with this suggestion in the documentation, if there is one, I am genuinely sorry.

Possible Solution / Implementation

Though I have some experience of the source code in the spatial.jl file, I do not have any experience regarding parsing PDB files. My suggestion might be writing a string parser as a function but I am not sure how we can connect it with a ProteinStructure structure.

Context

My project is related with secondary structures. I think it might be nice to be able to obtain regarding information for those who in need.

export fixlists!

I am doing structure modeling from images, so I need to manipulate the atoms and residues. Is it possible to export the fixlists! function, so I can update the atom_list after adding some atoms. OR are there other ways I can build a new protein structure using the current API?

Strip whitespaces from atom/element names

When looking at an atom record I discovered that the whitespaces do not get stripped from the atom/element names when they are parsed.

An example: Dict{String, AbstractAtom}(" CA " => Atom CA with serial ,.....

As you can see, CA is sorrounded by spaces, which is kind of inconvenient. I suggest changing the function parseatomname in pdb.jl to the following:

function parseatomname(line::String, line_n::Integer=1)
    try
            return strip(line[13:16])
    catch
            throw(PDBParseError("could not read atom name", line_n, line))
    end
end

And the same for parseelement.

MMCIF reader parsing mistakes, some keys are missing / corrupted

While running the MMCIF reader over the whole PDB archive, my run aborted on PDB entry 1JUF. There seems to be some mixups in parsing the file, with some keys missing and some data values ending up being interpreted as keys. There might be more pdb entries with parsing problems, i'll try and collect all the failures i can find in the next days.

Perhaps it would be good to have a more detailed test that also parses the whole PDB with the biopython MMCIF reader (or perhaps there is a reference mmcif parser) and checks if the generated dictionaries are exactly the same.

Steps to Reproduce

using BioStructures
cif = MMCIFDict("1JUF.cif")
cif["_entry.id"]            # works
cif["_exptl.method"]        # fails with missing key, but grep shows it's there
keys(cif)                   # some keys look like data values

I tried playing around deleting sections or adding them to a new cif file but wasn't able to isolate the problem yet.

Your Environment

BioStructures v0.11.0
Julia 1.5.1
Linux, openSUSE 15.0

MMCIF reading error

When obtaining an mmcif dictionary, if the structure contains atom ids with prime (') it gives an error such as:
ArgumentError("Line ended with quote open: 'C3 H7 N "O2'" 89.093 ")
or
ArgumentError("Opening quote in middle of word: 11 NE2 ? A HIS 16 ? A HIS 16 ? 1_555 CO ? E B12 . ? A B12 201 ? 1_555 C6' ? F FWK . ? B FWK 501 ? 1_555 173.8 ? ")

The second error makes sense since it wasn't surrounded in double quotes but I believe the first error should not happen.

You can test this out with 6H9E

https://files.rcsb.org/download/6H9E.cif

EDIT: Nevermind, it was an error within my file. Sorry about that.

Create a ContactMap from a BitArray

Would it be possible to provide the functionality to create a ContactMap from a BitArray{2} ? Or to modify the .data property of an existing ContactMap ?

Suggestion: Make structural elements mutable

I just was trying to use BioStructures to solve a problem which required making copies of chains, rename them, alter atom serial numbers and other modifications to structural elements. In its current implementation only the atom coordinates can be modified. This package is great, but I think it could be much more useful if the corresponding structures are declared as mutable, and I don't see any downside on doing that.
It would be great to be able to modify atoms, residues and chains and create new models/structures with them.
Best regards,
Amaury

Request for symmetries

This is a great library and I'm already using it lots. I would like to evaluate positions of a protein's atoms with the full expression of symmetry in the complete unit cell or entire crystal. I can't seem to find it in the documentation for this library, nor in the code.

Expected Behavior

I would like to see documented examples of how to retrieve that atoms positions along with all the symmetries of those positions.

Current Behavior

I've written my own basic parser (below), but it only works for PDB files. Molecules in the PDB can often be dowloaded in either PDB format or MMCIF format, however some appear to only be downloadable as mmCIF, e.g. "6PEM". In these cases, downloadpdb throws an exception unless the MMCIF format is specified.

julia> BioStructures.downloadpdb("6PEM")   # no .pdb file available!
[ Info: Downloading PDB: 6PEM
┌ Error: Download failed: curl: (22) The requested URL returned error: 404 Not Found
└ @ Base download.jl:43
ERROR: failed process: Process(`/usr/bin/curl -s -S -g -L -f -o /var/folders/ry/wg4yqb1j5lz3lcyzyls96k2r0000gn/T/jl_32sXTz http://files.rcsb.org/download/6PEM.pdb.gz`, ProcessExited(22)) [22]

julia> BioStructures.downloadpdb("6PEM", file_format=MMCIF)   # works!

Possible Solution / Implementation

Perhaps two functions be added to the library. One function that can inform the user which filetypes a protein can be downloaded in, and another function that allows the user to read symmetries regardless of the file type. However there may be more natural solutions that integrate better with the BioStructures.Model (e.g. adding a field for parsed symmetries). I would be happy with anything that supported both formats and allowed me to ultimately get all the atoms in the unit cell or entire crystal.

By way of example of the second function, the code that I implemented as a workaround follows. It assumes that you have already downloaded a PDB file to a pdb_cache_directory, and it only works with .pdb format files.

function read_symmetries(protein_name)
    # get the symmetries of the complex - rotation matrix and translation vector
    # they are in the form:
    #                     index         rotation M               translation
    # REMARK 350   BIOMT1   1  1.000000  0.000000  0.000000        0.00000
    # REMARK 350   BIOMT2   1  0.000000  1.000000  0.000000        0.00000
    # REMARK 350   BIOMT3   1  0.000000  0.000000  1.000000        0.00000
    #
    full_protien_name = protein_name * ".pdb"
    fn = joinpath(pdb_cache_dir, full_protien_name)
    lines = readlines(fn)
    remark = "REMARK 350   BIOMT"
    lines_defining_symmetry = filter(x->startswith(x,remark), lines)
    @assert (length(lines_defining_symmetry) % 3 == 0) "odd number of lines in symmetry section"
    nsymmetries = round(Int64, length(lines_defining_symmetry)/3)
    rotation_matries = Array{Float64,2}[]
    translation_vectors = Array{Float64,1}[]
    for i in 1:nsymmetries
        idx = (i-1)*3
        s1 = split(lines_defining_symmetry[idx+1])
        s2 = split(lines_defining_symmetry[idx+2])
        s3 = split(lines_defining_symmetry[idx+3])
        rotation_matrix = [ parse(Float64, s1[5]) parse(Float64, s1[6]) parse(Float64, s1[7]);
                            parse(Float64, s2[5]) parse(Float64, s2[6]) parse(Float64, s2[7]);
                            parse(Float64, s3[5]) parse(Float64, s3[6]) parse(Float64, s3[7]) ]
        translation_vector = [ parse(Float64, s1[8]), parse(Float64, s2[8]), parse(Float64, s3[8]) ]
        push!(rotation_matries, rotation_matrix)
        push!(translation_vectors, translation_vector)
    end
    rotation_matries, translation_vectors
end

Your Environment

Package Version used: v0.6.0
Julia Version used: v1.3.1
Operating System and version (desktop or mobile): macOS Version 10.15.2

Multiple selection and Merge Selection

can we use collectresidues to select multiple region of protein? such as :

domain = collectresidues(chain, res -> (13 <= resnumber(res) <= 436) || (504 <= resnumber(res) <= 534), allselector)

or try to merge different selections to a new selection?

Feature suggestion: support BinaryCIF file format

I think it would be good to support the BinaryCIF format. This might perhaps make for a good GSoC project.

"BinaryCIF is a data format for storing text based CIF files using a more efficient binary encoding. "

https://github.com/molstar/BinaryCIF

BinaryCIF is the replacement format for MMTF:

"As of July 2, 2024, RCSB PDB will no longer serve PDB data in the MMTF compression format. Users are strongly encouraged to switch to accessing the data files offered in the compressed BinaryCIF (BCIF) format."

https://www.rcsb.org/news/65a1af31c76ca3abcc925d0c

biojulia / biostructures.jl Goto Github PK

biostructures.jl's Introduction

BioStructures.jl

Description

Installation

Citation

Contributing and questions

biostructures.jl's People

Contributors

Stargazers

Watchers

Forkers

biostructures.jl's Issues

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behavior

Current Behavior

Possible Solution / Implementation

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behavior

Current Behavior

Possible Solution / Implementation

Context

Steps to Reproduce

Your Environment

Expected Behavior

Current Behavior

Possible Solution / Implementation

Your Environment

Recommend Projects

Recommend Topics

Recommend Org