biojulia / biostructures.jl Goto Github PK
View Code? Open in Web Editor NEWA Julia package to read, write and manipulate macromolecular structures (particularly proteins)
License: Other
A Julia package to read, write and manipulate macromolecular structures (particularly proteins)
License: Other
BioStructures.jl
should be able to parse all files from SCOPe/ASTRAL: http://scop.berkeley.edu/downloads/pdbstyle/pdbstyle-sel-gs-bib-95-2.07.tgz
BioStructures.jl
fails on 2962 files, for example d9pcya_
. Full list of files attached:
biostructures_test_scop95.txt
download pdbs from http://scop.berkeley.edu/downloads/pdbstyle/pdbstyle-sel-gs-bib-95-2.07.tgz and unzip
using BioStructures
read("d9pcya_.ent", PDB)
can we use collectresidues to select multiple region of protein? such as :
domain = collectresidues(chain, res -> (13 <= resnumber(res) <= 436) || (504 <= resnumber(res) <= 534), allselector)
or try to merge different selections to a new selection?
While running the MMCIF reader over the whole PDB archive, my run aborted on PDB entry 1JUF. There seems to be some mixups in parsing the file, with some keys missing and some data values ending up being interpreted as keys. There might be more pdb entries with parsing problems, i'll try and collect all the failures i can find in the next days.
Perhaps it would be good to have a more detailed test that also parses the whole PDB with the biopython MMCIF reader (or perhaps there is a reference mmcif parser) and checks if the generated dictionaries are exactly the same.
using BioStructures
cif = MMCIFDict("1JUF.cif")
cif["_entry.id"] # works
cif["_exptl.method"] # fails with missing key, but grep shows it's there
keys(cif) # some keys look like data values
I tried playing around deleting sections or adding them to a new cif file but wasn't able to isolate the problem yet.
This template is rather extensive. Fill out all that you can, if are a new contributor or you're unsure about any section, leave it unchanged and a reviewer will help you ๐. This template is simply a tool to help everyone remember the BioJulia guidelines, if you feel anything in this template is not relevant, simply delete it.
The module "libz1" cannot be found.
1.add BioStructures
2.using BioStructures
3.
4.
I am installing a fresh JuliaPro with julia 1.3, now I am adding packages
I think it would be good to support the BinaryCIF format. This might perhaps make for a good GSoC project.
"BinaryCIF is a data format for storing text based CIF files using a more efficient binary encoding. "
https://github.com/molstar/BinaryCIF
BinaryCIF is the replacement format for MMTF:
"As of July 2, 2024, RCSB PDB will no longer serve PDB data in the MMTF compression format. Users are strongly encouraged to switch to accessing the data files offered in the compressed BinaryCIF (BCIF) format."
I am doing structure modeling from images, so I need to manipulate the atoms and residues. Is it possible to export the fixlists!
function, so I can update the atom_list after adding some atoms. OR are there other ways I can build a new protein structure using the current API?
I just was trying to use BioStructures to solve a problem which required making copies of chains, rename them, alter atom serial numbers and other modifications to structural elements. In its current implementation only the atom coordinates can be modified. This package is great, but I think it could be much more useful if the corresponding structures are declared as mutable, and I don't see any downside on doing that.
It would be great to be able to modify atoms, residues and chains and create new models/structures with them.
Best regards,
Amaury
Would it be possible to provide the functionality to create a ContactMap from a BitArray{2}
? Or to modify the .data
property of an existing ContactMap ?
When obtaining an mmcif dictionary, if the structure contains atom ids with prime (') it gives an error such as:
ArgumentError("Line ended with quote open: 'C3 H7 N "O2'" 89.093 ")
or
ArgumentError("Opening quote in middle of word: 11 NE2 ? A HIS 16 ? A HIS 16 ? 1_555 CO ? E B12 . ? A B12 201 ? 1_555 C6' ? F FWK . ? B FWK 501 ? 1_555 173.8 ? ")
The second error makes sense since it wasn't surrounded in double quotes but I believe the first error should not happen.
You can test this out with 6H9E
https://files.rcsb.org/download/6H9E.cif
EDIT: Nevermind, it was an error within my file. Sorry about that.
When looking at an atom record I discovered that the whitespaces do not get stripped from the atom/element names when they are parsed.
An example: Dict{String, AbstractAtom}(" CA " => Atom CA with serial ,.....
As you can see, CA is sorrounded by spaces, which is kind of inconvenient. I suggest changing the function parseatomname in pdb.jl to the following:
function parseatomname(line::String, line_n::Integer=1)
try
return strip(line[13:16])
catch
throw(PDBParseError("could not read atom name", line_n, line))
end
end
And the same for parseelement.
Not an issue, the package looks great but I was looking to feed a ligand into the MolecularGraph.jl package and I was wondering if you had a suggestion of how to convert a ligand.pbd into a mol/sdf in Julia?
This is a great library and I'm already using it lots. I would like to evaluate positions of a protein's atoms with the full expression of symmetry in the complete unit cell or entire crystal. I can't seem to find it in the documentation for this library, nor in the code.
I would like to see documented examples of how to retrieve that atoms positions along with all the symmetries of those positions.
I've written my own basic parser (below), but it only works for PDB files. Molecules in the PDB can often be dowloaded in either PDB format or MMCIF format, however some appear to only be downloadable as mmCIF, e.g. "6PEM". In these cases, downloadpdb throws an exception unless the MMCIF format is specified.
julia> BioStructures.downloadpdb("6PEM") # no .pdb file available!
[ Info: Downloading PDB: 6PEM
โ Error: Download failed: curl: (22) The requested URL returned error: 404 Not Found
โ @ Base download.jl:43
ERROR: failed process: Process(`/usr/bin/curl -s -S -g -L -f -o /var/folders/ry/wg4yqb1j5lz3lcyzyls96k2r0000gn/T/jl_32sXTz http://files.rcsb.org/download/6PEM.pdb.gz`, ProcessExited(22)) [22]
julia> BioStructures.downloadpdb("6PEM", file_format=MMCIF) # works!
Perhaps two functions be added to the library. One function that can inform the user which filetypes a protein can be downloaded in, and another function that allows the user to read symmetries regardless of the file type. However there may be more natural solutions that integrate better with the BioStructures.Model (e.g. adding a field for parsed symmetries). I would be happy with anything that supported both formats and allowed me to ultimately get all the atoms in the unit cell or entire crystal.
By way of example of the second function, the code that I implemented as a workaround follows. It assumes that you have already downloaded a PDB file to a pdb_cache_directory, and it only works with .pdb format files.
function read_symmetries(protein_name)
# get the symmetries of the complex - rotation matrix and translation vector
# they are in the form:
# index rotation M translation
# REMARK 350 BIOMT1 1 1.000000 0.000000 0.000000 0.00000
# REMARK 350 BIOMT2 1 0.000000 1.000000 0.000000 0.00000
# REMARK 350 BIOMT3 1 0.000000 0.000000 1.000000 0.00000
#
full_protien_name = protein_name * ".pdb"
fn = joinpath(pdb_cache_dir, full_protien_name)
lines = readlines(fn)
remark = "REMARK 350 BIOMT"
lines_defining_symmetry = filter(x->startswith(x,remark), lines)
@assert (length(lines_defining_symmetry) % 3 == 0) "odd number of lines in symmetry section"
nsymmetries = round(Int64, length(lines_defining_symmetry)/3)
rotation_matries = Array{Float64,2}[]
translation_vectors = Array{Float64,1}[]
for i in 1:nsymmetries
idx = (i-1)*3
s1 = split(lines_defining_symmetry[idx+1])
s2 = split(lines_defining_symmetry[idx+2])
s3 = split(lines_defining_symmetry[idx+3])
rotation_matrix = [ parse(Float64, s1[5]) parse(Float64, s1[6]) parse(Float64, s1[7]);
parse(Float64, s2[5]) parse(Float64, s2[6]) parse(Float64, s2[7]);
parse(Float64, s3[5]) parse(Float64, s3[6]) parse(Float64, s3[7]) ]
translation_vector = [ parse(Float64, s1[8]), parse(Float64, s2[8]), parse(Float64, s3[8]) ]
push!(rotation_matries, rotation_matrix)
push!(translation_vectors, translation_vector)
end
rotation_matries, translation_vectors
end
Hello. In my project, I am using secondary structures. Without going into detail, I need to sample CA atoms from different secondary structures for my algorithm. This can easily be done by parsing strings, I know, but I think it might be a nice feature to be able to obtain secondary structure information like any other feature in a PDB file.
One solution might be creating a function such as helix(struc::ProteinStructure)
or sheet(struc::ProteinStructure)
and obtain the lines which contains the starting and ending residues of the secondary structure. Then we can get the list of atoms/residues with a combination like this:
collectatoms(struc::ProteinStructure, calphaselector)[helix(struc::ProteinStructure)[1]] # to obtain the atoms belonging the first Alpha Helix
I could not find something related with this suggestion in the documentation, if there is one, I am genuinely sorry.
Though I have some experience of the source code in the spatial.jl file, I do not have any experience regarding parsing PDB files. My suggestion might be writing a string parser as a function but I am not sure how we can connect it with a ProteinStructure
structure.
My project is related with secondary structures. I think it might be nice to be able to obtain regarding information for those who in need.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.