ihmwg / ihmcif Goto Github PK

View Code? Open in Web Editor NEW

21.0 32.0 3.0 56.58 MB

📖 mmCIF support for hybrid/integrative models

Home Page: https://pdb-dev.wwpdb.org

License: Creative Commons Zero v1.0 Universal

Shell 100.00%

mmcif pdb hybrid-model dictionary integrative-modeling

ihmcif's Introduction

Overview

The IHMCIF dictionary provides the data representation required for archiving integrative/hybrid structural models in PDB-Dev.

This dictionary is an extension of the PDBx/mmCIF dictionary and provides the additional defintions required to handle integrative / hybrid models.

The IHMCIF dictionary provides a mechanism to capture the following information regarding integrative / hybrid models:

Defintions of multi-scale, multi-state, ordered, ensembles
Definition of models with heterogenous composition
Descriptions of the starting structural models of individual molecular components obtained from experimental and computational techniques such as
- X-ray diffraction
- NMR spectroscopy
- Electron microscopy
- Computational models
- Intergrative models
Definitions of the spatial restraints derived from a variety of experimental data such as
- 2D and 3D electron microscopy (2DEM and 3DEM)
- Chemical crosslinking mass spectrometry (CX-MS)
- Small angle scattering (SAS)
- Forster resonance energy transfer (FRET)
- Electron paramagnetic resonance spectroscopy (EPR)
- Hydrogen-deuterium exchange mass spectrometry (HDX-MS)
- Atomic force microscopy (AFM)
- Distrance restraints from coevolution data
- Generic distance restraints obtained from biophysical and proteomics methods
Referencing associated data from external resources
Definitions of the ambiguities/uncertainties associated with the experimental data and preliminary validation metrics.
Description of the modeling workflow

The IHMCIF dictionary currently has over 30 new data categories and 300 new data items.

For more details regarding the dictionary, see the IHMCIF dictionary documentation and the mmCIF resources website.

For tips on structuring integrative modeling studies to be amenable to deposition, see this page.

Browse the wiki page for archived information regarding weekly meetings and descriptions of integrative modeling examples.

Organization of the repository

README.md - this file

IHMCIF extension - IHM dictionary extension

IHMCIF complete - IHM dictionary extension merged with the parent PDBx/mmCIF dictionary

dictionary_documentation - directory with detailed documentation regarding the data categories defined in the IHMCIF dictionary along with examples.

examples - directory with examples of integrative models compliant with the IHM dictionary

deposition - directory tracks support for generating IHM dictionary compliant data files by modeling software such as IMP.

Discussion

Discussion on the file formats is conducted via email - please subscribe to the mailing list.
To get an email every time this GitHub repository is updated, please subscribe to the IHM-mmCIF-commits mailing list.

Deposition of models to PDB-Dev

Models can be deposited to PDB-Dev in a semi-automated fashion, via the deposition and data harvesting system. The system accepts mmCIF files compliant with the PDBx/mmCIF and IHMCIF dictionaries. Compliant files can be generated using the python-ihm software library. Modeling software such as IMP have interal support for IHMCIF. See the deposition directory for more information.

Visualization of integrative models

There is currently basic support for visualization of IHM mmCIF models in daily builds of UCSF ChimeraX and in Molstar.

ihmcif's People

Contributors

Stargazers

Watchers

Forkers

dunbrack vasileios-rantos skyclub3

ihmcif's Issues

Local paths in ihm_external_files.file_path

Does ihm_external_files.file_path require paths to be separated with / or \ or is either acceptable? Windows builds of our software will naturally output paths separated with \ while the provided examples all use /. The docs say "This data item assumes a POSIX-like directory structure" but I'm not sure if that means "you must use / to separate paths".

For local files, the path is required to be a relative path. But relative to what?

Add ensemble ids to ihm_localization_density_files for exosome and mediator

The ensemble ids are "." in the ihm_localization_density_files table for the exosome and mediator examples. They need to have a valid ensemble id to know which ensemble the localization density is for.

Add position and scaling matrix for nup84 2D EM

nup84.cif has a reference to a 2D EM map (*.pgm) file which does not contain pixel size information or alignment to the sphere models.

Merge this repository with ihmwg/IHM-dictionary ?

While most development of the IHM dictionary occurs here, there's another repository linked to from the PDB-dev website which contains similar information: https://github.com/ihmwg/IHM-dictionary

Any reason not to merge the two? Since both are public and contain similar information it seems to make more sense to me to drop one (or both) of them in favor of a single repo, so things stay in sync.

My suggestion is to use the ihmwg organization (since it reinforces the notion that we're not Sali-lab-specific) but (mainly) use the content from this repo (since there's a lot more of it, including issues etc).

If, @brindakv, you agree, I should be able to do this pretty easily if you make me an admin in the ihmwg organization.

Add reference to external sphere model ensemble files

Would like IHM to be able to reference sphere model ensemble files that are too big to be put into the IHM file. Currently the nup84 has such ensembles in the DOI zip archive in PDB format and I hacked references to those files into the nup84 ChimeraX demo by adding a file reference in the ihm_model_list table as follows.

_ihm_model_list.ordinal_id
_ihm_model_list.model_id
_ihm_model_list.model_name
_ihm_model_list.model_group_id
_ihm_model_list.model_group_name
_ihm_model_list.assembly_id
_ihm_model_list.protocol_id
_ihm_model_list.file
1 1 'Cluster 1 best score' 1 'Cluster 1' 1 1 .
2 2 'Cluster 2 best score' 2 'Cluster 2' 1 1 .
3 3 'Cluster 1 ensemble' 1 'Cluster 1' 1 1 extra_data/ensembles/cluster1.pdb
4 4 'Cluster 2 ensemble' 2 'Cluster 2' 1 1 extra_data/ensembles/cluster2.pdb

Instead the link out should be an id to another table that lists all the externally linked files either by DOI+archive path, URL, or local file path. That table would also list the file format -- for instance the sphere model ensembles might be in RMF format (currently used by IMP) or a binary trajectory format.

Make a time series example

The nup84, mediator and exosome examples do not use the time series capabilities of the IHM format. Ben and Brinda and I discussed making an example using time series but Ben said the Sali lab has not published an time series example, so we did not come up with any candidate example systems.

ihm_starting_model_seq_dif.db_entity_id

When a starting model differs from the input crystal structure by a MSE -> MET mutation, am I required to add a new entity to the mmCIF file to list the input crystal structure's sequence (with MSE)? I assume that's what ihm_starting_model_seq_dif.db_entity_id is for, but the dictionary MSE example has ihm_starting_model_seq_dif.entity_id = ihm_starting_model_seq_dif.db_entity_id, which surely can't happen if the sequences are different, right?

List all linked out files in a single new table.

In order to use IHM format as a working format while a modeling project is in progress, the externally linked files (comparative models, sequence alignments, EM maps, ensembles of result structures, localization maps, ...) should be listed in a table that refers to local files (on the local disk) instead of referencing a DOI zip archive.

One design would give an integer id to every external file, external database reference, external DOI reference, and all other tables would use this id. Any external data could be referenced in any of these 3 ways (local file, database, DOI archive).

Having external references in one table will simplify validating that deposited structures do not reference missing data.

Starting model coordinates should link out to standard mmCIF files.

Currently IHM starting model coordinates are in the ihm_starting_model_coord table. This is a redundant representation of atomic model coordinates usually represented by the mmCIF atom_site table. Important information such as secondary structure is not available when representing coordinates with this special redundant table. Also software capable of reading the atom_site table can't easily be reused on this table because of the different name and all the missing associated mmCIF tables (such as secondary structure).

It seems more sensible to have the IHM file contain a starting model table that links out to mmCIF atomic model files, rather than trying to duplicate that information in the IHM file.

Add ihm_external_files.details

The old ihm_dataset_other category used to have a details field. This category is largely replaced now by ihm_external_files. Would be useful to have a details field in that category so we can give a human-readable description of the file.

Predicted Contacts

Predicted residue-residue contacts are another commonly used form of distance restraints, like crosslinks. They are commonly used together with crosslinks in integrative modeling. I cannot find explicit support for them in the current form of the dictionary - should general_distance_constraints from the NMR-Star dictionary be used for that? Are they compatible with the IHM dict? Can multiple extensions be used together? Sorry if that is obvious, I'm not so deep into mmCIF yet :)
Best,
Lukas

Use ".ihm" file suffix instead of ".cif".

The IHM file format should use a different suffix than ".cif", I suggest ".ihm". This will allow software that reads this format to easily identify the dictionary in use, distinguishing it from the mmCIF dictionary that covers only atomic models.

The idea of sticking with ".cif" and putting some table in the file that identifies that this is an IHM flavor cif file will make it much harder for software developers to add support for reading this file because essentially all software distinguishes file types by the file suffix, rather than looking inside the file.

I suggest we should call these "integrated hybrid model" or IHM files, not "mmCIF" files when talking about the format. mmCIF is strongly associated with atomic-only resolution models, so that name is very misleading when applied to IHM files.

Why are datasets repeated in ihm_dataset_list for each group id?

In the nup84.cif example the ihm_dataset_list table repeats every dataset 3 times with group ids 1, 2 and 3. This is confusing. The 3 copies of each data set have the same ihm_dataset_list.id but different Ihm_dataset_list.ordinal_id values.

It seems this table is being used to give ids to the datasets but also to say which groups each dataset belongs to. But then fields like "data_type" and "database_hosted" have to be repeated for each line listing the same dataset in another group. Maybe the table is trying to do too much and it should only list the unique datasets and a new table should identify which groups each dataset belongs to?

The ihm_starting_model_details table has the same type of issue where the same starting model is repeated multiple times so that one field can be varied -- but then all other fields need to be copied. Maybe tables should be designed so properties that can only have one value for a given object and properties that can have multiple values for a given object (like multiple templates for one comparative model, or multiple groups for one dataset) should not be mixed in the same table.

Web-browsable IHM dictionary?

It might be easier for newcomers to the dictionary if they could browse it with a web browser in the same way as existing extension dictionaries, e.g. something like http://mmcif.wwpdb.org/dictionaries/mmcif_em.dic/Index/. Is there any way to feed the existing draft dictionary into whatever tool generates those views? Ideally that would be automated and update weekly (or whenever the repository is updated) and the data dumped somewhere accessible, such as pdb-dev.

How to document interface mapping information

Information about interfaces can be used in the modelling (e.g. by HADDOCK). This can be for examples:

mutations shown to disrupt the binding
footprinting experiments for DNA/RNA
H/D exchange data from MS or NMR
NMR chemical shift perturbations
oxidation protection (e.g. detected by MS)
and many others.

Basically those will provide lists of residues that are expected to be part of the interface. This info is typically not stores in dedicated databases and should be captured here.

Link _atom_site.pdbx_PDB_model_num to ihm_model_list

For I/H models with atomistic representations, the _atom_site.pdbx_PDB_model_num data item needs to be linked to the ihm_model_list category.

Clarify docs for ihm_starting_model_details.starting_model_sequence_offset

It isn't clear from the dictionary docs how ihm_starting_model_details.starting_model_sequence_offset is supposed to be interpreted.

Currently I assume it means the same as the IMP definition, which is I/H model residue # = starting model residue # + offset

However, it could just as easily mean starting model residue # = I/H model residue # + offset

Easy enough to modify IMP if the second meaning is intended.

Add references to sphere model ensemble PDB files for nup84, exosome and mediator

The ihm_ensemble_info table now has an ensemble_file_id field to reference the ensemble sphere files. These are in the nup84 example as multimodel PDB files (not sure about mediator and exosome). Should include reference to these ensemble files in the example data.

Need to be more clear about what each "state" is

A model containing two states may actually represent a bulk sample of 10²³ molecules, where some fraction of those molecules are in one state (different conformation and/or composition) and some other fraction are in a different state. Alternatively, the experimental data may have been determined from a single molecule experiment, in which case the different states in the model are adopted by the same molecule at different points in time (and the experiment returns some kind of convolution of these states). These two conditions should be distinguished - perhaps with an extra enumeration in the ihm_multi_state_modeling table. (An alternative solution would be to require that the latter case be treated as a single state, with multiple time points; the problem here is that since the data are convoluted there is no way to order the states.)

Mediator missing comparative template models

The mediator example lists 6 comparative models in the ihm_starting_model_details table but there is no ihm_starting_comparative_models table giving the templates.

_ihm_starting_model_details.ordinal_id
_ihm_starting_model_details.entity_id
_ihm_starting_model_details.entity_description
_ihm_starting_model_details.asym_id
_ihm_starting_model_details.seq_id_begin
_ihm_starting_model_details.seq_id_end
_ihm_starting_model_details.starting_model_source
_ihm_starting_model_details.starting_model_db_name
_ihm_starting_model_details.starting_model_db_code
_ihm_starting_model_details.starting_model_auth_asym_id
_ihm_starting_model_details.starting_model_sequence_offset
_ihm_starting_model_details.starting_model_id
_ihm_starting_model_details.dataset_list_id
1 1 med6 A 1 192 'experimental model' PDB 4GWP G 0 med6-m1 1
2 2 med8 B 23 214 'experimental model' PDB 4GWP C 0 med8-m1 1
3 3 med11 C 4 115 'experimental model' PDB 4GWP A 0 med11-m1 1
4 4 med17 D 182 687 'experimental model' PDB 4GWP B 0 med17-m1 1
5 5 med18 E 2 301 'experimental model' PDB 4GWP E 0 med18-m1 1
6 6 med20 F 2 210 'experimental model' PDB 4GWP F 0 med20-m1 1
7 7 med22 G 1 121 'experimental model' PDB 4GWP D 0 med22-m1 1
8 8 med4 H 37 127 'comparative model' ? ? D 0 med4-m1 2
9 9 med7 I 12 206 'comparative model' ? ? G 0 med7-m1 2
10 10 med9 J 65 149 'comparative model' ? ? I 0 med9-m1 2
11 11 med31 K 19 110 'comparative model' ? ? Z 0 med31-m1 2
12 12 med21 L 2 128 'comparative model' ? ? U 0 med21-m1 2
13 21 med16 U 8 538 'comparative model' ? ? ' ' 0 med16-m1 3

Need association between comparative model asym_ids and sphere model asym_ids

In the mediator.cif example file, comparative model cr_mid_fullmed10.pdb listed in table ihm_dataset_other has chain identifiers (asym_ids) that do not appear to match the sphere model asym_ids. This causes the comparative model to not be aligned to the sphere model when opened in ChimeraX. The ihm_starting_model_details table lists the association between starting models and this comparative model but there is no table field to say which asym_id in the comparative model corresponds to a given starting model asym_id. There is a "starting_model_db_pdb_auth_asym_id" field for models referenced with "starting_model_db_name" and "starting_model_db_code", but comparative models not at a database (e.g. when a project is in progress) do not have the associated asym_id identified.

How will model domains be defined?

The fly genome example final_2L_60_161.ihm has spheres 1-102 in two domains 1-50 and 51-102. While those two domains can be named using the ihm_struct_assembly table. John Westbrook notes that this is not the envisioned use of the assembly table which usually is for defining complexes consisting of multiple entities rather than subparts of a single entity. Possibly a new idea is needed about how to specify domains.

Avoid ihm_model_id in atom_site table

Currently IHM inserts a new field ihm_model_id into the atom_site table. This complicates code that reads these files because it will likely use an existing mmCIF reader for atom_site that does not know about the ihm_model_id. It might be better to use the atom_site pdbx_PDB_model_num field. That field is necessary in any case to correctly parse the atom_site table into multiple models. The mapping of pdbx_PDB_model_num to ihm_model_id could be another IHM table allowing the IHM extension to be self-contained without injecting fields into categories from other dictionaries.

Document the data representation model in IHM

It would be useful to describe the model representation used by IHM in a precise way but in higher level terms than the current IHM documentation

https://github.com/salilab/mmcif/blob/master/dictionary_documentation/documentation.md

Such a description will help assess the generality of the model representation. I'll attach an attempt at such a description. This probably belongs in the above documentation.md but I didn't want to tamper with that file unless others agree.

Mediator example, DOI zip archives not found.

The DOI archive in the mediator.cif example

https://zenodo.org/record/556216/files/integrativemodeling/mediator-v1.0.2.zip

gets a file not found. It appears that the correct URL is

https://zenodo.org/record/556216/files/mediator-v1.0.2.zip

Likewise the cluster DOI zip files

https://zenodo.org/record/556216/files/integrativemodeling/cluster1.zip

are not found and correct URL appears to be

https://zenodo.org/record/556216/files/cluster1.zip

content_type for localization density

Would be nice to specify the content_type for localization density ".mrc" files in ihm_external_files as "localization density". Currently for the nup84, exosome, mediator examples it describes the content_type as 'Modeling or post-processing output' which is quite a bit vaguer.

Starting models table contains repeated comparative models

In the nup84.cif example the ihm_starting_model_details table contains multiple lines for a single comparative model, one line for each template that was used in building that comparative model. Here is an example

loop_
_ihm_starting_model_details.ordinal_id
_ihm_starting_model_details.entity_id
_ihm_starting_model_details.entity_description
_ihm_starting_model_details.asym_id
_ihm_starting_model_details.seq_id_begin
_ihm_starting_model_details.seq_id_end
_ihm_starting_model_details.starting_model_source
_ihm_starting_model_details.starting_model_auth_asym_id
_ihm_starting_model_details.starting_model_sequence_offset
_ihm_starting_model_details.starting_model_id
_ihm_starting_model_details.dataset_list_id
1 1 Nup84 A 7 436 'comparative model' A 0 Nup84-m1 5
2 1 Nup84 A 33 424 'comparative model' A 0 Nup84-m1 5
3 1 Nup84 A 429 488 'comparative model' A 0 Nup84-m1 5
4 1 Nup84 A 506 726 'comparative model' A 0 Nup84-m1 5
...

The same comparative model is repeated 4 times once for each template that was used to build it. This is a confusing organization. Note that the first two lines give sequence ranges that overlap. The lines only different in the sequence ranges specified. It seems like details of how templates were used to make this comparative model are being shoe-horned into this table. That seems odd given that there is a separate ihm_starting_comparative_models table that lists those templates.

Perhaps the sequence ranges given in the four entries for comparative model Nup84-m1 above should in fact be in the ihm_starting_comparative_models table, and the ihm_starting_model_details table should just have one line for this comparative model.

Units for ihm_external_files.file_size_kb?

The dictionary is unclear what the units are for ihm_external_files.file_size_kb. Are "kilobytes" old-style (1024 bytes) or new-style (1000 bytes)? If the former, it would be clearer to say "kibibytes". Either way the multiplier should be stated in the dictionary so it's clear.

Make an IHM example from a Rosetta project

The nup84, mediator and exosome examples are all based on published analysis done with IMP. To see if the format will accommodate hybrid models computed with other software we should make an example using Rosetta, or other non-IMP methods.

Need for a more generic distance information class

The model now allows for cross-link data to be stored. But this is rather specific. In principle any kind of distance should be store, coming from MS or any other experimental method providing such information (e.g. FRET, DEER, ...)

Allow external file reference for ihm_model_list entries

Want to be able to reference an external file for sphere models, both individual models and ensembles. For the nup84 ChimeraX demo I added a file reference to access the sphere model ensembles (multimodel PDB file) to the ihm_model_list table.

A problem with this is that there are not good formats established for sphere models. So IMP sometimes uses multi-model PDB files for this. We have talked about using DCD or other binary trajectory formats for this purpose. RMF is another format used by IMP for sphere model ensembles. But the actual format of the linked file is not something the IHM dictionary needs to worry about -- it just needs to provide the ability to link to external sphere files in whatever format IMP, Rosetta, or other modeling programs use.

Access individual DOI archive files for better performance

Currently fetching the DOI archive for the exosome from Zenodo takes about 30 minutes (1.3 Gbytes). This means the exosome IHM file is not viewable in ChimeraX for 30 minutes after trying to open it. This is ridiculously slow and unusable given it is just trying to get some small localization density maps from the file. I believe the bulk archive is ensemble models (which are currently not referenced by the IHM file).

If Zenodo allows accessing individual files from the DOI that should be used in the IHM file (ihm_external_files table) to improve performance, so only the data files that are actually being viewed get downloaded.

The current slow performance will inhibit most users of these files. If the files are only available as one Gbyte download this is a poor design and other archiving methods that allow access to individual files should be investigated.

Use two model groups for clusters 1 and 2 of nup84.cif

Would be good to use the model group capability of IHM for cluster 1 and 2 of the nup84 example data. Also name the two groups "Cluster 1" and "Cluster 2". This requires changing nup84.cif table

_ihm_model_list.ordinal_id
_ihm_model_list.model_id
_ihm_model_list.model_group_id
_ihm_model_list.model_group_name
_ihm_model_list.assembly_id
_ihm_model_list.protocol_id
1 1 1 'Sphere models' 1 1
2 2 1 'Sphere models' 1 1

Without this there is no indication in the nup84 file that the two sphere models are associated with two clusters.

Use standard EMDB accession codes in references

Currently the mediator.cif IHM example contains in the ihm_dataset_related_db_reference table a reference to EMDB map 2634. The table field "accession_code" is filled in with EMD-2634, but the actual accession code is 2634 without the leading "EMD-". Brinda Vallat points out that mmCIF files (e.g. 4ux1) in the pdbx_database_related table have a "db_id" field that uses the format "EMD-2759" and the IHM example was following that. This mmCIF field is not called an accession_code. It seems reasonable that a field called "accession_code" should really be the accession code.

Chemical Crosslinks

The chemical moiety used in cross linking experiments is currently defined in the _ihm_cross_link_list.type data item, which consists of an enumerated list (EDC, DSS, other). A better way to define this may be to include the chemical moiety in the chemical component dictionary and use the _chem_comp.id in the ihm_cross_link_link_list table.

Exosome model is missing starting model coordinates for Rrp46_gfp-m2

We refer to a starting model Rrp46_gfp-m2 in the .cif file for the Exosome example but don't have any coordinates for it in ihm_starting_model_coord. This is probably because the residue numbers are offset between the starting model and the output IMP model, and the IMP code is confused somehow. Will investigate.

Added ihm_model_list table model_name to nup84, mediator, exosome examples

Add the "model_name" field to the ihm_model_list tables for the example data sets. This name should be something like "Cluster 1 best score" or something descriptive of how this one representative sphere model was chosen.

How is distance_threshold interpreted in ihm_cross_link_restraint table?

David Castillo notes that the fly genome example final_2L_60_161.ihm uses 3 types of crosslink restraints with different interpretations of the distance_threshold field in the ihm_cross_link_restraint table:

"These are the restraints we use for our modelling

LowerBoundHarmonic
IMP.core.HarmonicLowerBound where the restraint is satisfied if the distance between the particles is above a distance threshold.

UpperBoundHarmonic
IMP.core.HarmonicUpperBound where the restraint is satisfied if the distance between the particles is below a distance threshold.

Harmonic
IMP.core.Harmonic where the restraint is satisfied if the distance between the particles is equal to the equilibrium distance "

In order to color violated restraints differently from satisfied restraints in ChimeraX, these different interpretations of the distance_threshold would need to be encoded in the IHM file.

Add comparative model alignment files to nup84 example

Would be useful to reference the comparative model alignment files in the nup84 example using the new ihm_starting_model_alignment_files table.

Add ihm_localization_density_files to nup84 example

Would be good to include references to localization map MRC files in the nup84 example using the new ihm_localization_density_files table. I think the MRC files in the DOI archive may not align with the sphere models. I have made new MRC localization files from the ensemble models using Chimera. We could include those in a new DOI archive in order to make nup84 a better example of the IHM format.

Make a multistate example

The current nup84, mediator and exosome IHM data examples are not multistate, so the multistate features of the format have not been tested.

Ben and Brinda and I discussed making multistate example. Ben said the exosome project was 2 states but done as 2 separate imp jobs for the two states so it is not an ideal example. Another multistate system Ben mentioned was published years ago and does not use current IMP protocols which would make it hard to put into IHM format. So we didn't come up with any candidate example systems.

nup84 2D EM map in PGM format can't be read by Python Image Library

The 2D EM map in the nup84.cif example is in ASCII PGM format which is not supported by the Python Image Library (Pillow 4.0.0, latest available). Binary PGM is supported. This is such an obscure format it is not worth adding support in ChimeraX to read it. Better to convert to a commonly used format like MRC or CCP4 or PNG or TIFF to make the example represent best practice.

Need permanent identifiers/accessions for PDB-dev entries

The dictionary is now at the point where we can usefully encode entire structures. We and other labs would like to start depositing structures in PDB-dev in future publications. In order for this to work effectively, we'll need to be able to uniquely refer to a deposit - so we'll need some sort of permanent identifier or accession code.

Use full file paths within DOI archives

Currently the ihm_dataset_other fields "doi" and "content_filename" are used to refer to a DOI of a zip archive and a file within that archive. In our nup84, mediator and exosome example data the currently supplied content_filename is not actually the full path within the zip archive -- it excludes the top level directory listed in the archive. In the nup84 example data the "content_filename" is given, for example, as "data/ScNup84_7-488_506-726_new.pdb", but the actual path in the archive is "integrativemodeling-nup84-a69f895/data/ScNup84_7-488_506-726_new.pdb". Currently the top level directory is not being included because it is technically difficult to determine in an automated fashion according to Ben Webb. That top level directory name includes the GitHub repository version string and is automatically included in an upload to Zenodo. These details are specific to current Sali lab practices and should not alter the fact that "content_filename" should always be a full path to a file in the referenced archive.

SSL certificate for pdb-dev website has expired

Looks like the SSL certificate for https://pdb-dev.rcsb.rutgers.edu/ expired on Feb 7th 2017. So anybody trying to visit the page now will get an 'insecure connection' error (I can't see it at all in Firefox without adding an exception). Obviously this is going to put a lot of people off using the website.

Align ensemble models to match localization density for nup84

The current nup84 example ensemble models seem to be positioned with random orientations. This is not good for visualization. Our examples should encourage others to follow best practices, and I think in this case the ensemble models should be aligned in whatever way they were aligned when computing the localization density maps. There are at least 2 different possible reasonable alignments, either aligning all beads (with equal weighting per bead) to minimize RMSD, or align only chains A, E, G which were held fixed in the ensemble calculation. I believe we are using the localization density maps I computed using the first alignment choice. I can probably write the new DCD files if this is deemed reasonable. ChimeraX has options to do the alignments on the fly, but it is a real nuisance to use those options.

Representing a hierarchy of structural components in 3D genome structures

We need a method for representing a hierarchy of structural components, especially in 3D genome structures. The ihm_struct_assembly category provides a way of achieving this. However, the hierarchy is implicit and not explicitly defined. More information regarding the provenance of the hierarchy is required to create a generic data representation. The em_entity_assembly category in the PDBx/mmCIF dictionary is a good example with a self-linking parent.

Add model_name field to ihm_model_list table

Currently there is a model id for sphere models given in the ihm_model_list table but no text description explaining what the model represents. Would be good to add a "model_name" field to this table. For nup84.cif names for the two sphere models could be

"Cluster 1 best score"
"Cluster 2 best score"

I think the two sphere models in that example represent the best scoring models for each of the two cluster but there is currently no text in the file that indicates that.

Support localization densities for output clusters

As @shruthivis suggested, it would make sense to include localization density maps for output model clusters (e.g. Fig 6A in the Nup84 paper), i.e. the mmCIF file should contain one or more cluster representatives but also localization density maps to show the variability in position of each subunit.

We propose to do this in two ways (could use none, one or both of these in a deposited model).

Deposit MRC files in GitHub and link to them from the mmCIF file via DOI (we'll do this for Nup84, as per integrativemodeling/nup84#2).
Store the densities in the mmCIF file itself as a GMM (set of Gaussians; ihm_gaussian_obj_site in the mmCIF file).

@cgreenberg, for #2, we need to store the XYZ coordinates of the center of the Gaussian, a 3x3 covariance matrix, and a weight, correct? The weight is necessary? I notice in your GMM output files you call the center "mean" instead - is this "more correct"?

What is recommended usage for model groups?

In David Castillo's fly genome model final_2L_60_161.ihm he uses group id 1 for "Centroid cluster 1" consisting of one model and group id 2 for "Cluster 1" which contains 505 models of an ensemble all specified in the sphere_obj_site table. This seems reasonable to use 2 groups for these two conceptually distinct collections of models. In nup84, exosome and mediator examples a single group id is used for "Best scoring model cluster 1" with a single sphere model in the ihm file, and the Cluster 1 ensemble which is a reference to an external DCD file. This provides two levels of grouping. The best scoring model is grouped with the ensemble, and the ensemble itself is a different kind of grouping. But that two levels of grouping cannot be done in the fly genome example because the ensemble models are not in an external file. The fly genome file currently does not have a ihm_ensemble_info table. If one was added it would have to specify a single group id and if the group id included both centroid and 505 ensemble models then the centroid would be considered one of the ensemble models which seems wrong. So a single group id for centroid and ensemble does not seem to work in the fly genome case but is used in the nup84, exosome, mediator cases. It seems the fly genome example should contain an ihm_ensemble_info table and two separate groups should be used (as in the current file) so that the ihm_ensemble_info can properly refer only to the ensemble members.

The nup84, exosome and mediator cases may be different in that the "Best scoring model" I think actually is one of the ensemble members. Those examples currently define the ensemble using a group that contains both the best scoring model and the ensemble of models. Technically this would appear to make the best scoring model appear two times in the ihm_ensemble, once from the sphere_obj_site table and once from the external DCD file for the ensemble models. This does not seem ideal.