Comments (34)
Hi Yan,
To generate distograms with the contacts_to_distograms.py script you need two inputs.
- a CSV with your contacts where each row has the following format: ResiduePosition1 ResiduePosition2 FDR
ResiduePositions are 1-indexed, i.e., starting from 1.
FDR corresponds to the uncertainty for this particular contact, essentially the level of noise you expect. It's a number between 0 and 1. E.g., 0.05 would mean there is a 5% chance this contact is noisy/ wrong. In principle, you can use an FDR of 0 but this would be more akin to forcing constraints.
Example:
10 15 0.05
30 50 0.2 - the cutoff for your contact information in Angstrom (CA-CA).
The script currently only allows one cutoff. If you want to mix multiple cutoffs (which is possible), a quick hack would be to run this script with different cutoffs and concatenate the output files.
The output is another CSV with the distograms per contact pair that you can use for prediction.
Run like this:
python contacts_to_distograms.py --csv my_contact_file --cutoff 10 --output my_distogram_output
Let me know how it's working out with your data, I'm curious!
Kolja
from alphalink.
Sorry! It's fixed in the new version.
from alphalink.
(there should be no header in the file, just the comma-separated list of restraints with the lines like my comment above)
from alphalink.
it works now without headers. thank you so much!
from alphalink.
Hi Yan,
--distograms is only a flag, it doesn't take an argument.
restraints.csv then corresponds to your restraints_distribution.csv.
7K3N_A.fasta -- you need to have some sort of FASTA input file with your sequence.
pdb_mmcif/mmcif_files -- yes, follow the OpenFold instructions, also for setting up the databases like uniref90, mgnify, and pdb70.
Kolja
from alphalink.
Hello, This thread helped me to understand the scripts. Is there a sample input for testing:
- preprocessing_distributions.py - e.g. the csv described here #9 (comment)
- along with sample input for these predict_with_crosslinks.py
Currently I am working out some issues to build the dockerfile (which is failing on my linux instance, but I should be able to work through). Afterwards, I would like to test with existing input. It appears there is testing data (test_set), but this may be already processed to some degree?
Thanks!
from alphalink.
I pushed a fix. Thanks for reporting the issue!
from alphalink.
Hi Kolja,
I just tried the contacts_to_distograms.py script. it showed the following ERROR:
22 def get_uniform(cutoff, fdr):
23 d = np.ones(128)
---> 24 maximum_bin = np.argmax(distogram_bins > cutoff)
25 d[:maximum_bin] /= np.sum(d[:maximum_bin])
26 d[:maximum_bin] *= 1 - fdr
TypeError: '>' not supported between instances of 'numpy.ndarray' and 'str'
do you know how to fix it?
Thank you.
Yan
from alphalink.
Hi Kolja,
the new script works now.
Thank you so much!
Yan
from alphalink.
Hi Kolja,
i have a new problem on the 'preprocessing_distributions.py' script. what is the format of the infile "restraints.csv"? Could you please give an example for the csv file?
Thank you so much!
Yan
from alphalink.
specifically, it gave the errror:
cannot parse restraint type in line
('From', 'To', 'mu', 'sigma', 'type')
Thank you.
Yan
from alphalink.
hello,
if you run
preprocessing_distributions.py --help
it should give you the format of restraints.csv (I'll fix the way it's displayed and include an example and clarification in the readme)
The file should be made up of lines like (for example)
125,51,10.0,5.0,normal
16,10,10.0,5.0,log-normal
These lines specify
a restraint from resi 125 to resi 51 with a mean distance of 10 Angstrom, standard deviation of 5 and a normal distribution
a restraint from resi 16 to resi 10 with a mean distance of 10, standard deviation of 5 and a log-normal distribution
from alphalink.
yes, i did run the preprocessing_distributions.py --help it showed
"residueFrom, residueTo, mean_dist, stdev, distribution_type"
and i changed them to ('From', 'To', 'mu', 'sigma', 'type') in your original script, but it still do not work and showed cannot parse restraint type in line
('From', 'To', 'mu', 'sigma', 'type')
from alphalink.
could you attach your restraint.csv file here?
from alphalink.
Hi,
Again, now I have questions on the predict_with_crosslinks.py script. In the READ.ME protocol, command shows
'python predict_with_crosslinks.py --distograms --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt 7K3N_A.fasta restraints.csv uniref90.fasta mgy_clusters.fa pdb70/pdb70 pdb_mmcif/mmcif_files uniclust30_2018_08/uniclust30_2018_08'
i dissected into the following way:
python predict_with_crosslinks.py
--distograms restrains_distribution.csv (#csv file generated by preprocessing_distributions.py script)
--checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt
? 7K3N_A.fasta (#i guess it's the input file? then what script we should run?)
? restraints.csv (#why there is restraints.csv file here? what py script we should run?)
--uniref90_database_path uniref90.fasta
--mgnify_database_path mgy_clusters.fa
--pdb70_database_path pdb70/pdb70
? pdb_mmcif/mmcif_files (#what script we should run? like openfold command format? )
--uniclust30_database_path uniclust30_2018_08/uniclust30_2018_08
i left questions on each line of the commands.
Thank you so much!
Yan
from alphalink.
Thank you so much Kolja.
from alphalink.
Hi, again the error as following:
"RuntimeError: Error(s) in loading state_dict for AlphaFold:
size mismatch for xl_embedder.linear.weight: copying a param with shape torch.Size([128, 1]) from checkpoint, the shape in current model is torch.Size([128, 128])"
How to easily figure it out?
Thanks
Yan
from alphalink.
Hi,
never mind about my last question. i made it work with distogram.pt.
but it showed
' File "/opt/AlphaLink/openfold/np/relax/amber_minimize.py", line 521, in run_pipeline
max_attempts=max_attempts,
File "/opt/AlphaLink/openfold/np/relax/amber_minimize.py", line 459, in _run_one_iteration
raise ValueError(f"Minimization failed after {max_attempts} attempts.")
ValueError: Minimization failed after 100 attempts."
and the prediction gave only one structure model. does it make sense? how to figure it out?
Cheers,
Yan
from alphalink.
In the end, you get two models, one relaxed and one unrelaxed. It fails in your case in the relax stage. I haven't run into this issue before. It's probably due to a lot of clashes. You should have a look at the unrelaxed model to see if it makes sense.
from alphalink.
Thank you so much Kolja.
Is there a simple way to run multiple different sequences and restraints data at the same time?
Thanks,
Yan
from alphalink.
No, not in parallel. Sequentially, the easiest would be to wrap everything in a for-loop in bash.
from alphalink.
Okay. thanks.
from alphalink.
Hello,
Just to clarify: AlphaLink is run with a protein sequence (.fasta format) and a set of distance restraints (usually crosslinking MS data). The restraints can be represented in 2 ways:
-the default representation, which is a space-separated file with ResidueFrom ResidueTo FDR as shown in the ReadMe. In this case, predict_with_crosslinks.py is run with no additional flags. The same information can be given as a pyTorch dictionary with Numpy arrays.
-as distance distributions. In this case, you may either provide the programs with your own distance distributions, or generate some for your restraints using preprocessing_distributions.py . In this case, you will then use the output of this script as the input for predict_with_crosslinks.py, which you run with the --distogram flag.
In addition to these 2 files, you may or may not want to use openfold to generate the msa features. For example, you can run the msa stage of alphafold2 and then take the msa directories coming out of that by pointing to that directory with the --use_precomputed_alignments flag.
In the test directory, you will find the CDK case: there you have pyTorch dictionaries, sequences and precomputed msa files to run predict_with_crosslinks.py. If you want, we can also add the space separated restraint file instead (it looks very much like the one in the ReadMe).
Hope this helps!
from alphalink.
Thanks it helps. It would be helpful to have the space separated restraint file for the CDK example (this would be the data informed by crosslinking MS data, correct?). And if I understand, then I do not need the preprocessing_distributions and may follow the 'default representation'? Having those data in text form rather than pyTorch dictionary would be helpful as I am not (yet?) very familiar with the representation and our own data would be closer to the text form.
from alphalink.
I uploaded the corresponding CSV files. Note that CDK was a theoretical experiment (proof-of-concept), the links correspond to simulated data. The real data for the membrane set is also in the git (still in PyTorch format though).
Yes, for the CDK example you can use the CSV directly as an input to the photo-AA network.
For your data, you could do the same, if your data is close to 10A. If you need a different cutoff, you would need to use the distogram network. Here you could use the contacts_to_distograms.py to preprocess your CSV data (same format as the input to the network ("default representation")).
from alphalink.
Hi,
Could you share some detailed commands to play with -neff flag? should we run it with predict_with_crosslinks.py script?
predict_with_crosslinks.py --distograms --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt 7K3N_A.fasta restraints.csv uniref90.fasta mgy_clusters.fa pdb70/pdb70 pdb_mmcif/mmcif_files uniclust30_2018_08/uniclust30_2018_08 --neff
Thanks.
Yan
from alphalink.
Neff is a number - the number of effective sequences in the MSA, as described in the AlphaLink paper, the original AlphaFold2 paper (Fig.5) and previous publicaitons. It acts at the MSA level by subsampling MSA to a given number of effective sequences.
The fewer effective sequences, the weaker the MSA evidence will be. Thus, for Neff=10, in your command:
predict_with_crosslinks.py --distograms --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt 7K3N_A.fasta restraints.csv uniref90.fasta mgy_clusters.fa pdb70/pdb70 pdb_mmcif/mmcif_files uniclust30_2018_08/uniclust30_2018_08 --neff 10
from alphalink.
Hi,
i tried the command and gave the error;
File "/opt/AlphaLink/predict_with_crosslinks.py", line 569, in
main(args)
File "/opt/AlphaLink/predict_with_crosslinks.py", line 421, in main
feature_dict['deletion_matrix'] = feature_dict['deletion_matrix'][indices]
KeyError: 'deletion_matrix'
Any solution/
Thanks,
from alphalink.
Hi,
is there positional argument 'pdb_mmcif/mmcif_files' in predict_with_crosslinks.py script? it seems i can not add template_mmcif_dir.
Thanks,
Yan
from alphalink.
AlphaLink was trained on model_5 and doesn't support templates. I updated the README with the db flags since they were missing.
from alphalink.
Hi,
I am quite intrigued by the approach you used for processing input data when training on distograms. Specifically, for static structures, the distance for a selected pair is a fixed value. Could you kindly explain how you transform this value into a distribution for training purposes? Or perhaps, I might have misunderstood the model's methodology. I would greatly appreciate your insights on this matter.
Thank you for your time!
Lin
from alphalink.
Hi Lin,
In different ways. The simplest is just a "contact" distogram, i.e., an uniform distribution. Let's say we want to simulate a true restraint with an upper bound of 10 A. We choose any restraint that is < 10 A in the crystal structure. The distogram then has uniform probability over the bins up to 10 A. We sometimes replaced the uniform distribution with a subsampled "real" distance distribution over X restraints with < 10 A or a distance distribution based on simulated crosslinking data.
from alphalink.
Hi Lin, In different ways. The simplest is just a "contact" distogram, i.e., an uniform distribution. Let's say we want to simulate a true restraint with an upper bound of 10 A. We choose any restraint that is < 10 A in the crystal structure. The distogram then has uniform probability over the bins up to 10 A. We sometimes replaced the uniform distribution with a subsampled "real" distance distribution over X restraints with < 10 A or a distance distribution based on simulated crosslinking data.
Hi!
Thank you for your reply. I understand the operation for the "contact" distribution map, but I noticed that you have trained two sets of parameters separately. The first set, "finetuning_model_5_ptm_CACA_10A.pt", should be the one you mentioned that was run with a uniform distribution of 10A. I am curious about how the "finetuning_model_5_ptm_distogram.pt" set was trained. It seems to involve more complex procedures.
Thank you for your time!
Lin
from alphalink.
"finetuning_model_5_ptm_CACA_10A.pt" doesn't use a distogram as the internal representation of the crosslinking data. Here, the input data was simply a contact map.
from alphalink.
Related Issues (18)
- Over-weight of crosslinking data HOT 4
- Wrong array shape when using single crosslink record HOT 1
- problem with model loading HOT 12
- CUDA stub file missing for cpu-only build HOT 1
- No MSA output - precomputed alignments called automatically HOT 3
- Request for an example folder HOT 1
- Request for training HOT 5
- Problem with Crosslinking data input HOT 12
- Inter-subunit crosslinking data HOT 2
- Run "python preprocessing_distributions.py --infile restraints.csv" but get an error HOT 4
- Running "predict_with_crosslinks.py" with "restraints.csv --distograms" gives the error HOT 6
- Alphalink install failed HOT 3
- Installation unclear HOT 1
- crosslinks need to be provided as comma separated, not space HOT 1
- FDR Description in Arg parse is potentially wrong HOT 3
- Issue with multimer HOT 3
- Save pkl files HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from alphalink.