When I was reproducing the results of CDK in the test_set, you provided input data in

I noticed you mentioned the example T1064 in another (<a href="https://github.com/lhat

What do you mean, it didn't utilize the MSA? <p dir="au

What do you mean, it didn't utilize the MSA? </blockquot

Problem with Crosslinking data input about alphalink HOT 12 OPEN

lhatsk commented on July 29, 2024

Problem with Crosslinking data input

from alphalink.

Comments (12)

lhatsk commented on July 29, 2024

The xl array (as well as a contact map or the pair representation) is symmetric that's why you have (i,j) and (j,i). The grouping array is an artefact, it's no longer required in the distogram network. Here, we just assign every crosslink to its own group indicated by an integer.

To reproduce the results, you need to disable all sources of non-determinism, for example, the MSA masking.

from alphalink.

Li-dacheng commented on July 29, 2024

Thank you for your response. Based on your explanation, am I correct in understanding that grouping_array doesn't serve any purpose in the model?
Shouldn't using the crosslink data in both PT and CSV formats yield the same results?
Also, the CDK_neff10.pkl file contains MSA features, but since the inference process didn't utilize MSA, it shouldn't affect my reproduction efforts, correct?

from alphalink.

Li-dacheng commented on July 29, 2024

I noticed you mentioned the example T1064 in another (issues/13 ). I ran the data as per your instructions, but the resulting PLDDT score doesn't match the displayed 82.371 in the link Additionally, it differs significantly from the TM score in the attached model.cif file. Could you please help me identify the issue?
Below are my input commands and output files.
python predict_with_crosslinks.py \ T1064.fasta \ T1064_8_LEU_10A_CA.pt \ --features T1064.pkl \ --checkpoint_path /AlphaLink/resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt \ --output_dir $output_dir
link

from alphalink.

lhatsk commented on July 29, 2024

Based on your explanation, am I correct in understanding that grouping_array doesn't serve any purpose in the model? Shouldn't using the crosslink data in both PT and CSV formats yield the same results?

It doesn't serve any purpose, but unfortunately, it can still affect the results due to injecting randomness.

Also, the CDK_neff10.pkl file contains MSA features, but since the inference process didn't utilize MSA, it shouldn't affect my reproduction efforts, correct?

What do you mean, it didn't utilize the MSA? For this example, there will not be any random subsampling of the MSAs, since the MSA size is below the threshold, but by default, there is always MSA masking. This would also apply to T1064, you'd need to remove any source of randomness including MSA masking. We removed any non-determinism to make the results comparable to AlphaFold.

from alphalink.

Li-dacheng commented on July 29, 2024

What do you mean, it didn't utilize the MSA?

I noticed in predict_with_crosslinks.py that if a PKL file is provided, MSA won't be performed since the PKL file already contains MSA information, is that correct?

This would also apply to T1064, you'd need to remove any source of randomness including MSA masking.

For this example, there will not be any random subsampling of the MSAs, since the MSA size is below the threshold, but by default, there is always MSA masking.

So, when you refer to the random subsampling of the MSAs, what does that mean? Do I need to input the neff parameter? How do I remove MSA masking? Can you give out an example?

We removed any non-determinism to make the results comparable to AlphaFold.

By the way, you trained on model_5_ptm. When comparing with AlphaFold, did you use the results from model_5? Which checkpoint did you use, the one from AlphaFold or OpenFold?

Thank you very much for your patient responses. Looking forward to your reply.

from alphalink.

lhatsk commented on July 29, 2024

What do you mean, it didn't utilize the MSA?

I noticed in predict_with_crosslinks.py that if a PKL file is provided, MSA won't be performed since the PKL file already contains MSA information, is that correct?

Yes, no MSA search will be performed if you supply a pickle file. The pickle already contains all the features, including the MSA. This way the MSA stays fixed (at a given Neff) which will ensure comparability with AlphaFold, since we used exactly the same input features. The only difference is the crosslinks (+ additional training).

So, when you refer to the random subsampling of the MSAs, what does that mean? Do I need to input the neff parameter?

To limit memory consumption, AlphaFold limits the size of the input MSAs. How many MSAs are being used, is defined in the model configuration, see https://github.com/lhatsk/AlphaLink/blob/main/openfold/config_crosslinks.py#L197.

If the MSA is bigger than max_msa_clusters, the MSA is subsampled to max_msa_clusters many sequences and the rest is aggregated in the extraMSA stack.

How do I remove MSA masking? Can you give out an example?

https://github.com/lhatsk/AlphaLink/blob/main/openfold/config_crosslinks.py#L196

Set this to 0.0.

By the way, you trained on model_5_ptm. When comparing with AlphaFold, did you use the results from model_5? Which checkpoint did you use, the one from AlphaFold or OpenFold?

We used the 2.0 AlphaFold weights for model_5_ptm both as a starting point for fine-tuning and in the prediction for AlphaFold. The predictions were made in OpenFold with the AlphaFold weights which produces the same results (or reasonably close) to AlphaFold.

from alphalink.

Li-dacheng commented on July 29, 2024

Yes, no MSA search will be performed if you supply a pickle file. The pickle already contains all the features, including the MSA. This way the MSA stays fixed (at a given Neff) which will ensure comparability with AlphaFold, since we used exactly the same input features. The only difference is the crosslinks (+ additional training).

Sorry, OpenFold cannot accept a feature file as input, right? So how do you ensure that you are using exactly the same input? When creating feature files, you mentioned using different 'neff' values. How is this variable controlled when comparing with AlphaFold2?

How do I remove MSA masking? Can you give out an example?https://github.com/lhatsk/AlphaLink/blob/main/openfold/config_crosslinks.py#L196
Set this to 0.0.

Thank you very much for your prompt. After setting this config to 0.0, the result of TM score from the AlphaLink inference increased from 0.365 to 0.8675. Could you please explain why this has such a significant impact?
Do we always need to set masked_msa_replace_fraction to 0.0 when using AlphaLink?
And, when comparing with AlphaFold2, do I also need to set masked_msa_replace_fraction in the OpenFold config to 0.0?

from alphalink.

lhatsk commented on July 29, 2024

Sorry, OpenFold cannot accept a feature file as input, right? So how do you ensure that you are using exactly the same input?

No, not by default, but it's easy to change. I just removed crosslinks from AlphaLink and used the original AlphaFold weights with the same inputs.

When creating feature files, you mentioned using different 'neff' values. How is this variable controlled when comparing with AlphaFold2?

By using the same features which includes the MSA with a fixed Neff.

Thank you very much for your prompt. After setting this config to 0.0, the result of TM score from the AlphaLink inference increased from 0.365 to 0.8675. Could you please explain why this has such a significant impact?

The MSA masking affects the Neff. It randomly removes 15% of the information in the MSA. The effect is obviously much stronger for MSAs that contain little information to begin with (low Neff). Depending on what is subsampled and how well the network is able to reconstruct it, you may end up with a lower/ higher Neff than before. It could now for example be the case that you mask out parts that could help with noise rejection or that remove information that are super complementary to crosslinks, resulting in worse results and more variance. Here, the masking was just unlucky, it could also help.

Do we always need to set masked_msa_replace_fraction to 0.0 when using AlphaLink?

No, I would keep it on for normal usage.

And, when comparing with AlphaFold2, do I also need to set masked_msa_replace_fraction in the OpenFold config to 0.0?

Yes, you should set it to 0.0 to keep the comparison fair for both methods.

from alphalink.

Li-dacheng commented on July 29, 2024

By using the same features which includes the MSA with a fixed Neff.

Thank you. I would like to know how the number of effective sequences (Neff) is defined. Did you set the parameter neff=10 when running AlphaLink and AlphaFold2 on the dataset? Is this done to reflect the impact of crosslink data?

I want to ask this question because when I ran MSA with neff=10 on the example 6LKI_B (ma-rap-alink-0001), the results differ from using feature inputs (skipping MSA). The TM scores with the ground truth are 0.8087 and 0.9012, respectively.

from alphalink.

lhatsk commented on July 29, 2024

Thank you. I would like to know how the number of effective sequences (Neff) is defined. Did you set the parameter neff=10 when running AlphaLink and AlphaFold2 on the dataset? Is this done to reflect the impact of crosslink data?

The Neff is defined in the "MSA subsampling" section. We subsampled the MSAs to a given Neff to simulate challenging targets to show the impact of crosslinking MS data.

I want to ask this question because when I ran MSA with neff=10 on the example 6LKI_B (ma-rap-alink-0001), the results differ from using feature inputs (skipping MSA). The TM scores with the ground truth are 0.8087 and 0.9012, respectively.

6LKI is part of the low Neff CAMEO targets, they are already challenging with low Neffs (at most 25, for 6LKI it's 15) therefore we didn't do any MSA subsampling. Your subsampling will further reduce the Neff and make the target harder which likely results in a lower TM-score.

from alphalink.

linjyshanghaitech commented on July 29, 2024

Hello, I noticed in the data_module_xl.py file, specifically at line 24, that you have imported the MSA subsampling module with from openfold.data.msa_subsampling import get_eff, subsample_msa, subsample_msa_sequentially, subsample_msa_random. However, looking further into the file, I didn't find any usage of this module. Could you please explain why it was imported but not used?

from alphalink.

lhatsk commented on July 29, 2024

data_module_xl.py is not used. It's some legacy stuff that I didn't clean up.

from alphalink.

Problem with Crosslinking data input about alphalink HOT 12 OPEN

Comments (12)

Related Issues (18)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent