mkusner / grammarvae Goto Github PK

Code for the "Grammar Variational Autoencoder" https://arxiv.org/abs/1703.01925

Python 91.25% Makefile 0.01% C++ 1.95% Shell 0.09% HTML 0.28% TeX 0.96% CSS 0.03% Jupyter Notebook 1.78% Batchfile 0.01% Gnuplot 0.01% C 1.48% Cuda 2.16%

grammarvae's People

Contributors

Stargazers

Watchers

Forkers

ml-lab chriscummins arashmh adrianhust johndpope munky69rock threefoldo cpehle benjamesbabala yanjunli-cs codeaudit danielflamshep dmrd kastnerkyle shubhampachori12110095 juliannaumann exe1023 chao1224 lilleswing zmeigorynych dhidru andiac ivanweiz vikingmew alexander-turner tonydeep friendshipity anu-bioinfo prokia lujuanjuan xxffliu frnsys xuanheiiis kohlisimranjit zcrwind alinoleumm mrlinning zjensen262 wangz10 ssuperfu stjordanis afcarl jonathanyin12 jiangnanhugo springri iisuslik43 lygztq pascalnotin abdulelahalshehri pulak09 sneachchea iusc orientalcds soumiknandi01 sebp imamun93 rezaei-ma georgeprojects monge88 akihoni laodanning sailfish009 jxzhangjhu jinzhezhang luis-sribeiro mengchengyao pierrekatuhe wwade90 wuys13 hongxinxiang rnaimehaom lipi12q yqtianust zhangjiahuan17 elijahahianyo umesh1608 itsmahbub

grammarvae's Issues

Tokenize

Hi,

When tokenizing smiles strings like 'c1(c(c(nc(c1Cl)Cl)C(=O)OCl)N' I get this: ['c', '1', '(', 'c', '(', 'c', '(', 'n', 'c', '(', 'c', '1', 'C', 'l', ')', 'C', 'l', ')', 'C', '(', '=', 'O', ')', 'O', ')', 'C', 'l', ')', 'N'] and therefore, it can't be parsed since 'l' is not a terminating symbol in the grammar. How can I fix this?

Training accuracy is decreasing, loss is decreasing

While I am running the model on the same data set, I observe that after several steps that the loss is going down but the training accuracy also seems to go down which is very counter intuitive.

Is this behavior normal? or am I misunderstanding something?

Thanks

Running the code

Thank you so much for your code. I have read your paper and tried to run your code, but many required packages are deprecated. Therefore, I tried to change and update your code so that it could be compatible with python 3.7.6. The code is now updated. Some parts of the code model_eq.py is updated is below:

def conditional(x_true, x_pred):
            #most_likely = K.argmax(x_true)
            most_likely = tf.math.argmax(x_true)
            
            most_likely = tf.reshape(most_likely,[-1]) # flatten most_likely
            ix2 = tf.expand_dims(tf.gather(ind_of_ind_K, most_likely),1) # index ind_of_ind with res
            ix2 = tf.cast(ix2, tf.int32) # cast indices as ints 
            M2 = tf.gather_nd(masks_K, ix2) # get slices of masks_K with indices
            M3 = tf.reshape(M2, [-1,MAX_LEN,DIM]) # reshape them
            #P2 = tf.mul(K.exp(x_pred),M3) # apply them to the exp-predictions
            P2 = tf.math.multiply(tf.math.exp(x_pred),tf.cast(M3,tf.float32)) # apply them to the exp-predictions
            #P2 = tf.math.multiply(tf.cast(tf.math.exp(x_pred),tf.float64),M3) # apply them to the exp-predictions
            #P2 = tf.div(P2,K.sum(P2,axis=-1,keepdims=True)) # normalize predictions
            P2 = tf.math.divide(P2,tf.math.reduce_sum(P2,axis=-1,keepdims=True)) # normalize predictions
            return P2

        def vae_loss(x, x_decoded_mean):
            x_decoded_mean = conditional(x, x_decoded_mean)
            #x = K.flatten(x)
            #x_decoded_mean = K.flatten(x_decoded_mean)
            x = tf.keras.layers.Flatten()(x)
            x_decoded_mean = tf.keras.layers.Flatten()(x_decoded_mean)
            
            #xent_loss = max_length * objectives.binary_crossentropy(x, x_decoded_mean)
            xent_loss = max_length * tf.keras.losses.binary_crossentropy(x, x_decoded_mean)
            #kl_loss = - 0.5 * K.mean(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis = -1)
            #kl_loss = - 0.5 * K.mean(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis = -1)
            
            kl_loss = - 0.5 * tf.reduce_mean(1 + z_log_var - tf.math.square(z_mean) - tf.math.exp(z_log_var), axis = -1)
            kl_loss = - 0.5 * tf.reduce_mean(1 + z_log_var - tf.math.square(z_mean) - tf.math.exp(z_log_var), axis = -1)
            
            return xent_loss + kl_loss

        #return (vae_loss, Lambda(sampling, output_shape=(latent_rep_size,), name='lambda')([z_mean, z_log_var]))
        return (vae_loss, tf.keras.layers.Lambda(sampling, output_shape=(latent_rep_size,), name='lambda')([z_mean, z_log_var]))

The equivalent lines of the previous code are commented. But after running the code, it shows the following error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'time_distributed_1_target' with dtype float and shape [?,?,?]
[[{{node time_distributed_1_target}}]]

After dealing with the code, I realized that when I comment the line "x_decoded_mean = conditional(x, x_decoded_mean)", the code starts running, but the accuracy will not be correct. In addition, commenting the line "P2=tf.math.divide(P2,tf.math.reduce_sum(P2,axis=-1,keepdims=True)) # normalize predictions", does not remove the error. But replacing "P2 = tf.math.multiply(tf.math.exp(x_pred),tf.cast(M3,tf.float32)) # apply them to the exp-predictions" with "P2=tf.math.exp(x_pred)", removes the error. Hence, it means that the error rises from the "conditional" function and M3. I do not know exactly what this function does. Could you please help me for solving this bug? It seems that this is the only error which prevents running the code. If the code works, I can give you the updated code to put it on github.

Thank you

Questions about VAE training

Hi Matt,
Great piece of work and great to see that you seem to answer questions posted here!
I came across your paper a few months ago and ported it to Pytorch with the intention of using the same principles in a different task. However, I am struggling to get good reconstruction accuracy when training the VAE. The main difference with your training method is that instead of using pre-exisiting data, I generate random sentences on the fly while training (I don't have the same restrictions as SMILES strings over semantic validity). To ensure unbiased training, my sentence lengths are uniform random.

I measure reconstruction accuracy as the proportion of strings (from a randomly generated set of 1000) that are exactly recovered by the model. I am getting 0-5% at best, and the reconstruction loss (BCE) plateaus at a fairly high level. I am annealing the KLD from 0 to 1/latent _size, but there is only so much this achieves.

A few questions:

What accuracy were you getting, using the same metric (if you did use it)? Did you observe the same behaviour?
In your experience, was exact reconstruction accuracy essential to the quality of the logP predictor? Maybe it is not such a useful metric in assessing the quality of the latent representations learned by the model, in terms of usability by the BO predictor?
According to the paper, you were able to train the VAE independently from the logP prediction task, i.e. you didn't need to manipulate the latent space to include relevant 'features' for the BO task. Did you run any experiments where you tried to link the two during the VAE training phase? Do you have any reflections on this?
In molecule_vae.py, the method _sample_using_masks() uses a Gumbel-Softmax distribution. If my understanding is correct, the main benefit of taking the argmax of a G-S distribution is that it provides a differentiable alternative to sampling from a classical Softmax (as described in your paper GANs for Sequences of Discrete Elements with the Gumbel-Softmax Distribution). However, as far as I can tell, we never actually backpropagate through this portion of the code, so what was the motivation for using GS rather than Softmax?

Thanks

KL divergence term in loss function

Hi,
I believe there is a bug in the implementation of the VAE loss. You take the mean over the latent dimension rather than the sum. I.e. the line:
kl_loss = - 0.5 * K.mean(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis = -1)
should be:
kl_loss = - 0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis = -1)

The same bug is present in the orignal character based smiles autoencoder:
maxhodak/keras-molecules#59

Replicating the results in pytorch

Dear Matt,

after hearing the talk you gave in Cambridge on the Grammar VAE, I thought it would be fun to play with it in pytorch, so I ported your code to pytorch/Python 3, now at https://github.com/ZmeiGorynych/grammarVAE_pytorch

However, I have some questions when trying to replicate the calibration: I use Adam optimizer with lr =5e-4, decreasing to 1e-4 on plateaus, the loss function is like this

BCE = seq_len * self.bce_loss(model_out_x, target_x)
KLD_element = (1 + log_var - mu*mu - log_var.exp())
KLD = -0.5 * torch.mean(KLD_element)
loss = BCE + KLD

and the encoder/decoder (settings here), which is as best I can tell an exact replica of your functions at model_zinc.py. I'm using batch size 200 as that is the most that'll fit on a p2.xlarge in my implementation of the network.

Now you seem to be calibrating for 100 epochs, which would be 125000 batches for me. However, when I train with the above parameters, doing one validation batch after every 10 train batches, I get the following loss values (x value is batches):

In other words, the loss saturates at the value of about 1.8 after a couple of epochs, and stays there.

Now when I put in a lot of dropout, turn off the sampling of z (just take the mean instead), and replace the KL term with a simple deviation of z batch mean and covariance matrix from those of N(0,1), the model trains much better, getting to loss 0.5 or so over the same period as in the figure above.

Any idea what I could be doing wrong? Should I just de-weight the KL term further until it works?

Thanks a lot for any suggestions,
E.

Latent dimensionality consistency

Hi,
On the way to finger out the reason of the very low accuracy, I noticed there are some inconsistencies. For the grammar based training, in train_zinc.py, the default LATENT = 56, but in model_zinc.py, latent_rep_size = 2 appeared twice: in def create(....) and def load(....)
For the string based training,, in train_zinc_str.py, the default is also LATENT = 56, but in model_zinc_str.py, latent_rep_size = 292 appeared twice. in def create(....) and def load(....).
Is it normal? Is it necessary to keep them consistent for each type of the training?
In keras-molecule, the default latent is 292, for your grammar based latent space, Is there any particular consideration to make it 56?
Thanks!
Toushi68

Pytorch implementation of Bayesian Optimization

Hi,
I have tried the Bayesian Optimization which is coded in Theano. It was working great (Though it was taking a lot of time for training and sampling), but I wanted to port the whole pipeline in Pytorch.

So can you mention some sources where I can find this implementation in Pytorch.

Latent Space Gradient for Equations Data

Hi! I used PCA to plot the grammarVAE latent space into 2 dimensions and colored them by the symbolic regression score (similar to the LogP plot you have in your paper). However, the colors are quite mixed (I don't see the score gradient I expected). (Otherwise the model performs well.)

Have you encountered this issue with equations? Or did you only attempt plotting the LogP?

Thanks in advance for the information!

Question on Latent Space Dimension

@mkusner Hi Kusner,
Have some quetions on the dimension of latent space.

As I checked it in the pretraind folder, and training scripts in GVAE and CVAE, in your setting, you are using 56 as your latent dimension for both GVAE and CVAE. And in the paper appendix, you also plot the 2-dimensional space.

So two questions here:

It seems like you are showing VAE on 2- and 56-dimensional space. Have you tried to optimize over other latent space dimension? Or you implemented that during Bayesian optimization?
In Gomez's paper and code, he tried both 56 and 292 as latent space dimension. Have you also tried to run that results? Or you skip that because 56 is better than 292?

Please correct me if I miss something.

Variational encoding

How can I remove the "variational" part of the autoencoder, to get a plain autoencoder? What parameters or sections of the code control this?

The setting `MAX_LEN = 277` in `models/model_zinc` seems not enough

If we try to encode the following SMILES string with the pretrained zinc model, it will result in an "index out of bounds" error, due to num_productions=288 for this string. Should I simply set MAX_LEN larger?

grammar_model.encode(["O1[C@@H](O[C@H]2[C@H](O)[C@@H](O[C@@H]3OC[C@@](O)(C)[C@H]([NH2+]C)[C@H]3O)[C@@H]([NH2+]CC)C[C@H]2[NH3+])[C@@H]([NH3+])CC=C1C[NH3+]"])

stochastic encoding?

Hi Matt,

Thank you very much for posting the code of your very nice work.

You mention in the paper appendix, in the context of calculating the reconstruction accuracy, that you encode each of the 5000 test molecules 10 time "as encoding and decoding are stochastic". Looking at your the code here, I can see that the decoding is stochastic, but the encoding looks deterministic to me: https://github.com/mkusner/grammarVAE/blob/master/molecule_vae.py#L71-L84
Where does the stochasticity of the encoding come in?

I changed the keras imports to use tf.keras and planning to train it using cuda 9.0 and TF 1.8.0-dev. I hope I will be able to reproduce your results. Did you obtain the ./pretrained weights using the hyperparameters in this code repo? For example, did you use default Adam learning rate?

Bayesian optimization (run_bo.py) seems not producing reasonable results

Hi Matt,

I'm recently playing with the BO part of your code, I only ran simulation 1 and the results seem a bit off. I'm posting the log here. Could you take a look and see if this looks right to you?

In summary, the predicted mean of the selected batch are often much much lower (e.g. -28) than the actual scores (e.g. 2); and in the 5 BO iterations, only the 2nd batch decode into mostly valid SMILE strings, the other 4 iterations only have 1, 0, 3, 3 valid strings out of the 50 points in each batch.

FYI, I modified the generate_latent_features_and_targets.py to encode the SMILES string in batches (otherwise the program always got killed in this line, don't know why);
that is replace line 68

latent_points = grammar_model.encode(smiles_rdkit)

by the following for-loop

nn = len(smiles)
batch_size = 5000
num_batches = nn / batch_size
all_latent_points = []
print('encoding ...')
for i in range(num_batches):
    print('batch %d / %d' % (i, num_batches))
    #latent_points = grammar_model.encode(smiles_rdkit)
    latent_points_batch = grammar_model.encode(smiles_rdkit[i*batch_size:(i+1)*batch_size])
    all_latent_points.append(latent_points_batch)
extra = nn - num_batches * batch_size
if extra > 0:
    print('the last batch ...')
    latent_points_batch = grammar_model.encode(smiles_rdkit[(i+1)*batch_size:nn])
    all_latent_points.append(latent_points_batch)
latent_points = np.vstack(all_latent_points)

after running this modified version of generate_latent_features_and_targets.py,
I run
python run_bo.py > run_bo.log

Attached is "run_bo.log".

run_bo.log

Thanks!

Convert Exception

I got this exception when I ran the encode_decode_zinc.py

TypeError: Failed to convert object of type <class 'theano.tensor.var.TensorVariable'> to Tensor. Contents: argmax. Consider casting elements to a supported type.

So I'm wondering if this is cause by different Tensorflow and Keras version. Can you specify what version you are using?

BTW. I tested both Keras and Tensorflow as 1.2.1

Can not replicate results

Hi, recently cloned the repository and before I began editing the code, I attempted to replicate the results. The training accuracy builds to ~5% accuracy, then drops to ~1.9% accuracy and stays there until I kill the training. I can not seem to figure out why I do not receive the same accuracy as you do in the paper. Any assistance would be helpful. (On the zinc dataset)

Binary Cross Entropy

Hi,

Congrats on your amazing paper! I just have a question regarding the GVAE loss functions. It's not clear why you are using BCE instead of categorical cross-entropy. At each stage of decoding (after masking invalid rules), the model should select one of valid production rules, so it's more reasonable to use CCE instead of BCE, right? Or is there a reason I'm not seeing.

Best,
Mohsen

encode and decode

binary accuracy

Hi Kusner:
First of all, thank for your contributions. I just want know how much is binary_accuracy of grammar_vae when we train molecular (smiles) model.
I think it is great progress comparing with the CVAE.
Best wshes.
Yuanpeng Li

Failing to run run_bo.py

Hi I have to following error occuring while trying to run run_bo.py character VAE for molecules.

File "run_bo.py", line 164, in
current_log_P_value = Descriptors.MolLogP(MolFromSmiles(valid_smiles_final[ i ]))
File "/is/ps2/pghosh/.anaconda/grammar_RAE/lib/python2.7/site-packages/rdkit/Chem/Crippen.py", line 170, in
MolLogP = lambda *x, **y: rdMolDescriptors.CalcCrippenDescriptors(*x, **y)[0]
ValueError: Sanitization error: Explicit valence for atom # 17 O, 4, is greater than permitted

I suspect that it is trying to evaluate the quality of a molecule that is not physically possible. However this should be ignored with highest evaluation cost. I have a different verwion of RDKIT (2017.09.3). Is that a problem? How may I fix this?

AttributeError: 'module' object has no attribute 'MatrixInversePSD'

I get this error while trying to run bayesian optimization
(I've tried to import other classes from 'molecule_optimization/simulation1/grammar/sparse_gp_theano_internal.py' and it didn't cause such an error, so I guess that is not an import error)

Installation issues

pip install -r requirements.txt

ERROR: tensorflow_gpu-0.12.1-cp27-none-linux_x86_64.whl is not a supported wheel on this platform.

Looks like Python 2 is required. It would be helpful to specify that in the installation description.

TypeError: object of type 'filter' has no len()

I am getting this error at line 12 in molecule_vae.py

What does each code do

Hi,

I want to use the code to implement it on RNA data with a different grammar. Could you please let me know what the code molecule_vae.py does? What is the difference between make_zinc_dataset_grammar.py and make_zinc_dataset_str.py? If I want to change the grammar and the input data, which parts of the codes need to be changed?

Thank you

how to sample from the generative model

Hello,

I'd to sample a batch of molecules from a pretrained GrammarVAE. Using encode_decode_zinc.py as inspiration, I first loaded the grammar_weights and grammar_model. I then sample from a standard Normal and then call the decode function on the grammar model using a sample.

grammar_weights = "pretrained/zinc_vae_grammar_L56_E100_val.hdf5"
grammar_model = molecule_vae.ZincGrammarModel(grammar_weights)
latent_rep_size = 56
epsilon_std = 1.0
batch_size = 1000
prior_z_samples = np.random.normal(loc=0.0, scale=epsilon_std, size=(batch_size, latent_rep_size))
decoded_samples = []
for i in range(batch_size):
	decoded_samples.append(grammar_model.decode(prior_z_sample[i][None,:])[0])

Does this seem like a reasonable way to get samples from the generative model? The code above runs but the outputted molecules seem off. For example, if I plot the empirical distribution of QED scores using the sampled molecules to the empirical distribution of QED scores from the zinc dataset, the empirical distribution from GrammarVAE is highly overdispersed and on average has a lower QED score.

Low accuracy

I'm getting very low training performance on train_zinc.py, even using the pretrained model and after having regenerated the data multiple times. The accuracy starts at about .15% and trends down to .05%, while loss goes from 2.1 to 1.7.

Strangely, BO metrics seem almost unaffected.

tensorflow wheel

I am using tensorflow 1.8.0. I want to change .whl from cpu to gpu. But I don't know the right .whl file is. Could you provide new .whl file? Many thanks.

Why did you encode each molecule 10 times?

In your paper, you said " For each molecule we encode it 10 times, and we decode each encoding 100 times (as encoding and decoding are stochastic). This results in 1000 decoded molecules for each of the
5000 input molecules"

I understand that decoding is stochastic, but when I encoded several times for a molecule, I got the same encoded code z1. When I used pretrained weights, I think encoding is not stochastic. Am I right?

Bayesian Optimization

Hi,

For implementing the bayesian optimization, in the decode_from_latent_space function, why did you pick the most popular samples?

Thank you,
Narges

data

Could you please send me a copy of eq2_grammar_dataset.h5 and eq2_str_dataset.h5? I can't access the dropbox at all. My email is [email protected]. Thank you very much!!!

Training time

What's the expected (approx.) training time for symbolic regression/zinc with the hyper-parameters in the paper (i.e. 100 epochs)?