Coder Social home page Coder Social logo

Comments (21)

haighal avatar haighal commented on May 23, 2024 2

Hey Marc,

Just wanted to say thanks again for your very helpful comments here (that was indeed the issue I was having at test time). We got an implementation working on Python ASTs using only the Syntax edges and got results pretty similar to what you reported in your ablation from the ICLR '18 paper!

Accuracy@1: 34.5671%
Accuracy@3: 42.4864%
Accuracy@5: 45.6902%

The code lives at https://github.com/haighal/graph-based-code-modelling/ and has a thorough summary of both (1) our modifications to your code base and (2) the steps to generate our results. Let me know if you have any questions - my email is haighal [at] cs.stanford.edu

Cheers,
Alex

from graph-based-code-modelling.

mmjb avatar mmjb commented on May 23, 2024 1

from graph-based-code-modelling.

mmjb avatar mmjb commented on May 23, 2024 1

from graph-based-code-modelling.

mmjb avatar mmjb commented on May 23, 2024 1

Graphs with a substantial diameter are the norm -- this truncation is relevant for more or less every sample in the dataset. I'm currently at ICLR and am having a hard time to do proper hacking, but I can try to look into recreating the ICLR'18 VarNaming dataset next week.

from graph-based-code-modelling.

mmjb avatar mmjb commented on May 23, 2024 1

Getting the NextToken edges right is indeed a bit harder to do with the Python parser (in Roslyn, we can (a) rely on the fact that the source code visitor visits things in the right order, see

private void AddToken(SyntaxNode parentNode, SyntaxToken token)
)

However, Patrick, whom Miltos and I worked with on a recent ICLR paper, also released Python2Graph code, so you might want to look into simply reusing that:
https://github.com/CoderPat/structured-neural-summarization/blob/master/parsers/sourcecode/barone/ast_graph_generator.py

Re type embeddings: Setting the size to 0 should just work, but it may require making a few things more robust in the tensorisation pipeline (i.e., access to the types would need to become optional). However, this data is just passed through, so the changes required should be fairly minor.

Marc

from graph-based-code-modelling.

mmjb avatar mmjb commented on May 23, 2024

Heya,

This sounds mostly right (there may be some patching required in tensorise to look at the variable name instead of the target expression). The code released here is indeed quite similar to what we used for the ICLR'18 model (it's the basis before some refactorings / extensions for the ICLR'19 paper). The Graph2SeqModel is a bit different from what we used, but you should be able to retool it fairly easily.

The only obvious problem I see is that I removed the subtokenization bits at some point (when introducing the CharCNNs for node labels). If you really want to reproduce the original result, you would need to patch this back in (essentially, look for uses of the cg_node_label_embedding_style hyperparameter to find places you need to modify). Apart from that, I see no obvious issue, but I also haven't tried this ... so good luck, and come back with questions? :-)

Re the Cvitovic et al. extensions: These are controlled by the cg_add_subtoken_nodes hyperparameter (for adding the meta-nodes for each used subtoken) and the cg_node_label_embedding_style
hyperparameter (which is CharCNN for their best-performing configuration).

I think we also added residual GGNN connections, which you should be able to turn off with cg_ggnn_residual_connections: {}.

Good luck!

from graph-based-code-modelling.

haighal avatar haighal commented on May 23, 2024

Thanks for your help here! Seriously appreciate it.

Right now, I'm working with tensorize.py to process the data, and I'm getting an error in the _load_metadata_from_sample function in contextgraphmodel.py. It seems like the context graphs in the dataset don't have EdgeValues fields, so the program crashes at the line

for edge_type, values in raw_sample['ContextGraph']['EdgeValues'].items():

After poking around a bit, I'm also unsure what the cg_edge_value_sizes field of the metadata does, so I'm not sure how to patch this up. Any ideas? Did you change the way you stored the graphs between papers? An example of the program graph in the old dataset is here

from graph-based-code-modelling.

haighal avatar haighal commented on May 23, 2024

Awesome. As you mentioned in the first post, the decoder needs to look at the variable name instead of the expression that you're trying to generate. In SeqDecoder.load_metadata_from_sample, it looks at the SymbolLabels field of the sample, which doesn't exist in the graph samples I have.

From the sample graph in the repo, SymbolLabels is just keeping a count of how often each subtoken appears in an answer in order to specify the output vocabulary. The most similar thing in the dataset from ICLR '18 is the field called SymbolCandidates, which per the description is a:

List of type-correct in-scope variables that can be used in the considered slot. Each variable represented by a dictionary in which "SymbolDummyNode" identifies the node in "ContextGraph" corresponding to the candidate, "IsCorrect" marks if the candidate is the correct choice, and "SymbolName" gives the name of the variable candidate [Unneeded for task, used for debugging].

My plan is to just find the correct variable name and split it up into subtokens based on camelCase (as you say in the paper: the name inputStreamBuffer is treated as the sequence [input, stream, buffer]). This seems reasonable, but I want to get your input first to make sure that is in fact what you do. Also, is there any chance you've already written a utility that tokenizes strings into camelCase somewhere?

Looking forward to training time, there's also a Productions field that gets used in SeqDecoder.load_data_from_sample and in collect_token_seq (in model.py). This seems like an artifact of expanding your program graph based on production rules and is unnecessary for the simpler VarNaming task - can I just get rid of that bit and replace it with the actual answer tokens?

Thanks again for the help!

from graph-based-code-modelling.

haighal avatar haighal commented on May 23, 2024

Just wanted to say thanks again for the help here - we have the pipeline working and were able to overfit a small dataset. Now training on the full dataset!

from graph-based-code-modelling.

mmjb avatar mmjb commented on May 23, 2024

from graph-based-code-modelling.

haighal avatar haighal commented on May 23, 2024

Yeah, if you could share that, that'd be great! Although rather than re-compiling the dataset, it might be easier just to traverse the existing graph instance and replace instances of the answer token (if that makes any sense).

My edits are a fork of this repo, and when I get a chance I'll make a little readme explaining my changes/the steps I did to make it work.

from graph-based-code-modelling.

haighal avatar haighal commented on May 23, 2024

Hey Marc,

Just following up here - would love suggestions on the easiest way to create a dataset like the one you used for VarNaming.

Also: do you remember the hyperparameters you used when training the VarNaming model (namely how many epochs)? The default patience value (5) gives up after ~20 epochs since there isn't any improvement from epochs 15-20 when I run it.

from graph-based-code-modelling.

mmjb avatar mmjb commented on May 23, 2024

Heya,

I've just pushed a new branch (dev/mabrocks/varnaming) with the data extractor for variable naming (copy & pasted from our internal repo from back then).

The problem with using the existing data is that the graphs we are extracting are truncated. More precisely, we have the notion of a "hole" (e.g., the removed expression) and we then copy a subgraph around this hole, based on the idea that the GNN in the end will only use K=8 propagation steps anyway. For VariableNaming, we have more holes, corresponding to different uses of the variable, and so the graphs are richer than what would you get from just using the dataset we generated for expression generation...

Re hypers, @mallamanis should know the details...

from graph-based-code-modelling.

mallamanis avatar mallamanis commented on May 23, 2024

I believe that we used the default hyperparameters here. I am sure that a more exhaustive hyper-optimization could have helped here.

Note, that for variable naming we have used a GRU decoder that generated the variable name per-subtoken. This isn't included in this repo.

from graph-based-code-modelling.

haighal avatar haighal commented on May 23, 2024

Thanks for all the help (and the clarifications on the nuances of your dataset), guys! I'll check out what you did with the data extractor and look a little more into doing an exhaustive hyperparameter search. How often did you encounter graphs with diameter > 8 where this would make a difference?

@mallamanis - I thought the SeqDecoder in this repo did what you're saying. Did you use something different?

from graph-based-code-modelling.

haighal avatar haighal commented on May 23, 2024

Awesome - would be very much appreciated. Have fun at ICLR!

from graph-based-code-modelling.

mallamanis avatar mallamanis commented on May 23, 2024

@haighal You're right! I missed that SeqDecoder could be used for variable naming too.

from graph-based-code-modelling.

haighal avatar haighal commented on May 23, 2024

Great! Thanks for the confirmation.

I know it's a slightly different task than what you've published (I'm really doing variable naming with only 1 masked instance and thus a different/truncated subgraph), but I'm still hoping to get interesting results with this.

I trained the model pretty quickly, but testing is taking ages: I ran the test script with 4 parallel workers on Friday afternoon and it still isn't complete (it's now Tuesday at 4 pm here in California). Do you guys have any idea why it might be taking so long/where there might be inefficiencies in the test script? I'm not tensorizing the test graphs before running test.py, but I don't think that's the issue because model.test() seems to do that for us.

Alex

from graph-based-code-modelling.

haighal avatar haighal commented on May 23, 2024

Also, after I can get it working with C# code, I'm hoping to apply a similar model to a Python dataset. This obviously limits the scope of edge types I can use, and I was planning to start with only the syntax edges (Child and NextToken) and not use variable type information (since it's unavailable in Python). Two questions on that end:

  1. What's the easiest way to remove the type hierarchy from the model? It seems like we can just set 'cg_node_type_embedding_size' = 0, but is there anywhere else in the tensorisation pipeline we'd need to make substantial modifications?

  2. How do you actually compute the NextToken edges from the AST? I found this in the paper but am still a bit unsure (do you just arbitrarily link the terminals based on the order in which they're traversed?):

We use Child edges to connect nodes according to the AST. As this does not induce an order on children of a syntax node, we additionally add NextToken edges connecting each syntax token to its successor. An example of this is shown in Fig. 2a.

Thanks!

from graph-based-code-modelling.

haighal avatar haighal commented on May 23, 2024

Perfect, thanks for sharing! And do you have any ideas on why testing might be so slow (two posts above)?

from graph-based-code-modelling.

mmjb avatar mmjb commented on May 23, 2024

Oh, sorry, yes. I don't have a clear idea, but I have a hunch. Concretely, I observed in the experiments for the ICLR'19 paper that the sequence decoder would get slower with time; in the end, I think I traced it to the tensorflow session collecting more and more constants or something. This problem resolved itself when I introduced per-file parallelisation (https://github.com/microsoft/graph-based-code-modelling/blob/master/Models/utils/test.py#L53) because that is creating a new model copy (with its own tf.Session) for every file that it's testing on. Now, my test files happened to be quite small; if yours are large, the problem may continue to exist for you...

So, to make that actionable:
(1) Run test from the start, track how fast it's making progress. If you see that it's getting slower: Yay, you're running into this issue.
(2) If this is indeed your problem, instead of principled fix, you can just take test test files and quickly split them into smaller files and everything should magically work...

from graph-based-code-modelling.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.