Comments (29)
Okay, seems like we are coming closer to the reason. Thanks a lot for providing all this information!
from astminer.
Hi @appupulla , sorry for the long wait. I reproduced it in a more notorious way: I found a large .c file for which we extract a lot of path contexts but store 0 tokens and 0 paths in the corresponding file. I'm still investigating the reason behind such behavior.
from astminer.
Hello! Could you please share the generated data?
from astminer.
processed_data.zip
I have attached the generated data. I had to restrict to giving you just 10 path contexts in view of the file size limit.
from astminer.
Are those the whole tokens and paths files? It seems like a huge mismatch compared to indices in path_contexts.csv.
Could you also share the exact command that you used to run the jar?
from astminer.
Also, did you run preprocessing beforehand?
from astminer.
Yes the tokens and paths files are complete. The mismatch is the problem.
I ran it using the following command.
java -Xmx45g -jar cli.jar code2vec --lang c,cpp --project /home/c_cpp_data/ --output /home/code2vec_train/ --maxH 5 --maxW 3 --maxContexts 1000000 --maxTokens 100000 --maxPaths 100000. This command gave me the above mentioned numbers.
However the attached output I have shared are without the maxContexts, maxTokens and maxPaths flags.
The issue remains the same in both the cases.
from astminer.
Yes I did run preprocessing beforehand and that did not solve my issue either.
from astminer.
Thanks for the complete information, looking into it.
from astminer.
Thanks a lot @egor-bogomolov. Appreciate the help and quick responses
from astminer.
Which version of the CLI did you use?
from astminer.
astminer-cli-0.3-all.jar
from astminer.
Are you sure that it was the command? I'm not sure if it is important but for me, astminer-cli-0.3-all generates path_contexts files in a different format (space-separated triples of indices, not semicolon-separated).
from astminer.
Also, if the data that you used is publicly available, could you provide a link to download it (e.g., github repo that you parsed)?
from astminer.
Yes that was the command. I edited the path_contexts.csv to suit the code2vec training in the py_example code. The data is not publicly available.
from astminer.
Oh, okay, we should make this part simpler.
from astminer.
So far I've checked code and run the CLI on several projects of different size. The issue hasn't reproduced for me yet. Could you share the script that converts output to be compatible with py example?
from astminer.
The script just strips and replaces the string of commas spaces and semicolons. I will run it on a small set of my data and replicate the error and send the data over to you.
from astminer.
Hi Egor,
I error cannot be replicated on a small amount of data. I am running it on a huge set of data that contains over 3000 c and cpp files each one having more than 1000 lines of code. The dataset is too huge to upload here. I tried replicating it on 100 c and cpp files and the error does not exist. May be there's a bug in the partitioning of path_context files. Can you please try running it on the biggest data you've got and see if you can replicate the result? I will try and get you the dataset to replicate the error as soon as I can.
from astminer.
Ran the cli on just 30 files and the number of tokens generated was 4396 while running on 3000 files gives just ~7900 tokens which is less than <50% change on 100 times more files.
Data Folder structure:
main folder >>
folder1>>>>
1500 files
folder2>>>>
1500 files
from astminer.
Potentially if you are to replicate the error make sure the Xmx45g/Xmx40g flag in the java -jar command is absolutely essential, which means running it on a huge data that need 40-45g of java heap space. If not, this error is not getting replicated.
from astminer.
Hi, @appupulla! What happens when you set a limit on Java heap to be smaller? Since batching is in use, it should work normally as soon as it does not hit memory limit due to abnormally large file.
from astminer.
The cli fails with java heap space error. Looks like the batching is not working as expected on abnormally large files then.
from astminer.
The error message:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at gnu.trove.set.hash.THashSet.rehash(THashSet.java:163)
at gnu.trove.impl.hash.THash.postInsertHook(THash.java:388)
at gnu.trove.set.hash.THashSet.add(THashSet.java:112)
at org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerGraph.addVertex(TinkerGraph.java:284)
at io.shiftleft.codepropertygraph.cpgloading.ProtoToCpg.addNodes(ProtoToCpg.java:63)
at io.shiftleft.codepropertygraph.cpgloading.ProtoCpgLoader.loadFromListOfProtos(ProtoCpgLoader.java:104)
at io.shiftleft.fuzzyc2cpg.output.inmemory.OutputModule.constructTinkerGraphFromCpg(OutputModule.java:54)
at io.shiftleft.fuzzyc2cpg.output.inmemory.OutputModule.persist(OutputModule.java:42)
at io.shiftleft.fuzzyc2cpg.output.inmemory.OutputModuleFactory.persist(OutputModuleFactory.java:33)
at io.shiftleft.fuzzyc2cpg.FuzzyC2Cpg.runAndOutput(FuzzyC2Cpg.scala:34)
at astminer.parse.cpp.FuzzyCppParser.parse(FuzzyCppParser.kt:87)
at astminer.common.model.Parser$DefaultImpls.parseProject(ParsingModel.kt:55)
at astminer.parse.cpp.FuzzyCppParser.parseProject(FuzzyCppParser.kt:23)
at astminer.common.model.Parser$DefaultImpls.parseWithExtension(ParsingModel.kt:64)
at astminer.parse.cpp.FuzzyCppParser.parseWithExtension(FuzzyCppParser.kt:23)
at cli.Code2VecExtractor.extract(Code2VecExtractor.kt:107)
at cli.Code2VecExtractor.run(Code2VecExtractor.kt:132)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:136)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:14)
at com.github.ajalt.clikt.core.CliktCommand.parse(CliktCommand.kt:216)
at com.github.ajalt.clikt.core.CliktCommand.parse$default(CliktCommand.kt:213)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:231)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:250)
at cli.MainKt.main(Main.kt:14)
[1]+ Exit 1 nohup java -jar cli.jar code2vec --lang c,cpp --project /home/c_cpp_data/ --output /home/code2vec_train/
from astminer.
I've run the jar on several large projects (thousands of cpp/c fies) and didn't face the error. Could you please share several of the extremely large source files?
from astminer.
@egor-bogomolov were you able to reproduce the error with this data?
from astminer.
@egor-bogomolov, @iromeo, @olegs, @gsvgit Can anyone look into this issue please?
from astminer.
Hey, I've investigated the issue deeper and found out the reason behind the enormous memory usage and, most likely, your problem as well. Now the tool should fit in a reasonable amount of memory. Please, download the updated version of CLI from google drive (it's not published in the repo yet.). Notice, that now to use batching you should pass --batchMode --batchSize 1000
(it toggles off the limits on maxTokens
and maxPaths
). Alternatively, you can pass the limits but don't use batching, if you have enough memory.
from astminer.
issue solved. Thanks a lot @egor-bogomolov
from astminer.
Related Issues (20)
- Error Parsing C++ Files for Code2Seq HOT 15
- Integrating astminer with code2vec for C source codes HOT 6
- need help HOT 3
- File information of path_context result HOT 2
- different paths for same code content in python HOT 2
- problem with running "gradle shadowJar" HOT 4
- cli.jar HOT 8
- Looping over AST trees to generate paths between terminals HOT 2
- can astminer extract control flow of a source code? HOT 5
- Fuzzy error
- How to add a new language? HOT 1
- Is it possible to extract shortest path between two nodes?
- Output format code2vec HOT 2
- Manage the number of output path contexts
- "No such file or directory" error while parsing C++ code HOT 1
- C/C++ tests fail on M1
- Plugin [id: 'org.jetbrains.dokka', version: '1.4.32'] was not found in any of the following sources:
- Which version of JDK do I need to install before running this project?
- About generating input data for Code2Vec from C files
- Getting a stack overflow error when parsing glibc with Fuzzy
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from astminer.