I am trying to generate the code2vec data using the cli.jar on a bunch of custom progr

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a href="https://github.com/JetBrains-Research/astminer/files/3926612/processed_data.z

Thanks a lot <a class="user-mention notranslate" data-hovercard-type="user" data-hover

cli.jar produces inconsistent results,about jetbrains-research/astminer

egor-bogomolov commented on May 29, 2024 1

Okay, seems like we are coming closer to the reason. Thanks a lot for providing all this information!

from astminer.

egor-bogomolov commented on May 29, 2024 1

Hi @appupulla , sorry for the long wait. I reproduced it in a more notorious way: I found a large .c file for which we extract a lot of path contexts but store 0 tokens and 0 paths in the corresponding file. I'm still investigating the reason behind such behavior.

from astminer.

egor-bogomolov commented on May 29, 2024

Hello! Could you please share the generated data?

from astminer.

appupulla commented on May 29, 2024

processed_data.zip
I have attached the generated data. I had to restrict to giving you just 10 path contexts in view of the file size limit.

from astminer.

egor-bogomolov commented on May 29, 2024

Are those the whole tokens and paths files? It seems like a huge mismatch compared to indices in path_contexts.csv.
Could you also share the exact command that you used to run the jar?

from astminer.

egor-bogomolov commented on May 29, 2024

Also, did you run preprocessing beforehand?

from astminer.

appupulla commented on May 29, 2024

Yes the tokens and paths files are complete. The mismatch is the problem.
I ran it using the following command.
java -Xmx45g -jar cli.jar code2vec --lang c,cpp --project /home/c_cpp_data/ --output /home/code2vec_train/ --maxH 5 --maxW 3 --maxContexts 1000000 --maxTokens 100000 --maxPaths 100000. This command gave me the above mentioned numbers.
However the attached output I have shared are without the maxContexts, maxTokens and maxPaths flags.
The issue remains the same in both the cases.

from astminer.

appupulla commented on May 29, 2024

Yes I did run preprocessing beforehand and that did not solve my issue either.

from astminer.

egor-bogomolov commented on May 29, 2024

Thanks for the complete information, looking into it.

from astminer.

appupulla commented on May 29, 2024

Thanks a lot @egor-bogomolov. Appreciate the help and quick responses

from astminer.

egor-bogomolov commented on May 29, 2024

Which version of the CLI did you use?

from astminer.

appupulla commented on May 29, 2024

astminer-cli-0.3-all.jar

from astminer.

egor-bogomolov commented on May 29, 2024

Are you sure that it was the command? I'm not sure if it is important but for me, astminer-cli-0.3-all generates path_contexts files in a different format (space-separated triples of indices, not semicolon-separated).

from astminer.

egor-bogomolov commented on May 29, 2024

Also, if the data that you used is publicly available, could you provide a link to download it (e.g., github repo that you parsed)?

from astminer.

appupulla commented on May 29, 2024

Yes that was the command. I edited the path_contexts.csv to suit the code2vec training in the py_example code. The data is not publicly available.

from astminer.

egor-bogomolov commented on May 29, 2024

Oh, okay, we should make this part simpler.

from astminer.

egor-bogomolov commented on May 29, 2024

So far I've checked code and run the CLI on several projects of different size. The issue hasn't reproduced for me yet. Could you share the script that converts output to be compatible with py example?

from astminer.

appupulla commented on May 29, 2024

The script just strips and replaces the string of commas spaces and semicolons. I will run it on a small set of my data and replicate the error and send the data over to you.

from astminer.

appupulla commented on May 29, 2024

Hi Egor,
I error cannot be replicated on a small amount of data. I am running it on a huge set of data that contains over 3000 c and cpp files each one having more than 1000 lines of code. The dataset is too huge to upload here. I tried replicating it on 100 c and cpp files and the error does not exist. May be there's a bug in the partitioning of path_context files. Can you please try running it on the biggest data you've got and see if you can replicate the result? I will try and get you the dataset to replicate the error as soon as I can.

from astminer.

appupulla commented on May 29, 2024

Ran the cli on just 30 files and the number of tokens generated was 4396 while running on 3000 files gives just ~7900 tokens which is less than <50% change on 100 times more files.
Data Folder structure:
main folder >>
folder1>>>>
1500 files
folder2>>>>
1500 files

from astminer.

appupulla commented on May 29, 2024

Potentially if you are to replicate the error make sure the Xmx45g/Xmx40g flag in the java -jar command is absolutely essential, which means running it on a huge data that need 40-45g of java heap space. If not, this error is not getting replicated.

from astminer.

egor-bogomolov commented on May 29, 2024

Hi, @appupulla! What happens when you set a limit on Java heap to be smaller? Since batching is in use, it should work normally as soon as it does not hit memory limit due to abnormally large file.

from astminer.

appupulla commented on May 29, 2024

The cli fails with java heap space error. Looks like the batching is not working as expected on abnormally large files then.

from astminer.

appupulla commented on May 29, 2024

The error message:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at gnu.trove.set.hash.THashSet.rehash(THashSet.java:163)
at gnu.trove.impl.hash.THash.postInsertHook(THash.java:388)
at gnu.trove.set.hash.THashSet.add(THashSet.java:112)
at org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerGraph.addVertex(TinkerGraph.java:284)
at io.shiftleft.codepropertygraph.cpgloading.ProtoToCpg.addNodes(ProtoToCpg.java:63)
at io.shiftleft.codepropertygraph.cpgloading.ProtoCpgLoader.loadFromListOfProtos(ProtoCpgLoader.java:104)
at io.shiftleft.fuzzyc2cpg.output.inmemory.OutputModule.constructTinkerGraphFromCpg(OutputModule.java:54)
at io.shiftleft.fuzzyc2cpg.output.inmemory.OutputModule.persist(OutputModule.java:42)
at io.shiftleft.fuzzyc2cpg.output.inmemory.OutputModuleFactory.persist(OutputModuleFactory.java:33)
at io.shiftleft.fuzzyc2cpg.FuzzyC2Cpg.runAndOutput(FuzzyC2Cpg.scala:34)
at astminer.parse.cpp.FuzzyCppParser.parse(FuzzyCppParser.kt:87)
at astminer.common.model.Parser$DefaultImpls.parseProject(ParsingModel.kt:55)
at astminer.parse.cpp.FuzzyCppParser.parseProject(FuzzyCppParser.kt:23)
at astminer.common.model.Parser$DefaultImpls.parseWithExtension(ParsingModel.kt:64)
at astminer.parse.cpp.FuzzyCppParser.parseWithExtension(FuzzyCppParser.kt:23)
at cli.Code2VecExtractor.extract(Code2VecExtractor.kt:107)
at cli.Code2VecExtractor.run(Code2VecExtractor.kt:132)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:136)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:14)
at com.github.ajalt.clikt.core.CliktCommand.parse(CliktCommand.kt:216)
at com.github.ajalt.clikt.core.CliktCommand.parse$default(CliktCommand.kt:213)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:231)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:250)
at cli.MainKt.main(Main.kt:14)
[1]+ Exit 1 nohup java -jar cli.jar code2vec --lang c,cpp --project /home/c_cpp_data/ --output /home/code2vec_train/

from astminer.

egor-bogomolov commented on May 29, 2024

I've run the jar on several large projects (thousands of cpp/c fies) and didn't face the error. Could you please share several of the extremely large source files?

from astminer.

appupulla commented on May 29, 2024

@egor-bogomolov were you able to reproduce the error with this data?

from astminer.

appupulla commented on May 29, 2024

@egor-bogomolov, @iromeo, @olegs, @gsvgit Can anyone look into this issue please?

from astminer.

egor-bogomolov commented on May 29, 2024

Hey, I've investigated the issue deeper and found out the reason behind the enormous memory usage and, most likely, your problem as well. Now the tool should fit in a reasonable amount of memory. Please, download the updated version of CLI from google drive (it's not published in the repo yet.). Notice, that now to use batching you should pass --batchMode --batchSize 1000 (it toggles off the limits on maxTokens and maxPaths). Alternatively, you can pass the limits but don't use batching, if you have enough memory.

from astminer.

appupulla commented on May 29, 2024

issue solved. Thanks a lot @egor-bogomolov

from astminer.

cli.jar produces inconsistent results about astminer HOT 29 CLOSED

Comments (29)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent