Coder Social home page Coder Social logo

Comments (29)

egor-bogomolov avatar egor-bogomolov commented on May 29, 2024 1

Okay, seems like we are coming closer to the reason. Thanks a lot for providing all this information!

from astminer.

egor-bogomolov avatar egor-bogomolov commented on May 29, 2024 1

Hi @appupulla , sorry for the long wait. I reproduced it in a more notorious way: I found a large .c file for which we extract a lot of path contexts but store 0 tokens and 0 paths in the corresponding file. I'm still investigating the reason behind such behavior.

from astminer.

egor-bogomolov avatar egor-bogomolov commented on May 29, 2024

Hello! Could you please share the generated data?

from astminer.

appupulla avatar appupulla commented on May 29, 2024

processed_data.zip
I have attached the generated data. I had to restrict to giving you just 10 path contexts in view of the file size limit.

from astminer.

egor-bogomolov avatar egor-bogomolov commented on May 29, 2024

Are those the whole tokens and paths files? It seems like a huge mismatch compared to indices in path_contexts.csv.
Could you also share the exact command that you used to run the jar?

from astminer.

egor-bogomolov avatar egor-bogomolov commented on May 29, 2024

Also, did you run preprocessing beforehand?

from astminer.

appupulla avatar appupulla commented on May 29, 2024

Yes the tokens and paths files are complete. The mismatch is the problem.
I ran it using the following command.
java -Xmx45g -jar cli.jar code2vec --lang c,cpp --project /home/c_cpp_data/ --output /home/code2vec_train/ --maxH 5 --maxW 3 --maxContexts 1000000 --maxTokens 100000 --maxPaths 100000. This command gave me the above mentioned numbers.
However the attached output I have shared are without the maxContexts, maxTokens and maxPaths flags.
The issue remains the same in both the cases.

from astminer.

appupulla avatar appupulla commented on May 29, 2024

Yes I did run preprocessing beforehand and that did not solve my issue either.

from astminer.

egor-bogomolov avatar egor-bogomolov commented on May 29, 2024

Thanks for the complete information, looking into it.

from astminer.

appupulla avatar appupulla commented on May 29, 2024

Thanks a lot @egor-bogomolov. Appreciate the help and quick responses

from astminer.

egor-bogomolov avatar egor-bogomolov commented on May 29, 2024

Which version of the CLI did you use?

from astminer.

appupulla avatar appupulla commented on May 29, 2024

astminer-cli-0.3-all.jar

from astminer.

egor-bogomolov avatar egor-bogomolov commented on May 29, 2024

Are you sure that it was the command? I'm not sure if it is important but for me, astminer-cli-0.3-all generates path_contexts files in a different format (space-separated triples of indices, not semicolon-separated).

from astminer.

egor-bogomolov avatar egor-bogomolov commented on May 29, 2024

Also, if the data that you used is publicly available, could you provide a link to download it (e.g., github repo that you parsed)?

from astminer.

appupulla avatar appupulla commented on May 29, 2024

Yes that was the command. I edited the path_contexts.csv to suit the code2vec training in the py_example code. The data is not publicly available.

from astminer.

egor-bogomolov avatar egor-bogomolov commented on May 29, 2024

Oh, okay, we should make this part simpler.

from astminer.

egor-bogomolov avatar egor-bogomolov commented on May 29, 2024

So far I've checked code and run the CLI on several projects of different size. The issue hasn't reproduced for me yet. Could you share the script that converts output to be compatible with py example?

from astminer.

appupulla avatar appupulla commented on May 29, 2024

The script just strips and replaces the string of commas spaces and semicolons. I will run it on a small set of my data and replicate the error and send the data over to you.

from astminer.

appupulla avatar appupulla commented on May 29, 2024

Hi Egor,
I error cannot be replicated on a small amount of data. I am running it on a huge set of data that contains over 3000 c and cpp files each one having more than 1000 lines of code. The dataset is too huge to upload here. I tried replicating it on 100 c and cpp files and the error does not exist. May be there's a bug in the partitioning of path_context files. Can you please try running it on the biggest data you've got and see if you can replicate the result? I will try and get you the dataset to replicate the error as soon as I can.

from astminer.

appupulla avatar appupulla commented on May 29, 2024

Ran the cli on just 30 files and the number of tokens generated was 4396 while running on 3000 files gives just ~7900 tokens which is less than <50% change on 100 times more files.
Data Folder structure:
main folder >>
folder1>>>>
1500 files
folder2>>>>
1500 files

from astminer.

appupulla avatar appupulla commented on May 29, 2024

Potentially if you are to replicate the error make sure the Xmx45g/Xmx40g flag in the java -jar command is absolutely essential, which means running it on a huge data that need 40-45g of java heap space. If not, this error is not getting replicated.

from astminer.

egor-bogomolov avatar egor-bogomolov commented on May 29, 2024

Hi, @appupulla! What happens when you set a limit on Java heap to be smaller? Since batching is in use, it should work normally as soon as it does not hit memory limit due to abnormally large file.

from astminer.

appupulla avatar appupulla commented on May 29, 2024

The cli fails with java heap space error. Looks like the batching is not working as expected on abnormally large files then.

from astminer.

appupulla avatar appupulla commented on May 29, 2024

The error message:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at gnu.trove.set.hash.THashSet.rehash(THashSet.java:163)
at gnu.trove.impl.hash.THash.postInsertHook(THash.java:388)
at gnu.trove.set.hash.THashSet.add(THashSet.java:112)
at org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerGraph.addVertex(TinkerGraph.java:284)
at io.shiftleft.codepropertygraph.cpgloading.ProtoToCpg.addNodes(ProtoToCpg.java:63)
at io.shiftleft.codepropertygraph.cpgloading.ProtoCpgLoader.loadFromListOfProtos(ProtoCpgLoader.java:104)
at io.shiftleft.fuzzyc2cpg.output.inmemory.OutputModule.constructTinkerGraphFromCpg(OutputModule.java:54)
at io.shiftleft.fuzzyc2cpg.output.inmemory.OutputModule.persist(OutputModule.java:42)
at io.shiftleft.fuzzyc2cpg.output.inmemory.OutputModuleFactory.persist(OutputModuleFactory.java:33)
at io.shiftleft.fuzzyc2cpg.FuzzyC2Cpg.runAndOutput(FuzzyC2Cpg.scala:34)
at astminer.parse.cpp.FuzzyCppParser.parse(FuzzyCppParser.kt:87)
at astminer.common.model.Parser$DefaultImpls.parseProject(ParsingModel.kt:55)
at astminer.parse.cpp.FuzzyCppParser.parseProject(FuzzyCppParser.kt:23)
at astminer.common.model.Parser$DefaultImpls.parseWithExtension(ParsingModel.kt:64)
at astminer.parse.cpp.FuzzyCppParser.parseWithExtension(FuzzyCppParser.kt:23)
at cli.Code2VecExtractor.extract(Code2VecExtractor.kt:107)
at cli.Code2VecExtractor.run(Code2VecExtractor.kt:132)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:136)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:14)
at com.github.ajalt.clikt.core.CliktCommand.parse(CliktCommand.kt:216)
at com.github.ajalt.clikt.core.CliktCommand.parse$default(CliktCommand.kt:213)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:231)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:250)
at cli.MainKt.main(Main.kt:14)
[1]+ Exit 1 nohup java -jar cli.jar code2vec --lang c,cpp --project /home/c_cpp_data/ --output /home/code2vec_train/

from astminer.

egor-bogomolov avatar egor-bogomolov commented on May 29, 2024

I've run the jar on several large projects (thousands of cpp/c fies) and didn't face the error. Could you please share several of the extremely large source files?

from astminer.

appupulla avatar appupulla commented on May 29, 2024

@egor-bogomolov were you able to reproduce the error with this data?

from astminer.

appupulla avatar appupulla commented on May 29, 2024

@egor-bogomolov, @iromeo, @olegs, @gsvgit Can anyone look into this issue please?

from astminer.

egor-bogomolov avatar egor-bogomolov commented on May 29, 2024

Hey, I've investigated the issue deeper and found out the reason behind the enormous memory usage and, most likely, your problem as well. Now the tool should fit in a reasonable amount of memory. Please, download the updated version of CLI from google drive (it's not published in the repo yet.). Notice, that now to use batching you should pass --batchMode --batchSize 1000 (it toggles off the limits on maxTokens and maxPaths). Alternatively, you can pass the limits but don't use batching, if you have enough memory.

from astminer.

appupulla avatar appupulla commented on May 29, 2024

issue solved. Thanks a lot @egor-bogomolov

from astminer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.