Coder Social home page Coder Social logo

authorship-detection's People

Contributors

egor-bogomolov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

authorship-detection's Issues

evaluation on the collected datasets

Hello, I am very interested in the evaluation on the collected datasets in section 7 of Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering. I want to reproduce this experiment.Can you share the separated working contexts and time-separated datasets used in this paper?

Error in reading token.csv

after extracting the 4 csvs from the PYTHON dataset with the command:

java -jar attribution/pathminer/extract-path-contexts.jar snapshot
--project datasets\gcjpy/
--output processed\gcjpy_antlr3_8_2\py/
--java-parser antlr
--maxContexts 2000 --maxH 8 --maxW 3

I launch the classification with:
python run_classification.py configs\gcjpy\nn\cv_32_hidden_10_epochs.yaml

and I get an error:
"KeyError: 'token'" as if there was a mismatch of indices. Am I doing something wrong?

datasets

Can you provide C + + and python datasets?

Error building the pathminer package

image

Hi! I am trying to reproduce your code and come into a problem when I try to rebuild the pathminer kotlin project. Here is a package named astminer, but the graddle builder can't locate the package. Is it an additional package that I have to download somewhere else?

500 random path-contexts

Hello,I have some problems in Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering.To speed up the PbNN’s computations, at each training iteration we only take up to 500 random path-contexts for each sample.How to take up to 500 random path-contexts for each sample? Is the 500 random path context of a single sample affected by the overall sample?

test model

Hello. We want to test the model generated by the training, but the test set is not tokens.csv and paths.csv comparison with training set tokens.csv and paths.csv. The number of token and path is completely different, so the generated model can't be used many times and can't be tested many times. Can you share the source program(attribution/pathminer/extract-path-contexts.jar) for generating these CSV files? We want to make some adjustments to the program to achieve the reuse of the model. thank you.

path context sequences

Hi, Why are the path context sequences generated separately from the same cpp file and together with other CPPs different?
like this
image

tokens

After running data extraction to mine path-contexts from the source files,
java -jar attribution/pathminer/extract-path-contexts.jar snapshot
--project datasets/java40/
--output processed/java40/
--java-parser antlr
--maxContexts 1000 --maxL 8 --maxW 3

output four CSV files.
the total number of IDs in the tokens.csv is not equal to the number of unique tokens corresponding to Java in Table 1 of the paper "Authorship attribute of source code: A language agnostic approach and applicability in software engineering". In addition, the total number of IDs in the paths.csv is not equal to the number of unique paths in Table 1.Should these quantities be equal?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.