jetbrains-research / authorship-detection Goto Github PK

View Code? Open in Web Editor NEW

21.0 21.0 6.0 213.93 MB

Evaluation of source authorship attribution tool

License: MIT License

Kotlin 16.73% C++ 9.68% Python 53.37% Shell 0.04% Java 20.18%

authorship-detection's People

Contributors

Stargazers

Watchers

Forkers

yangzhou6666 zfj1998 iloveacm4 smiroshnikova doriscullen harel-coffee

authorship-detection's Issues

How to run the code for the sample data given in the datasets folder?

Hello,
The paper seems very interesting to me.
Can you please give a step-by-step guide for running the given model for the sample dataset (eg. cpp samples)?

evaluation on the collected datasets

Hello, I am very interested in the evaluation on the collected datasets in section 7 of Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering. I want to reproduce this experiment.Can you share the separated working contexts and time-separated datasets used in this paper?

Error in reading token.csv

after extracting the 4 csvs from the PYTHON dataset with the command:

java -jar attribution/pathminer/extract-path-contexts.jar snapshot
--project datasets\gcjpy/
--output processed\gcjpy_antlr3_8_2\py/
--java-parser antlr
--maxContexts 2000 --maxH 8 --maxW 3

I launch the classification with:
python run_classification.py configs\gcjpy\nn\cv_32_hidden_10_epochs.yaml

and I get an error:
"KeyError: 'token'" as if there was a mismatch of indices. Am I doing something wrong?

datasets

Can you provide C + + and python datasets?

Error building the pathminer package

Hi! I am trying to reproduce your code and come into a problem when I try to rebuild the pathminer kotlin project. Here is a package named astminer, but the graddle builder can't locate the package. Is it an additional package that I have to download somewhere else?

500 random path-contexts

Hello,I have some problems in Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering.To speed up the PbNN’s computations, at each training iteration we only take up to 500 random path-contexts for each sample.How to take up to 500 random path-contexts for each sample? Is the 500 random path context of a single sample affected by the overall sample?

test model

Hello. We want to test the model generated by the training, but the test set is not tokens.csv and paths.csv comparison with training set tokens.csv and paths.csv. The number of token and path is completely different, so the generated model can't be used many times and can't be tested many times. Can you share the source program(attribution/pathminer/extract-path-contexts.jar) for generating these CSV files? We want to make some adjustments to the program to achieve the reuse of the model. thank you.

path context sequences

Hi, Why are the path context sequences generated separately from the same cpp file and together with other CPPs different？
like this

tokens

After running data extraction to mine path-contexts from the source files,
java -jar attribution/pathminer/extract-path-contexts.jar snapshot
--project datasets/java40/
--output processed/java40/
--java-parser antlr
--maxContexts 1000 --maxL 8 --maxW 3

output four CSV files.
the total number of IDs in the tokens.csv is not equal to the number of unique tokens corresponding to Java in Table 1 of the paper "Authorship attribute of source code: A language agnostic approach and applicability in software engineering". In addition, the total number of IDs in the paths.csv is not equal to the number of unique paths in Table 1.Should these quantities be equal?

jetbrains-research / authorship-detection Goto Github PK

authorship-detection's People

Contributors

Stargazers

Watchers

Forkers

authorship-detection's Issues

How to run the code for the sample data given in the datasets folder?

evaluation on the collected datasets

Error in reading token.csv

datasets

Error building the pathminer package

500 random path-contexts

test model

path context sequences

tokens

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent