jetbrains-research / authorship-detection Goto Github PK
View Code? Open in Web Editor NEWEvaluation of source authorship attribution tool
License: MIT License
Evaluation of source authorship attribution tool
License: MIT License
Hello,
The paper seems very interesting to me.
Can you please give a step-by-step guide for running the given model for the sample dataset (eg. cpp samples)?
Hello, I am very interested in the evaluation on the collected datasets in section 7 of Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering. I want to reproduce this experiment.Can you share the separated working contexts and time-separated datasets used in this paper?
after extracting the 4 csvs from the PYTHON dataset with the command:
java -jar attribution/pathminer/extract-path-contexts.jar snapshot
--project datasets\gcjpy/
--output processed\gcjpy_antlr3_8_2\py/
--java-parser antlr
--maxContexts 2000 --maxH 8 --maxW 3
I launch the classification with:
python run_classification.py configs\gcjpy\nn\cv_32_hidden_10_epochs.yaml
and I get an error:
"KeyError: 'token'" as if there was a mismatch of indices. Am I doing something wrong?
Can you provide C + + and python datasets?
Hello,I have some problems in Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering.To speed up the PbNN’s computations, at each training iteration we only take up to 500 random path-contexts for each sample.How to take up to 500 random path-contexts for each sample? Is the 500 random path context of a single sample affected by the overall sample?
Hello. We want to test the model generated by the training, but the test set is not tokens.csv and paths.csv comparison with training set tokens.csv and paths.csv. The number of token and path is completely different, so the generated model can't be used many times and can't be tested many times. Can you share the source program(attribution/pathminer/extract-path-contexts.jar) for generating these CSV files? We want to make some adjustments to the program to achieve the reuse of the model. thank you.
After running data extraction to mine path-contexts from the source files,
java -jar attribution/pathminer/extract-path-contexts.jar snapshot
--project datasets/java40/
--output processed/java40/
--java-parser antlr
--maxContexts 1000 --maxL 8 --maxW 3
output four CSV files.
the total number of IDs in the tokens.csv is not equal to the number of unique tokens corresponding to Java in Table 1 of the paper "Authorship attribute of source code: A language agnostic approach and applicability in software engineering". In addition, the total number of IDs in the paths.csv is not equal to the number of unique paths in Table 1.Should these quantities be equal?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.