doofuslarge / lscp Goto Github PK
View Code? Open in Web Editor NEWA lightweight source code preprocesser
A lightweight source code preprocesser
Currently, setting option "doIdentifiers" implies that option "doStringLiterals" will be also set, due to the way that the getIdentifers method is implemented. Provide a better implementation of getIdentifiers, and a new implementation of getStringLiterals.
Maybe in bug reports of emails, for example, we want to detect identifier names (camelCase, under_score) and not preprocess them. This will help any IR model link the bug report or email to the right file.
Currently, the implementation is quite simple by storing all terms in a single long string, e.g.:
"one two three four five"
Each method takes this string as input, and returns this (modified) string as output.
Perhaps it would be more efficient and cleaner to maintain a list of tokens that are passed around. This way, we can add attributes to each token (like what it's original form was, which steps were done to do, etc) for better logging. Just an idea.
If we have an identifier, like JText, we want it to be split (tokenized) into J and Text, not JT and ext.
Thanks to [email protected] for noticing this.
It would nice to have a nice statistical summary of what the preprocessing resulted in:
How many words were found?
How many identifiers were split?
How many stopwords were removed?
How many email addresses were removed?
How many words are left?
Etc.
The way stopwords are checked, words are fist lowercased, and they remain so, even if the option "doLowerCase" is not set. Remove this implied behavior.
Add functionality to allow rare (e.g., occurring in less than 1% of the files) or common (e.g., occurring in more than 80% of the files) to be removed.
The way getIdentifers and getComments are currently implemented, the resulting output will contain all identifiers, and then all the comments, appearing out of order from the original file. This is fine for bag-of-word IR models like VSM, LSI, or LDA. But some new IR models will use word order information, so we need to fix this.
See papers by Alberto Bachelli or Nic Bettenburg for ways to enhance our email preprocessing. i.e., island grammars to automatically detect source code snippets; learning signature patterns; detecting stack traces or command output; etc.
Current, "it's" will result in "it" and "s", if the option doRemovePunctuation is set to 1. A better approach is yield "it" and "is". Ditto for other common contractions.
Add the ability for the user to specify which file extensions should be considered in the input directory.
Remove all <code>
tags, and everything that's in between.
In the future, we may employ automatic acronym expansion, so we need a way to keep these from disappearing into the background. Right now, if they get lowercases, then there's no way for us to know that they were once acronyms.
We want to have the option to ignore comments that have to do with copyrights, disclaimers, legal issues, etc., as they don't add any value to the IR models. I used to have this functionality when I used xscc.awk script, but since I moved to Perl-only, I need to add this back in.
Currently, the list of keyword stopwords includes keywords from C, C++, and Java. Split these up and auto-detect which to use, based on file extension.
Add an option, doRemoveHTMLTags
, that will remove any and all HTML tags, but not what's between an opening and closing tag.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.