doofuslarge / lscp Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 3.0 271 KB

A lightweight source code preprocesser

Perl 91.87% Java 7.17% C 0.96%

lscp's People

Contributors

Stargazers

Watchers

Forkers

stepthom henglicad mkmknd

lscp's Issues

Decouple string literals from identifiers

Currently, setting option "doIdentifiers" implies that option "doStringLiterals" will be also set, due to the way that the getIdentifers method is implemented. Provide a better implementation of getIdentifiers, and a new implementation of getStringLiterals.

Allow option for identifier names to be preserved (i.e, not stemmed)

Maybe in bug reports of emails, for example, we want to detect identifier names (camelCase, under_score) and not preprocess them. This will help any IR model link the bug report or email to the right file.

Refactor to be token-based, not string-based?

Currently, the implementation is quite simple by storing all terms in a single long string, e.g.:

"one two three four five"

Each method takes this string as input, and returns this (modified) string as output.

Perhaps it would be more efficient and cleaner to maintain a list of tokens that are passed around. This way, we can add attributes to each token (like what it's original form was, which steps were done to do, etc) for better logging. Just an idea.

Incorrect handling of AAa splitting (tokenize) pattern

If we have an identifier, like JText, we want it to be split (tokenized) into J and Text, not JT and ext.

Thanks to [email protected] for noticing this.

Better statistics on word output

It would nice to have a nice statistical summary of what the preprocessing resulted in:

How many words were found?
How many identifiers were split?
How many stopwords were removed?
How many email addresses were removed?
How many words are left?

Etc.

Stopwords implies lowercase; remove this implication

The way stopwords are checked, words are fist lowercased, and they remain so, even if the option "doLowerCase" is not set. Remove this implied behavior.

Allow pruning of rare and common words

Add functionality to allow rare (e.g., occurring in less than 1% of the files) or common (e.g., occurring in more than 80% of the files) to be removed.

Preserve word order when doing comments, idents, and strings

The way getIdentifers and getComments are currently implemented, the resulting output will contain all identifiers, and then all the comments, appearing out of order from the original file. This is fine for bag-of-word IR models like VSM, LSI, or LDA. But some new IR models will use word order information, so we need to fix this.

Much better email handling

See papers by Alberto Bachelli or Nic Bettenburg for ways to enhance our email preprocessing. i.e., island grammars to automatically detect source code snippets; learning signature patterns; detecting stack traces or command output; etc.

Detect and handle contractions (isn't, it's, didn't) better.

Current, "it's" will result in "it" and "s", if the option doRemovePunctuation is set to 1. A better approach is yield "it" and "is". Ditto for other common contractions.

Add ability to filter file types in input directory

Add the ability for the user to specify which file extensions should be considered in the input directory.

Be able to strip out the <code> HTML tag

Remove all <code> tags, and everything that's in between.

Allow option for acronyms to be preserved (i.e, not stemmed)

In the future, we may employ automatic acronym expansion, so we need a way to keep these from disappearing into the background. Right now, if they get lowercases, then there's no way for us to know that they were once acronyms.

Add functionality to prune Copyright comments

We want to have the option to ignore comments that have to do with copyrights, disclaimers, legal issues, etc., as they don't add any value to the IR models. I used to have this functionality when I used xscc.awk script, but since I moved to Perl-only, I need to add this back in.

Split up keyword stopwords by programming language, and auto-detect which to use

Currently, the list of keyword stopwords includes keywords from C, C++, and Java. Split these up and auto-detect which to use, based on file extension.

Be able to strip out HTML tags

Add an option, doRemoveHTMLTags, that will remove any and all HTML tags, but not what's between an opening and closing tag.