Coder Social home page Coder Social logo

lscp's People

Contributors

doofuslarge avatar tchalmers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

lscp's Issues

Decouple string literals from identifiers

Currently, setting option "doIdentifiers" implies that option "doStringLiterals" will be also set, due to the way that the getIdentifers method is implemented. Provide a better implementation of getIdentifiers, and a new implementation of getStringLiterals.

Refactor to be token-based, not string-based?

Currently, the implementation is quite simple by storing all terms in a single long string, e.g.:

"one two three four five"

Each method takes this string as input, and returns this (modified) string as output.

Perhaps it would be more efficient and cleaner to maintain a list of tokens that are passed around. This way, we can add attributes to each token (like what it's original form was, which steps were done to do, etc) for better logging. Just an idea.

Better statistics on word output

It would nice to have a nice statistical summary of what the preprocessing resulted in:

How many words were found?
How many identifiers were split?
How many stopwords were removed?
How many email addresses were removed?
How many words are left?

Etc.

Allow pruning of rare and common words

Add functionality to allow rare (e.g., occurring in less than 1% of the files) or common (e.g., occurring in more than 80% of the files) to be removed.

Preserve word order when doing comments, idents, and strings

The way getIdentifers and getComments are currently implemented, the resulting output will contain all identifiers, and then all the comments, appearing out of order from the original file. This is fine for bag-of-word IR models like VSM, LSI, or LDA. But some new IR models will use word order information, so we need to fix this.

Much better email handling

See papers by Alberto Bachelli or Nic Bettenburg for ways to enhance our email preprocessing. i.e., island grammars to automatically detect source code snippets; learning signature patterns; detecting stack traces or command output; etc.

Add functionality to prune Copyright comments

We want to have the option to ignore comments that have to do with copyrights, disclaimers, legal issues, etc., as they don't add any value to the IR models. I used to have this functionality when I used xscc.awk script, but since I moved to Perl-only, I need to add this back in.

Be able to strip out HTML tags

Add an option, doRemoveHTMLTags, that will remove any and all HTML tags, but not what's between an opening and closing tag.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.