Coder Social home page Coder Social logo

batmanwgd / wordlistanalyser Goto Github PK

View Code? Open in Web Editor NEW

This project forked from electronart/wordlistanalyser

0.0 0.0 0.0 2.43 MB

To assist optimizing stemming rules and for comparing stemmers.

Home Page: http://www.dtsearch.co.uk/products/list-analyser.aspx

License: Apache License 2.0

C# 100.00%

wordlistanalyser's Introduction

WordListAnalyzer Project

WELCOME! .

Please read the LICENSE.TXT and other docs in the /docs folder.

What is it for?

Primarily it was developed to assist optimizing stemming rules in various languages and for comparing stemmers. It compares two word lists of the same length, these are normally the input and out of a stemmer. It could however be useful for other purposes. It displays the input and output lists with measures of similarity and difference and calculates useful measures of stemmer strength, under and over stemming counts, and error rate relative to truncation (ERRT) according to the method described by Chris D Paice of Lancaster University. (Chris Paice. 'Method for Evaluation of Stemming Algorithms Based on Error Counting': http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.9560&rep=rep1&type=pdf ).

HISTORY

Word List Analyzer started life for internal use while developing stemming rules for various languages for the dtSearch Engine, it was released in 2013 (as List Analyser 1.0 'beta' ) to customers of dtSearch Corp, 'dtSearch UK' and academic users free of charge; it is a single executable written in C# (originally using the Open Source IDE SharpDevelop. http://www.icsharpcode.net/ ). The last update was in 2016 (1.1.5876 beta) which added Error Rate Relative to Truncation (ERRT), a method devised by Chris Paice of Lancaster University. The December 2018 Release build on GitHub v1.1.6916 was a rebuild using Visual Studio 2017, no major code changes. Stemming Tester 1.4 executable (no source) for use with the List Analyzer (see https://www.dtsearch.co.uk/products/stemming-tester.aspx ) was added to the Release build in March 2019.

Chris Paice (1941 - 2016) Obituary http://wp.lancs.ac.uk/chrispaice/ .

While testing the ERRT calculations we discovered a couple of bugs in the Java Stemmer Test Software used at Lancaster University which gave errors in some published academic papers, consequently we added to the File>Preferences menu an option to emulate the bugs so that the results agree with the academic papers, details are in the Wiki pages of this repositary.

OPEN SOURCE RELEASE 2018

By releasing our code as open source, I hope that others will build on the work of Chris D. Paice and his colleagues. The Wiki contains a section on the errors mentioned above, and also references some later papers that improve on the original ERRT method. We decided to keep the status as 'beta' since the project was primarily for internal use and a limited number of external users, it performs well enough but contains 'hacks' as quick workarounds, it is definately not to 'production' standard; the lack of large 'grouped' word lists and limited academic papers with verified data on ERRT, etc. has meant that we are assuming that since our results agree very closely with the Lancaster University published papers that they are accurate.

You are welcome to use this software for anything you choose, be it for academic research, commercial use, a hobby, or just out of curiosity! We hope that you will also contribute to this project, if you have suggestions for improvements or find bugs please raise an Issue here on GitHub, or email us at [email protected]; The Master branch is protected to prevent deletion, please fork and request a pull once you've thouroughly tested your code, thank you. A Wiki contains the full content of the original WebHelp, and includes additional release notes, wish list and a contributors list.

A final note. There are some that like to promote the idea that there is some kind of 'war' between open source and closed source software companies, my point of view is that both have their place, and it's a case of 'horses for courses'; choose whatever does the job you want it to. We have used and contributed to both open source and closed source software for over 30 years, and were pioneers in 'shareware' distribution in the early '90s. Word List Analyzer has always been, and will continue to be, distributed by us free of charge. Please feel free to add your name to the List of Contributors if you find a bug, make an edit or use it in your research!

Kind Regards
Ray Harris BA MIET
Founder & CEO of ElectronArt Design Ltd
https://electronart.co.uk
https://www.dtsearch.co.uk/products/list-analyser.aspx

wordlistanalyser's People

Contributors

electronart avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.