Coder Social home page Coder Social logo

customized-indonesian-analyzer's Introduction

Customized Indonesian Analyzer For Apache Lucene

Motivation / Motivasi

Apache Lucene already provides analyzer for Bahasa Indonesia with stopword removal and stemmer here. However, the code doesn't provide feature to customize whether we want to use stopword removal and/or stemmer or not. This repository provide customized Indonesian Analyzer to turn stopword removal and/or stemmer process on or off. This repository also used for Advanced Information Retrieval course assignment in Faculty of Computer Science Universitas Indonesia.

Apache Lucene sudah menyediakan analyzer untuk Bahasa Indonesia dengan pembuangan stopword dan imbuhan di sini. Namun, kode tersebut tidak menyediakan fitur untuk mengkustomisasi apakah kita ingin menggunakan pembuangan stopword dan/atau imbuhan atau tidak. Repositori ini menyediakan kustomisasi analyzer untuk Bahasa Indonesia yang dapat mengatur proses pembuangan stopword dan/atau imbuhan. Repositori ini juga digunakan untuk tugas kuliah Perolehan Informasi Lanjut di Fakultas Ilmu Komputer Universitas Indonesia.

How to Use / Cara Menggunakan

Indexing / Pengindeksan

  • Put your corpus of text files in a folder, see corpus-example folder for example.
  • Change variables in IndexFiles file with your own settings. / Ganti variabel di berkas IndexFiles dengan pengaturan Anda.
// Corpus Path of text files to be indexed
public static final String CORPUS_PATH = "corpus-example/";
// Path for index result
public static final String INDEX_PATH = "index-example/";
// True if you want to use stemmer, false otherwise
private static final boolean USE_STEMMER = false;
// True if you want to use stopword removal, false otherwise
private static final boolean USE_STOPWORD = false;
  • Use compile_run_index.bat or compile_run_index.sh to create the index. / Gunakan compile_run_index.bat atau compile_run_index.sh untuk membuat indeks.
  • The result of your index will be available in the folder specified in INDEX_PATH variable. / Hasil index Anda akan tersedia pada folder yang dispesifikasikan pada variabel INDEX_PATH.

Searching / Pencarian

  • Change variables in SearchFiles file with your own settings. / Ganti variabel di berkas SearchFiles dengan pengaturan Anda.
// Index from IndexFiles path
public static final String INDEX_PATH = "index-example/";
// True if you want to use stemmer, false otherwise
public static final boolean USE_STEMMER = false;
// True if you want to use stopword removal, false otherwise
public static final boolean USE_STOPWORD = false;

IMPORTANT NOTE: Make sure your SearchFiles use the same configuration as the index in INDEX_PATH. For example: If you create your index using stemmer but not stopword removal, then you should also set USE_STEMMER = true and USE_STOPWORD = false in SearchFiles.

CATATAN PENTING: Pastikan SearchFiles menggunakan konfigurasi yang sama dengan index yang ada di INDEX_PATH. Contoh: Jika Anda membuat index menggunakan pembuangan imbuhan tetapi tidak pembuangan stopword, maka Anda juga harus membuat USE_STEMMER = true dan USE_STOPWORD = false di berkas SearchFiles.

  • Use compile_run_search.bat or compile_run_search.sh to create the index. / Gunakan compile_run_search.bat atau compile_run_search.sh untuk membuat indeks.
  • The result of your index will be available in the folder specified in INDEX_PATH variable. / Hasil index Anda akan tersedia pada folder yang dispesifikasikan pada variabel INDEX_PATH.

Contact / Kontak

Feel free to contact me at remmy.augusta [at] ui.ac.id for any inquiries.

Silakan kontak saya di remmy.augusta [at] ui.ac.id untuk pertanyaan apapun.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.