The habeas from sotheanithsok

Fix inaccuracies on Milestone1

From the Milestone1 Grade.

some issues with a phrase query
Wildcard query was strangely a little bit off
NEAR query was strangely a little bit off

JsonFileDocument

Create new implementation of FileDocument(JsonFileDocument) to work with json files

New getContent()
New getTitle()

Visual Enhancement with color theme

favorite colors
jesse - burgundy, #550015
nith - neon blue
yashua - greenish blue
sella - mint

K-Gram Index

Build a k-gram index base on input hashset

"Right now your index uses a binary search to locate postings for a vocabulary term;
this requires log2 T disk reads. This can be improved with a properly configured B+ tree mapping from vocabulary terms to disk locations. Do not write a B+ tree yourself. Find a library with a permissible license that implements a B+ tree that can map from a string term to a long disk location. Import the library into your project and configure your DiskPositionalIndex class to use the B+ tree instead of a binary search over the vocab.bin file."

Code Coverage Report

code coverage with our unit tests

Electron GUI integration

GUI #14
the dependencies

dotnet add package ElectronNet.API -v 5.22.14
dotnet add package Microsoft.AspNetCore.App
dotnet add package Microsoft.AspNetCore.Razor.Design -v 2.2.0
dotnet add package Microsoft.VisualStudio.Web.CodeGeneration.Design -v 2.2.3
dotnet tool install ElectronNET.CLI -g
npm install electron-builder --global
electronize start

OnDiskHashMap

Create a generic on-disk hashmap to deal with k-gram and soundex

Ranked Retrievals

"This is the biggest new requirement. Your main program must operate in two modes:
Boolean query mode, and ranked query mode.
In ranked query mode, you must process a query without any Boolean operators and return the top K = 10 documents satisfying the query. Use the 'term at a time' algorithm as discussed in class:

For each term t in the query:
(a) Calculate wq;t = ln (1 + N/dft)
(b) For each document d in t's postings list:
i. Acquire an accumulator value Ad (the design of this system is up to you).
ii. Calculate wd;t = 1 + ln (tft;d).
iii. Increase Ad by wd;t × wq;t.
For each non-zero Ad, divide Ad by Ld, where Ld is read from the docWeights.bin file.
Select and return the top K = 10 documents by largest Ad value. (Use a binary heap priority queue to select the largest results; do not sort the accumulators.)

Use 8-byte floating point numbers for all the calculations.

(print ranked retrieval results: Please print the title of each document returned from a ranked retrieval, as well as the final accumulator value for that document.)"

Core#2 New Index with Positions

Program the PositionalInvertedIndex()
Incorporate PositionalInvertedIndex() into the indexing process, such as AddTerm()
Update Posting() to store positions

Building The Index

"Your PositionalInvertedIndex must be written to disk using the DiskIndexWriter class from Homework 5. Your index file must be constructed in the following pattern: dft d tf t,d p1 p2 · · · pi, where d is the document ID, tf t,d is the term frequency of a term in the document (i.e., the number of positions
that the term appears at), and p1 p2 · · · pi are each of the i positions of that term in that document. All document IDs and positions must be written as gaps."

Spelling Correction

"Implement a spelling correction module for your search engine. Any time a user runs a query using a term that is either missing from the vocabulary or whose document frequency is below some threshold value (your decision), run the query and give results as normal, but then print a suggested modified query where the possibly misspelled term(s) is replaced by a most-likely correction. Ask the user if they would like to run this modified query. To select the most-likely correction:

Select all vocabulary types that have k-grams in common with the misspelled term, as described in lecture.
Calculate the Jaccard coefficient for each type in the selection.
For each type whose coefficient exceeds some threshold (your decision), calculate the edit distance from that type to the misspelled term.
Select the type with the lowest edit distance. If multiple types tie, select the type with the highest dft (when stemmed)."

Adv#5 Near operator

NearLiteral class
Implement GetPostings()
Upgrade findNextLiteral() to recognize NEAR query
unit test on it

Near query example:
[angels NEAR/2 baseball]
To find documents that "baseball" appears at most 2 positions away from "angels".

"angels baseball" - match
"angels love baseball" - match
"angels really love baseball" - not match

Bug: GUI Header

"Vocab" & "Index" options do not disappear in navigation bar when the user selects the "Index" option from the search page.

Adv#2 GUI

GUI first prompts user to "select a directory to index"
GUI shows text field for user to type queries
GUI shows list of doc titles matching query results
Double clicking an item opens a new window showing the contents of the document
GUI features alternative for special query ":q"
GUI features alternative for special query ":stem token"
GUI features alternative for special query ":index directoryname"
GUI features alternative for special query ":vocab"
GUI for special query ":author name"

Core#1 Corpus Directory Selection and JSON File Documents

SPIMI Algorithm

"Program the SPIMI algorithm for creating an on-disk index. Set some constant threshold
value that determines when your in-memory index is 'full'. Process tokens into an in-memory positional inverted index until the index is full, then save the index to a 'partial' index using your DiskIndexWriter class. Clear the index, then process tokens again, repeating the process until all documents are processed. Merge the partial indexes together into one final index using the SPIMI merge algorithm. You can simplify the merge algorithm in this way:

Assume that the entire vocabulary fits into main memory. Construct this full vocabulary by reading the vocabularies from each bucket index and unioning those together.
For each term in the sorted vocabulary, read the postings for that term from each partial index in
order. Merge those postings into one final list, then write that list to disk.
Repeat.

In this way, you do not have to program the 'priority queue' aspect of the SPIMI algorithm.
Modify your search engine to always use the SPIMI algorithm instead of the default DiskIndexWriter
construction method. Demonstrate that your code works by indexing the entire mlb-articles-full.zip
corpus on BeachBoard, which has 200,000 articles and takes the example search engine implementation 15 minutes to index with SPIMI"

Core#6 Main Application

1. Ask Directory to index
2. Print how long indexing takes
3. Ask a document to view
4. Print the content of the selected document

Tiered Index

Implement a three-tier index as described in the text in section 7.2.1.
Perform a complete in-memory indexing using your PositionalInvertedIndex class as before, but now instead of creating a single on-disk index, create 3, and place documents into only one of the three based on some tft,d threshold you choose with care. (Speak with me if you don't know where to start with this.) For Boolean queries, you will need to pull postings from all 3 indexes prior to merging results. For Ranked queries, you will attempt to satisfy the query using only documents from the "top" index. If you score fewer than K documents in this way, then you need to start over and score documents using the top 2 indexes; likewise involve the third index only if this second attempt fails. For this requirement, let K = 20. Compare your new search engine with your baseline search engine from Milestone 2 in the following ways:

the Mean-Average-Precision for the test collection data set
the average number of nonzero accumulators used in the queries
the mean response time to satisfy a query
the throughput (queries/second) of the system
the percentage of queries that were satisfied only using documents in the top tier

Check Through Demo Rubric!

Project passes test cases on core features (tests cases provided by instructor)
Project passes test cases on additional features
Project meets code review standards for appropriate object-oriented design techniques (well-designed classes, methods, high cohesion, low coupling)
Project meets code review standards for intelligent use of data structures and algorithms
Project meets code review standards for efficient use of memory and CPU time
Project has thorough documentation of all classes and methods in the standard style set by C#
Entire code base must feature consistent set of rules governing its coding style
Project has a professional appearance

Core#3 New Token Processor

Remove all non-alphanumeric characters form the beginning and end of the token but not the the middle
Remove all apostrophes
Remove quotation marks
Hyphenated Words
Convert the token to lowercase
Stem the token using Porter2 Stemmer

Test on the actual corpus

Neal's test on the actual corpus

UnitTest for PositionalInvertedIndex

Also, check the code coverage..?

Variable Byte Encoding

"Encode the index files using variable byte encoding. Modify IndexWriter so all document ID gaps, position counts, and position gaps are written using variable byte encoding as described in lecture and in the book. Modify DiskPositionalIndex so it accounts for variable byte counts when reading postings and positions. Hint: you cannot assume that 4 bytes should be read for each int, so you now must read 1 byte at a time and decide whether you need more bytes for the number you are decoding. This also means you cannot seek/skip past positional postings if you don't need them (for non-phrase queries); you need to scan past them, counting the number of times you read a byte with a top-most bit of 1 and continuing to scan until you encounter one such byte for every position you are attempting to skip. Your encoding and decoding methods must be efficient. You cannot use string variables in these operations; you must only work with integer types, bytes, and arrays of bytes."

Modify IndexWriter so all document ID gaps, position counts, and position gaps are written using variable byte encoding as described in lecture and in the book
Modify DiskPositionalIndex so it accounts for variable byte counts when reading postings and positions. Hint: you cannot assume that 4 bytes should be read for each int, so you now must read 1 byte at a time and decide whether you need more bytes for the number you are decoding. This also means you cannot seek/skip past positional postings if you don't need them (for non-phrase queries); you need to scan past them, counting the number of times you read a byte with a top-most bit of 1 and continuing to scan until you encounter one such byte for every position you are attempting to skip
Encoding and decoding methods must be efficient. You cannot use string variables in these operations; you must only work with integer types, bytes, and arrays of bytes

Enable Stemming for Indexing and Query

Check SOLID software principles

Check the structure and method access types
Alarms

too many static methods?
is everything public?
S
O
L
I
D

Unit Testing Framework

"You must include tests for ranked retrieval by hand-computing scores for a few selected documents on a few selected queries and testing your algorithms on those results."

To-Dos For Final Report

1. Fix the speed problem
2. Redistribute to 3 tiers
3. Conduct experiments on the tier sizes to maximize the precision and throughput
4. Write the report!

Excluding url and only processing body for the document

K-Gram Index On Disk

"Save your wildcard index to disk, and incorporate wildcards into ranked retrieval queries. Extend the 'create an index' procedure of your program to also generate a disk-based wildcard index, and
likewise extend the 'query an index' procedure to load this wildcard index for use with wildcard queries. I will leave the design of this index up to you, but will gladly give input if you need it. You may create a design in which the entire wildcard index is read into and retained in memory for the duration of the search engine; you do not need to architect a system that reads wildcard information from the binary file each time a wildcard query is needed. To incorporate wildcards into ranked retrievals, simply include every vocabulary type that matches the wildcard token in the ranking procedure. (Yes, this gives higher scores to documents that contain multiple different words matching the wildcard query. If you can come up with a better procedure, I welcome your
proposal.)"

Update GUI

Have GUI reflect all new Habeas features from milestone 2.

Soundex Index On Disk

"Apply the logic in the DiskIndexWriter class(es) to saving your Soundex index, and similarly apply the logic of the DiskPositionalIndex class to reading your Soundex index. You will still only use the index to satisfy: author queries in Boolean retrieval mode."

Core#4 Query Language

Reading The Index

DiskPositionalIndex
BinarySearchVocabulary()
GetPostings()
GetPositionalPostings()

"You must create a new implementation of the Index interface, DiskPositionalIndex, which reads postings from an on-disk index as created by your DiskIndexWriter. Each Index interface method must be implemented; to program getPostings(String term), you must locate the byte position within the postings.bin file for the given term, which requires binary searching your vocabTable.bin (using vocab.bin as well). This is tricky; to get you started, I have implemented most of this behavior as part of some starter code in DiskFoundations.zip on BeachBoard. Once you have a byte location, use a seek method on a DataInputStream (Java) or BinaryReader (C#) to jump to that position and then read and translate integer values from disk into Postings objects in your program."

Adv#3 Wildcard query

Adv#1 Unit Test

Things to test

Libraries to use

XUnit
FluentAssertion

To start XUnit

To use XUnit in the package,
$ dotnet new xunit --force
This sets up all it needs for the project to configure.
It changes CECS-529-Search-Engine-Project.csproj file and creates UnitTest1.cs
To resolve multiple entry points issue,
Add a line to CECS-529-Search-Engine-Project.csproj file
<GenerateProgramFile>false</GenerateProgramFile>
inside of <PropertyGroup>
Build the project
$ dotnet build

Done. This creates a simple unit test base code.

FluentAssertion

a library to help easier assertion

Adv#4 SoundEx

SoundexIndex class
- hashMap [sound hash key] -> [docIDs]
- Construct by reading author from corpus
- GetPostings(string name)
Author field in IDocument and its implementations
Querying in Main() :author ______
tests

Build custom MemoryMappedFile manager

As discussed at #31 with @sotheanith,
mapName used in MemoryMappedFile.OpenExisting() is a feature provided by Windows, so it can't be used in Unix.

Need to build a manager to store memoryMappedFile

in order to handle the problem of calling createFileFromDisk multiple times.
One file being used by other processor errors.
when 1) testing index, 2) searching and opening a file twice

Calculating Document Weights

"Ranked retrieval requires calculating a 'weight' of each document to use in normalization, so that long documents don't receive 'extra relevance' just because they are longer. The weight of each document is called Ld and can be calculated during the indexing process. Each Ld value is equal to the Euclidean normalization of the vector of wd;t weights for the document, where wd;t = 1 + ln (tft;d) Note that this formula only needs tft;d, which is the number of times a particular term occurs in the document. So to calculate Ld for a document that you just finished indexing, you need to know each term that occurred at least once in the document along with the number of times that term occurred; a HashMap from term to integer, updated as you read each token, can help track this. For each term in the final map, the integer it maps to is tft;d for that term and can be used to calculate wd;t for that term. To normalize those wd;t terms and find Ld, sum the squares of all the wd;t terms, and then take the square root: Ld = qPt (wd;t)2. Each Ld term needs to be written as an 8-byte double in document order to a file called docWeights.bin. Modify DiskIndexWriter so that it creates this file during the indexing process. Modify DiskPositionalIndex so it knows how to open this file and skip to an appropriate location to read a 8-byte double for Ld. Ld values will be used when calculating ranked retrieval scores."

Core#5 Special Queries

Querying The Index

"You must not read the entire index into main memory when processing user queries. You must only read postings lists for the terms required to satisfy the query. In other words, you will not be using PositionalInvertedIndex or NaiveInvertedIndex when processing ranked or Boolean queries; that will fall to your new DiskPositionalIndex. Your DiskPositionalIndex class must have two different functions for retrieving postings lists: one for when you don't care about positional information, and one for when you do want positional information in the returned postings. You will need to modify the Index interface with the second method, and clearly indicate which method loads positions and which does not. Your query engine must use the appropriate index retrieval method depending on the type of query: only a few Boolean query component types need positions (phrase queries, NEAR operator), whereas ranked queries only need tft;d (and not the actual positions themselves)."

K-Gram generator

Add k-gram generation functionality to the token processor.

sotheanithsok / habeas Goto Github PK

habeas's People

Contributors

Watchers

habeas's Issues

Things to test

Libraries to use

To start XUnit

FluentAssertion

Need to build a manager to store memoryMappedFile

Recommend Projects

Recommend Topics

Recommend Org