Coder Social home page Coder Social logo

tascate / suffix-arrays-in-cpp Goto Github PK

View Code? Open in Web Editor NEW
13.0 1.0 2.0 37 KB

C++ Implementations of Suffix Array Constructions Algorithms for SA-IS and Skew as well as Kasai's algorithm for LCP construction.

CMake 0.71% C++ 99.29%
suffixarray suffix-array cpp

suffix-arrays-in-cpp's Introduction

C++ Suffix Arrays

This is a set of Suffix Array implementations using the SA-IS algorithm and the Skew Algorithm.
LCP Array construction from a Suffix Array is implemented by the Kasai Algorithm and used to find the Longest Common Substring in k-Strings.
Resources to learn about the algorithim, Suffix Arrays, and their applications can be found at the end of this readme.

Compiling & Building

In order to build the project, you will need to install:

  • CMake
  • Make

Clone the Repository.

git clone "https://github.com/Tascate/Suffix-Array-Implementation.git"

Compile using CMake.

cmake .

Build using the Makefile.

make

Run to view test code.

./SuffixArray

Example Usage

Given the following code:

#include "Skew.h"

Skew sa;
sa.addString("cabbage");
sa.makeSuffixArray();
sa.printSuffixArray();

This will output the computed Suffix Array using the Skew Algorithm:

Suffix Array : 7 1 4 3 2 0 6 5

Things to note

The abstract class SuffixArray represents a generic base class that any class can derive from to implement different Suffix Array Construction Algorithms. The classes: Naive, Skew, SAIS, all derive from the SuffixArray class to implement their respective algorithms.

Skew Algorithm

O(n) time, O(n) work
The algorithm is fairly simple to implement, requiring less code overall than the SA-IS implementation. Because of this, it is quite useful for quick testing and usage of Suffix Arrays. This implementation has support for multiple strings which is useful for finding the LCS of k-Strings.

SA-IS Algorithm

O(n) time, O(n) work
This is an implementation of the SA-IS algorithm for my own understanding as well as improving my C++. The code is largely based on Screwtape's A walk through the SA-IS Suffix Array Construction Algorithm and may not necessarily be optimized. Screwtape's walkthrough is a great resource for learning the SA-IS algorithm!

Kasai Algorithm

O(n) time, O(1) work
A simple algorithm to construct the LCP Array from a Suffix Array in linear time. In this code, the LCP Array can be used to find the Longest Common Substring.

Longest Common String for k-Strings

O(n) time, O(n) work)
Finding the LCS is based upon the premise that for a sliding window from i to j on the LCP array. The LCP value for that sliding window is the minimum value from i+1 to j provided that the first character for the suffix i and suffix j are the same. Finding the minimum value in our sliding window takes amortized O(1) time using a deque. An algorithim for finding the minimum value of a sliding window can be found here.
We can then use bookkeeping to find the longest common substring.

Finding out which string a suffix belongs to:

For coding simplicity, finding out the parent string for a given suffix takes O(logm) time (where m = # of strings) and O(1) space using binary search. However, it is possible to optimize for O(1) time by tracking the parent string for a suffix in memory. (For example, a map or array)

Good Resources I used to Learn about Suffix Arrays

Algorithm Documentation:

  • SA-IS: "Linear Suffix Array Construction by Almost Pure Induced-Sorting" by G. Nong, S. Zhang and W. H. Chan"
  • Skew: "Simple Linear Work Suffix Array Construction" by Juha Kärkkäinen and Peter Sanders)
  • Kasai: "Linear-Time Longest Common-Prefix Computation in Suffix Arrays and Its Applications" by Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa and Kunsoo Park

Learning:

Videos:

suffix-arrays-in-cpp's People

Contributors

tascate avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

suffix-arrays-in-cpp's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.