Skip to content

Suffix Array thoughts

pierce314159 edited this page Jul 19, 2021 · 5 revisions

Quick introduction

A Suffix Array is a sorted array of all suffixes of a string. This data structure is useful for certain string algorithms including lossless data compression and string sorting

Example:
Given a string "banana$", the suffixes are as follows
    s[0]="banana$"
    s[1]="anana$"
    s[2]="nana$"
    s[3]="ana$"
    s[4]="na$"
    s[5]="a$"
    s[6]="$"
The suffix array of string "banana$" is the array of indices of sorted suffixes
    s[6]="$"
    s[5]="a$"
    s[3]="ana$"
    s[1]="anana$"
    s[0]="banana$"
    s[4]="na$"
    s[2]="nana$"
so sa=[6,5,3,1,0,4,2]

Arkouda has pull requests (See PR #865 and #627) which implement suffix arrays with linear time construction

Suffix Array construction as an operation on Strings

In the current implementation, suffix array is a standalone object with the client having disjoint SArrays and Strings classes. I've started thinking that suffix array construction should be framed as an operation performed on a Strings object.

My reasoning is a suffix array is an intermediary representation which doesn't make sense without an associated string. We won't ever have an instance where we need a suffix array without a string, so instead of converting Strings objects into SArrays, we could have the suffix array construction methods reside in the strings class.

There are 2 approaches that come to mind when making this happen:

  • Have all algorithms which require suffix arrays to create them on the fly as necessary
    • This shouldn't be an algorithmic bottle neck since the construction is linear
  • Automatically create the suffix array as an additional component of SegStrings on the server
    • Increased memory required to avoid repeated construction of suffix arrays if multiple algorithms requiring them are called on the same SegString
    • This extra memory is already happening with building SArrays objects from Strings

I am in favor of building suffix arrays as needed while we are implementing the algorithms which require them. We can then do a bake-off to see how the performance compares with our current non-suffix array methods. If these perform better, we can look into whether having more permanent storage for suffix arrays as an additional component in SegStrings on the server is worthwhile

Clone this wiki locally