-
Notifications
You must be signed in to change notification settings - Fork 96
Suffix Array thoughts
A Suffix Array is a sorted array of all suffixes of a string. This data structure is useful for certain string algorithms including lossless data compression and string sorting
Example:
Given a string "banana$", the suffixes are as follows
s[0]="banana$"
s[1]="anana$"
s[2]="nana$"
s[3]="ana$"
s[4]="na$"
s[5]="a$"
s[6]="$"
The suffix array of string "banana$" is the array of indices of sorted suffixes
s[6]="$"
s[5]="a$"
s[3]="ana$"
s[1]="anana$"
s[0]="banana$"
s[4]="na$"
s[2]="nana$"
so sa=[6,5,3,1,0,4,2]
Arkouda has pull requests (See PR #865 and #627) which implement suffix arrays with linear time construction
In the current implementation, suffix array is a standalone object with the client having disjoint SArrays and Strings classes. I've started thinking that suffix array construction should be framed as an operation performed on a Strings object.
My reasoning is a suffix array is an intermediary representation which doesn't make sense without an associated string. We won't ever have an instance where we need a suffix array without a string, so instead of converting Strings objects into SArrays, we could have the suffix array construction methods reside in the strings class.
There are 2 approaches that come to mind when making this happen:
- Have all algorithms which require suffix arrays to create them on the fly as necessary
- This shouldn't be an algorithmic bottle neck since the construction is linear
- Automatically create the suffix array as an additional component of
SegStringson the server- Increased memory required to avoid repeated construction of suffix arrays if multiple algorithms requiring them are called on the same
SegString - This extra memory is already happening with building
SArraysobjects fromStrings
- Increased memory required to avoid repeated construction of suffix arrays if multiple algorithms requiring them are called on the same
I am in favor of building suffix arrays as needed while we are implementing the algorithms which require them. We can then do a bake-off to see how the performance compares with our current non-suffix array methods. If these perform better, we can look into whether having more permanent storage for suffix arrays as an additional component in SegStrings on the server is worthwhile