This article presents an
O
(
n
)-time algorithm called SACA-K for sorting the suffixes of an input string
T
[0,
n
-1] over an alphabet
A
[0,
K
-1]. The problem of sorting the suffixes of
T
is also known as constructing the suffix array (SA) for
T
. The theoretical memory usage of SACA-K is
n
log
K
+
n
log
n
+
K
log
n
bits. Moreover, we also have a practical implementation for SACA-K that uses
n
bytes + (
n
+ 256) words and is suitable for strings over any alphabet up to full ASCII, where a word is log
n
bits. In our experiment, SACA-K outperforms SA-IS that was previously the most time- and space-efficient linear-time SA construction algorithm (SACA). SACA-K is around 33% faster and uses a smaller deterministic workspace of
K
words, where the workspace is the space needed beyond the input string and the output SA. Given
K
=
O
(1), SACA-K runs in linear time and
O
(1) workspace. To the best of our knowledge, such a result is the first reported in the literature with a practical source code publicly available.
We present a new suffix array construction algorithm that aims to build, in external memory, the suffix array for an input string of length
n
measured in the magnitude of tens of Giga characters over a constant or integer alphabet. The core of this algorithm is adapted from the framework of the original internal memory SA-DS algorithm that samples fixed-size d-critical substrings. This new external-memory algorithm, called EM-SA-DS, uses novel cache data structures to construct a suffix array in a sequential scanning manner with good data spatial locality: data is read from or written to disk sequentially. On the assumed external-memory model with RAM capacity
Ω
((
nB
)
0.5
), disk capacity
O
(
n
), and size of each I/O block
B
, all measured in log
n
-bit words, the I/O complexity of EM-SA-DS is
O
(
n
/
B
). This work provides a general cache-based solution that could be further exploited to develop external-memory solutions for other suffix-array-related problems, for example, computing the longest-common-prefix array, using a modern personal computer with a typical memory configuration of 4GB RAM and a single disk.
We present in this article an external memory algorithm, called disk SA-IS (DSA-IS), to exactly emulate the induced sorting algorithm SA-IS previously proposed for sorting suffixes in RAM. DSA-IS is a new diskfriendly method for sequentially retrieving the preceding character of a sorted suffix to induce the order of the preceding suffix. For a size-n string of a constant or integer alphabet, given the RAM capacity ((nW ) 0.5 ), where W is the size of each I/O buffer that is large enough to amortize the overhead of each access to disk, both the CPU time and peak disk use of DSA-IS are O(n). Our experimental study shows that on average, DSA-IS achieves the best time and space results of all of the existing external memory algorithms based on the induced sorting principle.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.