Inverted Files Versus Suffix Arrays for Locating Patterns in
Primary Memory
Simon Puglisi
Department of Computing,
Curtin University of Technology,
Perth, Australia
Bill Smyth
Department of Computer Science,
McMaster University,
Canada.
Andrew Turpin
School of Computer Science and Information Technology,
RMIT University,
Melbourne, Australia.
Status
Proc. 13th Symposium on String Processing and
Information Retrieval (SPIRE 2006),
Glasgow,
to appear October 2006.
Abstract
Recent advances in the asymptotic resource costs of pattern matching with
compressed suffix arrays
are attractive, but a key rival structure, the compressed inverted file,
has been dismissed or ignored in papers presenting the new structures.
In this paper we examine the resource requirements of
compressed suffix array algorithms against
compressed inverted file data structures for general pattern matching in
genomic and English texts.
In both cases, the inverted file indexes $q$-grams, thus allowing full
pattern matching capabilities, rather than simple word based search, making
their functionality equivalent to the compressed suffix array structures.
When using equivalent memory for the two structures,
inverted files are faster at reporting the location of patterns
when the number of occurrences of the patterns is high.
Furthermore, inverted files easily scale up to work on external storage,
while the use of succinct self indexes in external memory is an open
problem.