conference

Generalization of Efficient Implementation of Compression by Substring Enumeration

Shumpei Sakuma, Kazuyuki Narisawa, Ayumi Shinohara

Data Compression Conference 2016 (DCC 2016), , pp.630 (2016), [peer-reviewed]
Event Date: March 30-April 1, 2016

DOI: 10.1109/dcc.2016.86 External Link

Abstract / 概要

Summary form only given. Compression via Substring Enumeration (CSE) is a lossless universal data compression scheme, introduced by Dube and Beaudoin [1]. CSE compresses a target binary string by enumerating substrings occurred in it, and encodes the numbers of occurrences effectively, by calculating its upper-bound and lower-bound based on the previous numbers. They used a data structure called Compacted Substring Tree (CST) for counting the occurrences. Instead of CST, Kanai et al. [2] proposed an elegant and efficient implementation for CSE by utilizing Burrows-Wheeler Transform (BWT) Matrix and several auxiliary arrays. In this paper, we extend it in two ways, (1) to deal with the explicit phase awareness for byte-oriented source, and (2) to treat multiple characters for a finite alphabet source. Extension for Explicit Phase Awareness The original CSE is not effective for non-textual and byte-oriented data. Beliveau et al. [3] improved CSE to deal with the phase unaware problem, by categorizes the substrings based on the positions that start in a byte, and counting the occurrences separately. We adopt it and naturally extend Kanai’s method. The running time is $O(n)$. Extension for CSE with Finite Alphabet. The original CSE is intended for binary strings, and no experimental results are shown to deals with multiple characters explicitly so far, for the best of our knowledges. We extend Kanai’s method for a finite alphabet, by utilizing some additional auxiliary arrays and the Wavelet Tree (or Wavelet Matrix). It also runs in $O(n)$ time assuming that alphabet size is constant.