Abstract
String dictionaries store a collection \(\left( s_i\right) _{0\le i < m}\) of m variable-length keys (strings) over an alphabet \(\varSigma \) and support the operations lookup (given a string \(s\in \varSigma ^*\), decide if \(s_i=s\) for some i, and return this i) and access (given an integer \(0\le i < m\), return the string \(s_i\)). We show how to modify the Lempel–Ziv-78 data compression algorithm to store the strings space-efficiently and support the operations lookup and access in optimal time. Our approach is validated experimentally on dictionaries of up to 1.5 GB of uncompressed text. We achieve compression ratios often outperforming the existing alternatives, especially on dictionaries containing many repeated substrings. Our query times remain competitive.
Similar content being viewed by others
Notes
The depiction of the PDT is simplified; a concrete implementation includes some technical details that are not required to understand the techniques described in this section.
This is only theoretically interesting, as for our datasets no phrase exceeds a length of 127 characters.
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Arroyuelo, D., Navarro, G.: Space-efficient construction of Lempel–Ziv compressed text indexes. Inf. Comput. 209(7), 1070–1102 (2011)
Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel–Ziv based compressed text indexing. Algorithmica 62(1–2), 54–101 (2012)
Arz, J., Fischer, J.: LZ-compressed string dictionaries. In: Proceedings of the DCC, pp. 322–331. IEEE Press (2014)
Belazzougui, D., Navarro, G.: Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms 11(4), 31 (2015)
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: a scalable fully distributed web crawler. Softw. Pract. Exp. 34(8), 711–726 (2004)
Böttcher, S., Lohrey, M., Maneth, S., Rytter, W. (eds): Abstracts collection—structure-based compression of complex massive data. No. 08261 in Dagstuhl Seminar Proceedings, Schloss Dagstuhl—Leibniz-Zentrum für Informatik, Germany (2008)
Brisaboa, N.R., Cánovas, R., Claude, F., Martínez-Prieto, M.A., Navarro, G.: Compressed string dictionaries. In: Proceedings of the 10th International Symposium on Experimental Algorithms (SEA 2011), Springer, Lecture Notes in Computer Science, vol. 6630, pp. 136–147 (2011)
Clark, D.R.: Compact Pat Trees. PhD thesis, Waterloo, ON, Canada (1998)
Ferragina, P., Venturini, R.: Compressed permuterm index. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2007), ACM, pp. 535–542 (2007)
Fischer, J., Gawrychowski, P.: Alphabet-dependent string searching with wexponential search trees. In: Proceedings of the CPM, Springer, LNCS, vol. 9133, pp. 160–171 (2015)
Fischer, J., I, T., Köppl, D.: Lempel Ziv computation in small space (LZ-CISS). In: Proceedings of the CPM, Springer, LNCS, vol. 9133, pp. 172–184 (2015)
Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with \({O}(1)\) worst case access time. J. ACM 31(3), 538–544 (1984)
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Proceedings of the SEA, Springer, LNCS, vol. 8504, pp. 326–337 (2014)
Grossi, R., Ottaviano, G.: Fast compressed tries through path decompositions. ACM J. Exp. Algorithmics 19, 3–4 (2014)
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithm (SODA 2003), ACM/SIAM, pp. 841–850 (2003)
Hu, T.C., Tucker, A.C.: Optimal computer search trees and variable-length alphabetical codes. SIAM J. Appl. Math. 21(4), 514–532 (1971)
Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. IRE 40(9), 1098–1101 (1952)
Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of the 30th Annual Symposium on Foundations of Computer Science (FOCS 1989), IEEE Computer Society, pp. 549–554 (1989)
Jansson, J., Sadakane, K., Sung, W.: Linked dynamic tries with applications to LZ-compression in sublinear time and space. Algorithmica 71(4), 969–988 (2015)
Knuth, D.E.: Sorting and Searching, The Art of Computer Programming, vol. 3, 2nd edn. Addison Wesley, Reading (1998)
Kosaraju, S.R., Manzini, G.: Compression of low entropy strings with Lempel–Ziv algorithms. SIAM J. Comput. 29(3), 893–911 (1999)
Larsson, N.J., Moffat, A.: Offline dictionary-based compression. In: Proceedings of the Data Compression Conference (DCC 1999), IEEE Computer Society, pp. 296–305 (1999)
Maneth, S., Navarro, G.: Indexes and computation over compressed structured data (Dagstuhl Seminar 13232). Dagstuhl Rep. 3(6), 22–37 (2013)
Martínez-Prieto, M.A., Brisaboa, N.R., Cánovas, R., Claude, F., Navarro, G.: Practical compressed string dictionaries. Inf. Syst. 56, 73–108 (2016)
Mehlhorn, K., Sanders, P.: Algorithms and Data Structures: The Basic Toolbox. Springer, Berlin (2008)
Müller, I., Ratsch, C., Färber, F.: Adaptive string dictionary compression in in-memory column-store database systems. In: Proceedings of the 17th International Conference on Extending Database Technology (EDBT), OpenProceedings.org, pp. 283–294 (2014)
Munro, J.I.: Tables. In: Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 1996), Springer, Lecture Notes in Computer Science, vol. 1180, pp. 37–42 (1996)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39, 2 (2007)
Navarro, G., Providel, E.: Fast, small, simple rank/select on bitmaps. In: Proceedings of the SEA, Springer, LNCS, vol. 7276, pp. 295–306 (2012)
Russo, L.M.S., Oliveira, A.L.: A compressed self-index using a Ziv–Lempel dictionary. Inf. Retr. 11(4), 359–388 (2008)
Russo, L.M.S., Navarro, G., Oliveira, A.L.: Fully compressed suffix trees. ACM Trans. Algorithms 7(4), 53 (2011)
Vigna, S.: Broadword implementation of rank/select queries. In: Proceedings of the 7th International Workshop on Experimental Algorithms (WEA 2008), Springer, Lecture Notes in Computer Science, vol. 5038, pp. 154–168 (2008)
Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco (1999)
Yata, S.: Dictionary compression using nested prefix/Patricia tries (in Japanese). In: Proceedings of the 17th Annual Meeting on Natural Language Processing (NLP2001), pp. 576–578 (2011). http://www.anlp.jp/proceedings/annual_meeting/2011/pdf_dir/F2-6.pdf
Zhou, D., Andersen, D.G., Kaminsky, M.: Space-efficient, high-performance rank and select structures on uncompressed bit sequences. In: Proceedings of the SEA, Springer, LNCS, vol. 7933, pp. 151–163 (2013)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)
Acknowledgements
Many people helped to improve this article in different ways. First, we thank Giuseppe Ottaviano for providing his data sets, and Francisco Claude and Miguel Ángel Martínez-Prieto for the source codes of their implementations. Second, we thank Paweł Gawrychowski for interesting discussions on this topic, and Giuseppe Ottaviano, Rossano Venturini, and Gonzalo Navarro for pointing out the work by Russo and Oliveira [31] during the Dagstuhl Seminar 13232 “Indexes and Computation over Compressed Structured Data” [24]. Gonzalo Navarro also brought Lemma 2.3 from Kosaraju and Manzini [22] to our attention. We further thank Simon Gog for bringing [36] to our attention, and the anonymous reviewers for their comments that helped to improve this article.
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of this paper was published at the 24th Data Compression Conference [4], and in the first author’s diploma thesis at KIT.
Rights and permissions
About this article
Cite this article
Arz, J., Fischer, J. Lempel–Ziv-78 Compressed String Dictionaries. Algorithmica 80, 2012–2047 (2018). https://doi.org/10.1007/s00453-017-0348-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-017-0348-7