Skip to main content
Log in

Lempel–Ziv-78 Compressed String Dictionaries

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

String dictionaries store a collection \(\left( s_i\right) _{0\le i < m}\) of m variable-length keys (strings) over an alphabet \(\varSigma \) and support the operations lookup (given a string \(s\in \varSigma ^*\), decide if \(s_i=s\) for some i, and return this i) and access (given an integer \(0\le i < m\), return the string \(s_i\)). We show how to modify the Lempel–Ziv-78 data compression algorithm to store the strings space-efficiently and support the operations lookup and access in optimal time. Our approach is validated experimentally on dictionaries of up to 1.5 GB of uncompressed text. We achieve compression ratios often outperforming the existing alternatives, especially on dictionaries containing many repeated substrings. Our query times remain competitive.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Available at http://ls11-www.cs.uni-dortmund.de/_media/fischer/research/code-lz-csd.rar.

  2. The depiction of the PDT is simplified; a concrete implementation includes some technical details that are not required to understand the techniques described in this section.

  3. This is only theoretically interesting, as for our datasets no phrase exceeds a length of 127 characters.

  4. http://pizzachili.di.unipi.it.

  5. https://code.google.com/archive/p/tx-trie/.

  6. http://github.com/ot/path_decomposed_tries.

References

  1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)

    Article  Google Scholar 

  2. Arroyuelo, D., Navarro, G.: Space-efficient construction of Lempel–Ziv compressed text indexes. Inf. Comput. 209(7), 1070–1102 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  3. Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel–Ziv based compressed text indexing. Algorithmica 62(1–2), 54–101 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  4. Arz, J., Fischer, J.: LZ-compressed string dictionaries. In: Proceedings of the DCC, pp. 322–331. IEEE Press (2014)

  5. Belazzougui, D., Navarro, G.: Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms 11(4), 31 (2015)

    Article  MathSciNet  Google Scholar 

  6. Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: a scalable fully distributed web crawler. Softw. Pract. Exp. 34(8), 711–726 (2004)

    Article  Google Scholar 

  7. Böttcher, S., Lohrey, M., Maneth, S., Rytter, W. (eds): Abstracts collection—structure-based compression of complex massive data. No. 08261 in Dagstuhl Seminar Proceedings, Schloss Dagstuhl—Leibniz-Zentrum für Informatik, Germany (2008)

  8. Brisaboa, N.R., Cánovas, R., Claude, F., Martínez-Prieto, M.A., Navarro, G.: Compressed string dictionaries. In: Proceedings of the 10th International Symposium on Experimental Algorithms (SEA 2011), Springer, Lecture Notes in Computer Science, vol. 6630, pp. 136–147 (2011)

  9. Clark, D.R.: Compact Pat Trees. PhD thesis, Waterloo, ON, Canada (1998)

  10. Ferragina, P., Venturini, R.: Compressed permuterm index. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2007), ACM, pp. 535–542 (2007)

  11. Fischer, J., Gawrychowski, P.: Alphabet-dependent string searching with wexponential search trees. In: Proceedings of the CPM, Springer, LNCS, vol. 9133, pp. 160–171 (2015)

  12. Fischer, J., I, T., Köppl, D.: Lempel Ziv computation in small space (LZ-CISS). In: Proceedings of the CPM, Springer, LNCS, vol. 9133, pp. 172–184 (2015)

  13. Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with \({O}(1)\) worst case access time. J. ACM 31(3), 538–544 (1984)

    Article  MathSciNet  MATH  Google Scholar 

  14. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Proceedings of the SEA, Springer, LNCS, vol. 8504, pp. 326–337 (2014)

  15. Grossi, R., Ottaviano, G.: Fast compressed tries through path decompositions. ACM J. Exp. Algorithmics 19, 3–4 (2014)

    MathSciNet  MATH  Google Scholar 

  16. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithm (SODA 2003), ACM/SIAM, pp. 841–850 (2003)

  17. Hu, T.C., Tucker, A.C.: Optimal computer search trees and variable-length alphabetical codes. SIAM J. Appl. Math. 21(4), 514–532 (1971)

    Article  MathSciNet  MATH  Google Scholar 

  18. Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. IRE 40(9), 1098–1101 (1952)

    Article  MATH  Google Scholar 

  19. Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of the 30th Annual Symposium on Foundations of Computer Science (FOCS 1989), IEEE Computer Society, pp. 549–554 (1989)

  20. Jansson, J., Sadakane, K., Sung, W.: Linked dynamic tries with applications to LZ-compression in sublinear time and space. Algorithmica 71(4), 969–988 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  21. Knuth, D.E.: Sorting and Searching, The Art of Computer Programming, vol. 3, 2nd edn. Addison Wesley, Reading (1998)

    MATH  Google Scholar 

  22. Kosaraju, S.R., Manzini, G.: Compression of low entropy strings with Lempel–Ziv algorithms. SIAM J. Comput. 29(3), 893–911 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  23. Larsson, N.J., Moffat, A.: Offline dictionary-based compression. In: Proceedings of the Data Compression Conference (DCC 1999), IEEE Computer Society, pp. 296–305 (1999)

  24. Maneth, S., Navarro, G.: Indexes and computation over compressed structured data (Dagstuhl Seminar 13232). Dagstuhl Rep. 3(6), 22–37 (2013)

    Google Scholar 

  25. Martínez-Prieto, M.A., Brisaboa, N.R., Cánovas, R., Claude, F., Navarro, G.: Practical compressed string dictionaries. Inf. Syst. 56, 73–108 (2016)

    Article  Google Scholar 

  26. Mehlhorn, K., Sanders, P.: Algorithms and Data Structures: The Basic Toolbox. Springer, Berlin (2008)

    MATH  Google Scholar 

  27. Müller, I., Ratsch, C., Färber, F.: Adaptive string dictionary compression in in-memory column-store database systems. In: Proceedings of the 17th International Conference on Extending Database Technology (EDBT), OpenProceedings.org, pp. 283–294 (2014)

  28. Munro, J.I.: Tables. In: Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 1996), Springer, Lecture Notes in Computer Science, vol. 1180, pp. 37–42 (1996)

  29. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39, 2 (2007)

    Article  MATH  Google Scholar 

  30. Navarro, G., Providel, E.: Fast, small, simple rank/select on bitmaps. In: Proceedings of the SEA, Springer, LNCS, vol. 7276, pp. 295–306 (2012)

  31. Russo, L.M.S., Oliveira, A.L.: A compressed self-index using a Ziv–Lempel dictionary. Inf. Retr. 11(4), 359–388 (2008)

    Article  Google Scholar 

  32. Russo, L.M.S., Navarro, G., Oliveira, A.L.: Fully compressed suffix trees. ACM Trans. Algorithms 7(4), 53 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  33. Vigna, S.: Broadword implementation of rank/select queries. In: Proceedings of the 7th International Workshop on Experimental Algorithms (WEA 2008), Springer, Lecture Notes in Computer Science, vol. 5038, pp. 154–168 (2008)

  34. Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)

    Article  Google Scholar 

  35. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco (1999)

    MATH  Google Scholar 

  36. Yata, S.: Dictionary compression using nested prefix/Patricia tries (in Japanese). In: Proceedings of the 17th Annual Meeting on Natural Language Processing (NLP2001), pp. 576–578 (2011). http://www.anlp.jp/proceedings/annual_meeting/2011/pdf_dir/F2-6.pdf

  37. Zhou, D., Andersen, D.G., Kaminsky, M.: Space-efficient, high-performance rank and select structures on uncompressed bit sequences. In: Proceedings of the SEA, Springer, LNCS, vol. 7933, pp. 151–163 (2013)

  38. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

Many people helped to improve this article in different ways. First, we thank Giuseppe Ottaviano for providing his data sets, and Francisco Claude and Miguel Ángel Martínez-Prieto for the source codes of their implementations. Second, we thank Paweł Gawrychowski for interesting discussions on this topic, and Giuseppe Ottaviano, Rossano Venturini, and Gonzalo Navarro for pointing out the work by Russo and Oliveira [31] during the Dagstuhl Seminar 13232 “Indexes and Computation over Compressed Structured Data” [24]. Gonzalo Navarro also brought Lemma 2.3 from Kosaraju and Manzini [22] to our attention. We further thank Simon Gog for bringing [36] to our attention, and the anonymous reviewers for their comments that helped to improve this article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johannes Fischer.

Additional information

A preliminary version of this paper was published at the 24th Data Compression Conference [4], and in the first author’s diploma thesis at KIT.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Arz, J., Fischer, J. Lempel–Ziv-78 Compressed String Dictionaries. Algorithmica 80, 2012–2047 (2018). https://doi.org/10.1007/s00453-017-0348-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-017-0348-7

Keywords

Navigation