Document Open Access Logo

Top-k Frequent Patterns in Streams and Parameterized-Space LZ Compression

Authors Patrick Dinklage , Johannes Fischer, Nicola Prezza



PDF
Thumbnail PDF

File

LIPIcs.SEA.2024.9.pdf
  • Filesize: 0.84 MB
  • 20 pages

Document Identifiers

Author Details

Patrick Dinklage
  • TU Dortmund University, Germany
Johannes Fischer
  • TU Dortmund University, Germany
Nicola Prezza
  • Ca' Foscari University of Venice, Italy

Acknowledgements

The authors gratefully acknowledge the computing time provided on the Linux HPC cluster at Technical University Dortmund (LiDO3), partially funded in the course of the Large-Scale Equipment Initiative by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) as project 271512359.

Cite As Get BibTex

Patrick Dinklage, Johannes Fischer, and Nicola Prezza. Top-k Frequent Patterns in Streams and Parameterized-Space LZ Compression. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 9:1-9:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024) https://doi.org/10.4230/LIPIcs.SEA.2024.9

Abstract

We present novel online approximations of the Lempel-Ziv 77 (LZ77) and Lempel-Ziv 78 (LZ78) compression schemes [Lempel & Ziv, 1977/1978] with parameterizable space usage based on estimating which k patterns occur the most frequently in the streamed input for parameter k. This new approach overcomes the issue of finding only local repetitions, which is a natural limitation of algorithms that compress using a sliding window or by partitioning the input into blocks. For this, we introduce the top-k trie, a summary for maintaining online the top-k frequent consecutive patterns in a stream of characters based on a combination of the Lempel-Ziv 78 compression scheme and the Misra-Gries algorithm for frequent item estimation in streams. Using straightforward encoding, our implementations yield compression ratios (output over input size) competitive with established general-purpose LZ-based compression utilities such as gzip or xz.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
  • Theory of computation → Pattern matching
  • Theory of computation → Sketching and sampling
Keywords
  • compression
  • streaming
  • heavy hitters
  • algorithm engineering

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi. Mergeable summaries. ACM Trans. Database Syst., 38(4):26, 2013. URL: https://doi.org/10.1145/2500128.
  2. Charu C. Aggarwal and Philip S. Yu. A survey of synopsis construction in data streams. In Data Streams - Models and Algorithms, volume 31 of Advances in Database Systems, pages 169-207. Springer, 2007. URL: https://doi.org/10.1007/978-0-387-47534-9_9.
  3. Sergio De Agostino. Bounded size dictionary compression: Relaxing the lru deletion heuristic. Int. J. Found. Comput. Sci., 17(6):1273-1280, 2006. URL: https://doi.org/10.1142/S0129054106004406.
  4. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci., 58(1):137-147, 1999. URL: https://doi.org/10.1006/jcss.1997.1545.
  5. Alberto Apostolico, Matteo Comin, and Laxmi Parida. Bridging lossy and lossless compression by motif pattern discovery. In General Theory of Information Transfer and Combinatorics, volume 4123 of Lecture Notes in Computer Science, pages 793-813. Springer, 2006. URL: https://doi.org/10.1007/11889342_51.
  6. Diego Arroyuelo, Rodrigo Cánovas, Johannes Fischer, Dominik Köppl, Marvin Löbel, Gonzalo Navarro, and Rajeev Raman. Engineering practical Lempel-Ziv tries. ACM J. Exp. Algorithmics, 26:14:1-14:47, 2021. URL: https://doi.org/10.1145/3481638.
  7. Julian Arz and Johannes Fischer. Lempel-Ziv-78 compressed string dictionaries. Algorithmica, 80(7):2012-2047, 2018. URL: https://doi.org/10.1007/S00453-017-0348-7.
  8. Sairam Behera, Sutanu Gayen, Jitender S. Deogun, and N. V. Vinodchandran. KmerEstimate: A streaming algorithm for estimating k-mer counts with optimal space usage. In ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB), pages 438-447. ACM, 2018. URL: https://doi.org/10.1145/3233547.3233587.
  9. Prosenjit Bose, Evangelos Kranakis, Pat Morin, and Yihui Tang. Bounds for frequency estimation of packet streams. In 10th Internaltional Colloquium on Structural Information Complexity (SIROCCO), volume 17 of Proceedings in Informatics, pages 33-42. Carleton Scientific, 2003. Google Scholar
  10. Graham Cormode and Marios Hadjieleftheriou. Finding the frequent items in streams of data. Commun. ACM, 52(10):97-105, 2009. URL: https://doi.org/10.1145/1562764.1562789.
  11. Peter Deutsch. DEFLATE compressed data format specification version 1.3. RFC, 1951:1-17, 1996. URL: https://doi.org/10.17487/RFC1951.
  12. Jonas Ellert. Sublinear time Lempel-Ziv (LZ77) factorization. In 30th International Symposium on String Processing and Information Retrieval (SPIRE), volume 14240 of Lecture Notes in Computer Science, pages 171-187. Springer, 2023. URL: https://doi.org/10.1007/978-3-031-43980-3_14.
  13. Johannes Fischer, Travis Gagie, Pawel Gawrychowski, and Tomasz Kociumaka. Approximating LZ77 via small-space multiple-pattern matching. In 23rd European Symposium on Algorithms (ESA), volume 9294, pages 533-544. Springer, 2015. URL: https://doi.org/10.1007/978-3-662-48350-3_45.
  14. Johannes Fischer, Volker Heun, and Stefan Kramer. Optimal string mining under frequency constraints. In 10th European Conference on Principles and Practice of Knowledge Discovery in Databases PKDD, volume 4213 of Lecture Notes in Computer Science, pages 139-150. Springer, 2006. URL: https://doi.org/10.1007/11871637_17.
  15. Johannes Fischer, Tomohiro I, Dominik Köppl, and Kunihiko Sadakane. Lempel-ziv factorization powered by space efficient suffix trees. Algorithmica, 80(7):2048-2081, 2018. URL: https://doi.org/10.1007/S00453-017-0333-1.
  16. E. Fredkin. Trie memory. Commun. ACM, 3:490-499, 1960. URL: https://doi.org/10.1145/367390.367400.
  17. Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, and Philip S. Yu. A survey of parallel sequential pattern mining. ACM Trans. Knowl. Discov. Data, 13(3):25:1-25:34, 2019. URL: https://doi.org/10.1145/3314107.
  18. Torben Hagerup. Sorting and searching on the word RAM. In 15th Annual Symposium on Theoretical Aspects of Computer Science (STACS), volume 1373 of Lecture Notes in Computer Science, pages 366-398. Springer, 1998. URL: https://doi.org/10.1007/BFb0028575.
  19. Aaron Hong, Massimiliano Rossi, and Christina Boucher. LZ77 via prefix-free parsing. In 25th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 123-134. SIAM, 2023. URL: https://doi.org/10.1137/1.9781611977561.CH11.
  20. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Linear time lempel-ziv factorization: Simple, fast, small. In 24th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 7922 of Lecture Notes in Computer Science, pages 189-200. Springer, 2013. URL: https://doi.org/10.1007/978-3-642-38905-4_19.
  21. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lempel-Ziv parsing in external memory. In Data Compression Conference, DCC 2014, Snowbird, UT, USA, 26-28 March, 2014, pages 153-162. IEEE, 2014. URL: https://doi.org/10.1109/DCC.2014.78.
  22. Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. J. ACM, 53(6):918-936, 2006. URL: https://doi.org/10.1145/1217856.1217858.
  23. Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In 50th Annual ACM Symposium on Theory of Computing (STOC), pages 827-840. ACM, 2018. URL: https://doi.org/10.1145/3188745.3188814.
  24. Dmitry Kosolobov, Daniel Valenzuela, Gonzalo Navarro, and Simon J. Puglisi. Lempel-Ziv-like parsing in small space. Algorithmica, 82(11):3195-3215, 2020. URL: https://doi.org/10.1007/S00453-020-00722-6.
  25. Shirou Maruyama and Yasuo Tabei. Fully online grammar compression in constant space. In Data Compression Conference, DCC 2014, Snowbird, UT, USA, 26-28 March, 2014, pages 173-182. IEEE, 2014. URL: https://doi.org/10.1109/DCC.2014.69.
  26. Páll Melsted and Bjarni V. Halldórsson. Kmerstream: streaming algorithms for k-mer abundance estimation. Bioinform., 30(24):3541-3547, 2014. URL: https://doi.org/10.1093/bioinformatics/btu713.
  27. Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In 10th International Conference on Database Theory (ICDT), volume 3363, pages 398-412. Springer, 2005. URL: https://doi.org/10.1007/978-3-540-30570-5_27.
  28. Jayadev Misra and David Gries. Finding repeated elements. Sci. Comput. Program, 2(2):143-152, 1982. URL: https://doi.org/10.1016/0167-6423(82)90012-0.
  29. Hamid Mohamadi, Hamza Khan, and Inanç Birol. ntcard: a streaming algorithm for cardinality estimation in genomics data. Bioinform., 33(9):1324-1330, 2017. URL: https://doi.org/10.1093/bioinformatics/btw832.
  30. S. Muthukrishnan. Data streams: algorithms and applications. In 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 413-413. SIAM, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644174.
  31. Prashant Pandey, Michael A. Bender, Rob Johnson, and Rob Patro. Squeakr: an exact and approximate k-mer counting system. Bioinform., 34(4):568-575, 2018. URL: https://doi.org/10.1093/bioinformatics/btx636.
  32. Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, 80(7):1986-2011, 2018. URL: https://doi.org/10.1007/S00453-017-0327-Z.
  33. Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory Comput. Syst., 41(4):589-607, 2007. URL: https://doi.org/10.1007/S00224-006-1198-X.
  34. Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory, 23(3):337-343, 1977. URL: https://doi.org/10.1109/TIT.1977.1055714.
  35. Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory, 24(5):530-536, 1978. URL: https://doi.org/10.1109/TIT.1978.1055934.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail