Latest

Javadoc Javadoc


Estimates the number of distinct elements in a data stream using the HyperLogLog++ algorithm. The respective transforms to create and merge sketches, and to extract from them, are:

You can read more about what a sketch is at https://github.com/google/zetasketch.

Examples

Example 1: creates a long-type sketch for a PCollection<Long> with a custom precision:

 PCollection<Long> input = ...;
 int p = ...;
 PCollection<byte[]> sketch = input.apply(HllCount.Init.forLongs().withPrecision(p).globally());

Example 2: creates a bytes-type sketch for a PCollection<KV<String, byte[]>>:

 PCollection<KV<String, byte[]>> input = ...;
 PCollection<KV<String, byte[]>> sketch = input.apply(HllCount.Init.forBytes().perKey());

Example 3: merges existing sketches in a PCollection<byte[]> into a new sketch, which summarizes the union of the inputs that were aggregated in the merged sketches:

 PCollection<byte[]> sketches = ...;
 PCollection<byte[]> mergedSketch = sketches.apply(HllCount.MergePartial.globally());

Example 4: estimates the count of distinct elements in a PCollection<String>:

 PCollection<String> input = ...;
 PCollection<Long> countDistinct =
     input.apply(HllCount.Init.forStrings().globally()).apply(HllCount.Extract.globally());

Example 5: extracts the count distinct estimate from an existing sketch:

 PCollection<byte[]> sketch = ...;
 PCollection<Long> countDistinct = sketch.apply(HllCount.Extract.globally());