HLL

Description

HLL is an engineering implementation based on the HyperLogLog algorithm. It is used to store the intermediate results of the HyperLog calculation process. It can only be used as the value column type of the table. By aggregating to reduce the amount of data continuously, in order to achieve the purpose of speeding up the query, based on which an estimated result, the error is about 1%. The HLL column is generated by other columns or data in the imported data. When imported, the hll_hash function is used to specify which column in the data is used to generate the HLL column. It is often used to replace count distinct, and to quickly calculate UV in business by combining rollup.

The correlation function:

TOTAL UNION This function is an aggregation function, which is used to calculate the cardinality estimation of all data satisfying the conditions. This function can also be used to analyze functions. It only supports the default window and does not support the window clause.

Coach L.u RAW AGG This function is an aggregation function that aggregates HLL type fields and returns HLL type.

HLL_CARDINALITY(hll) This function is used to estimate the cardinality of a single HLL sequence

HLL_HASH(column_name) Generate HLL column types for insert or import, see the instructions for the use of imports

EMPTY_HLL() Generate empty HLL column types for insert or import, see the instructions for the use of imports

example

First create a table with HLL columns create table test( dt date, id int, name char(10), Province of char (10), The char (1), the European Union, European Union distributed by hash(id) buckets 32;

Import data. See help curl for the way you import it.

A. Generate HLL columns using columns in tables

 curl --location-trusted -uname:password -T data -H "label:load_1" -H "columns:dt, id, name, province, os, set1=hll_hash(id), set2=hll_hash(name)"
     http://host/api/test_db/test/_stream_load

B. Generate HLL columns using a column in the data

 curl --location-trusted -uname:password -T data -H "label:load_1" -H "columns:dt, id, name, province, sex, cuid, os, set1=hll_hash(cuid), set2=hll_hash(os)"
     http://host/api/test_db/test/_stream_load

There are three common ways of aggregating data: (without aggregating the base table directly, the speed may be similar to that of using APPROX_COUNT_DISTINCT directly)

A. Create a rollup that allows HLL columns to generate aggregation. alter table test add rollup test_rollup(dt, set1);

B. Create another table dedicated to computing uv, and insert data)

create table test_uv( dt date, uv_set hll hll_union) distributed by hash(dt) buckets 32;

insert into test_uv select dt, set1 from test;

C. Create another table dedicated to computing uv, then insert and generate HLL columns from other non-hll columns of test through hll_hash

create table test_uv( dt date, id_set hll hll_union) distributed by hash(dt) buckets 32;

insert into test_uv select dt, hll_hash(id) from test;

Query, HLL column is not allowed to query its original value directly, it can be queried by matching functions.

a. 27714; 24635; uv select HLL_UNION_AGG(uv_set) from test_uv;

B. Seek every day’s UV select dt, HLL_CARDINALITY(uv_set) from test_uv;

C. Find the aggregate value of Set1 in the test table select dt, HLL_CARDINALITY(uv) from (select dt, HLL_RAW_AGG(set1) as uv from test group by dt) tmp; select dt, HLL_UNION_AGG(set1) as uv from test group by dt;

keyword

HLL