Develop an aggregate function

Prerequisite

To develop an aggregate function for MatrixOne, you need a basic knowledge of Golang programming. You can go through this excellent Golang tutorial to get some Golang basic concepts.

Preparation

Before you start, please make sure that you have Go installed, cloned the MatrixOne code base. Please refer to Preparation and Contribute Code for more details.

What is an aggregation function?

In database systems, an aggregate function or aggregation function is a function where the values of multiple rows are grouped together to form a single summary value.

Common aggregate functions include:

  • COUNT counts how many rows are in a particular column.
  • SUM adds together all the values in a particular column.
  • MIN and MAX return the lowest and highest values in a particular column, respectively.
  • AVG calculates the average of a group of selected values.

Aggregate function in MatrixOne

As MatrixOne’s one major key feature, via factorisation, the database join in MatrixOne is highly efficient and less redundant in comparison with other state-of-the-art databases. Therefore, many operations in MatrixOne need to be adapted to the factorisation method, in order to improve efficiency when performing join. Aggregate function is an important one among those operations.

To implement aggragate function in MatrixOne, we design a data structure named Ring. Every aggregate function needs to implement the Ring interface in order to be factorised when join occurs.

For the common aggregate function AVG as an example, we need to calculate the number of groups and their total numeric sum, then get an average result, which is the common practice for any database design. But when a query with join occurs for two tables, the common method is to get a Cartesian product by joining tables first, then perform an AVG with that Cartesian product, which is an expensive computational cost as Cartesian product can be very large. In MatrixOne’s implementation, the factorisation method pushs down the calculation of group statistic and sum before the join operation is performed. This method helps to reduce a lot in computational and storage cost. This factorisation is realized by the Ring interface and its inner functions.

To checkout more about the factorisation theory and factorized database, please refer to Principles of Factorised Databases.

What is a Ring

Ring is an important data structure for MatrixOne factorisation, as well as an mathematical algebraic concept with a clear definition). An algebraic Ring is a set equipped with two binary operations + (addition) and (multiplication) satisfying several axioms.

A Ring in MatrixOne is an interface, with several functions similar to the algebraic Ring structure. We use Ring interface to implement aggragate functions, the +(addition) is defined as merging two Rings groups, the (multiplication) operation is defined as the computation of a grouped aggregate value combined with its grouping key frequency information.

Method of Ring InterfaceDo What
CountReturn the number of groups
SizeReturn the memory size of Ring
DupDuplicate a Ring of same type
TypeReturn the type of a Ring
StringReturn some basic information of Ring for execution plan log
FreeFree the Ring memory
GrowAdd a group for the Ring
GrowsAdd multiple groups for the Ring
SetLengthShrink the size of Ring, keep the first N groups
ShrinkShrink the size of Ring, keep the designated groups
EvalReturn the eventual result of the aggregate function
FillUpdate the data of Ring by a row
BulkFillUpdate the ring data by a whole vector
BatchFillUpdate the ring data by a part of vector
AddMerge a couple of groups for two Rings
BatchAddMerge several couples of groups for two Rings
MulMultiplication between groups for two Rings, called when join occurs

The implementation of Ring data structure is under /pkg/container/ring/.

How does Ring work with query:

To better understand the Ring interface, we can take aggregate function sum() as an example. We’ll walk you through the whole process of Ring.

There are two different scenarios for aggregation functions with Rings.

1. Query with single table.

In the single table scenario, when we run the below query, it generates one or several Rings, depending on the storage blocks the T1 table is stored. The number of blocks depends on the storage strategy. Each Ring will store several groups of sums, the number of group depends how many duplicate rows are in this Ring.

  1. T1 (id, class, age)
  2. +------+-------+------+
  3. | id | class | age |
  4. +------+-------+------+
  5. | 1 | one | 23 |
  6. | 2 | one | 20 |
  7. | 3 | one | 22 |
  8. | 4 | two | 20 |
  9. | 5 | two | 19 |
  10. | 6 | three | 18 |
  11. | 7 | three | 20 |
  12. | 8 | three | 21 |
  13. | 9 | three | 24 |
  14. | 10 | three | 19 |
  15. +------+-------+------+
  16. select class, sum(age) from T1 group by class;

For example, if two Rings were generated, the first Ring holds the group sums of the first 4 rows, which will be like:

  1. | one | 23+20+22 |
  2. | two | 20 |

The second Ring holds the group sums of the last 6 rows, which will be like:

  1. | two | 19 |
  2. | three | 18+20+21+24+19 |

Then the Add method of Ring will be called to merge two groups together, and at last the Eval method will return the overall result to user.

  1. | one | 23+20+22 |
  2. | two | 20+19 |
  3. | three | 18+20+21+24+19 |

2. Query with joining multiple tables.

In the multiple tables join scenario, we have two tables Tc and Ts. The query looks like the following.

  1. Tc (id, class)
  2. +------+-------+
  3. | id | class |
  4. +------+-------+
  5. | 1 | one |
  6. | 2 | one |
  7. | 3 | one |
  8. | 4 | two |
  9. | 5 | two |
  10. | 6 | three |
  11. | 7 | three |
  12. | 8 | three |
  13. | 9 | three |
  14. | 10 | three |
  15. +------+-------+
  16. Ts (id, age)
  17. +------+------+
  18. | id | age |
  19. +------+------+
  20. | 1 | 23 |
  21. | 2 | 20 |
  22. | 3 | 22 |
  23. | 4 | 20 |
  24. | 5 | 19 |
  25. | 6 | 18 |
  26. | 7 | 20 |
  27. | 8 | 24 |
  28. | 9 | 24 |
  29. | 10 | 19 |
  30. +------+------+
  31. select class, sum(age) from Tc join Ts on Tc.id = Ts.id group by class;

When we run this query, it will firstly generate Rings for the Ts table, as we are performing aggeration over age column. It might generate also one or several Rings, same as the single table. For simplyfing, we imagine only one Ring is created for each table. The Ring-Ts will start to count sums for the group of id, as all ids are different, so it will maintain the same. Then a hashtable will be created for performing join operation.

The Ring-Tc is created in the same time as join is performed. This Ring-Tc will count the appearing frequency f of id. Then the Mul method of Ring-Ts is called, to calculate the sum calculated from the Ring-Ts and frequency from Ring-Tc.

  1. sum[i] = sum[i] * f[i]

Now we get values of [class, sum(age)], then performing a group by with class will give us the final result.

The secret of factorisation

From the above example, you can see that the Ring performs some pre calculations and only the result(like sum) is stored in its structure. When performing operations like join, only simple Add or Multiplication is needed to get the result, which is called a push down calculation in factorisation. With the help of this push down, we no longer need to deal with costly Cartesian product. As the joined table number increases, the factorisation allow us to take linear cost performing that, instead of exponential increase.

Take the implementation of Variance function as an example.

The variance formula is as below:

  1. Variance = Σ [(xi - x̅)^2]/n
  2. Example: xi = 10,8,6,12,14, x̅ = 10
  3. Calculation: ((10-10)^2+(8-10)^2+(6-10)^2+(12-10)^2+(14-10)^2)/5 = 8

If we proceed implementation with this formula, we have to record all values of each group, and keep maintaining these values with Add and Mul operations of Ring. Eventual result is calculated in Eval() function. This implementation has a drawback of high memory cost, as we store all values during processing.

But in the Avg implementation, it doesn’t store all values in the Ring. Instead it stores only the sum of each group and the null numbers. It returns the final result with a simple division. This method saves a lot of memory space.

Now let’s turn the Variance formula a bit to a different form:

  1. Variance = Σ (xi^2)/n-x̅^2
  2. Example: xi = 10,8,6,12,14, x̅ = 10
  3. Calculation: (10^2+8^2+6^2+12^2+14^2)/5-10^2 = 8

This formula’s result is actually exactly the same as the previous one, but we only have to record the values sum of xi^2 and the sum of xi. We can largely reduce the memory space with this kind of reformulation.

To conclude, every aggregate function needs to find a way to record as little values as possible in order to reduce memory cost. Below is two different implementation for Variance (the second one has a better performance):

  1. //Implementation1
  2. type VarRing struct {
  3. // Typ is vector's value type
  4. Typ types.Type
  5. // attributes for computing the variance
  6. Data []byte // store all the Sums' bytes
  7. Sums []float64 // sums of each group, its memory address is same to Data
  8. Values [][]float64 // values of each group
  9. NullCounts []int64 // group to record number of the null value
  10. }
  11. //Implementation2
  12. type VarRing struct {
  13. // Typ is vector's value type
  14. Typ types.Type
  15. // attributes for computing the variance
  16. Data []byte
  17. SumX []float64 // sum of x, its memory address is same to Data, because we will use it to store result finally.
  18. SumX2 []float64 // sum of x^2
  19. NullCounts []int64 // group to record number of the null value
  20. }

Develop an var() function

In this tutorial, we walk you through the complete implementation of Variance (get the standard overall variance value) aggregate function as an example with two different methods.

Step 1: register function

MatrixOne doesn’t distinguish between operators and functions. In our code repository, the file pkg/sql/viewexec/transformer/types.go register aggregate functions as operators and we assign each operator a distinct integer number. To add a new function var(), add a new const Variance in the const declaration and var in the name declaration.

  1. const (
  2. Sum = iota
  3. Avg
  4. Max
  5. Min
  6. Count
  7. StarCount
  8. ApproxCountDistinct
  9. Variance
  10. )
  11. var TransformerNames = [...]string{
  12. Sum: "sum",
  13. Avg: "avg",
  14. Max: "max",
  15. Min: "min",
  16. Count: "count",
  17. StarCount: "starcount",
  18. ApproxCountDistinct: "approx_count_distinct",
  19. Variance: "var",
  20. }

Step2: implement the Ring interface

1. Define Ring structure

Create variance.go under pkg/container/ring, and define a structure of VarRing.

As we calculate the overall variance, we need to calculate:

  • The numeric Sums and the null value numbers of each group, to calculate the average.
  • Values of each group, to calculate the variance.
  1. //Implementation1
  2. type VarRing struct {
  3. // Typ is vector's value type
  4. Typ types.Type
  5. // attributes for computing the variance
  6. Data []byte // store all the Sums' bytes
  7. Sums []float64 // sums of each group, its memory address is same to Data
  8. Values [][]float64 // values of each group
  9. NullCounts []int64 // group to record number of the null value
  10. }
  11. //Implementation2
  12. type VarRing struct {
  13. // Typ is vector's value type
  14. Typ types.Type
  15. // attributes for computing the variance
  16. Data []byte
  17. SumX []float64 // sum of x, its memory address is same to Data, because we will use it to store result finally.
  18. SumX2 []float64 // sum of x^2
  19. NullCounts []int64 // group to record number of the null value
  20. }

2. Implement the functions of Ring interface

You can checkout the full implmetation at variance.go.

  • Fill function
  1. //Implementation1
  2. func (v *VarRing) Fill(i, j int64, z int64, vec *vector.Vector) {
  3. var value float64 = 0
  4. switch vec.Typ.Oid {
  5. case types.T_int8:
  6. value = float64(vec.Col.([]int8)[j])
  7. case ...
  8. }
  9. for k := z; k > 0; k-- {
  10. v.Values[i] = append(v.Values[i], value)
  11. }
  12. v.Sums[i] += value * float64(z)
  13. if nulls.Contains(vec.Nsp, uint64(z)) {
  14. v.NullCounts[i] += z
  15. }
  16. }
  17. //Implementation2
  18. func (v *VarRing) Fill(i, j int64, z int64, vec *vector.Vector) {
  19. var value float64 = 0
  20. switch vec.Typ.Oid {
  21. case types.T_int8:
  22. value = float64(vec.Col.([]int8)[j])
  23. case ...
  24. }
  25. v.SumX[i] += value * float64(z)
  26. v.SumX2[i] += math.Pow(value, 2) * float64(z)
  27. if nulls.Contains(vec.Nsp, uint64(z)) {
  28. v.NullCounts[i] += z
  29. }
  30. }
  • Add function
  1. //Implementation1
  2. func (v *VarRing) Add(a interface{}, x, y int64) {
  3. v2 := a.(*VarRing)
  4. v.Sums[x] += v2.Sums[y]
  5. v.NullCounts[x] += v2.NullCounts[y]
  6. v.Values[x] = append(v.Values[x], v2.Values[y]...)
  7. }
  8. //Implementation2
  9. func (v *VarRing) Add(a interface{}, x, y int64) {
  10. v2 := a.(*VarRing)
  11. v.SumX[x] += v2.SumX[y]
  12. v.SumX2[x] += v2.SumX2[y]
  13. v.NullCounts[x] += v2.NullCounts[y]
  14. }
  • Mul function
  1. //Implementation1
  2. func (v *VarRing) Mul(a interface{}, x, y, z int64) {
  3. v2 := a.(*VarRing)
  4. {
  5. v.Sums[x] += v2.Sums[y] * float64(z)
  6. v.NullCounts[x] += v2.NullCounts[y] * z
  7. for k := z; k > 0; k-- {
  8. v.Values[x] = append(v.Values[x], v2.Values[y]...)
  9. }
  10. }
  11. }
  12. //Implementation2
  13. func (v *VarRing) Mul(a interface{}, x, y, z int64) {
  14. v2 := a.(*VarRing)
  15. {
  16. v.SumX[x] += v2.SumX[y] * float64(z)
  17. v.SumX2[x] += v2.SumX2[y] * float64(z)
  18. v.NullCounts[x] += v2.NullCounts[y] * z
  19. }
  20. }

* Eval function

  1. //Implementation1
  2. func (v *VarRing) Eval(zs []int64) *vector.Vector {
  3. defer func() {
  4. ...
  5. }()
  6. nsp := new(nulls.Nulls)
  7. for i, z := range zs {
  8. if n := z - v.NullCounts[i]; n == 0 {
  9. nulls.Add(nsp, uint64(i))
  10. } else {
  11. v.Sums[i] /= float64(n)
  12. var variance float64 = 0
  13. avg := v.Sums[i]
  14. for _, value := range v.Values[i] {
  15. variance += math.Pow(value-avg, 2.0) / float64(n)
  16. }
  17. v.Sums[i] = variance
  18. }
  19. }
  20. return ...
  21. }
  22. //Implementation2
  23. func (v *VarRing) Eval(zs []int64) *vector.Vector {
  24. defer func() {
  25. ...
  26. }()
  27. nsp := new(nulls.Nulls)
  28. for i, z := range zs {
  29. if n := z - v.NullCounts[i]; n == 0 {
  30. nulls.Add(nsp, uint64(i))
  31. } else {
  32. v.SumX[i] /= float64(n) // compute E(x)
  33. v.SumX2[i] /= float64(n) // compute E(x^2)
  34. variance := v.SumX2[i] - math.Pow(v.SumX[i], 2)
  35. v.SumX[i] = variance // using v.SumX to record the result and return.
  36. }
  37. }
  38. return ...
  39. }

3. Implement encoding and decoding for VarRing

In the pkg/sql/protocol/protocol.go file, implement the code for serialization and deserialization of VarRing.

Serialization functionDeserialization function
EncodeRingDecodeRing
DecodeRingWithProcess

Serialization:

  1. case *variance.VarRing:
  2. buf.WriteByte(VarianceRing)
  3. // NullCounts
  4. n := len(v.NullCounts)
  5. buf.Write(encoding.EncodeUint32(uint32(n)))
  6. if n > 0 {
  7. buf.Write(encoding.EncodeInt64Slice(v.NullCounts))
  8. }
  9. // Sumx2
  10. n = len(v.SumX2)
  11. buf.Write(encoding.EncodeUint32(uint32(n)))
  12. if n > 0 {
  13. buf.Write(encoding.EncodeFloat64Slice(v.SumX2))
  14. }
  15. // Sumx
  16. da := encoding.EncodeFloat64Slice(v.SumX)
  17. n = len(da)
  18. buf.Write(encoding.EncodeUint32(uint32(n)))
  19. if n > 0 {
  20. buf.Write(da)
  21. }
  22. // Typ
  23. buf.Write(encoding.EncodeType(v.Typ))
  24. return nil

Deserialization:

  1. case VarianceRing:
  2. r := new(variance.VarRing)
  3. data = data[1:]
  4. // decode NullCounts
  5. n := encoding.DecodeUint32(data[:4])
  6. data = data[4:]
  7. if n > 0 {
  8. r.NullCounts = make([]int64, n)
  9. copy(r.NullCounts, encoding.DecodeInt64Slice(data[:n*8]))
  10. data = data[n*8:]
  11. }
  12. // decode Sumx2
  13. n = encoding.DecodeUint32(data[:4])
  14. data = data[4:]
  15. if n > 0 {
  16. r.SumX2 = make([]float64, n)
  17. copy(r.SumX2, encoding.DecodeFloat64Slice(data[:n*8]))
  18. data = data[n*8:]
  19. }
  20. // decode Sumx
  21. n = encoding.DecodeUint32(data[:4])
  22. data = data[4:]
  23. if n > 0 {
  24. r.Data = data[:n]
  25. data = data[n:]
  26. }
  27. r.SumX = encoding.DecodeFloat64Slice(r.Data)
  28. // decode typ
  29. typ := encoding.DecodeType(data[:encoding.TypeSize])
  30. data = data[encoding.TypeSize:]
  31. r.Typ = typ
  32. // return
  33. return r, data, nil

Here we go. Now we can fire up MatrixOne and try with our var() function.

Compile and run MatrixOne

Once the aggregation function is ready, we could compile and run MatrixOne to see the function behavior.

Step1: Run make config and make build to compile the MatrixOne project and build binary file.

  1. make config
  2. make build

Info

make config generates a new configuration file, in this tutorial, you only need to run it once. If you modify some code and want to recompile, you only have to run make build.

Step2: Run ./mo-server system_vars_config.toml to launch MatrixOne, the MatrixOne server will start to listen for client connecting.

  1. ./mo-server system_vars_config.toml

Info

The logger print level of system_vars_config.toml is set to default as DEBUG, which will print a lot of information for you. If you only care about what your function will print, you can modify the system_vars_config.toml and set cubeLogLevel and level to ERROR level.

cubeLogLevel = “error”

level = “error”

Info

Sometimes a port is in use error at port 50000 will occur. You could checkout what process in occupying port 50000 by lsof -i:50000. This command helps you to get the PIDNAME of this process, then you can kill the process by kill -9 PIDNAME.

Step3: Connect to MatrixOne server with a MySQL client. Use the built-in test account for example:

user: dump password: 111

  1. $ mysql -h 127.0.0.1 -P 6001 -udump -p
  2. Enter password:

Step4: Test your function behavior with some data. Below is an example. You can check if you get the right mathematical variance result. You can also try an inner join and check the result, if the result is correct, the factorisation is valid.

  1. mysql>select * from variance;
  2. +------+------+
  3. | a | b |
  4. +------+------+
  5. | 1 | 4 |
  6. | 10 | 3 |
  7. | 19 | 12 |
  8. | 239 | 114 |
  9. | 49 | 149 |
  10. | 10 | 159 |
  11. | 1 | 3 |
  12. | 34 | 35 |
  13. +------+------+
  14. 8 rows in set (0.04 sec)
  15. mysql> select * from variance2;
  16. +------+------+
  17. | a | b |
  18. +------+------+
  19. | 14 | 3514 |
  20. | 10 | 3514 |
  21. | 1 | 61 |
  22. +------+------+
  23. 3 rows in set (0.02 sec)
  24. mysql> select var(variance.a), var(variance.b) from variance;
  25. +-----------------+-----------------+
  26. | var(variance.a) | var(variance.b) |
  27. +-----------------+-----------------+
  28. | 5596.2344 | 4150.1094 |
  29. +-----------------+-----------------+
  30. 1 row in set (0.06 sec)
  31. mysql> select variance.a, var(variance.b) from variance inner join variance2 on variance.a = variance2.a group by variance.a;
  32. +------------+-----------------+
  33. | variance.a | var(variance.b) |
  34. +------------+-----------------+
  35. | 10 | 6084.0000 |
  36. | 1 | 0.2500 |
  37. +------------+-----------------+
  38. 2 rows in set (0.04 sec)

Bingo!

Info

Except for var(), MatrixOne has already some neat examples for aggregate functions, such as sum(), count(), max(), min() and avg(). With some minor corresponding changes, the procedure is quite the same as other functions.

Write unit Test for your function

We recommend you to also write a unit test for the new function. Go has a built-in testing command called go test and a package testing which combine to give a minimal but complete testing experience. It automates execution of any function of the form.

  1. func TestXxx(*testing.T)

To write a new test suite, create a file whose name ends _test.go that contains the TestXxx functions as described here. Put the file in the same package as the one being tested. The file will be excluded from regular package builds but will be included when the go test command is run.

Step1: Create a file named variance_test.go under pkg/container/ring/variance/ directory. Import the testing framework and the reflect framework we are going to use for testing.

  1. package variance
  2. import (
  3. "fmt"
  4. "github.com/matrixorigin/matrixone/pkg/container/types"
  5. "reflect"
  6. "testing"
  7. )
  8. // TestVariance just for verify varRing related process
  9. func TestVariance(t *testing.T) {
  10. }

Step2: Implement the TestVariance function with some predefined values.

  1. func TestVariance(t *testing.T) {
  2. // verify that if we can calculate
  3. // the variance of {1, 2, null, 0, 3, 4} and {2, 3, null, null, 4, 5} correctly
  4. // 1. make the test case
  5. v1 := NewVarianceRing(types.Type{Oid: types.T_float64})
  6. v2 := v1.Dup().(*VarRing)
  7. {
  8. // first 3 rows.
  9. // column1: {1, 2, null}, column2: {2, 3, null}
  10. v1.SumX = []float64{1+2, 2+3}
  11. v1.SumX2 = []float64{1*1+2*2, 2*2+3*3}
  12. v1.NullCounts = []int64{1, 1}
  13. }
  14. {
  15. // last 3 rows.
  16. // column1: {0, 3, 4}, column2: {null, 4, 5}
  17. v2.SumX = []float64{0+3+4, 4+5}
  18. v2.SumX2 = []float64{3*3+4*4, 4*4+5*5}
  19. v2.NullCounts = []int64{0, 1}
  20. }
  21. v1.Add(v2, 0, 0)
  22. v1.Add(v2, 1, 1)
  23. result := v1.Eval([]int64{6, 6})
  24. expected := []float64{2.0, 1.25}
  25. if !reflect.DeepEqual(result.Col, expected) {
  26. t.Errorf(fmt.Sprintf("TestVariance wrong, expected %v, but got %v", expected, result.Col))
  27. }
  28. }

Step3: Complete the unit test for Serialization and Deserialization in the function TestRing in the file pkg/sql/protocol/protocol_test.go. You can check the complete test code of VarRing there.

  1. &variance.VarRing{
  2. NullCounts: []int64{1, 2, 3},
  3. SumX: []float64{4, 9, 13},
  4. SumX2: []float64{16, 81, 169},
  5. Typ: types.Type{Oid: types.T(types.T_float64), Size: 8},
  6. }
  7. case *variance.VarRing:
  8. oriRing := r.(*variance.VarRing)
  9. // Sumx
  10. if string(ExpectRing.Data) != string(encoding.EncodeFloat64Slice(oriRing.SumX)) {
  11. t.Errorf("Decode varRing Sums failed.")
  12. return
  13. }
  14. // NullCounts
  15. for i, n := range oriRing.NullCounts {
  16. if ExpectRing.NullCounts[i] != n {
  17. t.Errorf("Decode varRing NullCounts failed. \nExpected/Got:\n%v\n%v", n, ExpectRing.NullCounts[i])
  18. return
  19. }
  20. }
  21. // Sumx2
  22. for i, v := range oriRing.SumX2 {
  23. if !reflect.DeepEqual(ExpectRing.SumX2[i], v) {
  24. t.Errorf("Decode varRing Values failed. \nExpected/Got:\n%v\n%v", v, ExpectRing.SumX2[i])
  25. return
  26. }
  27. }

Step4: Launch Test.

Within the same directory as the test:

  1. go test

This picks up any files matching packagename_test.go. If you are getting a PASS, you are passing the unit test.

In MatrixOne, we have a bvt test framework which will run all the unit tests defined in the whole package, and each time your make a pull request to the code base, the test will automatically run. You code will be merged only if the bvt test pass.

Conduct a Performance Test

Aggregate function is an important feature of a database system, with queries on hundreds of millions of data rows, the time consumption of aggregate function is quite significant. So we recommend you to run a performance test.

Step1: Download the standard test dataset.

We have prepared a single table SSB query dataset with 10 million rows of data. The raw data file size is about 4GB, 500MB after being zipped. You can get the data files directly:

  1. https://community-shared-data-1308875761.cos.ap-beijing.myqcloud.com/lineorder_flat.tar.bz2

Step2: Unzip the file and Load the data into MatrixOne.

With the following SQL you can create the database and table, and load the lineorder_flat.tbl into MatrixOne.

  1. create database if not exists ssb;
  2. use ssb;
  3. drop table if exists lineorder_flat;
  4. CREATE TABLE lineorder_flat(
  5. LO_ORDERKEY bigint primary key,
  6. LO_LINENUMBER int,
  7. LO_CUSTKEY int,
  8. LO_PARTKEY int,
  9. LO_SUPPKEY int,
  10. LO_ORDERDATE date,
  11. LO_ORDERPRIORITY char(15),
  12. LO_SHIPPRIORITY tinyint,
  13. LO_QUANTITY double,
  14. LO_EXTENDEDPRICE double,
  15. LO_ORDTOTALPRICE double,
  16. LO_DISCOUNT double,
  17. LO_REVENUE int unsigned,
  18. LO_SUPPLYCOST int unsigned,
  19. LO_TAX double,
  20. LO_COMMITDATE date,
  21. LO_SHIPMODE char(10),
  22. C_NAME varchar(25),
  23. C_ADDRESS varchar(25),
  24. C_CITY char(10),
  25. C_NATION char(15),
  26. C_REGION char(12),
  27. C_PHONE char(15),
  28. C_MKTSEGMENT char(10),
  29. S_NAME char(25),
  30. S_ADDRESS varchar(25),
  31. S_CITY char(10),
  32. S_NATION char(15),
  33. S_REGION char(12),
  34. S_PHONE char(15),
  35. P_NAME varchar(22),
  36. P_MFGR char(6),
  37. P_CATEGORY char(7),
  38. P_BRAND char(9),
  39. P_COLOR varchar(11),
  40. P_TYPE varchar(25),
  41. P_SIZE int,
  42. P_CONTAINER char(10)
  43. );
  44. load data infile '/Users/YOURPATH/lineorder_flat.tbl' into table lineorder_flat FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';

If you load successfully this dataset, you are normally getting a result as:

  1. Query OK, 10272594 rows affected (1 min 7.09 sec)

Step3: Run your aggregate function and sum(), avg() on the column LO_SUPPKEY respectively to check the performance.

  1. select avg(LO_SUPPKEY) from lineorder_flat;
  2. select sum(LO_SUPPKEY) from lineorder_flat;
  3. select yourfunction(LO_SUPPKEY) from lineorder_flat;

Step4: When you submit your PR, please submit these performance results in your PR comment as well.