AMPLab Big Data Benchmark

See https://amplab.cs.berkeley.edu/benchmark/

Sign up for a free account at https://aws.amazon.com. It requires a credit card, email, and phone number. Get a new access key at https://console.aws.amazon.com/iam/home?nc2=h_m_sc#security_credential

Run the following in the console:

  1. $ sudo apt-get install s3cmd
  2. $ mkdir tiny; cd tiny;
  3. $ s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/tiny/ .
  4. $ cd ..
  5. $ mkdir 1node; cd 1node;
  6. $ s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/1node/ .
  7. $ cd ..
  8. $ mkdir 5nodes; cd 5nodes;
  9. $ s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/5nodes/ .
  10. $ cd ..

Run the following ClickHouse queries:

  1. CREATE TABLE rankings_tiny
  2. (
  3. pageURL String,
  4. pageRank UInt32,
  5. avgDuration UInt32
  6. ) ENGINE = Log;
  7. CREATE TABLE uservisits_tiny
  8. (
  9. sourceIP String,
  10. destinationURL String,
  11. visitDate Date,
  12. adRevenue Float32,
  13. UserAgent String,
  14. cCode FixedString(3),
  15. lCode FixedString(6),
  16. searchWord String,
  17. duration UInt32
  18. ) ENGINE = MergeTree(visitDate, visitDate, 8192);
  19. CREATE TABLE rankings_1node
  20. (
  21. pageURL String,
  22. pageRank UInt32,
  23. avgDuration UInt32
  24. ) ENGINE = Log;
  25. CREATE TABLE uservisits_1node
  26. (
  27. sourceIP String,
  28. destinationURL String,
  29. visitDate Date,
  30. adRevenue Float32,
  31. UserAgent String,
  32. cCode FixedString(3),
  33. lCode FixedString(6),
  34. searchWord String,
  35. duration UInt32
  36. ) ENGINE = MergeTree(visitDate, visitDate, 8192);
  37. CREATE TABLE rankings_5nodes_on_single
  38. (
  39. pageURL String,
  40. pageRank UInt32,
  41. avgDuration UInt32
  42. ) ENGINE = Log;
  43. CREATE TABLE uservisits_5nodes_on_single
  44. (
  45. sourceIP String,
  46. destinationURL String,
  47. visitDate Date,
  48. adRevenue Float32,
  49. UserAgent String,
  50. cCode FixedString(3),
  51. lCode FixedString(6),
  52. searchWord String,
  53. duration UInt32
  54. ) ENGINE = MergeTree(visitDate, visitDate, 8192);

Go back to the console:

  1. $ for i in tiny/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_tiny FORMAT CSV"; done
  2. $ for i in tiny/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_tiny FORMAT CSV"; done
  3. $ for i in 1node/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_1node FORMAT CSV"; done
  4. $ for i in 1node/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_1node FORMAT CSV"; done
  5. $ for i in 5nodes/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_5nodes_on_single FORMAT CSV"; done
  6. $ for i in 5nodes/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_5nodes_on_single FORMAT CSV"; done

Queries for obtaining data samples:

  1. SELECT pageURL, pageRank FROM rankings_1node WHERE pageRank > 1000
  2. SELECT substring(sourceIP, 1, 8), sum(adRevenue) FROM uservisits_1node GROUP BY substring(sourceIP, 1, 8)
  3. SELECT
  4. sourceIP,
  5. sum(adRevenue) AS totalRevenue,
  6. avg(pageRank) AS pageRank
  7. FROM rankings_1node ALL INNER JOIN
  8. (
  9. SELECT
  10. sourceIP,
  11. destinationURL AS pageURL,
  12. adRevenue
  13. FROM uservisits_1node
  14. WHERE (visitDate > '1980-01-01') AND (visitDate < '1980-04-01')
  15. ) USING pageURL
  16. GROUP BY sourceIP
  17. ORDER BY totalRevenue DESC
  18. LIMIT 1

Original article