AMPLab 大数据基准测试

参考 https://amplab.cs.berkeley.edu/benchmark/

需要您在https://aws.amazon.com注册一个免费的账号。注册时需要您提供信用卡、邮箱、电话等信息。之后可以在https://console.aws.amazon.com/iam/home?nc2=h_m_sc#security_credential获取新的访问密钥

在控制台运行以下命令:

  1. sudo apt-get install s3cmd
  2. mkdir tiny; cd tiny;
  3. s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/tiny/ .
  4. cd ..
  5. mkdir 1node; cd 1node;
  6. s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/1node/ .
  7. cd ..
  8. mkdir 5nodes; cd 5nodes;
  9. s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/5nodes/ .
  10. cd ..

在ClickHouse运行如下查询:

  1. CREATE TABLE rankings_tiny
  2. (
  3. pageURL String,
  4. pageRank UInt32,
  5. avgDuration UInt32
  6. ) ENGINE = Log;
  7. CREATE TABLE uservisits_tiny
  8. (
  9. sourceIP String,
  10. destinationURL String,
  11. visitDate Date,
  12. adRevenue Float32,
  13. UserAgent String,
  14. cCode FixedString(3),
  15. lCode FixedString(6),
  16. searchWord String,
  17. duration UInt32
  18. ) ENGINE = MergeTree(visitDate, visitDate, 8192);
  19. CREATE TABLE rankings_1node
  20. (
  21. pageURL String,
  22. pageRank UInt32,
  23. avgDuration UInt32
  24. ) ENGINE = Log;
  25. CREATE TABLE uservisits_1node
  26. (
  27. sourceIP String,
  28. destinationURL String,
  29. visitDate Date,
  30. adRevenue Float32,
  31. UserAgent String,
  32. cCode FixedString(3),
  33. lCode FixedString(6),
  34. searchWord String,
  35. duration UInt32
  36. ) ENGINE = MergeTree(visitDate, visitDate, 8192);
  37. CREATE TABLE rankings_5nodes_on_single
  38. (
  39. pageURL String,
  40. pageRank UInt32,
  41. avgDuration UInt32
  42. ) ENGINE = Log;
  43. CREATE TABLE uservisits_5nodes_on_single
  44. (
  45. sourceIP String,
  46. destinationURL String,
  47. visitDate Date,
  48. adRevenue Float32,
  49. UserAgent String,
  50. cCode FixedString(3),
  51. lCode FixedString(6),
  52. searchWord String,
  53. duration UInt32
  54. ) ENGINE = MergeTree(visitDate, visitDate, 8192);

回到控制台运行如下命令:

  1. for i in tiny/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_tiny FORMAT CSV"; done
  2. for i in tiny/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_tiny FORMAT CSV"; done
  3. for i in 1node/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_1node FORMAT CSV"; done
  4. for i in 1node/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_1node FORMAT CSV"; done
  5. for i in 5nodes/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_5nodes_on_single FORMAT CSV"; done
  6. for i in 5nodes/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_5nodes_on_single FORMAT CSV"; done

简单的查询示例:

  1. SELECT pageURL, pageRank FROM rankings_1node WHERE pageRank > 1000
  2. SELECT substring(sourceIP, 1, 8), sum(adRevenue) FROM uservisits_1node GROUP BY substring(sourceIP, 1, 8)
  3. SELECT
  4. sourceIP,
  5. sum(adRevenue) AS totalRevenue,
  6. avg(pageRank) AS pageRank
  7. FROM rankings_1node ALL INNER JOIN
  8. (
  9. SELECT
  10. sourceIP,
  11. destinationURL AS pageURL,
  12. adRevenue
  13. FROM uservisits_1node
  14. WHERE (visitDate > '1980-01-01') AND (visitDate < '1980-04-01')
  15. ) USING pageURL
  16. GROUP BY sourceIP
  17. ORDER BY totalRevenue DESC
  18. LIMIT 1

Original article