使用窗口函数分析数据

本教程简要介绍 Drill 1.2 的分析,即分析窗口函数的 ANSI SQL 标准。Drill 支持以下窗口函数:

  • PARTITION BY 和 OVER 语法
  • 不同种类的聚合函数,如 Sum,Max,Min,Count,Avg
  • 解析函数,例如 First_Value, Last_Value, Lead, Lag, NTile, Row_Number 和 Rank

窗口函数是多功能的。你可以减少连接,子查询,和显示游标,你只需要写而已。窗口函数以最小的编码工作,解决了各种复杂的情况。

本教程建立在前面的教程之上,A-Y-A-D 数据分析高度动态的数据集分析,并且使用的是相同的 Yelp 数据集。

开始

  1. 在开始前,下载 Yelp(商业评论)。
  2. 安装并启动 Drill
  3. 在 Drill 中列出可用的 Schema。

    1. SHOW schemas;
    2. +---------------------+
    3. | SCHEMA_NAME |
    4. +---------------------+
    5. | INFORMATION_SCHEMA |
    6. | cp.default |
    7. | dfs.default |
    8. | dfs.root |
    9. | dfs.tmp |
    10. | dfs.yelp |
    11. | sys |
    12. +---------------------+
    13. 7 rows selected (1.755 seconds)
  4. 切换工作目录。

    1. USE dfs.yelp;
    2. +-------+---------------------------------------+
    3. | ok | summary |
    4. +-------+---------------------------------------+
    5. | true | Default schema changed to [dfs.yelp] |
    6. +-------+---------------------------------------+
    7. 1 row selected (0.129 seconds)
  5. 开始探索 Yelp 中可用的数据集信息。

    1. SELECT * FROM `business.json` LIMIT 1;
    2. +------------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+--------------------------------+---------+--------------+-------------------+-------------+-------+-------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+
    3. | business_id | full_address | hours | open | categories | city | review_count | name | longitude | state | stars | latitude | attributes | type | neighborhoods |
    4. +------------------------+--------------+------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+--------------------------------+---------+--------------+-------------------+-------------+-------+-------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+
    5. | vcNAWiLM4dR7D2nwwJ7nCA | 4840 E Indian School Rd Ste 101 Phoenix, AZ 85018 | {"Tuesday":{"close":"17:00","open":"08:00"},"Friday":{"close":"17:00","open":"08:00"},"Monday":{"close":"17:00","open":"08:00"},"Wednesday":{"close":"17:00","open":"08:00"},"Thursday":{"close":"17:00","open":"08:00"},"Sunday":{},"Saturday":{}} | true | ["Doctors","Health & Medical"] | Phoenix | 7 | Eric Goldberg, MD | -111.983758 | AZ | 3.5 | 33.499313 | {"By Appointment Only":true,"Good Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} | business | [] |
    6. +-------------+--------------+-------+------+------------+------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+--------------------------------+---------+--------------+-------------------+-------------+-------+-------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+
    7. 1 row selected (0.514 seconds)

使用窗口函数做简单的查询

  1. 获取 Yelp 的业务基础评论数据。

    1. SELECT name, city, review_count, row_number()
    2. OVER (PARTITION BY city ORDER BY review_count DESC) AS rownum
    3. FROM `business.json` LIMIT 15;
    4. +----------------------------------------+------------+---------------+---------+
    5. | name | city | review_count | rownum |
    6. +----------------------------------------+------------+---------------+---------+
    7. | Cupz N' Crepes | Ahwatukee | 124 | 1 |
    8. | My Wine Cellar | Ahwatukee | 98 | 2 |
    9. | Kathy's Alterations | Ahwatukee | 12 | 3 |
    10. | McDonald's | Ahwatukee | 7 | 4 |
    11. | U-Haul | Ahwatukee | 5 | 5 |
    12. | Hi-Health | Ahwatukee | 4 | 6 |
    13. | Healthy and Clean Living Environments | Ahwatukee | 4 | 7 |
    14. | Active Kids Pediatrics | Ahwatukee | 4 | 8 |
    15. | Roberto's Authentic Mexican Food | Anthem | 117 | 1 |
    16. | Q to U BBQ | Anthem | 74 | 2 |
    17. | Outlets At Anthem | Anthem | 64 | 3 |
    18. | Dara Thai | Anthem | 56 | 4 |
    19. | Cafe Provence | Anthem | 53 | 5 |
    20. | Shanghai Club | Anthem | 50 | 6 |
    21. | Two Brothers Kitchen | Anthem | 43 | 7 |
    22. +----------------------------------------+------------+---------------+---------+
    23. 15 rows selected (0.67 seconds)
  2. 检查每个业务的数量相比在城市的所有业务的平均数量的评论。

    1. SELECT name, city,review_count,
    2. Avg(review_count) OVER (PARTITION BY City) AS city_reviews_avg
    3. FROM `business.json` LIMIT 15;
    4. +----------------------------------------+------------+---------------+---------------------+
    5. | name | city | review_count | city_reviews_avg |
    6. +----------------------------------------+------------+---------------+---------------------+
    7. | Hi-Health | Ahwatukee | 4 | 32.25 |
    8. | My Wine Cellar | Ahwatukee | 98 | 32.25 |
    9. | U-Haul | Ahwatukee | 5 | 32.25 |
    10. | Cupz N' Crepes | Ahwatukee | 124 | 32.25 |
    11. | McDonald's | Ahwatukee | 7 | 32.25 |
    12. | Kathy's Alterations | Ahwatukee | 12 | 32.25 |
    13. | Healthy and Clean Living Environments | Ahwatukee | 4 | 32.25 |
    14. | Active Kids Pediatrics | Ahwatukee | 4 | 32.25 |
    15. | Anthem Community Center | Anthem | 4 | 14.492063492063492 |
    16. | Scrapbooks To Remember | Anthem | 4 | 14.492063492063492 |
    17. | Hungry Howie's Pizza | Anthem | 7 | 14.492063492063492 |
    18. | Pinata Nueva | Anthem | 3 | 14.492063492063492 |
    19. | Starbucks Coffee Company | Anthem | 13 | 14.492063492063492 |
    20. | Pizza Hut | Anthem | 6 | 14.492063492063492 |
    21. | Rays Pizza | Anthem | 19 | 14.492063492063492 |
    22. +----------------------------------------+------------+---------------+---------------------+
    23. 15 rows selected (0.395 seconds)
  3. 检查每个企业的评论数量为城市的所有业务的总数量的贡献。

    1. SELECT name, city,review_count,
    2. Sum(review_count) OVER (PARTITION BY City) AS city_reviews_sum
    3. FROM `business.json`limit 15;
    4. +----------------------------------------+------------+---------------+-------------------+
    5. | name | city | review_count | city_reviews_sum |
    6. +----------------------------------------+------------+---------------+-------------------+
    7. | Hi-Health | Ahwatukee | 4 | 258 |
    8. | My Wine Cellar | Ahwatukee | 98 | 258 |
    9. | U-Haul | Ahwatukee | 5 | 258 |
    10. | Cupz N' Crepes | Ahwatukee | 124 | 258 |
    11. | McDonald's | Ahwatukee | 7 | 258 |
    12. | Kathy's Alterations | Ahwatukee | 12 | 258 |
    13. | Healthy and Clean Living Environments | Ahwatukee | 4 | 258 |
    14. | Active Kids Pediatrics | Ahwatukee | 4 | 258 |
    15. | Anthem Community Center | Anthem | 4 | 913 |
    16. | Scrapbooks To Remember | Anthem | 4 | 913 |
    17. | Hungry Howie's Pizza | Anthem | 7 | 913 |
    18. | Pinata Nueva | Anthem | 3 | 913 |
    19. | Starbucks Coffee Company | Anthem | 13 | 913 |
    20. | Pizza Hut | Anthem | 6 | 913 |
    21. | Rays Pizza | Anthem | 19 | 913 |
    22. +----------------------------------------+------------+---------------+-------------------+
    23. 15 rows selected (0.543 seconds)

使用窗口函数进行复杂查询

  1. 排名前 10 名的城市和他们的排名最高的企业数量的评论。使用 Drill 窗口函数,例如 rank,dense_sank。

    1. WITH X
    2. AS
    3. (SELECT name, city, review_count,
    4. RANK()
    5. OVER (PARTITION BY city
    6. ORDER BY review_count DESC) AS review_rank
    7. FROM `business.json`)
    8. SELECT X.name, X.city, X.review_count
    9. FROM X
    10. WHERE X.review_rank =1 ORDER BY review_count DESC LIMIT 10;
    11. +-------------------------------------------+-------------+---------------+
    12. | name | city | review_count |
    13. +-------------------------------------------+-------------+---------------+
    14. | Mon Ami Gabi | Las Vegas | 4084 |
    15. | Studio B | Henderson | 1336 |
    16. | Phoenix Sky Harbor International Airport | Phoenix | 1325 |
    17. | Four Peaks Brewing Co | Tempe | 1110 |
    18. | The Mission | Scottsdale | 783 |
    19. | Joe's Farm Grill | Gilbert | 770 |
    20. | The Old Fashioned | Madison | 619 |
    21. | Cornish Pasty Company | Mesa | 578 |
    22. | SanTan Brewing Company | Chandler | 469 |
    23. | Yard House | Glendale | 321 |
    24. +-------------------------------------------+-------------+---------------+
    25. 10 rows selected (0.49 seconds)
  2. 在城市的顶部和底部的评论计数的每个业务的评论数量比较。

    1. SELECT name, city, review_count,
    2. FIRST_VALUE(review_count)
    3. OVER(PARTITION BY city ORDER BY review_count DESC) AS top_review_count,
    4. LAST_VALUE(review_count)
    5. OVER(PARTITION BY city ORDER BY review_count DESC) AS bottom_review_count
    6. FROM `business.json` limit 15;
    7. +----------------------------------------+------------+---------------+-------------------+----------------------+
    8. | name | city | review_count | top_review_count | bottom_review_count |
    9. +----------------------------------------+------------+---------------+-------------------+----------------------+
    10. | My Wine Cellar | Ahwatukee | 98 | 124 | 12 |
    11. | McDonald's | Ahwatukee | 7 | 124 | 12 |
    12. | U-Haul | Ahwatukee | 5 | 124 | 12 |
    13. | Hi-Health | Ahwatukee | 4 | 124 | 12 |
    14. | Healthy and Clean Living Environments | Ahwatukee | 4 | 124 | 12 |
    15. | Active Kids Pediatrics | Ahwatukee | 4 | 124 | 12 |
    16. | Cupz N' Crepes | Ahwatukee | 124 | 124 | 12 |
    17. | Kathy's Alterations | Ahwatukee | 12 | 124 | 12 |
    18. | Q to U BBQ | Anthem | 74 | 117 | 117 |
    19. | Dara Thai | Anthem | 56 | 117 | 117 |
    20. | Cafe Provence | Anthem | 53 | 117 | 117 |
    21. | Shanghai Club | Anthem | 50 | 117 | 117 |
    22. | Two Brothers Kitchen | Anthem | 43 | 117 | 117 |
    23. | The Tennessee Grill | Anthem | 32 | 117 | 117 |
    24. | Dollyrockers Boutique and Salon | Anthem | 30 | 117 | 117 |
    25. +----------------------------------------+------------+---------------+-------------------+----------------------+
    26. 15 rows selected (0.516 seconds)
  3. 比较前后的业务评论数量。

    1. SELECT city, review_count, name,
    2. LAG(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC)
    3. AS preceding_count,
    4. LEAD(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC)
    5. AS following_count
    6. FROM `business.json` limit 15;
    7. +------------+---------------+----------------------------------------+------------------+------------------+
    8. | city | review_count | name | preceding_count | following_count |
    9. +------------+---------------+----------------------------------------+------------------+------------------+
    10. | Ahwatukee | 124 | Cupz N' Crepes | null | 98 |
    11. | Ahwatukee | 98 | My Wine Cellar | 124 | 12 |
    12. | Ahwatukee | 12 | Kathy's Alterations | 98 | 7 |
    13. | Ahwatukee | 7 | McDonald's | 12 | 5 |
    14. | Ahwatukee | 5 | U-Haul | 7 | 4 |
    15. | Ahwatukee | 4 | Hi-Health | 5 | 4 |
    16. | Ahwatukee | 4 | Healthy and Clean Living Environments | 4 | 4 |
    17. | Ahwatukee | 4 | Active Kids Pediatrics | 4 | null |
    18. | Anthem | 117 | Roberto's Authentic Mexican Food | null | 74 |
    19. | Anthem | 74 | Q to U BBQ | 117 | 64 |
    20. | Anthem | 64 | Outlets At Anthem | 74 | 56 |
    21. | Anthem | 56 | Dara Thai | 64 | 53 |
    22. | Anthem | 53 | Cafe Provence | 56 | 50 |
    23. | Anthem | 50 | Shanghai Club | 53 | 43 |
    24. | Anthem | 43 | Two Brothers Kitchen | 50 | 32 |
    25. +------------+---------------+----------------------------------------+------------------+------------------+
    26. 15 rows selected (0.518 seconds)