Hive综合案例实战

一 数据源的准备工作

首先我们去一个网站下载相关的数据,之后通过hive导入进行实验.http://grouplens.org/

7 Hive综合案例实战 - 图1

二 内部表

1 创建内部表并载入数据

  1. hadoop@hadoopmaster:~$ beeline -u jdbc:hive2://hadoopmaster:10000/
  2. Beeline version 2.1.0 by Apache Hive
  3. 0: jdbc:hive2://hadoopmaster:10000/> show databases;
  4. OK
  5. +----------------+--+
  6. | database_name |
  7. +----------------+--+
  8. | default |
  9. | fincials |
  10. +----------------+--+
  11. 2 rows selected (1.038 seconds)
  12. 0: jdbc:hive2://hadoopmaster:10000/> use default;
  13. OK
  14. No rows affected (0.034 seconds)
  15. 0: jdbc:hive2://hadoopmaster:10000/> create table u_data (userid INT, movieid INT, rating INT, unixtime STRING) row format delimited fields terminated by '\t' lines terminated by '\n';
  16. OK
  17. No rows affected (0.242 seconds)
  18. 0: jdbc:hive2://hadoopmaster:10000/> LOAD DATA LOCAL INPATH '/home/hadoop/u.data' OVERWRITE INTO TABLE u_data;
  19. Loading data to table default.u_data
  20. OK
  21. No rows affected (0.351 seconds)
  22. 0: jdbc:hive2://hadoopmaster:10000/> select * from u_data;
  23. OK
  24. +----------------+-----------------+----------------+------------------+--+
  25. | u_data.userid | u_data.movieid | u_data.rating | u_data.unixtime |
  26. +----------------+-----------------+----------------+------------------+--+
  27. | 196 | 242 | 3 | 881250949 |
  28. | 186 | 302 | 3 | 891717742 |
  29. | 22 | 377 | 1 | 878887116 |
  30. | 244 | 51 | 2 | 880606923 |
  31. | 166 | 346 | 1 | 886397596 |
  32. | 298 | 474 | 4 | 884182806 |
  33. | 115 | 265 | 2 | 881171488 |
  34. | 253 | 465 | 5 | 891628467 |
  35. | 305 | 451 | 3 | 886324817 |
  36. | 6 | 86 | 3 | 883603013 |
  37. | 62 | 257 | 2 | 879372434 |
  38. | 286 | 1014 | 5 | 879781125 |

2 查看占用的HDFS空间

  1. hadoop@hadoopmaster:~$ hdfs dfs -ls /user/hive/warehouse/u_data
  2. Found 1 items
  3. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:19 /user/hive/warehouse/u_data/u.data

3 写脚本反复导入100次

先查看以前有多少行

  1. 0: jdbc:hive2://hadoopmaster:10000/> select count(*) from u_data;
  2. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
  3. Query ID = hadoop_20160722102853_77aa1bc6-79c2-4916-9b07-a763d112ef41
  4. Total jobs = 1
  5. Launching Job 1 out of 1
  6. Number of reduce tasks determined at compile time: 1
  7. In order to change the average load for a reducer (in bytes):
  8. set hive.exec.reducers.bytes.per.reducer=<number>
  9. In order to limit the maximum number of reducers:
  10. set hive.exec.reducers.max=<number>
  11. In order to set a constant number of reducers:
  12. set mapreduce.job.reduces=<number>
  13. Starting Job = job_1468978056881_0003, Tracking URL = http://hadoopmaster:8088/proxy/application_1468978056881_0003/
  14. Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1468978056881_0003
  15. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  16. 2016-07-22 10:28:58,786 Stage-1 map = 0%, reduce = 0%
  17. 2016-07-22 10:29:03,890 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.89 sec
  18. 2016-07-22 10:29:10,005 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.71 sec
  19. MapReduce Total cumulative CPU time: 1 seconds 710 msec
  20. Ended Job = job_1468978056881_0003
  21. MapReduce Jobs Launched:
  22. Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 1.71 sec HDFS Read: 1987050 HDFS Write: 106 SUCCESS
  23. Total MapReduce CPU Time Spent: 1 seconds 710 msec
  24. OK
  25. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
  26. +---------+--+
  27. | c0 |
  28. +---------+--+
  29. | 100000 |
  30. +---------+--+
  31. 1 row selected (17.757 seconds)
  32. hiveMapreduce引擎计算真心在速度上不行,10W用了17秒,比关系型数据库差不少,还是要用Spark

再我们需要了解如何用hive中的一次命令,我们可以这样用.

  1. hadoop@hadoopmaster:~$ hive -e "LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data;"
  2. Loading data to table default.u_data
  3. OK
  4. Time taken: 1.239 seconds

最后写脚本

  1. #!/bin/bash
  2. for (( c=1; c<=10; c++ ))
  3. do
  4. echo "正在写入第 $c 次数据..."
  5. hive -e "LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data;"
  6. wait
  7. done

插入完,检查查询成本

  1. 0: jdbc:hive2://hadoopmaster:10000/> select count(*) from u_data;
  2. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
  3. Query ID = hadoop_20160722104633_18c3467d-9263-4785-8714-1570fc3bb9ae
  4. Total jobs = 1
  5. Launching Job 1 out of 1
  6. Number of reduce tasks determined at compile time: 1
  7. In order to change the average load for a reducer (in bytes):
  8. set hive.exec.reducers.bytes.per.reducer=<number>
  9. In order to limit the maximum number of reducers:
  10. set hive.exec.reducers.max=<number>
  11. In order to set a constant number of reducers:
  12. set mapreduce.job.reduces=<number>
  13. Starting Job = job_1468978056881_0009, Tracking URL = http://hadoopmaster:8088/proxy/application_1468978056881_0009/
  14. Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1468978056881_0009
  15. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  16. 2016-07-22 10:46:39,037 Stage-1 map = 0%, reduce = 0%
  17. 2016-07-22 10:46:46,190 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.82 sec
  18. 2016-07-22 10:46:52,310 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.67 sec
  19. MapReduce Total cumulative CPU time: 2 seconds 670 msec
  20. Ended Job = job_1468978056881_0009
  21. MapReduce Jobs Launched:
  22. Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.67 sec HDFS Read: 77198770 HDFS Write: 107 SUCCESS
  23. Total MapReduce CPU Time Spent: 2 seconds 670 msec
  24. OK
  25. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
  26. +----------+--+
  27. | c0 |
  28. +----------+--+
  29. | 3900000 |
  30. +----------+--+
  31. 1 row selected (20.173 seconds)
  32. 用了20秒,看起来Mapreduce的启动成本确实有点高了
  33. hadoop@hadoopmaster:~$ hdfs dfs -ls /user/hive/warehouse/u_data
  34. Found 39 items
  35. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:37 /user/hive/warehouse/u_data/u.data
  36. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:38 /user/hive/warehouse/u_data/u_copy_1.data
  37. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:40 /user/hive/warehouse/u_data/u_copy_10.data
  38. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:40 /user/hive/warehouse/u_data/u_copy_11.data
  39. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:41 /user/hive/warehouse/u_data/u_copy_12.data
  40. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:42 /user/hive/warehouse/u_data/u_copy_13.data
  41. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:42 /user/hive/warehouse/u_data/u_copy_14.data
  42. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:42 /user/hive/warehouse/u_data/u_copy_15.data
  43. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:42 /user/hive/warehouse/u_data/u_copy_16.data
  44. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:43 /user/hive/warehouse/u_data/u_copy_17.data
  45. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:43 /user/hive/warehouse/u_data/u_copy_18.data
  46. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:43 /user/hive/warehouse/u_data/u_copy_19.data
  47. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:39 /user/hive/warehouse/u_data/u_copy_2.data
  48. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:43 /user/hive/warehouse/u_data/u_copy_20.data
  49. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:43 /user/hive/warehouse/u_data/u_copy_21.data
  50. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:43 /user/hive/warehouse/u_data/u_copy_22.data
  51. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:43 /user/hive/warehouse/u_data/u_copy_23.data
  52. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:44 /user/hive/warehouse/u_data/u_copy_24.data
  53. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:44 /user/hive/warehouse/u_data/u_copy_25.data
  54. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:44 /user/hive/warehouse/u_data/u_copy_26.data
  55. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:44 /user/hive/warehouse/u_data/u_copy_27.data
  56. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:44 /user/hive/warehouse/u_data/u_copy_28.data
  57. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:44 /user/hive/warehouse/u_data/u_copy_29.data
  58. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:39 /user/hive/warehouse/u_data/u_copy_3.data
  59. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:45 /user/hive/warehouse/u_data/u_copy_30.data
  60. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:45 /user/hive/warehouse/u_data/u_copy_31.data
  61. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:45 /user/hive/warehouse/u_data/u_copy_32.data
  62. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:45 /user/hive/warehouse/u_data/u_copy_33.data
  63. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:45 /user/hive/warehouse/u_data/u_copy_34.data
  64. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:45 /user/hive/warehouse/u_data/u_copy_35.data
  65. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:45 /user/hive/warehouse/u_data/u_copy_36.data
  66. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:46 /user/hive/warehouse/u_data/u_copy_37.data
  67. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:46 /user/hive/warehouse/u_data/u_copy_38.data
  68. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:39 /user/hive/warehouse/u_data/u_copy_4.data
  69. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:39 /user/hive/warehouse/u_data/u_copy_5.data
  70. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:39 /user/hive/warehouse/u_data/u_copy_6.data
  71. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:39 /user/hive/warehouse/u_data/u_copy_7.data
  72. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:39 /user/hive/warehouse/u_data/u_copy_8.data
  73. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:40 /user/hive/warehouse/u_data/u_copy_9.data

三 外部表

1 创建外部表并载入数据

  1. 0: jdbc:hive2://hadoopmaster:10000/> create external table u_data_external_table (userid INT, movieid INT, rating INT, unixtime STRING) row format delimited fields terminated by '\t' lines terminated by '\n';
  2. OK
  3. No rows affected (0.047 seconds)
  4. 0: jdbc:hive2://hadoopmaster:10000/> show tables;
  5. OK
  6. +------------------------+--+
  7. | tab_name |
  8. +------------------------+--+
  9. | employees |
  10. | t_hive |
  11. | t_hive2 |
  12. | u_data |
  13. | u_data_external_table |
  14. +------------------------+--+
  15. 5 rows selected (0.036 seconds)

2 导入数据

  1. hive -e "LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data;"

3 内部表与外部表区别

  1. 我用drop table 命令删除刚才创建的二张表,一个内表一个外表之后结果是.
  2. hadoop@hadoopmaster:~$ hdfs dfs -ls /user/hive/warehouse/
  3. Found 5 items
  4. drwxrwxr-x - hadoop supergroup 0 2016-07-20 17:25 /user/hive/warehouse/employees
  5. drwxrwxr-x - hadoop supergroup 0 2016-07-21 15:52 /user/hive/warehouse/fincials.db
  6. drwxrwxr-x - hadoop supergroup 0 2016-07-20 09:50 /user/hive/warehouse/t_hive
  7. drwxrwxr-x - hadoop supergroup 0 2016-07-20 09:54 /user/hive/warehouse/t_hive2
  8. drwxrwxr-x - hadoop supergroup 0 2016-07-22 11:04 /user/hive/warehouse/u_data_external_table
  9. 内表的数据完全删除,而外表还有

最后归纳一下Hive中表与外部表的区别:

  • 在导入数据到外部表,数据并没有移动到自己的数据仓库目录下,也就是说外部表中的数据并不是由它自己来管理的!而表则不一样;
  • 在删除表的时候,Hive将会把属于表的元数据和数据全部删掉;而删除外部表的时候,Hive仅仅删除外部表的元数据,数据是不会删除的!
    那么,应该如何选择使用哪种表呢?在大多数情况没有太多的区别,因此选择只是个人喜好的问题。但是作为一个经验,如果所有处理都需要由Hive完成,那么你应该创建表,否则使用外部表!

四 分区表

  1. 0: jdbc:hive2://hadoopmaster:10000/> create table u_data_partitioned_table (userid INT, movieid INT, rating INT, unixtime STRING) partitioned by(day int) row format delimited fields terminated by '\t' lines terminated by '\n';
  2. OK
  3. No rows affected (0.256 seconds)
  4. 0: jdbc:hive2://hadoopmaster:10000/>
  5. 0: jdbc:hive2://hadoopmaster:10000/> LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data_partitioned_table partition(day=20160101);
  6. Loading data to table default.u_data_partitioned_table partition (day=20160101)
  7. OK
  8. No rows affected (0.424 seconds)
  9. 0: jdbc:hive2://hadoopmaster:10000/>
  10. 100,000 rows selected (4.653 seconds)
  11. 0: jdbc:hive2://hadoopmaster:10000/> LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data_partitioned_table partition(day=20160101);
  12. Loading data to table default.u_data_partitioned_table partition (day=20160101)
  13. OK
  14. No rows affected (0.424 seconds)
  15. 0: jdbc:hive2://hadoopmaster:10000/> LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data_partitioned_table partition(day=20160102);
  16. Loading data to table default.u_data_partitioned_table partition (day=20160102)
  17. OK
  18. No rows affected (0.499 seconds)
  19. 0: jdbc:hive2://hadoopmaster:10000/>
  20. hadoop@hadoopmaster:~$ hdfs dfs -ls /user/hive/warehouse/u_data_partitioned_table
  21. Found 2 items
  22. drwxrwxr-x - hadoop supergroup 0 2016-07-22 13:51 /user/hive/warehouse/u_data_partitioned_table/day=20160101
  23. drwxrwxr-x - hadoop supergroup 0 2016-07-22 13:51 /user/hive/warehouse/u_data_partitioned_table/day=20160102

五 分桶表

  1. 0: jdbc:hive2://hadoopmaster:10000/> CREATE TABLE bucketed_data_user (userid INT, movieid INT, rating INT, unixtime STRING) CLUSTERED BY (userid) INTO 4 BUCKETS row format delimited fields terminated by '\t' lines terminated by '\n';
  2. OK
  3. No rows affected (0.045 seconds)
  4. 0: jdbc:hive2://hadoopmaster:10000/>
  5. 0: jdbc:hive2://hadoopmaster:10000/> insert overwrite table bucketed_data_user select userid,movieid,rating,unixtime from u_data_partitioned_table;
  6. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
  7. Query ID = hadoop_20160722140142_c272bc07-b74d-4b5b-9689-0bec2ce71780
  8. Total jobs = 1
  9. Launching Job 1 out of 1
  10. Number of reduce tasks determined at compile time: 4
  11. In order to change the average load for a reducer (in bytes):
  12. set hive.exec.reducers.bytes.per.reducer=<number>
  13. In order to limit the maximum number of reducers:
  14. set hive.exec.reducers.max=<number>
  15. In order to set a constant number of reducers:
  16. set mapreduce.job.reduces=<number>
  17. Starting Job = job_1468978056881_0010, Tracking URL = http://hadoopmaster:8088/proxy/application_1468978056881_0010/
  18. Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1468978056881_0010
  19. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
  20. 2016-07-22 14:01:48,774 Stage-1 map = 0%, reduce = 0%
  21. 2016-07-22 14:01:55,978 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.89 sec
  22. 2016-07-22 14:02:06,236 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 5.66 sec
  23. 2016-07-22 14:02:07,272 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.43 sec
  24. MapReduce Total cumulative CPU time: 9 seconds 430 msec
  25. Ended Job = job_1468978056881_0010
  26. Loading data to table default.bucketed_data_user
  27. MapReduce Jobs Launched:
  28. Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 9.43 sec HDFS Read: 5959693 HDFS Write: 5937879 SUCCESS
  29. Total MapReduce CPU Time Spent: 9 seconds 430 msec
  30. OK
  31. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
  32. No rows affected (26.251 seconds)
  33. 0: jdbc:hive2://hadoopmaster:10000/>
  34. 0: jdbc:hive2://hadoopmaster:10000/> select count(*) from bucketed_data_user ;
  35. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
  36. Query ID = hadoop_20160722141056_eaf582be-4107-403a-bacd-0a18f567f576
  37. Total jobs = 1
  38. Launching Job 1 out of 1
  39. Number of reduce tasks determined at compile time: 1
  40. In order to change the average load for a reducer (in bytes):
  41. set hive.exec.reducers.bytes.per.reducer=<number>
  42. In order to limit the maximum number of reducers:
  43. set hive.exec.reducers.max=<number>
  44. In order to set a constant number of reducers:
  45. set mapreduce.job.reduces=<number>
  46. Starting Job = job_1468978056881_0012, Tracking URL = http://hadoopmaster:8088/proxy/application_1468978056881_0012/
  47. Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1468978056881_0012
  48. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  49. 2016-07-22 14:11:04,156 Stage-1 map = 0%, reduce = 0%
  50. 2016-07-22 14:11:09,331 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.94 sec
  51. 2016-07-22 14:11:15,488 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.78 sec
  52. MapReduce Total cumulative CPU time: 1 seconds 780 msec
  53. Ended Job = job_1468978056881_0012
  54. MapReduce Jobs Launched:
  55. Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 1.78 sec HDFS Read: 5945855 HDFS Write: 106 SUCCESS
  56. Total MapReduce CPU Time Spent: 1 seconds 780 msec
  57. OK
  58. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
  59. +---------+--+
  60. | c0 |
  61. +---------+--+
  62. | 300000 |
  63. +---------+--+
  64. 1 row selected (20.397 seconds)
  65. 0: jdbc:hive2://hadoopmaster:10000/>
  66. hadoop@hadoopmaster:~$ hdfs dfs -ls /user/hive/warehouse/bucketed_data_user
  67. Found 4 items
  68. -rwxrwxr-x 2 hadoop supergroup 1400994 2016-07-22 14:02 /user/hive/warehouse/bucketed_data_user/000000_0
  69. -rwxrwxr-x 2 hadoop supergroup 1493856 2016-07-22 14:02 /user/hive/warehouse/bucketed_data_user/000001_0
  70. -rwxrwxr-x 2 hadoop supergroup 1566738 2016-07-22 14:02 /user/hive/warehouse/bucketed_data_user/000002_0
  71. -rwxrwxr-x 2 hadoop supergroup 1475931 2016-07-22 14:02 /user/hive/warehouse/bucketed_data_user/000003_0