Integration with Hive UDFs/UDAFs/UDTFs

Description

Spark SQL supports integration of Hive UDFs, UDAFs and UDTFs. Similar to Spark UDFs and UDAFs, Hive UDFs work on a single row as input and generate a single row as output, while Hive UDAFs operate on multiple rows and return a single aggregated row as a result. In addition, Hive also supports UDTFs (User Defined Tabular Functions) that act on one row as input and return multiple rows as output. To use Hive UDFs/UDAFs/UTFs, the user should register them in Spark, and then use them in Spark SQL queries.

Examples

Hive has two UDF interfaces: UDF and GenericUDF. An example below uses GenericUDFAbs derived from GenericUDF.

  1. -- Register `GenericUDFAbs` and use it in Spark SQL.
  2. -- Note that, if you use your own programmed one, you need to add a JAR containing it
  3. -- into a classpath,
  4. -- e.g., ADD JAR yourHiveUDF.jar;
  5. CREATE TEMPORARY FUNCTION testUDF AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs';
  6. SELECT * FROM t;
  7. +-----+
  8. |value|
  9. +-----+
  10. | -1.0|
  11. | 2.0|
  12. | -3.0|
  13. +-----+
  14. SELECT testUDF(value) FROM t;
  15. +--------------+
  16. |testUDF(value)|
  17. +--------------+
  18. | 1.0|
  19. | 2.0|
  20. | 3.0|
  21. +--------------+
  22. -- Register `UDFSubstr` and use it in Spark SQL.
  23. -- Note that, it can achieve better performance if the return types and method parameters use Java primitives.
  24. -- e.g., UDFSubstr. The data processing method is UTF8String <-> Text <-> String. we can avoid UTF8String <-> Text.
  25. CREATE TEMPORARY FUNCTION hive_substr AS 'org.apache.hadoop.hive.ql.udf.UDFSubstr';
  26. select hive_substr('Spark SQL', 1, 5) as value;
  27. +-----+
  28. |value|
  29. +-----+
  30. |Spark|
  31. +-----+

An example below uses GenericUDTFExplode derived from GenericUDTF.

  1. -- Register `GenericUDTFExplode` and use it in Spark SQL
  2. CREATE TEMPORARY FUNCTION hiveUDTF
  3. AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode';
  4. SELECT * FROM t;
  5. +------+
  6. | value|
  7. +------+
  8. |[1, 2]|
  9. |[3, 4]|
  10. +------+
  11. SELECT hiveUDTF(value) FROM t;
  12. +---+
  13. |col|
  14. +---+
  15. | 1|
  16. | 2|
  17. | 3|
  18. | 4|
  19. +---+

Hive has two UDAF interfaces: UDAF and GenericUDAFResolver. An example below uses GenericUDAFSum derived from GenericUDAFResolver.

  1. -- Register `GenericUDAFSum` and use it in Spark SQL
  2. CREATE TEMPORARY FUNCTION hiveUDAF
  3. AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum';
  4. SELECT * FROM t;
  5. +---+-----+
  6. |key|value|
  7. +---+-----+
  8. | a| 1|
  9. | a| 2|
  10. | b| 3|
  11. +---+-----+
  12. SELECT key, hiveUDAF(value) FROM t GROUP BY key;
  13. +---+---------------+
  14. |key|hiveUDAF(value)|
  15. +---+---------------+
  16. | b| 3|
  17. | a| 3|
  18. +---+---------------+