PXF External Tables and API

You can use the PXF API to create your own connectors to access any other type of parallel data store or processing engine.

The PXF Java API lets you extend PXF functionality and add new services and formats without changing HAWQ. The API includes three classes that are extended to allow HAWQ to access an external data source: Fragmenter, Accessor, and Resolver.

The Fragmenter produces a list of data fragments that can be read in parallel from the data source. The Accessor produces a list of records from a single fragment, and the Resolver both deserializes and serializes records.

Together, the Fragmenter, Accessor, and Resolver classes implement a connector. PXF includes plug-ins for HDFS and JSON files and tables in HBase and Hive.

Creating an External Table

The syntax for an EXTERNAL TABLE that uses the PXF protocol is as follows:

  1. CREATE [READABLE|WRITABLE] EXTERNAL TABLE <table_name>
  2. ( <column_name> <data_type> [, ...] | LIKE <other_table> )
  3. LOCATION('pxf://<host>[:<port>]/<path-to-data>?<pxf-parameters>[&<custom-option>=<value>[...]]')
  4. FORMAT 'custom' (formatter='pxfwritable_import|pxfwritable_export');

where <pxf-parameters> is:

  1. [FRAGMENTER=<fragmenter_class>&ACCESSOR=<accessor_class>
  2. &RESOLVER=<resolver_class>] | ?PROFILE=profile-name

Note: Not every PXF profile supports writable external tables. Refer to Writing Data to HDFS for a detailed discussion of the HDFS plug-in profiles that support this feature.

Table 1. Parameter values and description

ParameterValue and description
<host>The PXF host. While <host> may identify any PXF agent node, use the HDFS NameNode as it is guaranteed to be available in a running HDFS cluster. If HDFS High Availability is enabled, <host> must identify the HDFS NameService.
<port>The PXF port. If <port> is omitted, PXF assumes <host> identifies a High Availability HDFS Nameservice and connects to the port number designated by the pxf_service_port server configuration parameter value. Default is 51200.
<path-to-data>A directory, file name, wildcard pattern, table name, etc.
PROFILEThe profile PXF uses to access the data. PXF supports multiple plug-ins that currently expose profiles named HBase, Hive, HiveRC, HiveText, HiveORC, HiveVectorizedORC, HdfsTextSimple, HdfsTextMulti, Avro, SequenceWritable, and Json.
FRAGMENTERThe Java class the plug-in uses for fragmenting data. Used for READABLE external tables only.
ACCESSORThe Java class the plug-in uses for accessing the data. Used for READABLE and WRITABLE tables.
RESOLVERThe Java class the plug-in uses for serializing and deserializing the data. Used for READABLE and WRITABLE tables.
<custom-option>Additional values to pass to the plug-in at runtime. A plug-in can parse custom options with the PXF helper class org.apache.hawq.pxf.api.utilities.InputData

Note: When creating PXF external tables, you cannot use the HEADER option in your FORMAT specification.

About the Java Class Services and Formats

The LOCATION string in a PXF CREATE EXTERNAL TABLE statement is a URI that specifies the host and port of an external data source and the path to the data in the external data source. The query portion of the URI, introduced by the question mark (?), must include the PXF profile name or the plug-in’s FRAGMENTER (readable tables only), ACCESSOR, and RESOLVER class names.

PXF profiles are defined in the /etc/pxf/conf/pxf-profiles.xml file. Profile definitions include plug-in class names. For example, the HdfsTextSimple profile definition is:

  1. <profile>
  2. <name>HdfsTextSimple</name>
  3. <description> This profile is suitable for use when reading delimited
  4. single line records from plain text files on HDFS.
  5. </description>
  6. <plugins>
  7. <fragmenter>org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
  8. <accessor>org.apache.hawq.pxf.plugins.hdfs.LineBreakAccessor</accessor>
  9. <resolver>org.apache.hawq.pxf.plugins.hdfs.StringPassResolver</resolver>
  10. </plugins>
  11. </profile>

The parameters in the PXF URI are passed from HAWQ as headers to the PXF Java service. You can pass custom information to user-implemented PXF plug-ins by adding optional parameters to the LOCATION string.

The Java PXF service retrieves the source data from the external data source and converts it to a HAWQ-readable table format.

The Accessor, Resolver, and Fragmenter Java classes extend the org.apache.hawq.pxf.api.utilities.Plugin class:

  1. package org.apache.hawq.pxf.api.utilities;
  2. /**
  3. * Base class for all plug-in types (Accessor, Resolver, Fragmenter, ...).
  4. * Manages the meta data.
  5. */
  6. public class Plugin {
  7. protected InputData inputData;
  8. /**
  9. * Constructs a plug-in.
  10. *
  11. * @param input the input data
  12. */
  13. public Plugin(InputData input) {
  14. this.inputData = input;
  15. }
  16. /**
  17. * Checks if the plug-in is thread safe or not, based on inputData.
  18. *
  19. * @return true if plug-in is thread safe
  20. */
  21. public boolean isThreadSafe() {
  22. return true;
  23. }
  24. }

The parameters in the LOCATION string are available to the plug-ins through methods in the org.apache.hawq.pxf.api.utilities.InputData class. Plug-ins can look up the custom parameters added to the location string with the getUserProperty() method.

  1. /**
  2. * Common configuration available to all PXF plug-ins. Represents input data
  3. * coming from client applications, such as HAWQ.
  4. */
  5. public class InputData {
  6. /**
  7. * Constructs an InputData from a copy.
  8. * Used to create from an extending class.
  9. *
  10. * @param copy the input data to copy
  11. */
  12. public InputData(InputData copy);
  13. /**
  14. * Returns value of a user defined property.
  15. *
  16. * @param userProp the lookup user property
  17. * @return property value as a String
  18. */
  19. public String getUserProperty(String userProp);
  20. /**
  21. * Sets the byte serialization of a fragment meta data
  22. * @param location start, len, and location of the fragment
  23. */
  24. public void setFragmentMetadata(byte[] location);
  25. /** Returns the byte serialization of a data fragment */
  26. public byte[] getFragmentMetadata();
  27. /**
  28. * Gets any custom user data that may have been passed from the
  29. * fragmenter. Will mostly be used by the accessor or resolver.
  30. */
  31. public byte[] getFragmentUserData();
  32. /**
  33. * Sets any custom user data that needs to be shared across plug-ins.
  34. * Will mostly be set by the fragmenter.
  35. */
  36. public void setFragmentUserData(byte[] userData);
  37. /** Returns the number of segments in GP. */
  38. public int getTotalSegments();
  39. /** Returns the current segment ID. */
  40. public int getSegmentId();
  41. /** Returns true if there is a filter string to parse. */
  42. public boolean hasFilter();
  43. /** Returns the filter string, <tt>null</tt> if #hasFilter is <tt>false</tt> */
  44. public String getFilterString();
  45. /** Returns tuple description. */
  46. public ArrayList<ColumnDescriptor> getTupleDescription();
  47. /** Returns the number of columns in tuple description. */
  48. public int getColumns();
  49. /** Returns column index from tuple description. */
  50. public ColumnDescriptor getColumn(int index);
  51. /**
  52. * Returns the column descriptor of the recordkey column. If the recordkey
  53. * column was not specified by the user in the create table statement will
  54. * return null.
  55. */
  56. public ColumnDescriptor getRecordkeyColumn();
  57. /** Returns the data source of the required resource (i.e a file path or a table name). */
  58. public String getDataSource();
  59. /** Sets the data source for the required resource */
  60. public void setDataSource(String dataSource);
  61. /** Returns the ClassName for the java class that was defined as Accessor */
  62. public String getAccessor();
  63. /** Returns the ClassName for the java class that was defined as Resolver */
  64. public String getResolver();
  65. /**
  66. * Returns the ClassName for the java class that was defined as Fragmenter
  67. * or null if no fragmenter was defined
  68. */
  69. public String getFragmenter();
  70. /**
  71. * Returns the contents of pxf_remote_service_login set in Hawq.
  72. * Should the user set it to an empty string this function will return null.
  73. *
  74. * @return remote login details if set, null otherwise
  75. */
  76. public String getLogin();
  77. /**
  78. * Returns the contents of pxf_remote_service_secret set in Hawq.
  79. * Should the user set it to an empty string this function will return null.
  80. *
  81. * @return remote password if set, null otherwise
  82. */
  83. public String getSecret();
  84. /**
  85. * Returns true if the request is thread safe. Default true. Should be set
  86. * by a user to false if the request contains non thread-safe plug-ins or
  87. * components, such as BZip2 codec.
  88. */
  89. public boolean isThreadSafe();
  90. /**
  91. * Returns a data fragment index. plan to deprecate it in favor of using
  92. * getFragmentMetadata().
  93. */
  94. public int getDataFragment();
  95. }

Fragmenter

Note: You use the Fragmenter class to read data into HAWQ. You cannot use this class to write data out of HAWQ.

The Fragmenter is responsible for passing datasource metadata back to HAWQ. It also returns a list of data fragments to the Accessor or Resolver. Each data fragment describes some part of the requested data set. It contains the datasource name, such as the file or table name, including the hostname where it is located. For example, if the source is an HDFS file, the Fragmenter returns a list of data fragments containing an HDFS file block. Each fragment includes the location of the block. If the source data is an HBase table, the Fragmenter returns information about table regions, including their locations.

The ANALYZE command now retrieves advanced statistics for PXF readable tables by estimating the number of tuples in a table, creating a sample table from the external table, and running advanced statistics queries on the sample table in the same way statistics are collected for native HAWQ tables.

The configuration parameter pxf_enable_stat_collection controls collection of advanced statistics. If pxf_enable_stat_collection is set to false, no analysis is performed on PXF tables. An additional parameter, pxf_stat_max_fragments, controls the number of fragments sampled to build a sample table. By default pxf_stat_max_fragments is set to 100, which means that even if there are more than 100 fragments, only this number of fragments will be used in ANALYZE to sample the data. Increasing this number will result in better sampling, but can also impact performance.

When a PXF table is analyzed, any of the following conditions might result in a warning message with no statistics gathered for the table:

  • pxf_enable_stat_collection is set to off,
  • an error occurs because the table is not defined correctly,
  • the PXF service is down, or
  • getFragmentsStats() is not implemented

If ANALYZE is running over all tables in the database, the next table will be processed – a failure processing one table does not stop the command.

For a detailed explanation about HAWQ statistical data gathering, refer to the ANALYZE SQL command reference.

Note:

  • Depending on external table size, the time required to complete an ANALYZE operation can be lengthy. The boolean parameter pxf_enable_stat_collection enables statistics collection for PXF. The default value is on. Turning this parameter off (disabling PXF statistics collection) can help decrease the time needed for the ANALYZE operation.
  • You can also use pxf_stat_max_fragments to limit the number of fragments to be sampled by decreasing it from the default (100). However, if the number is too low, the sample might not be uniform and the statistics might be skewed.
  • You can also implement getFragmentsStats() to return an error. This will cause ANALYZE on a table with this Fragmenter to fail immediately, and default statistics values will be used for that table.

The following table lists the Fragmenter plug-in implementations included with the PXF API.

Table 2. Fragmenter base classes

Fragmenter class

Description

org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenterFragmenter for HDFS, JSON files
org.apache.hawq.pxf.plugins.hbase.HBaseDataFragmenterFragmenter for HBase tables
org.apache.hawq.pxf.plugins.hive.HiveDataFragmenterFragmenter for Hive tables 
org.apache.hawq.pxf.plugins.hdfs.HiveInputFormatFragmenterFragmenter for Hive tables with RC, ORC, or text file formats 

A Fragmenter class extends org.apache.hawq.pxf.api.Fragmenter:

org.apache.hawq.pxf.api.Fragmenter

  1. package org.apache.hawq.pxf.api;
  2. /**
  3. * Abstract class that defines the splitting of a data resource into fragments
  4. * that can be processed in parallel.
  5. */
  6. public abstract class Fragmenter extends Plugin {
  7. protected List<Fragment> fragments;
  8. public Fragmenter(InputData metaData) {
  9. super(metaData);
  10. fragments = new LinkedList<Fragment>();
  11. }
  12. /**
  13. * Gets the fragments of a given path (source name and location of each
  14. * fragment). Used to get fragments of data that could be read in parallel
  15. * from the different segments.
  16. */
  17. public abstract List<Fragment> getFragments() throws Exception;
  18. /**
  19. * Default implementation of statistics for fragments. The default is:
  20. * <ul>
  21. * <li>number of fragments - as gathered by {@link #getFragments()}</li>
  22. * <li>first fragment size - 64MB</li>
  23. * <li>total size - number of fragments times first fragment size</li>
  24. * </ul>
  25. * Each fragmenter implementation can override this method to better match
  26. * its fragments stats.
  27. *
  28. * @return default statistics
  29. * @throws Exception if statistics cannot be gathered
  30. */
  31. public FragmentsStats getFragmentsStats() throws Exception {
  32. List<Fragment> fragments = getFragments();
  33. long fragmentsNumber = fragments.size();
  34. return new FragmentsStats(fragmentsNumber,
  35. FragmentsStats.DEFAULT_FRAGMENT_SIZE, fragmentsNumber
  36. * FragmentsStats.DEFAULT_FRAGMENT_SIZE);
  37. }
  38. }

getFragments() returns a string in JSON format of the retrieved fragment. For example, if the input path is a HDFS directory, the source name for each fragment should include the file name including the path for the fragment.

Class Description

The Fragmenter.getFragments() method returns a List<Fragment>:

  1. package org.apache.hawq.pxf.api;
  2. /*
  3. * Fragment holds a data fragment' information.
  4. * Fragmenter.getFragments() returns a list of fragments.
  5. */
  6. public class Fragment
  7. {
  8. private String sourceName; // File path+name, table name, etc.
  9. private int index; // Fragment index (incremented per sourceName)
  10. private String[] replicas; // Fragment replicas (1 or more)
  11. private byte[] metadata; // Fragment metadata information (starting point + length, region location, etc.)
  12. private byte[] userData; // ThirdParty data added to a fragment. Ignored if null
  13. ...
  14. }

org.apache.hawq.pxf.api.FragmentsStats

The Fragmenter.getFragmentsStats() method returns a FragmentsStats:

  1. package org.apache.hawq.pxf.api;
  2. /**
  3. * FragmentsStats holds statistics for a given path.
  4. */
  5. public class FragmentsStats {
  6. // number of fragments
  7. private long fragmentsNumber;
  8. // first fragment size
  9. private SizeAndUnit firstFragmentSize;
  10. // total fragments size
  11. private SizeAndUnit totalSize;
  12. /**
  13. * Enum to represent unit (Bytes/KB/MB/GB/TB)
  14. */
  15. public enum SizeUnit {
  16. /**
  17. * Byte
  18. */
  19. B,
  20. /**
  21. * KB
  22. */
  23. KB,
  24. /**
  25. * MB
  26. */
  27. MB,
  28. /**
  29. * GB
  30. */
  31. GB,
  32. /**
  33. * TB
  34. */
  35. TB;
  36. };
  37. /**
  38. * Container for size and unit
  39. */
  40. public class SizeAndUnit {
  41. long size;
  42. SizeUnit unit;
  43. ...

getFragmentsStats() returns a string in JSON format of statistics for the data source. For example, if the input path is a HDFS directory of 3 files, each one of 1 block, the output will be the number of fragments (3), the size of the first file, and the size of all files in that directory.

Accessor

The Accessor retrieves specific fragments and passes records back to the Resolver. For example, the HDFS plug-ins create a org.apache.hadoop.mapred.FileInputFormat and a org.apache.hadoop.mapred.RecordReader for an HDFS file and sends this to the Resolver. In the case of HBase or Hive files, the Accessor returns single rows from an HBase or Hive table. PXF includes the following Accessor implementations:

Table 3. Accessor base classes

Accessor class

Description

org.apache.hawq.pxf.plugins.hdfs.HdfsAtomicDataAccessorBase class for accessing datasources which cannot be split. These will be accessed by a single HAWQ segment
org.apache.hawq.pxf.plugins.hdfs.QuotedLineBreakAccessorAccessor for TEXT files that have records with embedded linebreaks
org.apache.hawq.pxf.plugins.hdfs.HdfsSplittableDataAccessor

Base class for accessing HDFS files using RecordReaders

org.apache.hawq.pxf.plugins.hdfs.LineBreakAccessorAccessor for TEXT files (replaced the deprecated TextFileAccessor, LineReaderAccessor)
org.apache.hawq.pxf.plugins.hdfs.AvroFileAccessorAccessor for Avro files
org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessorAccessor for Sequence files
org.apache.hawq.pxf.plugins.hbase.HBaseAccessor Accessor for HBase tables 
org.apache.hawq.pxf.plugins.hive.HiveAccessorAccessor for Hive tables 
org.apache.hawq.pxf.plugins.hive.HiveLineBreakAccessorAccessor for Hive tables stored as text file format
org.apache.hawq.pxf.plugins.hive.HiveRCFileAccessorAccessor for Hive tables stored as RC file format
org.apache.hawq.pxf.plugins.hive.HiveORCAccessorAccessor for Hive tables stored as ORC format
org.apache.hawq.pxf.plugins.hive.HiveORCVectorizedAccessorAccessor for Hive tables stored as ORC format
org.apache.hawq.pxf.plugins.json.JsonAccessorAccessor for JSON files

The class must extend the org.apache.hawq.pxf.Plugin class, and implement one or both of the interfaces:

  • org.apache.hawq.pxf.api.ReadAccessor
  • org.apache.hawq.pxf.api.WriteAccessor
  1. package org.apache.hawq.pxf.api;
  2. /*
  3. * Internal interface that defines the access to data on the source
  4. * data store (e.g, a file on HDFS, a region of an HBase table, etc).
  5. * All classes that implement actual access to such data sources must
  6. * respect this interface
  7. */
  8. public interface ReadAccessor {
  9. boolean openForRead() throws Exception;
  10. OneRow readNextObject() throws Exception;
  11. void closeForRead() throws Exception;
  12. }
  1. package org.apache.hawq.pxf.api;
  2. /*
  3. * An interface for writing data into a data store
  4. * (e.g, a sequence file on HDFS).
  5. * All classes that implement actual access to such data sources must
  6. * respect this interface
  7. */
  8. public interface WriteAccessor {
  9. boolean openForWrite() throws Exception;
  10. OneRow writeNextObject(OneRow onerow) throws Exception;
  11. void closeForWrite() throws Exception;
  12. }

The Accessor calls openForRead() to read existing data. After reading the data, it calls closeForRead(). readNextObject() returns one of the following:

  • a single record, encapsulated in a OneRow object
  • null if it reaches EOF

The Accessor calls openForWrite() to write data out. After writing the data, it writes a OneRow object with writeNextObject(), and when done calls closeForWrite(). OneRow represents a key-value item.

org.apache.hawq.pxf.api.OneRow

  1. package org.apache.hawq.pxf.api;
  2. /*
  3. * Represents one row in the external system data store. Supports
  4. * the general case where one row contains both a record and a
  5. * separate key like in the HDFS key/value model for MapReduce
  6. * (Example: HDFS sequence file)
  7. */
  8. public class OneRow {
  9. /*
  10. * Default constructor
  11. */
  12. public OneRow();
  13. /*
  14. * Constructor sets key and data
  15. */
  16. public OneRow(Object inKey, Object inData);
  17. /*
  18. * Setter for key
  19. */
  20. public void setKey(Object inKey);
  21. /*
  22. * Setter for data
  23. */
  24. public void setData(Object inData);
  25. /*
  26. * Accessor for key
  27. */
  28. public Object getKey();
  29. /*
  30. * Accessor for data
  31. */
  32. public Object getData();
  33. /*
  34. * Show content
  35. */
  36. public String toString();
  37. }

Resolver

The Resolver deserializes records in the OneRow format and serializes them to a list of OneField objects. PXF converts a OneField object to a HAWQ-readable GPDBWritable format. PXF 1.x or higher contains the following implementations:

Table 4. Resolver base classes

Resolver class

Description

org.apache.hawq.pxf.plugins.hdfs.StringPassResolver

StringPassResolver replaced the deprecated TextResolver. It passes whole records (composed of any data types) as strings without parsing them

org.apache.hawq.pxf.plugins.hdfs.WritableResolver

Resolver for custom Hadoop Writable implementations. Custom class can be specified with the schema in DATA-SCHEMA. Supports the following types:

  1. DataType.BOOLEAN
  2. DataType.INTEGER
  3. DataType.BIGINT
  4. DataType.REAL
  5. DataType.FLOAT8
  6. DataType.VARCHAR
  7. DataType.BYTEA

org.apache.hawq.pxf.plugins.hdfs.AvroResolver

Supports the same field objects as WritableResolver

org.apache.hawq.pxf.plugins.hbase.HBaseResolver

Supports the same field objects as WritableResolver and also supports the following:

  1. DataType.SMALLINT
  2. DataType.NUMERIC
  3. DataType.TEXT
  4. DataType.BPCHAR
  5. DataType.TIMESTAMP

org.apache.hawq.pxf.plugins.hive.HiveResolver

Supports the same field objects as WritableResolver and also supports the following:

  1. DataType.SMALLINT
  2. DataType.TEXT
  3. DataType.TIMESTAMP

org.apache.hawq.pxf.plugins.hive.HiveStringPassResolver

Specialized HiveResolver for a Hive table stored as Text files. Should be used together with HiveInputFormatFragmenter/HiveLineBreakAccessor.
org.apache.hawq.pxf.plugins.hive.HiveColumnarSerdeResolverSpecialized HiveResolver for a Hive table stored as RC file. Should be used together with HiveInputFormatFragmenter/HiveRCFileAccessor.
org.apache.hawq.pxf.plugins.hive.HiveORCSerdeResolverSpecialized HiveResolver for a Hive table stored in ORC format. Should be used together with HiveInputFormatFragmenter/HiveORCAccessor.
org.apache.hawq.pxf.plugins.hive.HiveORCVectorizedResolverSpecialized HiveResolver for a Hive table stored in ORC format. Should be used together with HiveInputFormatFragmenter/HiveORCVectorizedAccessor.

The class needs to extend the org.apache.hawq.pxf.resolvers.Plugin class, and implement one or both interfaces:

  • org.apache.hawq.pxf.api.ReadResolver
  • org.apache.hawq.pxf.api.WriteResolver
  1. package org.apache.hawq.pxf.api;
  2. /*
  3. * Interface that defines the deserialization of one record brought from
  4. * the data Accessor. Every implementation of a deserialization method
  5. * (e.g, Writable, Avro, ...) must implement this interface.
  6. */
  7. public interface ReadResolver {
  8. public List<OneField> getFields(OneRow row) throws Exception;
  9. }
  1. package org.apache.hawq.pxf.api;
  2. /*
  3. * Interface that defines the serialization of data read from the DB
  4. * into a OneRow object.
  5. * Every implementation of a serialization method
  6. * (e.g, Writable, Avro, ...) must implement this interface.
  7. */
  8. public interface WriteResolver {
  9. public OneRow setFields(List<OneField> record) throws Exception;
  10. }

Note:

  • getFields() should return a List<OneField>, with each OneField representing a single field.
  • setFields() should return a single OneRow object, given a List<OneField>.

org.apache.hawq.pxf.api.OneField

  1. package org.apache.hawq.pxf.api;
  2. /*
  3. * Defines one field on a deserialized record.
  4. * 'type' is in OID values recognized by GPDBWritable
  5. * 'val' is the actual field value
  6. */
  7. public class OneField {
  8. public OneField() {}
  9. public OneField(int type, Object val) {
  10. this.type = type;
  11. this.val = val;
  12. }
  13. public int type;
  14. public Object val;
  15. }

The value of type should follow the org.apache.hawq.pxf.api.io.DataType enums. val is the appropriate Java class. Supported types are:

Table 5. Resolver supported types

DataType recognized OID

Field value

DataType.SMALLINT

Short

DataType.INTEGER

Integer

DataType.BIGINT

Long

DataType.REAL

Float

DataType.FLOAT8

Double

DataType.NUMERIC

String (“651687465135468432168421”)

DataType.BOOLEAN

Boolean

DataType.VARCHAR

String

DataType.BPCHAR

String

DataType.TEXT

String

DataType.BYTEA

byte []

DataType.TIMESTAMP

Timestamp

DataType.Date

Date

About Custom Profiles

Administrators can add new profiles or edit the built-in profiles in /etc/pxf/conf/pxf-profiles.xml. See Using Profiles to Read and Write Data for information on how to add custom profiles.

About Query Filter Push-Down

If a query includes a number of WHERE clause filters, HAWQ may push all or some queries to PXF. If pushed to PXF, the Accessor can use the filtering information when accessing the data source to fetch tuples. These filters only return records that pass filter evaluation conditions. This reduces data processing and reduces network traffic from the SQL engine.

This topic includes the following information:

  • Filter Availability and Ordering
  • Creating a Filter Builder class
  • Filter Operations
  • Sample Implementation
  • Using Filters

Filter Availability and Ordering

PXF allows push-down filtering if the following rules are met:

  • Uses only single expressions or a group of AND’ed expressions - no OR’ed expressions.
  • Uses only expressions of supported data types and operators.

FilterParser scans the pushed down filter list and uses the user’s build() implementation to build the filter.

  • For simple expressions (e.g, a >= 5), FilterParser places column objects on the left of the expression and constants on the right.
  • For compound expressions (e.g <expression> AND <expression>) it handles three cases in the build() function:
    1. Simple Expression: <Column Index> <Operation> <Constant>
    2. Compound Expression: <Filter Object> AND <Filter Object>
    3. Compound Expression: <List of Filter Objects> AND <Filter Object>

Creating a Filter Builder Class

To check if a filter queried PXF, call the InputData.hasFilter() function:

  1. /*
  2. * Returns true if there is a filter string to parse
  3. */
  4. public boolean hasFilter()
  5. {
  6. return filterStringValid;
  7. }

If hasFilter() returns false, there is no filter information. If it returns true, PXF parses the serialized filter string into a meaningful filter object to use later. To do so, create a filter builder class that implements the FilterParser.FilterBuilder interface:

  1. package org.apache.hawq.pxf.api;
  2. /*
  3. * Interface a user of FilterParser should implement
  4. * This is used to let the user build filter expressions in the manner she
  5. * sees fit
  6. *
  7. * When an operator is parsed, this function is called to let the user decide
  8. * what to do with its operands.
  9. */
  10. interface FilterBuilder {
  11. public Object build(Operation operation, Object left, Object right) throws Exception;
  12. }

While PXF parses the serialized filter string from the incoming HAWQ query, it calls the build() function. PXF calls this function for each condition or filter pushed down to PXF. Implementing this function returns some Filter object or representation that the Fragmenter, Accessor, or Resolver uses in runtime to filter out records. The build() function accepts an Operation as input, and left and right operands.

Filter Operations

  1. /*
  2. * Operations supported by the parser
  3. */
  4. public enum Operation
  5. {
  6. HDOP_LT, //less than
  7. HDOP_GT, //greater than
  8. HDOP_LE, //less than or equal
  9. HDOP_GE, //greater than or equal
  10. HDOP_EQ, //equal
  11. HDOP_NE, //not equal
  12. HDOP_LIKE,
  13. HDOP_IS_NULL,
  14. HDOP_IS_NOT_NULL,
  15. HDOP_IN
  16. };
  17. /**
  18. * Logical operators
  19. */
  20. public enum LogicalOperation {
  21. HDOP_AND,
  22. HDOP_OR,
  23. HDOP_NOT
  24. }

Filter Operands

There are three types of operands:

  • Column Index
  • Constant
  • Filter Object

Column Index

  1. /*
  2. * Represents a column index
  3. */
  4. public class ColumnIndex
  5. {
  6. public ColumnIndex(int idx);
  7. public int index();
  8. }

Constant

  1. /*
  2. * The class represents a constant object (String, Long, ...)
  3. */
  4. public class Constant
  5. {
  6. public Constant(Object obj);
  7. public Object constant();
  8. }

Filter Object

Filter Objects can be internal - such as those you define - or external, those that the remote system uses. For example, for HBase you define the HBase Filter class (org.apache.hadoop.hbase.filter.Filter), while for Hive you use an internal default representation created by the PXF framework, called BasicFilter. You can choose the filter object to use, including writing a new one. BasicFilter is the most common:

  1. /*
  2. * Basic filter provided for cases where the target storage system does not provide its own filter
  3. * For example: Hbase storage provides its own filter but for a Writable based record in a SequenceFile
  4. * there is no filter provided and so we need to have a default
  5. */
  6. static public class BasicFilter
  7. {
  8. /*
  9. * C'tor
  10. */
  11. public BasicFilter(Operation inOper, ColumnIndex inColumn, Constant inConstant);
  12. /*
  13. * Returns oper field
  14. */
  15. public Operation getOperation();
  16. /*
  17. * Returns column field
  18. */
  19. public ColumnIndex getColumn();
  20. /*
  21. * Returns constant field
  22. */
  23. public Constant getConstant();
  24. }

Sample Implementation

Let’s look at the following sample implementation of the filter builder class and its build() function that handles all 3 cases. Let’s assume that BasicFilter was used to hold our filter operations.

  1. import java.util.LinkedList;
  2. import java.util.List;
  3. import org.apache.hawq.pxf.api.FilterParser;
  4. import org.apache.hawq.pxf.api.utilities.InputData;
  5. public class MyDemoFilterBuilder implements FilterParser.FilterBuilder
  6. {
  7. private InputData inputData;
  8. public MyDemoFilterBuilder(InputData input)
  9. {
  10. inputData = input;
  11. }
  12. /*
  13. * Translates a filterString into a FilterParser.BasicFilter or a list of such filters
  14. */
  15. public Object getFilterObject(String filterString) throws Exception
  16. {
  17. FilterParser parser = new FilterParser(this);
  18. Object result = parser.parse(filterString);
  19. if (!(result instanceof FilterParser.BasicFilter) && !(result instanceof List))
  20. throw new Exception("String " + filterString + " resolved to no filter");
  21. return result;
  22. }
  23. public Object build(FilterParser.Operation opId,
  24. Object leftOperand,
  25. Object rightOperand) throws Exception
  26. {
  27. if (leftOperand instanceof FilterParser.BasicFilter)
  28. {
  29. //sanity check
  30. if (opId != FilterParser.Operation.HDOP_AND || !(rightOperand instanceof FilterParser.BasicFilter))
  31. throw new Exception("Only AND is allowed between compound expressions");
  32. //case 3
  33. if (leftOperand instanceof List)
  34. return handleCompoundOperations((List<FilterParser.BasicFilter>)leftOperand, (FilterParser.BasicFilter)rightOperand);
  35. //case 2
  36. else
  37. return handleCompoundOperations((FilterParser.BasicFilter)leftOperand, (FilterParser.BasicFilter)rightOperand);
  38. }
  39. //sanity check
  40. if (!(rightOperand instanceof FilterParser.Constant))
  41. throw new Exception("expressions of column-op-column are not supported");
  42. //case 1 (assume column is on the left)
  43. return handleSimpleOperations(opId, (FilterParser.ColumnIndex)leftOperand, (FilterParser.Constant)rightOperand);
  44. }
  45. private FilterParser.BasicFilter handleSimpleOperations(FilterParser.Operation opId,
  46. FilterParser.ColumnIndex column,
  47. FilterParser.Constant constant)
  48. {
  49. return new FilterParser.BasicFilter(opId, column, constant);
  50. }
  51. private List handleCompoundOperations(List<FilterParser.BasicFilter> left,
  52. FilterParser.BasicFilter right)
  53. {
  54. left.add(right);
  55. return left;
  56. }
  57. private List handleCompoundOperations(FilterParser.BasicFilter left,
  58. FilterParser.BasicFilter right)
  59. {
  60. List<FilterParser.BasicFilter> result = new LinkedList<FilterParser.BasicFilter>();
  61. result.add(left);
  62. result.add(right);
  63. return result;
  64. }
  65. }

Here is an example of creating a filter-builder class to implement the Filter interface, implement the build() function, and generate the Filter object. To do this, use either the Accessor, Resolver, or both to call the getFilterObject() function:

  1. if (inputData.hasFilter())
  2. {
  3. String filterStr = inputData.filterString();
  4. MyDemoFilterBuilder demobuilder = new MyDemoFilterBuilder(inputData);
  5. Object filter = demobuilder.getFilterObject(filterStr);
  6. ...
  7. }

Using Filters

Once you have built the Filter object(s), you can use them to read data and filter out records that do not meet the filter conditions:

  1. Check whether you have a single or multiple filters.
  2. Evaluate each filter and iterate over each filter in the list. Disqualify the record if filter conditions fail.
  1. if (filter instanceof List)
  2. {
  3. for (Object f : (List)filter)
  4. <evaluate f>; //may want to break if evaluation results in negative answer for any filter.
  5. }
  6. else
  7. {
  8. <evaluate filter>;
  9. }

Example showing evaluation of a single filter:

  1. //Get our BasicFilter Object
  2. FilterParser.BasicFilter bFilter = (FilterParser.BasicFilter)filter;
  3. //Get operation and operator values
  4. FilterParser.Operation op = bFilter.getOperation();
  5. int colIdx = bFilter.getColumn().index();
  6. String val = bFilter.getConstant().constant().toString();
  7. //Get more info about the column if desired
  8. ColumnDescriptor col = input.getColumn(colIdx);
  9. String colName = filterColumn.columnName();
  10. //Now evaluate it against the actual column value in the record...

Examples

This section contains the following information:

External Table Examples

Example 1

Shows an external table that can analyze all Sequencefiles that are populated Writable serialized records and exist inside the hdfs directory sales/2012/01. SaleItem.class is a Java class that implements the Writable interface and describes a Java record that includes three class members.

Note: In this example, the class member names do not necessarily match the database attribute names, but the types match. SaleItem.class must exist in the classpath of every DataNode and NameNode.

  1. CREATE EXTERNAL TABLE jan_2012_sales (id int, total int, comments varchar)
  2. LOCATION ('pxf://10.76.72.26:51200/sales/2012/01/*.seq'
  3. '?FRAGMENTER=org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter'
  4. '&ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor'
  5. '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver'
  6. '&DATA-SCHEMA=SaleItem')
  7. FORMAT 'custom' (formatter='pxfwritable_import');

Example 2

Example 2 shows an external table that can analyze an HBase table called sales. It has 10 column families (cf1 – cf10) and many qualifier names in each family. This example focuses on the rowkey, the qualifier saleid inside column family cf1, and the qualifier comments inside column family cf8 and uses direct mapping:

  1. CREATE EXTERNAL TABLE hbase_sales
  2. (hbaserowkey text, "cf1:saleid" int, "cf8:comments" varchar)
  3. LOCATION ('pxf://10.76.72.26:51200/sales?PROFILE=HBase')
  4. FORMAT 'custom' (formatter='pxfwritable_import');

Example 3

This example uses indirect mapping. Note how the attribute name changes and how they correspond to the HBase lookup table. Executing SELECT FROM my_hbase_sales, the attribute names automatically convert to their HBase correspondents.

  1. CREATE EXTERNAL TABLE my_hbase_sales (hbaserowkey text, id int, cmts varchar)
  2. LOCATION
  3. ('pxf://10.76.72.26:51200/sales?PROFILE=HBase')
  4. FORMAT 'custom' (formatter='pxfwritable_import');

Example 4

Shows an example for a writable table of compressed data.

  1. CREATE WRITABLE EXTERNAL TABLE sales_aggregated_2012
  2. (id int, total int, comments varchar)
  3. LOCATION ('pxf://10.76.72.26:51200/sales/2012/aggregated'
  4. '?PROFILE=HdfsTextSimple'
  5. '&COMPRESSION_CODEC=org.apache.hadoop.io.compress.BZip2Codec')
  6. FORMAT 'TEXT';

Example 5

Shows an example for a writable table into a sequence file, using a schema file. For writable tables, the formatter is pxfwritable_export.

  1. CREATE WRITABLE EXTERNAL TABLE sales_max_2012
  2. (id int, total int, comments varchar)
  3. LOCATION ('pxf://10.76.72.26:51200/sales/2012/max'
  4. '?FRAGMENTER=org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter'
  5. '&ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor'
  6. '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver'
  7. '&DATA-SCHEMA=SaleItem')
  8. FORMAT 'custom' (formatter='pxfwritable_export');

Plug-in Examples

This section contains sample dummy implementations of all three plug-ins. It also includes a usage example.

Dummy Fragmenter

  1. import org.apache.hawq.pxf.api.Fragmenter;
  2. import org.apache.hawq.pxf.api.Fragment;
  3. import org.apache.hawq.pxf.api.utilities.InputData;
  4. import java.util.List;
  5. /*
  6. * Class that defines the splitting of a data resource into fragments that can
  7. * be processed in parallel
  8. * getFragments() returns the fragments information of a given path (source name and location of each fragment).
  9. * Used to get fragments of data that could be read in parallel from the different segments.
  10. * Dummy implementation, for documentation
  11. */
  12. public class DummyFragmenter extends Fragmenter {
  13. public DummyFragmenter(InputData metaData) {
  14. super(metaData);
  15. }
  16. /*
  17. * path is a data source URI that can appear as a file name, a directory name or a wildcard
  18. * returns the data fragments - identifiers of data and a list of available hosts
  19. */
  20. @Override
  21. public List<Fragment> getFragments() throws Exception {
  22. String localhostname = java.net.InetAddress.getLocalHost().getHostName();
  23. String[] localHosts = new String[]{localhostname, localhostname};
  24. fragments.add(new Fragment(inputData.getDataSource() + ".1" /* source name */,
  25. localHosts /* available hosts list */,
  26. "fragment1".getBytes()));
  27. fragments.add(new Fragment(inputData.getDataSource() + ".2" /* source name */,
  28. localHosts /* available hosts list */,
  29. "fragment2".getBytes()));
  30. fragments.add(new Fragment(inputData.getDataSource() + ".3" /* source name */,
  31. localHosts /* available hosts list */,
  32. "fragment3".getBytes()));
  33. return fragments;
  34. }
  35. }

Dummy Accessor

  1. import org.apache.hawq.pxf.api.WriteAccessor;
  2. import org.apache.hawq.pxf.api.OneRow;
  3. import org.apache.hawq.pxf.api.utilities.InputData;
  4. import org.apache.hawq.pxf.api.utilities.Plugin;
  5. import org.apache.commons.logging.Log;
  6. import org.apache.commons.logging.LogFactory;
  7. /*
  8. * Internal interface that defines the access to a file on HDFS. All classes
  9. * that implement actual access to an HDFS file (sequence file, avro file,...)
  10. * must respect this interface
  11. * Dummy implementation, for documentation
  12. */
  13. public class DummyAccessor extends Plugin implements ReadAccessor, WriteAccessor {
  14. private static final Log LOG = LogFactory.getLog(DummyAccessor.class);
  15. private int rowNumber;
  16. private int fragmentNumber;
  17. public DummyAccessor(InputData metaData) {
  18. super(metaData);
  19. }
  20. @Override
  21. public boolean openForRead() throws Exception {
  22. /* fopen or similar */
  23. return true;
  24. }
  25. @Override
  26. public OneRow readNextObject() throws Exception {
  27. /* return next row , <key=fragmentNo.rowNo, val=rowNo,text,fragmentNo>*/
  28. /* check for EOF */
  29. if (fragmentNumber > 0)
  30. return null; /* signal EOF, close will be called */
  31. int fragment = inputData.getDataFragment();
  32. String fragmentMetadata = new String(inputData.getFragmentMetadata());
  33. /* generate row */
  34. OneRow row = new OneRow(fragment + "." + rowNumber, /* key */
  35. rowNumber + "," + fragmentMetadata + "," + fragment /* value */);
  36. /* advance */
  37. rowNumber += 1;
  38. if (rowNumber == 2) {
  39. rowNumber = 0;
  40. fragmentNumber += 1;
  41. }
  42. /* return data */
  43. return row;
  44. }
  45. @Override
  46. public void closeForRead() throws Exception {
  47. /* fclose or similar */
  48. }
  49. @Override
  50. public boolean openForWrite() throws Exception {
  51. /* fopen or similar */
  52. return true;
  53. }
  54. @Override
  55. public boolean writeNextObject(OneRow onerow) throws Exception {
  56. LOG.info(onerow.getData());
  57. return true;
  58. }
  59. @Override
  60. public void closeForWrite() throws Exception {
  61. /* fclose or similar */
  62. }
  63. }

Dummy Resolver

  1. import org.apache.hawq.pxf.api.OneField;
  2. import org.apache.hawq.pxf.api.OneRow;
  3. import org.apache.hawq.pxf.api.ReadResolver;
  4. import org.apache.hawq.pxf.api.WriteResolver;
  5. import org.apache.hawq.pxf.api.utilities.InputData;
  6. import org.apache.hawq.pxf.api.utilities.Plugin;
  7. import java.util.LinkedList;
  8. import java.util.List;
  9. import static org.apache.hawq.pxf.api.io.DataType.INTEGER;
  10. import static org.apache.hawq.pxf.api.io.DataType.VARCHAR;
  11. /*
  12. * Class that defines the deserializtion of one record brought from the external input data.
  13. * Every implementation of a deserialization method (Writable, Avro, BP, Thrift, ...)
  14. * must inherit this abstract class
  15. * Dummy implementation, for documentation
  16. */
  17. public class DummyResolver extends Plugin implements ReadResolver, WriteResolver {
  18. private int rowNumber;
  19. public DummyResolver(InputData metaData) {
  20. super(metaData);
  21. rowNumber = 0;
  22. }
  23. @Override
  24. public List<OneField> getFields(OneRow row) throws Exception {
  25. /* break up the row into fields */
  26. List<OneField> output = new LinkedList<OneField>();
  27. String[] fields = ((String) row.getData()).split(",");
  28. output.add(new OneField(INTEGER.getOID() /* type */, Integer.parseInt(fields[0]) /* value */));
  29. output.add(new OneField(VARCHAR.getOID(), fields[1]));
  30. output.add(new OneField(INTEGER.getOID(), Integer.parseInt(fields[2])));
  31. return output;
  32. }
  33. @Override
  34. public OneRow setFields(List<OneField> record) throws Exception {
  35. /* should read inputStream row by row */
  36. return rowNumber > 5
  37. ? null
  38. : new OneRow(null, "row number " + rowNumber++);
  39. }
  40. }

Usage Example

  1. psql=# CREATE EXTERNAL TABLE dummy_tbl
  2. (int1 integer, word text, int2 integer)
  3. LOCATION ('pxf://localhost:51200/dummy_location'
  4. '?FRAGMENTER=DummyFragmenter'
  5. '&ACCESSOR=DummyAccessor'
  6. '&RESOLVER=DummyResolver')
  7. FORMAT 'custom' (formatter = 'pxfwritable_import');
  8. CREATE EXTERNAL TABLE
  9. psql=# SELECT * FROM dummy_tbl;
  10. int1 | word | int2
  11. ------+------+------
  12. 0 | fragment1 | 0
  13. 1 | fragment1 | 0
  14. 0 | fragment2 | 0
  15. 1 | fragment2 | 0
  16. 0 | fragment3 | 0
  17. 1 | fragment3 | 0
  18. (6 rows)
  19. psql=# CREATE WRITABLE EXTERNAL TABLE dummy_tbl_write
  20. (int1 integer, word text, int2 integer)
  21. LOCATION ('pxf://localhost:51200/dummy_location'
  22. '?ACCESSOR=DummyAccessor'
  23. '&RESOLVER=DummyResolver')
  24. FORMAT 'custom' (formatter = 'pxfwritable_export');
  25. CREATE EXTERNAL TABLE
  26. psql=# INSERT INTO dummy_tbl_write VALUES (1, 'a', 11), (2, 'b', 22);
  27. INSERT 0 2