PXF External Tables and API
You can use the PXF API to create your own connectors to access any other type of parallel data store or processing engine.
The PXF Java API lets you extend PXF functionality and add new services and formats without changing HAWQ. The API includes three classes that are extended to allow HAWQ to access an external data source: Fragmenter, Accessor, and Resolver.
The Fragmenter produces a list of data fragments that can be read in parallel from the data source. The Accessor produces a list of records from a single fragment, and the Resolver both deserializes and serializes records.
Together, the Fragmenter, Accessor, and Resolver classes implement a connector. PXF includes plug-ins for HDFS and JSON files and tables in HBase and Hive.
Creating an External Table
The syntax for an EXTERNAL TABLE that uses the PXF protocol is as follows:
CREATE [READABLE|WRITABLE] EXTERNAL TABLE <table_name>( <column_name> <data_type> [, ...] | LIKE <other_table> )LOCATION('pxf://<host>[:<port>]/<path-to-data>?<pxf-parameters>[&<custom-option>=<value>[...]]')FORMAT 'custom' (formatter='pxfwritable_import|pxfwritable_export');
where
[FRAGMENTER=<fragmenter_class>&ACCESSOR=<accessor_class>&RESOLVER=<resolver_class>] | ?PROFILE=profile-name
Note: Not every PXF profile supports writable external tables. Refer to Writing Data to HDFS for a detailed discussion of the HDFS plug-in profiles that support this feature.
Table 1. Parameter values and description
| Parameter | Value and description |
|---|---|
| <host> | The PXF host. While <host> may identify any PXF agent node, use the HDFS NameNode as it is guaranteed to be available in a running HDFS cluster. If HDFS High Availability is enabled, <host> must identify the HDFS NameService. |
| <port> | The PXF port. If <port> is omitted, PXF assumes <host> identifies a High Availability HDFS Nameservice and connects to the port number designated by the pxf_service_port server configuration parameter value. Default is 51200. |
| <path-to-data> | A directory, file name, wildcard pattern, table name, etc. |
| PROFILE | The profile PXF uses to access the data. PXF supports multiple plug-ins that currently expose profiles named HBase, Hive, HiveRC, HiveText, HiveORC, HiveVectorizedORC, HdfsTextSimple, HdfsTextMulti, Avro, SequenceWritable, and Json. |
| FRAGMENTER | The Java class the plug-in uses for fragmenting data. Used for READABLE external tables only. |
| ACCESSOR | The Java class the plug-in uses for accessing the data. Used for READABLE and WRITABLE tables. |
| RESOLVER | The Java class the plug-in uses for serializing and deserializing the data. Used for READABLE and WRITABLE tables. |
| <custom-option> | Additional values to pass to the plug-in at runtime. A plug-in can parse custom options with the PXF helper class org.apache.hawq.pxf.api.utilities.InputData. |
Note: When creating PXF external tables, you cannot use the HEADER option in your FORMAT specification.
About the Java Class Services and Formats
The LOCATION string in a PXF CREATE EXTERNAL TABLE statement is a URI that specifies the host and port of an external data source and the path to the data in the external data source. The query portion of the URI, introduced by the question mark (?), must include the PXF profile name or the plug-in’s FRAGMENTER (readable tables only), ACCESSOR, and RESOLVER class names.
PXF profiles are defined in the /etc/pxf/conf/pxf-profiles.xml file. Profile definitions include plug-in class names. For example, the HdfsTextSimple profile definition is:
<profile><name>HdfsTextSimple</name><description> This profile is suitable for use when reading delimitedsingle line records from plain text files on HDFS.</description><plugins><fragmenter>org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter><accessor>org.apache.hawq.pxf.plugins.hdfs.LineBreakAccessor</accessor><resolver>org.apache.hawq.pxf.plugins.hdfs.StringPassResolver</resolver></plugins></profile>
The parameters in the PXF URI are passed from HAWQ as headers to the PXF Java service. You can pass custom information to user-implemented PXF plug-ins by adding optional parameters to the LOCATION string.
The Java PXF service retrieves the source data from the external data source and converts it to a HAWQ-readable table format.
The Accessor, Resolver, and Fragmenter Java classes extend the org.apache.hawq.pxf.api.utilities.Plugin class:
package org.apache.hawq.pxf.api.utilities;/*** Base class for all plug-in types (Accessor, Resolver, Fragmenter, ...).* Manages the meta data.*/public class Plugin {protected InputData inputData;/*** Constructs a plug-in.** @param input the input data*/public Plugin(InputData input) {this.inputData = input;}/*** Checks if the plug-in is thread safe or not, based on inputData.** @return true if plug-in is thread safe*/public boolean isThreadSafe() {return true;}}
The parameters in the LOCATION string are available to the plug-ins through methods in the org.apache.hawq.pxf.api.utilities.InputData class. Plug-ins can look up the custom parameters added to the location string with the getUserProperty() method.
/*** Common configuration available to all PXF plug-ins. Represents input data* coming from client applications, such as HAWQ.*/public class InputData {/*** Constructs an InputData from a copy.* Used to create from an extending class.** @param copy the input data to copy*/public InputData(InputData copy);/*** Returns value of a user defined property.** @param userProp the lookup user property* @return property value as a String*/public String getUserProperty(String userProp);/*** Sets the byte serialization of a fragment meta data* @param location start, len, and location of the fragment*/public void setFragmentMetadata(byte[] location);/** Returns the byte serialization of a data fragment */public byte[] getFragmentMetadata();/*** Gets any custom user data that may have been passed from the* fragmenter. Will mostly be used by the accessor or resolver.*/public byte[] getFragmentUserData();/*** Sets any custom user data that needs to be shared across plug-ins.* Will mostly be set by the fragmenter.*/public void setFragmentUserData(byte[] userData);/** Returns the number of segments in GP. */public int getTotalSegments();/** Returns the current segment ID. */public int getSegmentId();/** Returns true if there is a filter string to parse. */public boolean hasFilter();/** Returns the filter string, <tt>null</tt> if #hasFilter is <tt>false</tt> */public String getFilterString();/** Returns tuple description. */public ArrayList<ColumnDescriptor> getTupleDescription();/** Returns the number of columns in tuple description. */public int getColumns();/** Returns column index from tuple description. */public ColumnDescriptor getColumn(int index);/*** Returns the column descriptor of the recordkey column. If the recordkey* column was not specified by the user in the create table statement will* return null.*/public ColumnDescriptor getRecordkeyColumn();/** Returns the data source of the required resource (i.e a file path or a table name). */public String getDataSource();/** Sets the data source for the required resource */public void setDataSource(String dataSource);/** Returns the ClassName for the java class that was defined as Accessor */public String getAccessor();/** Returns the ClassName for the java class that was defined as Resolver */public String getResolver();/*** Returns the ClassName for the java class that was defined as Fragmenter* or null if no fragmenter was defined*/public String getFragmenter();/*** Returns the contents of pxf_remote_service_login set in Hawq.* Should the user set it to an empty string this function will return null.** @return remote login details if set, null otherwise*/public String getLogin();/*** Returns the contents of pxf_remote_service_secret set in Hawq.* Should the user set it to an empty string this function will return null.** @return remote password if set, null otherwise*/public String getSecret();/*** Returns true if the request is thread safe. Default true. Should be set* by a user to false if the request contains non thread-safe plug-ins or* components, such as BZip2 codec.*/public boolean isThreadSafe();/*** Returns a data fragment index. plan to deprecate it in favor of using* getFragmentMetadata().*/public int getDataFragment();}
Fragmenter
Note: You use the Fragmenter class to read data into HAWQ. You cannot use this class to write data out of HAWQ.
The Fragmenter is responsible for passing datasource metadata back to HAWQ. It also returns a list of data fragments to the Accessor or Resolver. Each data fragment describes some part of the requested data set. It contains the datasource name, such as the file or table name, including the hostname where it is located. For example, if the source is an HDFS file, the Fragmenter returns a list of data fragments containing an HDFS file block. Each fragment includes the location of the block. If the source data is an HBase table, the Fragmenter returns information about table regions, including their locations.
The ANALYZE command now retrieves advanced statistics for PXF readable tables by estimating the number of tuples in a table, creating a sample table from the external table, and running advanced statistics queries on the sample table in the same way statistics are collected for native HAWQ tables.
The configuration parameter pxf_enable_stat_collection controls collection of advanced statistics. If pxf_enable_stat_collection is set to false, no analysis is performed on PXF tables. An additional parameter, pxf_stat_max_fragments, controls the number of fragments sampled to build a sample table. By default pxf_stat_max_fragments is set to 100, which means that even if there are more than 100 fragments, only this number of fragments will be used in ANALYZE to sample the data. Increasing this number will result in better sampling, but can also impact performance.
When a PXF table is analyzed, any of the following conditions might result in a warning message with no statistics gathered for the table:
pxf_enable_stat_collectionis set to off,- an error occurs because the table is not defined correctly,
- the PXF service is down, or
getFragmentsStats()is not implemented
If ANALYZE is running over all tables in the database, the next table will be processed – a failure processing one table does not stop the command.
For a detailed explanation about HAWQ statistical data gathering, refer to the ANALYZE SQL command reference.
Note:
- Depending on external table size, the time required to complete an
ANALYZEoperation can be lengthy. The boolean parameterpxf_enable_stat_collectionenables statistics collection for PXF. The default value ison. Turning this parameter off (disabling PXF statistics collection) can help decrease the time needed for theANALYZEoperation. - You can also use
pxf_stat_max_fragmentsto limit the number of fragments to be sampled by decreasing it from the default (100). However, if the number is too low, the sample might not be uniform and the statistics might be skewed. - You can also implement
getFragmentsStats()to return an error. This will causeANALYZEon a table with thisFragmenterto fail immediately, and default statistics values will be used for that table.
The following table lists the Fragmenter plug-in implementations included with the PXF API.
Fragmenter class | Description |
|---|---|
| org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter | Fragmenter for HDFS, JSON files |
| org.apache.hawq.pxf.plugins.hbase.HBaseDataFragmenter | Fragmenter for HBase tables |
| org.apache.hawq.pxf.plugins.hive.HiveDataFragmenter | Fragmenter for Hive tables |
| org.apache.hawq.pxf.plugins.hdfs.HiveInputFormatFragmenter | Fragmenter for Hive tables with RC, ORC, or text file formats |
A Fragmenter class extends org.apache.hawq.pxf.api.Fragmenter:
org.apache.hawq.pxf.api.Fragmenter
package org.apache.hawq.pxf.api;/*** Abstract class that defines the splitting of a data resource into fragments* that can be processed in parallel.*/public abstract class Fragmenter extends Plugin {protected List<Fragment> fragments;public Fragmenter(InputData metaData) {super(metaData);fragments = new LinkedList<Fragment>();}/*** Gets the fragments of a given path (source name and location of each* fragment). Used to get fragments of data that could be read in parallel* from the different segments.*/public abstract List<Fragment> getFragments() throws Exception;/*** Default implementation of statistics for fragments. The default is:* <ul>* <li>number of fragments - as gathered by {@link #getFragments()}</li>* <li>first fragment size - 64MB</li>* <li>total size - number of fragments times first fragment size</li>* </ul>* Each fragmenter implementation can override this method to better match* its fragments stats.** @return default statistics* @throws Exception if statistics cannot be gathered*/public FragmentsStats getFragmentsStats() throws Exception {List<Fragment> fragments = getFragments();long fragmentsNumber = fragments.size();return new FragmentsStats(fragmentsNumber,FragmentsStats.DEFAULT_FRAGMENT_SIZE, fragmentsNumber* FragmentsStats.DEFAULT_FRAGMENT_SIZE);}}
getFragments() returns a string in JSON format of the retrieved fragment. For example, if the input path is a HDFS directory, the source name for each fragment should include the file name including the path for the fragment.
Class Description
The Fragmenter.getFragments() method returns a List<Fragment>:
package org.apache.hawq.pxf.api;/** Fragment holds a data fragment' information.* Fragmenter.getFragments() returns a list of fragments.*/public class Fragment{private String sourceName; // File path+name, table name, etc.private int index; // Fragment index (incremented per sourceName)private String[] replicas; // Fragment replicas (1 or more)private byte[] metadata; // Fragment metadata information (starting point + length, region location, etc.)private byte[] userData; // ThirdParty data added to a fragment. Ignored if null...}
org.apache.hawq.pxf.api.FragmentsStats
The Fragmenter.getFragmentsStats() method returns a FragmentsStats:
package org.apache.hawq.pxf.api;/*** FragmentsStats holds statistics for a given path.*/public class FragmentsStats {// number of fragmentsprivate long fragmentsNumber;// first fragment sizeprivate SizeAndUnit firstFragmentSize;// total fragments sizeprivate SizeAndUnit totalSize;/*** Enum to represent unit (Bytes/KB/MB/GB/TB)*/public enum SizeUnit {/*** Byte*/B,/*** KB*/KB,/*** MB*/MB,/*** GB*/GB,/*** TB*/TB;};/*** Container for size and unit*/public class SizeAndUnit {long size;SizeUnit unit;...
getFragmentsStats() returns a string in JSON format of statistics for the data source. For example, if the input path is a HDFS directory of 3 files, each one of 1 block, the output will be the number of fragments (3), the size of the first file, and the size of all files in that directory.
Accessor
The Accessor retrieves specific fragments and passes records back to the Resolver. For example, the HDFS plug-ins create a org.apache.hadoop.mapred.FileInputFormat and a org.apache.hadoop.mapred.RecordReader for an HDFS file and sends this to the Resolver. In the case of HBase or Hive files, the Accessor returns single rows from an HBase or Hive table. PXF includes the following Accessor implementations:
Accessor class | Description |
|---|---|
| org.apache.hawq.pxf.plugins.hdfs.HdfsAtomicDataAccessor | Base class for accessing datasources which cannot be split. These will be accessed by a single HAWQ segment |
| org.apache.hawq.pxf.plugins.hdfs.QuotedLineBreakAccessor | Accessor for TEXT files that have records with embedded linebreaks |
| org.apache.hawq.pxf.plugins.hdfs.HdfsSplittableDataAccessor | Base class for accessing HDFS files using |
| org.apache.hawq.pxf.plugins.hdfs.LineBreakAccessor | Accessor for TEXT files (replaced the deprecated TextFileAccessor, LineReaderAccessor) |
| org.apache.hawq.pxf.plugins.hdfs.AvroFileAccessor | Accessor for Avro files |
| org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor | Accessor for Sequence files |
| org.apache.hawq.pxf.plugins.hbase.HBaseAccessor | Accessor for HBase tables |
| org.apache.hawq.pxf.plugins.hive.HiveAccessor | Accessor for Hive tables |
| org.apache.hawq.pxf.plugins.hive.HiveLineBreakAccessor | Accessor for Hive tables stored as text file format |
| org.apache.hawq.pxf.plugins.hive.HiveRCFileAccessor | Accessor for Hive tables stored as RC file format |
| org.apache.hawq.pxf.plugins.hive.HiveORCAccessor | Accessor for Hive tables stored as ORC format |
| org.apache.hawq.pxf.plugins.hive.HiveORCVectorizedAccessor | Accessor for Hive tables stored as ORC format |
| org.apache.hawq.pxf.plugins.json.JsonAccessor | Accessor for JSON files |
The class must extend the org.apache.hawq.pxf.Plugin class, and implement one or both of the interfaces:
org.apache.hawq.pxf.api.ReadAccessororg.apache.hawq.pxf.api.WriteAccessor
package org.apache.hawq.pxf.api;/** Internal interface that defines the access to data on the source* data store (e.g, a file on HDFS, a region of an HBase table, etc).* All classes that implement actual access to such data sources must* respect this interface*/public interface ReadAccessor {boolean openForRead() throws Exception;OneRow readNextObject() throws Exception;void closeForRead() throws Exception;}
package org.apache.hawq.pxf.api;/** An interface for writing data into a data store* (e.g, a sequence file on HDFS).* All classes that implement actual access to such data sources must* respect this interface*/public interface WriteAccessor {boolean openForWrite() throws Exception;OneRow writeNextObject(OneRow onerow) throws Exception;void closeForWrite() throws Exception;}
The Accessor calls openForRead() to read existing data. After reading the data, it calls closeForRead(). readNextObject() returns one of the following:
- a single record, encapsulated in a
OneRowobject - null if it reaches
EOF
The Accessor calls openForWrite() to write data out. After writing the data, it writes a OneRow object with writeNextObject(), and when done calls closeForWrite(). OneRow represents a key-value item.
org.apache.hawq.pxf.api.OneRow
package org.apache.hawq.pxf.api;/** Represents one row in the external system data store. Supports* the general case where one row contains both a record and a* separate key like in the HDFS key/value model for MapReduce* (Example: HDFS sequence file)*/public class OneRow {/** Default constructor*/public OneRow();/** Constructor sets key and data*/public OneRow(Object inKey, Object inData);/** Setter for key*/public void setKey(Object inKey);/** Setter for data*/public void setData(Object inData);/** Accessor for key*/public Object getKey();/** Accessor for data*/public Object getData();/** Show content*/public String toString();}