Loading Data

Description of the different ways to add data to Greenplum Database.

Parent topic: Greenplum Database Best Practices

INSERT Statement with Column Values

A singleton INSERT statement with values adds a single row to a table. The row flows through the master and is distributed to a segment. This is the slowest method and is not suitable for loading large amounts of data.

COPY Statement

The PostgreSQL COPY statement copies data from an external file into a database table. It can insert multiple rows more efficiently than an INSERT statement, but the rows are still passed through the master. All of the data is copied in one command; it is not a parallel process.

Data input to the COPY command is from a file or the standard input. For example:

COPY table FROM '/data/mydata.csv' WITH CSV HEADER;

Use COPY to add relatively small sets of data, for example dimension tables with up to ten thousand rows, or one-time data loads.

Use COPY when scripting a process that loads small amounts of data, less than 10 thousand rows.

Since COPY is a single command, there is no need to disable autocommit when you use this method to populate a table.

You can run multiple concurrent COPY commands to improve performance.

External Tables

External tables provide access to data in sources outside of Greenplum Database. They can be accessed with SELECT statements and are commonly used with the Extract, Load, Transform (ELT) pattern, a variant of the Extract, Transform, Load (ETL) pattern that takes advantage of Greenplum Database’s fast parallel data loading capability.

With ETL, data is extracted from its source, transformed outside of the database using external transformation tools, such as Informatica or Datastage, and then loaded into the database.

With ELT, Greenplum external tables provide access to data in external sources, which could be read-only files (for example, text, CSV, or XML files), Web servers, Hadoop file systems, executable OS programs, or the Greenplum gpfdist file server, described in the next section. External tables support SQL operations such as select, sort, and join so the data can be loaded and transformed simultaneously, or loaded into a load table and transformed in the database into target tables.

The external table is defined with a CREATE EXTERNAL TABLE statement, which has a LOCATION clause to define the location of the data and a FORMAT clause to define the formatting of the source data so that the system can parse the input data. Files use the file:// protocol, and must reside on a segment host in a location accessible by the Greenplum superuser. The data can be spread out among the segment hosts with no more than one file per primary segment on each host. The number of files listed in the LOCATION clause is the number of segments that will read the external table in parallel.

External Tables with Gpfdist

The fastest way to load large fact tables is to use external tables with gpdist. gpfdist is a file server program using an HTTP protocol that serves external data files to Greenplum Database segments in parallel. A gpfdist instance can serve 200 MB/second and many gpfdist processes can run simultaneously, each serving up a portion of the data to be loaded. When you begin the load using a statement such as INSERT INTO <table> SELECT * FROM <external_table>, the INSERT statement is parsed by the master and distributed to the primary segments. The segments connect to the gpfdist servers and retrieve the data in parallel, parse and validate the data, calculate a hash from the distribution key data and, based on the hash key, send the row to its destination segment. By default, each gpfdist instance will accept up to 64 connections from segments. With many segments and gpfdist servers participating in the load, data can be loaded at very high rates.

Primary segments access external files in parallel when using gpfdist up to the value of gp_external_max_segments. When optimizing gpfdist performance, maximize the parallelism as the number of segments increase. Spread the data evenly across as many ETL nodes as possible. Split very large data files into equal parts and spread the data across as many file systems as possible.

Run two gpfdist instances per file system. gpfdist tends to be CPU bound on the segment nodes when loading. But if, for example, there are eight racks of segment nodes, there is lot of available CPU on the segments to drive more gpfdist processes. Run gpfdist on as many interfaces as possible. Be aware of bonded NICs and be sure to start enough gpfdist instances to work them.

It is important to keep the work even across all these resources. The load is as fast as the slowest node. Skew in the load file layout will cause the overall load to bottleneck on that resource.

The gp_external_max_segs configuration parameter controls the number of segments each gpfdist process serves. The default is 64. You can set a different value in the postgresql.conf configuration file on the master. Always keep gp_external_max_segs and the number of gpfdist processes an even factor; that is, the gp_external_max_segs value should be a multiple of the number of gpfdist processes. For example, if there are 12 segments and 4 gpfdist processes, the planner round robins the segment connections as follows:

Segment 1  - gpfdist 1 
Segment 2  - gpfdist 2 
Segment 3  - gpfdist 3 
Segment 4  - gpfdist 4 
Segment 5  - gpfdist 1 
Segment 6  - gpfdist 2 
Segment 7  - gpfdist 3 
Segment 8  - gpfdist 4 
Segment 9  - gpfdist 1 
Segment 10 - gpfdist 2 
Segment 11 - gpfdist 3 
Segment 12 - gpfdist 4

Drop indexes before loading into existing tables and re-create the index after loading. Creating an index on pre-existing data is faster than updating it incrementally as each row is loaded.

Run ANALYZE on the table after loading. Disable automatic statistics collection during loading by setting gp_autostats_mode to NONE. Run VACUUM after load errors to recover space.

Performing small, high frequency data loads into heavily partitioned column-oriented tables can have a high impact on the system because of the number of physical files accessed per time interval.

Gpload

gpload is a data loading utility that acts as an interface to the Greenplum external table parallel loading feature.

Beware of using gpload as it can cause catalog bloat by creating and dropping external tables. Use gpfdist instead, since it provides the best performance.

gpload executes a load using a specification defined in a YAML-formatted control file. It performs the following operations:

Invokes gpfdist processes
Creates a temporary external table definition based on the source data defined
Executes an INSERT, UPDATE, or MERGE operation to load the source data into the target table in the database
Drops the temporary external table
Cleans up gpfdist processes

The load is accomplished in a single transaction.

Best Practices

Drop any indexes on an existing table before loading data and recreate the indexes after loading. Newly creating an index is faster than updating an index incrementally as each row is loaded.
Disable automatic statistics collection during loading by setting the gp_autostats_mode configuration parameter to NONE.
External tables are not intended for frequent or ad hoc access.
External tables have no statistics to inform the optimizer. You can set rough estimates for the number of rows and disk pages for the external table in the pg_class system catalog with a statement like the following:
```
UPDATE pg_class SET reltuples=400000, relpages=400
WHERE relname='myexttable';
```
When using gpfdist, maximize network bandwidth by running one gpfdist instance for each NIC on the ETL server. Divide the source data evenly between the gpfdist instances.
When using gpload, run as many simultaneous gpload instances as resources allow. Take advantage of the CPU, memory, and networking resources available to increase the amount of data that can be transferred from ETL servers to the Greenplum Database.
Use the SEGMENT REJECT LIMIT clause of the COPY statement to set a limit for the number or percentage of rows that can have errors before the COPY FROM command is aborted. The reject limit is per segment; when any one segment exceeds the limit, the command is aborted and no rows are added. Use the LOG ERRORS clause to save error rows. If a row has errors in the formatting—for example missing or extra values, or incorrect data types—Greenplum Database stores the error information and row internally. Use the gp_read_error_log() built-in SQL function to access this stored information.
If the load has errors, run VACUUM on the table to recover space.
After you load data into a table, run VACUUM on heap tables, including system catalogs, and ANALYZE on all tables. It is not necessary to run VACUUM on append-optimized tables. If the table is partitioned, you can vacuum and analyze just the partitions affected by the data load. These steps clean up any rows from aborted loads, deletes, or updates and update statistics for the table.
Recheck for segment skew in the table after loading a large amount of data. You can use a query like the following to check for skew:
```
SELECT gp_segment_id, count(*) 
FROM schema.table 
GROUP BY gp_segment_id ORDER BY 2;
```
By default, gpfdist assumes a maximum record size of 32K. To load data records larger than 32K, you must increase the maximum row size parameter by specifying the -m <bytes> option on the gpfdist command line. If you use gpload, set the MAX_LINE_LENGTH parameter in the gpload control file.

Note: Integrations with Informatica Power Exchange are currently limited to the default 32K record length.

Additional Information

See the Greenplum Database Reference Guide for detailed instructions for loading data using gpfdist and gpload.