Loading Data with hawq load

The HAWQ hawq load utility loads data using readable external tables and the HAWQ parallel file server ( gpfdist or gpfdists). It handles parallel file-based external table setup and allows users to configure their data format, external table definition, and gpfdist or gpfdists setup in a single configuration file.

To use hawq load

  1. Ensure that your environment is set up to run hawq load. Some dependent files from your HAWQ /> installation are required, such as gpfdist and Python, as well as network access to the HAWQ segment hosts.
  2. Create your load control file. This is a YAML-formatted file that specifies the HAWQ connection information, gpfdist configuration information, external table options, and data format.

    For example:

    1. ---
    2. VERSION: 1.0.0.1
    3. DATABASE: ops
    4. USER: gpadmin
    5. HOST: mdw-1
    6. PORT: 5432
    7. GPLOAD:
    8. INPUT:
    9. - SOURCE:
    10. LOCAL_HOSTNAME:
    11. - etl1-1
    12. - etl1-2
    13. - etl1-3
    14. - etl1-4
    15. PORT: 8081
    16. FILE:
    17. - /var/load/data/*
    18. - COLUMNS:
    19. - name: text
    20. - amount: float4
    21. - category: text
    22. - description: text
    23. - date: date
    24. - FORMAT: text
    25. - DELIMITER: '|'
    26. - ERROR_LIMIT: 25
    27. - ERROR_TABLE: payables.err_expenses
    28. OUTPUT:
    29. - TABLE: payables.expenses
    30. - MODE: INSERT
    31. SQL:
    32. - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
    33. - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
  3. Run hawq load, passing in the load control file. For example:

    1. $ hawq load -f my_load.yml