ETL

The Extractor Transformer and Loader, or ETL, module for OrientDB provides support for moving data to and from OrientDB databases using ETL processes.

  • Configuration: The ETL module uses a configuration file, written in JSON.
  • Extractor Pulls data from the source database.
  • Transformers Convert the data in the pipeline from its source format to one accessible to the target database.
  • Loader loads the data into the target database.

How ETL Works

The ETL module receives a backup file from another database, it then converts the fields into an accessible format and loads it into OrientDB.

  1. EXTRACTOR => TRANSFORMERS[] => LOADER

For example, consider the process for a CSV file. Using the ETL module, OrientDB loads the file, applies whatever changes it needs, then stores the reocrd as a document into the current OrientDB database.

  1. +-----------+-----------------------+-----------+
  2. | | PIPELINE |
  3. + EXTRACTOR +-----------------------+-----------+
  4. | | TRANSFORMERS | LOADER |
  5. +-----------+-----------------------+-----------+
  6. | FILE ==> CSV->FIELD->MERGE ==> OrientDB |
  7. +-----------+-----------------------+-----------+

You can modify this pipeline, allowing the transformation and loading phases to run in parallel by setting the configuration variable "parallel" to true.

  1. {"parallel": true}

Installation

Beginning with version 2.0, OrientDB bundles the ETL module with the official release. Follow these steps to use the module:

  • Clone the repository on your computer, by executing:
    • git clone https://github.com/orientechnologies/orientdb-etl.git
  • Compile the module, by executing:
    • mvn clean install
  • Copy script/oetl.sh (or .bat under Windows) to $ORIENTDB_HOME/bin
  • Copy target/orientdb-etl-2.0-SNAPSHOT.jar to $ORIENTDB_HOME/lib

Usage

To use the ETL module, run the oetl.sh script with the configuration file given as an argument.

  1. $ $ORIENTDB_HOME/bin/oetl.sh config-dbpedia.json
NOTE NOTE: If you are importing data for use in a distributed database, then you must set ridBag.embeddedToSbtreeBonsaiThreshold=Integer.MAX\_VALUE for the ETL process to avoid replication errors, when the database is updated online.

Run-time Configuration

When you run the ETL module, you can define its configuration variables by passing it a JSON file, which the ETL module resolves at run-time by passing them as it starts up.

You could also define the values for these variables through command-line options. For example, you could assign the database URL as ${databaseURL}, then pass the relevant argument through the command-line:

  1. $ $ORIENTDB_HOME/bin/oetl.sh config-dbpedia.json \
  2. -databaseURL=plocal:/tmp/mydb

When the ETL module initializes, it pulls /tmp/mydb from the command-line to define this variable in the configuration file.

Available Components

Examples: