DataX

Overview

DataX task type for executing DataX programs. For DataX nodes, the worker will execute ${DATAX_HOME}/bin/datax.py to analyze the input json file.

Create Task

  • Click Project Management -> Project Name -> Workflow Definition, and click the Create Workflow button to enter the DAG editing page.
  • Drag the DataX - 图1 from the toolbar to the drawing board.

Task Parameters

ParameterDescription
Node nameThe node name in a workflow definition is unique.
Run flagIdentifies whether this node schedules normally, if it does not need to execute, select the prohibition execution.
Task priorityWhen the number of worker threads is insufficient, execute in the order of priority from high to low, and tasks with the same priority will execute in a first-in first-out order.
DescriptionDescribe the function of the node.
Worker groupAssign tasks to the machines of the worker group to execute. If Default is selected, randomly select a worker machine for execution.
Environment NameConfigure the environment name in which run the script.
Number of failed retriesThe number of times the task failed to resubmit.
Failed retry intervalThe time interval (unit minute) for resubmitting the task after a failed task.
Cpu quotaAssign the specified CPU time quota to the task executed. Takes a percentage value. Default -1 means unlimited. For example, the full CPU load of one core is 100%,and that of 16 cores is 1600%. This function is controlled by task.resource.limit.state
Max memoryAssign the specified max memory to the task executed. Exceeding this limit will trigger oom to be killed and will not automatically retry. Takes an MB value. Default -1 means unlimited. This function is controlled by task.resource.limit.state
Delayed execution timeThe time, in cents, that a task is delayed in execution.
Timeout alarmCheck the timeout alarm and timeout failure. When the task exceeds the “timeout period”, an alarm email will be sent and the task execution will fail.
Custom templateCustom the content of the DataX node’s json profile when the default data source provided does not meet the required requirements.
jsonjson configuration file for DataX synchronization.
Custom parametersSQL task type, and stored procedure is a custom parameter order to set values for the method. The custom parameter type and data type are the same as the stored procedure task type. The difference is that the SQL task type custom parameter will replace the ${variable} in the SQL statement.
Data sourceSelect the data source from which the data will be extracted.
sql statementthe sql statement used to extract data from the target database, the sql query column name is automatically parsed when the node is executed, and mapped to the target table synchronization column name. When the source table and target table column names are inconsistent, they can be converted by column alias.
Target librarySelect the target library for data synchronization.
Pre-sqlPre-sql is executed before the sql statement (executed by the target library).
Post-sqlPost-sql is executed after the sql statement (executed by the target library).
Stream limit (number of bytes)Limits the number of bytes in the query.
Limit flow (number of records)Limit the number of records for a query.
Running memorythe minimum and maximum memory required can be configured to suit the actual production environment.
Predecessor taskSelecting a predecessor task for the current task will set the selected predecessor task as upstream of the current task.

Task Example

This example demonstrates importing data from Hive into MySQL.

Configuring the DataX environment in DolphinScheduler

If you are using the DataX task type in a production environment, it is necessary to configure the required environment first. The configuration file is as follows: /dolphinscheduler/conf/env/dolphinscheduler_env.sh.

datax_task01

After the environment has been configured, DolphinScheduler needs to be restarted.

Configuring DataX Task Node

As the default data source does not contain data to be read from Hive, a custom json is required, refer to: HDFS Writer. Note: Partition directories exist on the HDFS path, when importing data in real world situations, partitioning is recommended to be passed as a parameter, using custom parameters.

After writing the required json file, you can configure the node content by following the steps in the diagram below.

datax_task02

View run results

datax_task03

Note

If the default data source provided does not meet your needs, you can configure the writer and reader of DataX according to the actual usage environment in the custom template option, available at https://github.com/alibaba/DataX.