Pipelines

Data Prepper Pipeline

To use Data Prepper, you define pipelines in a configuration YAML file. Each pipeline is a combination of a source, a buffer, zero or more processors, and one or more sinks. For example:

  1. simple-sample-pipeline:
  2. workers: 2 # the number of workers
  3. delay: 5000 # in milliseconds, how long workers wait between read attempts
  4. source:
  5. random:
  6. buffer:
  7. bounded_blocking:
  8. buffer_size: 1024 # max number of records the buffer accepts
  9. batch_size: 256 # max number of records the buffer drains after each read
  10. processor:
  11. - string_converter:
  12. upper_case: true
  13. sink:
  14. - stdout:
  • Sources define where your data comes from. In this case, the source is a random UUID generator (random).

  • Buffers store data as it passes through the pipeline.

    By default, Data Prepper uses its one and only buffer, the bounded_blocking buffer, so you can omit this section unless you developed a custom buffer or need to tune the buffer settings.

  • Processors perform some action on your data: filter, transform, enrich, etc.

    You can have multiple processors, which run sequentially from top to bottom, not in parallel. The string_converter processor transform the strings by making them uppercase.

  • Sinks define where your data goes. In this case, the sink is stdout.

Examples

This section provides some pipeline examples that you can use to start creating your own pipelines. For more information, see Data Prepper configuration reference guide.

The Data Prepper repository has several sample applications to help you get started.

Log ingestion pipeline

The following example demonstrates how to use HTTP source and Grok prepper plugins to process unstructured log data.

  1. log-pipeline:
  2. source:
  3. http:
  4. ssl: false
  5. processor:
  6. - grok:
  7. match:
  8. log: [ "%{COMMONAPACHELOG}" ]
  9. sink:
  10. - opensearch:
  11. hosts: [ "https://opensearch:9200" ]
  12. insecure: true
  13. username: admin
  14. password: admin
  15. index: apache_logs

This example uses weak security. We strongly recommend securing all plugins which open external ports in production environments.

Trace analytics pipeline

The following example demonstrates how to build a pipeline that supports the Trace Analytics OpenSearch Dashboards plugin. This pipeline takes data from the OpenTelemetry Collector and uses two other pipelines as sinks. These two separate pipelines index trace and the service map documents for the dashboard plugin.

Classic

This pipeline definition will be deprecated in 2.0. Users are recommended to use Event record type pipeline definition.

  1. entry-pipeline:
  2. delay: "100"
  3. source:
  4. otel_trace_source:
  5. ssl: false
  6. sink:
  7. - pipeline:
  8. name: "raw-pipeline"
  9. - pipeline:
  10. name: "service-map-pipeline"
  11. raw-pipeline:
  12. source:
  13. pipeline:
  14. name: "entry-pipeline"
  15. processor:
  16. - otel_trace_raw_prepper:
  17. sink:
  18. - opensearch:
  19. hosts: ["https://localhost:9200"]
  20. insecure: true
  21. username: admin
  22. password: admin
  23. index_type: trace-analytics-raw
  24. service-map-pipeline:
  25. delay: "100"
  26. source:
  27. pipeline:
  28. name: "entry-pipeline"
  29. processor:
  30. - service_map_stateful:
  31. sink:
  32. - opensearch:
  33. hosts: ["https://localhost:9200"]
  34. insecure: true
  35. username: admin
  36. password: admin
  37. index_type: trace-analytics-service-map

Event record type

Starting from Data Prepper 1.4, Data Prepper supports event record type in trace analytics pipeline source, buffer, and processors.

  1. entry-pipeline:
  2. delay: "100"
  3. source:
  4. otel_trace_source:
  5. ssl: false
  6. record_type: event
  7. buffer:
  8. bounded_blocking:
  9. buffer_size: 10240
  10. batch_size: 160
  11. sink:
  12. - pipeline:
  13. name: "raw-pipeline"
  14. - pipeline:
  15. name: "service-map-pipeline"
  16. raw-pipeline:
  17. source:
  18. pipeline:
  19. name: "entry-pipeline"
  20. buffer:
  21. bounded_blocking:
  22. buffer_size: 10240
  23. batch_size: 160
  24. processor:
  25. - otel_trace_raw:
  26. sink:
  27. - opensearch:
  28. hosts: ["https://localhost:9200"]
  29. insecure: true
  30. username: admin
  31. password: admin
  32. index_type: trace-analytics-raw
  33. service-map-pipeline:
  34. delay: "100"
  35. source:
  36. pipeline:
  37. name: "entry-pipeline"
  38. buffer:
  39. bounded_blocking:
  40. buffer_size: 10240
  41. batch_size: 160
  42. processor:
  43. - service_map_stateful:
  44. sink:
  45. - opensearch:
  46. hosts: ["https://localhost:9200"]
  47. insecure: true
  48. username: admin
  49. password: admin
  50. index_type: trace-analytics-service-map

Note that it is recommended to scale the buffer_size and batch_size by the estimated maximum batch size in the client request payload to maintain similar ingestion throughput and latency as in Classic.

Metrics pipeline

Data Prepper supports metrics ingestion using OTel. It currently supports the following metric types:

  • Gauge
  • Sum
  • Summary
  • Histogram

Other types are not supported. Data Prepper drops all other types, including Exponential Histogram and Summary. Additionally, Data Prepper does not support Scope instrumentation.

To set up a metrics pipeline:

  1. metrics-pipeline:
  2. source:
  3. otel_metrics_source:
  4. processor:
  5. - otel_metrics_raw_processor:
  6. sink:
  7. - opensearch:
  8. hosts: ["https://localhost:9200"]
  9. username: admin
  10. password: admin

S3 log ingestion pipeline

The following example demonstrates how to use the S3 Source and Grok Processor plugins to process unstructured log data from Amazon Simple Storage Service (Amazon S3). This example uses Application Load Balancer logs. As the Application Load Balancer writes logs to S3, S3 creates notifications in Amazon SQS. Data Prepper reads those notifications and reads the S3 objects to get the log data and process it.

  1. log-pipeline:
  2. source:
  3. s3:
  4. notification_type: "sqs"
  5. compression: "gzip"
  6. codec:
  7. newline:
  8. sqs:
  9. queue_url: "https://sqs.us-east-1.amazonaws.com/12345678910/ApplicationLoadBalancer"
  10. aws:
  11. region: "us-east-1"
  12. sts_role_arn: "arn:aws:iam::12345678910:role/Data-Prepper"
  13. processor:
  14. - grok:
  15. match:
  16. message: ["%{DATA:type} %{TIMESTAMP_ISO8601:time} %{DATA:elb} %{DATA:client} %{DATA:target} %{BASE10NUM:request_processing_time} %{DATA:target_processing_time} %{BASE10NUM:response_processing_time} %{BASE10NUM:elb_status_code} %{DATA:target_status_code} %{BASE10NUM:received_bytes} %{BASE10NUM:sent_bytes} \"%{DATA:request}\" \"%{DATA:user_agent}\" %{DATA:ssl_cipher} %{DATA:ssl_protocol} %{DATA:target_group_arn} \"%{DATA:trace_id}\" \"%{DATA:domain_name}\" \"%{DATA:chosen_cert_arn}\" %{DATA:matched_rule_priority} %{TIMESTAMP_ISO8601:request_creation_time} \"%{DATA:actions_executed}\" \"%{DATA:redirect_url}\" \"%{DATA:error_reason}\" \"%{DATA:target_list}\" \"%{DATA:target_status_code_list}\" \"%{DATA:classification}\" \"%{DATA:classification_reason}"]
  17. - grok:
  18. match:
  19. request: ["(%{NOTSPACE:http_method})? (%{NOTSPACE:http_uri})? (%{NOTSPACE:http_version})?"]
  20. - grok:
  21. match:
  22. http_uri: ["(%{WORD:protocol})?(://)?(%{IPORHOST:domain})?(:)?(%{INT:http_port})?(%{GREEDYDATA:request_uri})?"]
  23. - date:
  24. from_time_received: true
  25. destination: "@timestamp"
  26. sink:
  27. - opensearch:
  28. hosts: [ "https://localhost:9200" ]
  29. username: "admin"
  30. password: "admin"
  31. index: alb_logs

Migrating from Logstash

Data Prepper supports Logstash configuration files for a limited set of plugins. Simply use the logstash config to run Data Prepper.

  1. docker run --name data-prepper \
  2. -v /full/path/to/logstash.conf:/usr/share/data-prepper/pipelines.conf \
  3. opensearchproject/opensearch-data-prepper:latest

This feature is limited by feature parity of Data Prepper. As of Data Prepper 1.2 release, the following plugins from the Logstash configuration are supported:

  • HTTP Input plugin
  • Grok Filter plugin
  • Elasticsearch Output plugin
  • Amazon Elasticsearch Output plugin

Configure the Data Prepper server

Data Prepper itself provides administrative HTTP endpoints such as /list to list pipelines and /metrics/prometheus to provide Prometheus-compatible metrics data. The port that has these endpoints has a TLS configuration and is specified by a separate YAML file. By default, these endpoints are secured by Data Prepper docker images. We strongly recommend providing your own configuration file for securing production environments. Here is an example data-prepper-config.yaml:

  1. ssl: true
  2. keyStoreFilePath: "/usr/share/data-prepper/keystore.jks"
  3. keyStorePassword: "password"
  4. privateKeyPassword: "other_password"
  5. serverPort: 1234

To configure the Data Prepper server, run Data Prepper with the additional yaml file.

  1. docker run --name data-prepper -v /full/path/to/pipelines.yaml:/usr/share/data-prepper/pipelines.yaml \
  2. /full/path/to/data-prepper-config.yaml:/usr/share/data-prepper/data-prepper-config.yaml \
  3. opensearchproject/data-prepper:latest