Amazon AWS Kinesis Streams Connector

The Kinesis connector provides access to Amazon AWS Kinesis Streams.

To use the connector, add the following Maven dependency to your project:

  1. <dependency>
  2. <groupId>org.apache.flink</groupId>
  3. <artifactId>flink-connector-kinesis_2.11</artifactId>
  4. <version>1.9.0</version>
  5. </dependency>

The flink-connector-kinesis_2.11 has a dependency on code licensed under the Amazon Software License (ASL).Linking to the flink-connector-kinesis will include ASL licensed code into your application.

The flink-connector-kinesis_2.11 artifact is not deployed to Maven central as part ofFlink releases because of the licensing issue. Therefore, you need to build the connector yourself from the source.

Download the Flink source or check it out from the git repository. Then, use the following Maven command to build the module:

  1. mvn clean install -Pinclude-kinesis -DskipTests
  2. # In Maven 3.3 the shading of flink-dist doesn't work properly in one run, so we need to run mvn for flink-dist again.
  3. cd flink-dist
  4. mvn clean install -Pinclude-kinesis -DskipTests

Attention For Flink versions 1.4.2 and below, the KPL client versionused by default in the Kinesis connectors, KPL 0.12.5, is no longer supported by AWS Kinesis Streams(see here).This means that when building the Kinesis connector, you will need to specify a higher version KPL client (above 0.12.6)in order for the Flink Kinesis Producer to work. You can do this by specifying the preferred version via theaws.kinesis-kpl.version property, like so:

  1. mvn clean install -Pinclude-kinesis -Daws.kinesis-kpl.version=0.12.6 -DskipTests

The streaming connectors are not part of the binary distribution. See how to link with them for clusterexecution here.

Using the Amazon Kinesis Streams Service

Follow the instructions from the Amazon Kinesis Streams Developer Guideto setup Kinesis streams. Make sure to create the appropriate IAM policy and user to read / write to the Kinesis streams.

Kinesis Consumer

The FlinkKinesisConsumer is an exactly-once parallel streaming data source that subscribes to multiple AWS Kinesisstreams within the same AWS service region, and can transparently handle resharding of streams while the job is running. Each subtask of the consumer isresponsible for fetching data records from multiple Kinesis shards. The number of shards fetched by each subtask willchange as shards are closed and created by Kinesis.

Before consuming data from Kinesis streams, make sure that all streams are created with the status “ACTIVE” in the AWS dashboard.

  1. Properties consumerConfig = new Properties();
  2. consumerConfig.put(AWSConfigConstants.AWS_REGION, "us-east-1");
  3. consumerConfig.put(AWSConfigConstants.AWS_ACCESS_KEY_ID, "aws_access_key_id");
  4. consumerConfig.put(AWSConfigConstants.AWS_SECRET_ACCESS_KEY, "aws_secret_access_key");
  5. consumerConfig.put(ConsumerConfigConstants.STREAM_INITIAL_POSITION, "LATEST");
  6. StreamExecutionEnvironment env = StreamExecutionEnvironment.getEnvironment();
  7. DataStream<String> kinesis = env.addSource(new FlinkKinesisConsumer<>(
  8. "kinesis_stream_name", new SimpleStringSchema(), consumerConfig));
  1. val consumerConfig = new Properties()
  2. consumerConfig.put(AWSConfigConstants.AWS_REGION, "us-east-1")
  3. consumerConfig.put(AWSConfigConstants.AWS_ACCESS_KEY_ID, "aws_access_key_id")
  4. consumerConfig.put(AWSConfigConstants.AWS_SECRET_ACCESS_KEY, "aws_secret_access_key")
  5. consumerConfig.put(ConsumerConfigConstants.STREAM_INITIAL_POSITION, "LATEST")
  6. val env = StreamExecutionEnvironment.getEnvironment
  7. val kinesis = env.addSource(new FlinkKinesisConsumer[String](
  8. "kinesis_stream_name", new SimpleStringSchema, consumerConfig))

The above is a simple example of using the consumer. Configuration for the consumer is supplied with a java.util.Propertiesinstance, the configuration keys for which can be found in AWSConfigConstants (AWS-specific parameters) and ConsumerConfigConstants (Kinesis consumer parameters). The exampledemonstrates consuming a single Kinesis stream in the AWS region “us-east-1”. The AWS credentials are supplied using the basic method in whichthe AWS access key ID and secret access key are directly supplied in the configuration (other options are settingAWSConfigConstants.AWS_CREDENTIALS_PROVIDER to ENV_VAR, SYS_PROP, PROFILE, ASSUME_ROLE, and AUTO). Also, data is being consumedfrom the newest position in the Kinesis stream (the other option will be setting ConsumerConfigConstants.STREAM_INITIAL_POSITIONto TRIM_HORIZON, which lets the consumer start reading the Kinesis stream from the earliest record possible).

Other optional configuration keys for the consumer can be found in ConsumerConfigConstants.

Note that the configured parallelism of the Flink Kinesis Consumer sourcecan be completely independent of the total number of shards in the Kinesis streams.When the number of shards is larger than the parallelism of the consumer,then each consumer subtask can subscribe to multiple shards; otherwiseif the number of shards is smaller than the parallelism of the consumer,then some consumer subtasks will simply be idle and wait until it gets assignednew shards (i.e., when the streams are resharded to increase thenumber of shards for higher provisioned Kinesis service throughput).

Also note that the assignment of shards to subtasks may not be optimal whenshard IDs are not consecutive (as result of dynamic re-sharding in Kinesis).For cases where skew in the assignment leads to significant imbalanced consumption,a custom implementation of KinesisShardAssigner can be set on the consumer.

Configuring Starting Position

The Flink Kinesis Consumer currently provides the following options to configure where to start reading Kinesis streams, simply by setting ConsumerConfigConstants.STREAM_INITIAL_POSITION toone of the following values in the provided configuration properties (the naming of the options identically follows the namings used by the AWS Kinesis Streams service):

  • LATEST: read all shards of all streams starting from the latest record.
  • TRIM_HORIZON: read all shards of all streams starting from the earliest record possible (data may be trimmed by Kinesis depending on the retention settings).
  • AT_TIMESTAMP: read all shards of all streams starting from a specified timestamp. The timestamp must also be specified in the configurationproperties by providing a value for ConsumerConfigConstants.STREAM_INITIAL_TIMESTAMP, in one of the following date pattern :
    • a non-negative double value representing the number of seconds that has elapsed since the Unix epoch (for example, 1459799926.480).
    • a user defined pattern, which is a valid pattern for SimpleDateFormat provided by ConsumerConfigConstants.STREAM_TIMESTAMP_DATE_FORMAT. If ConsumerConfigConstants.STREAM_TIMESTAMP_DATE_FORMAT is not defined then the default pattern will be yyyy-MM-dd'T'HH:mm:ss.SSSXXX (for example, timestamp value is 2016-04-04 and pattern is yyyy-MM-dd given by user or timestamp value is 2016-04-04T19:58:46.480-00:00 without given a pattern).

Fault Tolerance for Exactly-Once User-Defined State Update Semantics

With Flink’s checkpointing enabled, the Flink Kinesis Consumer will consume records from shards in Kinesis streams andperiodically checkpoint each shard’s progress. In case of a job failure, Flink will restore the streaming program to thestate of the latest complete checkpoint and re-consume the records from Kinesis shards, starting from the progress thatwas stored in the checkpoint.

The interval of drawing checkpoints therefore defines how much the program may have to go back at most, in case of a failure.

To use fault tolerant Kinesis Consumers, checkpointing of the topology needs to be enabled at the execution environment:

  1. final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
  2. env.enableCheckpointing(5000); // checkpoint every 5000 msecs
  1. val env = StreamExecutionEnvironment.getExecutionEnvironment()
  2. env.enableCheckpointing(5000) // checkpoint every 5000 msecs

Also note that Flink can only restart the topology if enough processing slots are available to restart the topology.Therefore, if the topology fails due to loss of a TaskManager, there must still be enough slots available afterwards.Flink on YARN supports automatic restart of lost YARN containers.

Event Time for Consumed Records

  1. final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
  2. env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
  1. val env = StreamExecutionEnvironment.getExecutionEnvironment()
  2. env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

If streaming topologies choose to use the event time notion for recordtimestamps, an approximate arrival timestamp will be used by default. This timestamp is attached to records by Kinesis once theywere successfully received and stored by streams. Note that this timestamp is typically referred to as a Kinesis server-sidetimestamp, and there are no guarantees about the accuracy or order correctness (i.e., the timestamps may not always beascending).

Users can choose to override this default with a custom timestamp, as described here,or use one from the predefined ones. After doing so,it can be passed to the consumer in the following way:

  1. DataStream<String> kinesis = env.addSource(new FlinkKinesisConsumer<>(
  2. "kinesis_stream_name", new SimpleStringSchema(), kinesisConsumerConfig));
  3. kinesis = kinesis.assignTimestampsAndWatermarks(new CustomTimestampAssigner());
  1. val kinesis = env.addSource(new FlinkKinesisConsumer[String](
  2. "kinesis_stream_name", new SimpleStringSchema, kinesisConsumerConfig))
  3. kinesis = kinesis.assignTimestampsAndWatermarks(new CustomTimestampAssigner)

Threading Model

The Flink Kinesis Consumer uses multiple threads for shard discovery and data consumption.

For shard discovery, each parallel consumer subtask will have a single thread that constantly queries Kinesis for shardinformation even if the subtask initially did not have shards to read from when the consumer was started. In other words, ifthe consumer is run with a parallelism of 10, there will be a total of 10 threads constantly querying Kinesis regardlessof the total amount of shards in the subscribed streams.

For data consumption, a single thread will be created to consume each discovered shard. Threads will terminate when theshard it is responsible of consuming is closed as a result of stream resharding. In other words, there will always beone thread per open shard.

Internally Used Kinesis APIs

The Flink Kinesis Consumer uses the AWS Java SDK internally to call Kinesis APIsfor shard discovery and data consumption. Due to Amazon’s service limits for Kinesis Streamson the APIs, the consumer will be competing with other non-Flink consuming applications that the user may be running.Below is a list of APIs called by the consumer with description of how the consumer uses the API, as well as informationon how to deal with any errors or warnings that the Flink Kinesis Consumer may have due to these service limits.

  • DescribeStream: this is constantly calledby a single thread in each parallel consumer subtask to discover any new shards as a result of stream resharding. By default,the consumer performs the shard discovery at an interval of 10 seconds, and will retry indefinitely until it gets a resultfrom Kinesis. If this interferes with other non-Flink consuming applications, users can slow down the consumer ofcalling this API by setting a value for ConsumerConfigConstants.SHARD_DISCOVERY_INTERVAL_MILLIS in the suppliedconfiguration properties. This sets the discovery interval to a different value. Note that this setting directly impactsthe maximum delay of discovering a new shard and starting to consume it, as shards will not be discovered during the interval.

  • GetShardIterator: this is calledonly once when per shard consuming threads are started, and will retry if Kinesis complains that the transaction limit for theAPI has exceeded, up to a default of 3 attempts. Note that since the rate limit for this API is per shard (not per stream),the consumer itself should not exceed the limit. Usually, if this happens, users can either try to slow down any othernon-Flink consuming applications of calling this API, or modify the retry behaviour of this API call in the consumer bysetting keys prefixed by ConsumerConfigConstants.SHARDGETITERATOR* in the supplied configuration properties.

  • GetRecords: this is constantly calledby per shard consuming threads to fetch records from Kinesis. When a shard has multiple concurrent consumers (when thereare any other non-Flink consuming applications running), the per shard rate limit may be exceeded. By default, on each callof this API, the consumer will retry if Kinesis complains that the data size / transaction limit for the API has exceeded,up to a default of 3 attempts. Users can either try to slow down other non-Flink consuming applications, or adjust the throughputof the consumer by setting the ConsumerConfigConstants.SHARDGETRECORDS_MAX andConsumerConfigConstants.SHARD_GETRECORDS_INTERVAL_MILLIS keys in the supplied configuration properties. Setting the formeradjusts the maximum number of records each consuming thread tries to fetch from shards on each call (default is 10,000), whilethe latter modifies the sleep interval between each fetch (default is 200). The retry behaviour of theconsumer when calling this API can also be modified by using the other keys prefixed by ConsumerConfigConstants.SHARD_GETRECORDS*.

Kinesis Producer

The FlinkKinesisProducer uses Kinesis Producer Library (KPL) to put data from a Flink stream into a Kinesis stream.

Note that the producer is not participating in Flink’s checkpointing and doesn’t provide exactly-once processing guarantees. Also, the Kinesis producer does not guarantee that records are written in order to the shards (See here and here for more details).

In case of a failure or a resharding, data will be written again to Kinesis, leading to duplicates. This behavior is usually called “at-least-once” semantics.

To put data into a Kinesis stream, make sure the stream is marked as “ACTIVE” in the AWS dashboard.

For the monitoring to work, the user accessing the stream needs access to the CloudWatch service.

  1. Properties producerConfig = new Properties();
  2. // Required configs
  3. producerConfig.put(AWSConfigConstants.AWS_REGION, "us-east-1");
  4. producerConfig.put(AWSConfigConstants.AWS_ACCESS_KEY_ID, "aws_access_key_id");
  5. producerConfig.put(AWSConfigConstants.AWS_SECRET_ACCESS_KEY, "aws_secret_access_key");
  6. // Optional configs
  7. producerConfig.put("AggregationMaxCount", "4294967295");
  8. producerConfig.put("CollectionMaxCount", "1000");
  9. producerConfig.put("RecordTtl", "30000");
  10. producerConfig.put("RequestTimeout", "6000");
  11. producerConfig.put("ThreadPoolSize", "15");
  12. // Disable Aggregation if it's not supported by a consumer
  13. // producerConfig.put("AggregationEnabled", "false");
  14. // Switch KinesisProducer's threading model
  15. // producerConfig.put("ThreadingModel", "PER_REQUEST");
  16. FlinkKinesisProducer<String> kinesis = new FlinkKinesisProducer<>(new SimpleStringSchema(), producerConfig);
  17. kinesis.setFailOnError(true);
  18. kinesis.setDefaultStream("kinesis_stream_name");
  19. kinesis.setDefaultPartition("0");
  20. DataStream<String> simpleStringStream = ...;
  21. simpleStringStream.addSink(kinesis);
  1. val producerConfig = new Properties()
  2. // Required configs
  3. producerConfig.put(AWSConfigConstants.AWS_REGION, "us-east-1")
  4. producerConfig.put(AWSConfigConstants.AWS_ACCESS_KEY_ID, "aws_access_key_id")
  5. producerConfig.put(AWSConfigConstants.AWS_SECRET_ACCESS_KEY, "aws_secret_access_key")
  6. // Optional KPL configs
  7. producerConfig.put("AggregationMaxCount", "4294967295")
  8. producerConfig.put("CollectionMaxCount", "1000")
  9. producerConfig.put("RecordTtl", "30000")
  10. producerConfig.put("RequestTimeout", "6000")
  11. producerConfig.put("ThreadPoolSize", "15")
  12. // Disable Aggregation if it's not supported by a consumer
  13. // producerConfig.put("AggregationEnabled", "false")
  14. // Switch KinesisProducer's threading model
  15. // producerConfig.put("ThreadingModel", "PER_REQUEST")
  16. val kinesis = new FlinkKinesisProducer[String](new SimpleStringSchema, producerConfig)
  17. kinesis.setFailOnError(true)
  18. kinesis.setDefaultStream("kinesis_stream_name")
  19. kinesis.setDefaultPartition("0")
  20. val simpleStringStream = ...
  21. simpleStringStream.addSink(kinesis)

The above is a simple example of using the producer. To initialize FlinkKinesisProducer, users are required to pass in AWS_REGION, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY via a java.util.Properties instance. Users can also pass in KPL’s configurations as optional parameters to customize the KPL underlying FlinkKinesisProducer. The full list of KPL configs and explanations can be found here. The example demonstrates producing a single Kinesis stream in the AWS region “us-east-1”.

If users don’t specify any KPL configs and values, FlinkKinesisProducer will use default config values of KPL, except RateLimit. RateLimit limits the maximum allowed put rate for a shard, as a percentage of the backend limits. KPL’s default value is 150 but it makes KPL throw RateLimitExceededException too frequently and breaks Flink sink as a result. Thus FlinkKinesisProducer overrides KPL’s default value to 100.

Instead of a SerializationSchema, it also supports a KinesisSerializationSchema. The KinesisSerializationSchema allows to send the data to multiple streams. This isdone using the KinesisSerializationSchema.getTargetStream(T element) method. Returning null there will instruct the producer to write the element to the default stream.Otherwise, the returned stream name is used.

Threading Model

Since Flink 1.4.0, FlinkKinesisProducer switches its default underlying KPL from a one-thread-per-request mode to a thread-pool mode. KPL in thread-pool mode uses a queue and thread pool to execute requests to Kinesis. This limits the number of threads that KPL’s native process may create, and therefore greatly lowers CPU utilization and improves efficiency. Thus, We highly recommend Flink users use thread-pool model. The default thread pool size is 10. Users can set the pool size in java.util.Properties instance with key ThreadPoolSize, as shown in the above example.

Users can still switch back to one-thread-per-request mode by setting a key-value pair of ThreadingModel and PER_REQUEST in java.util.Properties, as shown in the code commented out in above example.

Backpressure

By default, FlinkKinesisProducer does not backpressure. Instead, records thatcannot be sent because of the rate restriction of 1 MB per second per shard arebuffered in an unbounded queue and dropped when their RecordTtl expires.

To avoid data loss, you can enable backpressuring by restricting the size of theinternal queue:

  1. // 200 Bytes per record, 1 shard
  2. kinesis.setQueueLimit(500);

The value for queueLimit depends on the expected record size. To choose a goodvalue, consider that Kinesis is rate-limited to 1MB per second per shard. Ifless than one second’s worth of records is buffered, then the queue may not beable to operate at full capacity. With the default RecordMaxBufferedTime of100ms, a queue size of 100kB per shard should be sufficient. The queueLimitcan then be computed via

  1. queue limit = (number of shards * queue size per shard) / record size

E.g. for 200Bytes per record and 8 shards, a queue limit of 4000 is a goodstarting point. If the queue size limits throughput (below 1MB per second pershard), try increasing the queue limit slightly.

Using Non-AWS Kinesis Endpoints for Testing

It is sometimes desirable to have Flink operate as a consumer or producer against a non-AWS Kinesis endpoint such asKinesalite; this is especially useful when performing functional testing of a Flinkapplication. The AWS endpoint that would normally be inferred by the AWS region set in the Flink configuration must be overridden via a configuration property.

To override the AWS endpoint, taking the producer for example, set the AWSConfigConstants.AWS_ENDPOINT property in theFlink configuration, in addition to the AWSConfigConstants.AWS_REGION required by Flink. Although the region isrequired, it will not be used to determine the AWS endpoint URL.

The following example shows how one might supply the AWSConfigConstants.AWS_ENDPOINT configuration property:

  1. Properties producerConfig = new Properties();
  2. producerConfig.put(AWSConfigConstants.AWS_REGION, "us-east-1");
  3. producerConfig.put(AWSConfigConstants.AWS_ACCESS_KEY_ID, "aws_access_key_id");
  4. producerConfig.put(AWSConfigConstants.AWS_SECRET_ACCESS_KEY, "aws_secret_access_key");
  5. producerConfig.put(AWSConfigConstants.AWS_ENDPOINT, "http://localhost:4567");
  1. val producerConfig = new Properties()
  2. producerConfig.put(AWSConfigConstants.AWS_REGION, "us-east-1")
  3. producerConfig.put(AWSConfigConstants.AWS_ACCESS_KEY_ID, "aws_access_key_id")
  4. producerConfig.put(AWSConfigConstants.AWS_SECRET_ACCESS_KEY, "aws_secret_access_key")
  5. producerConfig.put(AWSConfigConstants.AWS_ENDPOINT, "http://localhost:4567")