Introduction

Overview

Apache Flume is a distributed, reliable, and available system for efficientlycollecting, aggregating and moving large amounts of log data from manydifferent sources to a centralized data store.

The use of Apache Flume is not only restricted to log data aggregation.Since data sources are customizable, Flume can be used to transport massive quantitiesof event data including but not limited to network traffic data, social-media-generated data,email messages and pretty much any data source possible.

Apache Flume is a top level project at the Apache Software Foundation.

System Requirements

  • Java Runtime Environment - Java 1.8 or later
  • Memory - Sufficient memory for configurations used by sources, channels or sinks
  • Disk Space - Sufficient disk space for configurations used by channels or sinks
  • Directory Permissions - Read/Write permissions for directories used by agent

Architecture

Data flow model

A Flume event is defined as a unit of data flow having a byte payload and anoptional set of string attributes. A Flume agent is a (JVM) process that hoststhe components through which events flow from an external source to the nextdestination (hop).

Agent component diagram

A Flume source consumes events delivered to it by an external source like a webserver. The external source sends events to Flume in a format that isrecognized by the target Flume source. For example, an Avro Flume source can beused to receive Avro events from Avro clients or other Flume agents in the flowthat send events from an Avro sink. A similar flow can be defined usinga Thrift Flume Source to receive events from a Thrift Sink or a FlumeThrift Rpc Client or Thrift clients written in any language generated fromthe Flume thrift protocol.When a Flume source receives an event, itstores it into one or more channels. The channel is a passive store that keepsthe event until it’s consumed by a Flume sink. The file channel is one example– it is backed by the local filesystem. The sink removes the eventfrom the channel and puts it into an external repository like HDFS (via FlumeHDFS sink) or forwards it to the Flume source of the next Flume agent (nexthop) in the flow. The source and sink within the given agent run asynchronouslywith the events staged in the channel.

Complex flows

Flume allows a user to build multi-hop flows where events travel throughmultiple agents before reaching the final destination. It also allows fan-inand fan-out flows, contextual routing and backup routes (fail-over) for failedhops.

Reliability

The events are staged in a channel on each agent. The events are then deliveredto the next agent or terminal repository (like HDFS) in the flow. The eventsare removed from a channel only after they are stored in the channel of nextagent or in the terminal repository. This is a how the single-hop messagedelivery semantics in Flume provide end-to-end reliability of the flow.

Flume uses a transactional approach to guarantee the reliable delivery of theevents. The sources and sinks encapsulate in a transaction thestorage/retrieval, respectively, of the events placed in or provided by atransaction provided by the channel. This ensures that the set of events arereliably passed from point to point in the flow. In the case of a multi-hopflow, the sink from the previous hop and the source from the next hop both havetheir transactions running to ensure that the data is safely stored in thechannel of the next hop.

Recoverability

The events are staged in the channel, which manages recovery from failure.Flume supports a durable file channel which is backed by the local file system.There’s also a memory channel which simply stores the events in an in-memoryqueue, which is faster but any events still left in the memory channel when anagent process dies can’t be recovered.