Topology Design Considerations

Flume is very flexible and allows a large range of possible deploymentscenarios. If you plan to use Flume in a large, production deployment, it isprudent to spend some time thinking about how to express your problem interms of a Flume topology. This section covers a few considerations.

Is Flume a good fit for your problem?

If you need to ingest textual log data into Hadoop/HDFS then Flume is theright fit for your problem, full stop. For other use cases, here are someguidelines:

Flume is designed to transport and ingest regularly-generated event data overrelatively stable, potentially complex topologies. The notion of “event data”is very broadly defined. To Flume, an event is just a generic blob of bytes.There are some limitations on how large an event can be - for instance, itcannot be larger than what you can store in memory or on disk on a single machine -but in practice, flume events can be everything from textual log entries toimage files. The key property of an event is that they are generated in acontinuous, streaming fashion. If your data is not regularly generated(i.e. you are trying to do a single bulk load of data into a Hadoop cluster)then Flume will still work, but it is probably overkill for your situation.Flume likes relatively stable topologies. Your topologies do not need to beimmutable, because Flume can deal with changes in topology without losing dataand can also tolerate periodic reconfiguration due to fail-over orprovisioning. It probably won’t work well if you plant to change topologiesevery day, because reconfiguration takes some thought and overhead.

Flow reliability in Flume

The reliability of a Flume flow depends on several factors. By adjusting thesefactors, you can achieve a wide array of reliability options with Flume.

What type of channel you use. Flume has both durable channels (those whichwill persist data to disk) and non durable channels (those which will losedata if a machine fails). Durable channels use disk-based storage, and datastored in such channels will persist across machine restarts or nondisk-related failures.

Whether your channels are sufficiently provisioned for the workload. Channelsin Flume act as buffers at various hops. These buffers have a fixed capacity,and once that capacity is full you will create back pressure on earlier pointsin the flow. If this pressure propagates to the source of the flow, Flume willbecome unavailable and may lose data.

Whether you use redundant topologies. Flume let’s you replicate flowsacross redundant topologies. This can provide a very easy source of faulttolerance and one which is overcomes both disk or machine failures.

The best way to think about reliability in a Flume topology is to considervarious failure scenarios and their outcomes. What happens if a disk fails?What happens if a machine fails? What happens if your terminal sink(e.g. HDFS) goes down for some time and you have back pressure? The space ofpossible designs is huge, but the underlying questions you need to ask arejust a handful.

Flume topology design

The first step in designing a Flume topology is to enumerate all sourcesand destinations (terminal sinks) for your data. These will define the edgepoints of your topology. The next consideration is whether to introduceintermediate aggregation tiers or event routing. If you are collecting dataform a large number of sources, it can be helpful to aggregate the data inorder to simplify ingestion at the terminal sink. An aggregation tier canalso smooth out burstiness from sources or unavailability at sinks, byacting as a buffer. If you are routing data between different locations,you may also want to split flows at various points: this createssub-topologies which may themselves include aggregation points.

Sizing a Flume deployment

Once you have an idea of what your topology will look like, the next questionis how much hardware and networking capacity is needed. This starts byquantifying how much data you generate. That is not alwaysa simple task! Most data streams are bursty (for instance, due to diurnalpatterns) and potentially unpredictable. A good starting point is to thinkabout the maximum throughput you’ll have in each tier of the topology, bothin terms of events per second and bytes per second. Once you know therequired throughput of a given tier, you can calulate a lower bound on how manynodes you require for that tier. To determine attainable throughput, it’sbest to experiment with Flume on your hardware, using synthetic or sampledevent data. In general, disk-based channelsshould get 10’s of MB/s and memory based channels should get 100’s of MB/s ormore. Performance will vary widely, however depending on hardware andoperating environment.

Sizing aggregate throughput gives you a lower bound on the number of nodesyou will need to each tier. There are several reasons to have additionalnodes, such as increased redundancy and better ability to absorb bursts in load.