Section 1: Introduction to Apache Spark

In this section, we demonstrate how simple it is to analyze web logs using
Apache Spark. We’ll show how to load a Resilient Distributed Dataset
(RDD) of access log lines and use Spark tranformations and actions to compute
some statistics for web server monitoring. In the process, we’ll introduce
the Spark SQL and the Spark Streaming libraries.

In this explanation, the code snippets are in Java 8. However,
there is also sample code in Java 6, Scala, and Python
included in this directory. In those folders are README’s for
instructions on how to build and run those examples, and the necessary build files with all the required dependencies.

This chapter covers the following topics:

  1. First Log Analyzer in Spark - This is a first Spark standalone logs analysis application.
  • Spark SQL - This example does the same thing as the above example, but uses SQL syntax instead of Spark transformations and actions.
  • Spark Streaming - This example covers how to calculate log statistics using the streaming library.