Exercise 5: Summarize Data Using HDInsight Spark

Duration: 20 mins

Synopsis: In this exercise, attendees will prepare a summary of flight delay data in HDFS using Spark SQL.

Task 1: Summarize Delays by Airport

  1. Navigate to the blade for your Spark cluster in the Azure Portal. An easy way to find your Spark cluster is to navigate to the resource group you created during the workshop setup using the Resource Group item on the leftmost navigation area in the portal. Once you select your resource group you will see all of the resources that were created for the workshop (including the Spark cluster).
  2. In the Quick Links section, click Cluster Dashboards.

    Screenshot

  3. From the Cluster Dashboards , click Jupyter Notebooks.

    Screenshot

  4. A new browser tab should open and you will be prompted for credentials. Log in with the following:

    • User name: cortana
    • Password: Password.1!!
  5. On the Jupyter notebooks screen, click on the New dropdown list from top right corner and click Spark.

    Screenshot

  6. Copy the below text and paste it into the Jupyter notebook.

    Screenshot

    1. import sqlContext.implicits._
    2. val flightDelayTextLines = sc.textFile("wasb:///Scored_FlightsAndWeather.csv")
    3. case class AirportFlightDelays(OriginAirportCode:String,OriginLatLong:String,Month:Integer,Day:Integer,Hour:Integer,Carrier:String,DelayPredicted:Integer,DelayProbability:Double)
    4. val flightDelayRowsWithoutHeader = flightDelayTextLines.map(s => s.split(",")).filter(line => line(0) != "OriginAirportCode")
    5. val resultDataFrame = flightDelayRowsWithoutHeader.map(
    6. s => AirportFlightDelays(
    7. s(0), //Airport code
    8. s(13) + "," + s(14), //Lat,Long
    9. s(1).toInt, //Month
    10. s(2).toInt, //Day
    11. s(3).toInt, //Hour
    12. s(5), //Carrier
    13. s(11).toInt, //DelayPredicted
    14. s(12).toDouble //DelayProbability
    15. )
    16. ).toDF()
    17. resultDataFrame.write.mode("overwrite").saveAsTable("FlightDelays")
  7. Click the Play icon at the top of the screen to execute this code and create the FlightDelays table. You will know a command is executing by the asterisk to the left of the box. Once the command has completed, the asterisk will be replaced with a number.

    Screenshot

    Screenshot

  8. Once the previous command has completed, click in the empty paragraph below the paragraph in which you entered your Scala script. In this paragraph, you are going to author a SQL query to view the results of the table you just created. In order to switch from running Scala code, to running SQL, your first line in the paragraph must start with %%. As before, click the Play icon at the top of the screen to run the command.

    1. %%sql
    2. SELECT * FROM FlightDelays
  9. Click on the Table button.

  10. You should see the results appear in a table form similar to the following:

    Screenshot

  11. Next, you will create a table that summarizes the flight delays data. Instead of containing one row per flight, this new summary table will contain one row per origin airport at a given hour along with a count of the quantity of anticipated delays.

  12. In a new paragraph below, try running the following query.

    1. %%sql
    2. SELECT OriginAirportCode, OriginLatLong, Month, Day, Hour, Sum(DelayPredicted) NumDelays, Avg(DelayProbability) AvgDelayProbability
    3. FROM FlightDelays
    4. WHERE Month = 4
    5. GROUP BY OriginAirportCode, OriginLatLong, Month, Day, Hour
    6. Having Sum(DelayPredicted) > 1
  13. Click the Play icon in the top of the screen to execute this code.

    Screenshot

  14. This query should return a table that appears similar to the following:

    Screenshot

  15. Since the summary data looks good, the final step is save this summary calculation as a table that we can later query using Power BI in the next exercise.

  16. To accomplish creating the table, enter a new paragraph and add the following Scala code and run it.

    1. val summary = sqlContext.sql("SELECT OriginAirportCode, OriginLatLong, Month, Day, Hour, Sum(DelayPredicted) NumDelays, Avg(DelayProbability) AvgDelayProbability FROM FlightDelays WHERE Month = 4 GROUP BY OriginAirportCode, OriginLatLong, Month, Day, Hour Having Sum(DelayPredicted) > 1")
    2. summary.write.mode("overwrite").saveAsTable("FlightDelaysSummary")
  17. Click the Play icon in the top of the screen to execute this code.

    Screenshot

  18. To verify your table was successfully created, go to another new paragraph and enter and run the following query.

    1. %%sql
    2. SELECT * FROM FlightDelaysSummary
  19. Click the Play icon in the top of the screen to execute this code.

    Screenshot

  20. You should see results similar to the following:

    Screenshot

  21. You can also click on the other buttons like Pie , Scatter , Line , Area , and Bar to view these visualizations.

Next Exercise: Exercise 6 - Visualizing in Power BI Desktop