Tutorial: Big Data Analytics using Apache Spark with Smart City Applications

29 Nov 2017
09:00 - 12:00
Seminar Room 1

Tutorial: Big Data Analytics using Apache Spark with Smart City Applications

Big data analytics investigate large amounts of data to uncover hidden patterns and other insights important for businesses, healthcare, government, retail and smart cities. It also reduces the cost, improves the performance and decision-making, and discovers new products and services based on user feedback and preferences. Apache Spark is a leading platform for big data analytics. It provides faster in-memory processing for large-scale data compared with other data processing platforms such as Hadoop.
In this tutorial, you will learn the development of big data analytics applications using Apache Spark. We will use two main smart city examples to elaborate the concepts: graph computing for road network shortest path and Twitter data analysis for event detection. We will use GraphX, MLlib, and Spark SQL to build these applications. An emphasis would be to teach big data management concepts and practice using the Apache Spark platform.

Level: Basic to Intermediate

The knowledge of and skills in programming, databases, and Linux would be of advantage. Familiarity with parallel processing and cluster computing would be useful but is not required.


  • Apache Spark: An overview
    • Spark Stack
    • Execution Flow of Spark
    • Resilient Distributed Dataset (RDD) concept
    • Spark operations
    • Cluster Deployment
    • Working with Spark: Interactive and Standalone
  • Data Processing:
    • RDDs
    • Spark SQL
  • Data Analytic:
    • MLlib
    • GraphX
  • Building Spark applications:
    • Case 1: Shortest path problem of US road networks
    • Graph computing: US road networks
    • Case 2: Analyzing Twitter data to detect events
    • Event detection using Twitter data analysis
  • Performance tuning of Apache Spark:
    • Cluster resource utilization
  • Advanced features of Apache Spark:
    • Accumulators
    • Broadcast variables
    • Memory persistence levels