Syllabus

  1. Introduction
  2. Sampling & Streaming
    • Motivations
    • Techniques for sampling large datasets (e.g., Poisson & Bernoulli sampling, Reservoir sampling)
    • Learning from continuous data-streams
      • Linear models (e.g., Naive Bayes, Perceptrons)
      • Non-linear models (e.g., Very fast decisions trees, SVMs, Neural Nets)
  3. Hashing
    • Hash indexes
    • Hashing for exact matches (e.g., equi-join)
    • Hashing for similarity detection (e.g., LSH)
    • Hashing to handle an unknown number of features (e.g., count-min sketch) (If time permits)
  4. Parallel & distributed computing
    • Computing architectures: shared memory architectures, clusters, grids
    • Parallel computing paradigms: multi-threading, message passing, MapReduce
    • Performance evaluation
    • Which configuration for which problem?
  5. Hands-on with Hadoop
    • Analysis of GPS tracking data with Hadoop
    • Sampling versus distributed computing
    • Every participant gets an Hadoop cluster!
  6. Discussion

Disclaimer: the syllabus is subject to change without notice.

The distributed computing infrastructure for the hands-on exercises with Hadoop is provided by the Flemish Supercomputer Centre.