Syllabus

Introduction
Sampling & Streaming
- Motivations
- Techniques for sampling large datasets (e.g., Poisson & Bernoulli sampling, Reservoir sampling)
- Learning from continuous data-streams
  - Linear models (e.g., Naive Bayes, Perceptrons)
  - Non-linear models (e.g., Very fast decisions trees, SVMs, Neural Nets)
Hashing
- Hash indexes
- Hashing for exact matches (e.g., equi-join)
- Hashing for similarity detection (e.g., LSH)
- Hashing to handle an unknown number of features (e.g., count-min sketch) (If time permits)
Parallel & distributed computing
- Computing architectures: shared memory architectures, clusters, grids
- Parallel computing paradigms: multi-threading, message passing, MapReduce
- Performance evaluation
- Which configuration for which problem?
Hands-on with Hadoop
- Analysis of GPS tracking data with Hadoop
- Sampling versus distributed computing
- Every participant gets an Hadoop cluster!
Discussion

Disclaimer: the syllabus is subject to change without notice.

The distributed computing infrastructure for the hands-on exercises with Hadoop is provided by the Flemish Supercomputer Centre.