- Introduction
-
Sampling & Streaming
- Motivations
- Techniques for sampling large datasets (e.g., Poisson & Bernoulli sampling, Reservoir sampling)
- Learning from continuous data-streams
- Linear models (e.g., Naive Bayes, Perceptrons)
- Non-linear models (e.g., Very fast decisions trees, SVMs, Neural Nets)
-
Hashing
- Hash indexes
- Hashing for exact matches (e.g., equi-join)
- Hashing for similarity detection (e.g., LSH)
- Hashing to handle an unknown number of features (e.g., count-min sketch) (If time permits)
-
Parallel & distributed computing
- Computing architectures: shared memory architectures, clusters, grids
- Parallel computing paradigms: multi-threading, message passing, MapReduce
- Performance evaluation
- Which configuration for which problem?
-
Hands-on with Hadoop
- Analysis of GPS tracking data with Hadoop
- Sampling versus distributed computing
- Every participant gets an Hadoop cluster!
- Discussion
Disclaimer: the syllabus is subject to change without notice.
The distributed computing infrastructure for the hands-on exercises with Hadoop is provided by the Flemish Supercomputer Centre.