GEO5240 - High Performance Computing

Course Description

What does ‘big data’ mean for the applied healthcare data scientist? Do I need to know Spark (or Hadoop, or other things)? When are these tools useful, when should I try to implement them? These questions are the underlying motivation for this course. In today’s market, data scientists are expected to have a good understanding of how to handle ‘big data’ in the data science workflow. This includes knowing how to do ETL on large datasets, as well as how to implement inferential and predictive models on large datasets. Implicit in this is a solid understanding of the hardware and software used for these tasks.

In this course, we will cover a variety of topics central to working with large datasets, or performing excessively expensive computations on smaller datasets. We will be using R and Python as the languages of choice; I presume all students have a solid foundation in R and Python (we will not be teaching the basics of these languages) from prior coursework. I also assume students have a solid understanding of data management, inferential models such as multivariate linear models, and machine learning techniques in general.

Learner Outcomes

Upon completion of the course, students will be able to:
Use R and data.table to efficiently analyze big datasets in-memory data
Use Python and Dask to analyze larger than memory data on a workstation
Profile and optimize code using R or Python
Understand the history of Hadoop, Apache Spark, and clustered computing in the context of data science
Perform data management with Apache Spark and H2O
Understand statistical models, loss functions, and optimization in the context of learning algorithms
Identify when meta-heuristic techniques are required to solve a non-standard optimization problem
Implement genetic algorithms for NP-Hard problems
Understand implications of serial versus parallel calculations
Compare the performance of classification, regression, and clustering algorithms in R, Python, Apache Spark, and H2O
Identify best performing optimization algorithms based on data dimensionality and the problem at hand.
Differentiate between different neural networks commonly used in healthcare data analysis
Implement feed-forward artificial neural networks using Tensor Flow
Implement recurrent neural networks using Tensor Flow
Implement convolutional neural networks using Tensor Flow

Prerequisites

Introductory understanding of computer programming and statistical concepts.

Duration

30 Hours | 5 Days or 10 Nights

GEO5240 - High Performance Computing

Course Description

Learner Outcomes

Prerequisites

Duration

Site

Contact