Loading...

Course Description

What does ‘big data’ mean for the applied healthcare data scientist? Do I need to know Spark (or Hadoop, or other things)? When are these tools useful, when should I try to implement them? These questions are the underlying motivation for this course. In today’s market, data scientists are expected to have a good understanding of how to handle ‘big data’ in the data science workflow. This includes knowing how to do ETL on large datasets, as well as how to implement inferential and predictive models on large datasets. Implicit in this is a solid understanding of the hardware and software used for these tasks.

In this course, we will cover a variety of topics central to working with large datasets, or performing excessively expensive computations on smaller datasets. We will be using R and Python as the languages of choice; I presume all students have a solid foundation in R and Python (we will not be teaching the basics of these languages) from prior coursework. I also assume students have a solid understanding of data management, inferential models such as multivariate linear models, and machine learning techniques in general.

Learner Outcomes

  • Upon completion of the course, students will be able to:
  • Use R and data.table to efficiently analyze big datasets in-memory data
  • Use Python and Dask to analyze larger than memory data on a workstation
  • Profile and optimize code using R or Python
  • Understand the history of Hadoop, Apache Spark, and clustered computing in the context of data science
  • Perform data management with Apache Spark and H2O
  • Understand statistical models, loss functions, and optimization in the context of learning algorithms
  • Identify when meta-heuristic techniques are required to solve a non-standard optimization problem
  • Implement genetic algorithms for NP-Hard problems
  • Understand implications of serial versus parallel calculations
  • Compare the performance of classification, regression, and clustering algorithms in R, Python, Apache Spark, and H2O
  • Identify best performing optimization algorithms based on data dimensionality and the problem at hand.
  • Differentiate between different neural networks commonly used in healthcare data analysis
  • Implement feed-forward artificial neural networks using Tensor Flow
  • Implement recurrent neural networks using Tensor Flow
  • Implement convolutional neural networks using Tensor Flow

Prerequisites

Introductory understanding of computer programming and statistical concepts.

Duration

30 Hours | 5 Days or 10 Nights
Loading...

Thank you for your interest in this course. Unfortunately, the course you have selected is currently not open for enrollment. Please complete a Course Inquiry or call 314-977-3226 so that we may promptly notify you when enrollment opens.

Required fields are indicated by .
*Academic Unit eligibility to be determined by college/university in which you are enrolled in a degree seeking program.