Data Science – Intermediate

Return to schedule

register-button24

NEW!

Course Title Data Science – Intermediate
Course number 900-081-EQ
Platform R
Duration 21 hours
Prerequisites Basic understanding of Probability and Algebra. Basic Data analysis and processing in Excel, Tableau or other database systems. Basic Linear regression.
Target Audience Data Analysts; Computer Analysts; Individuals in any role dealing with small or large amounts of data needing to model the data to obtain predictions.
Dates March 7-12-14-19-21-26-28, 2018
Instructor Diego Perea  – Ph.D.
Room BH-210
Schedule Monday & Wednesday  6: 30 p.m. – 9:30 p.m.
Gouvernement du Québec fee $42.00
General public fee $338.03

Recommended textbook: An Introduction to Statistical Learning with Applications in R by G. James, D. Whitten, R. Tibshirani and T. Hastie.

NB: Certificate provided for all participants who have completed 80% of course hours

Course Description
Please note that this is a non-credit course.
The descriptive analytic methods, seen in previous big data courses, form the basis for the predictive analytic methods that are at the hearth of this new course. In this course, participants will learn the standard statistical methods currently used in industry to perform predictive analytics. These include non-linear regression and several classification methods such as logistic regression, LDA, QDA and KNN. Participants will learn how to research the available data and choose the best predictive method to apply. Key components of this course are the understanding of these methods, the methodology to evaluate them and the criteria to choose the best method.

The course methodology is based on lectures led by the instructor, who will present the concepts using examples. Each lecture is followed by a lab using real data, where the participants will complete specific tasks in R designed to reinforce the concepts introduced in the lecture.

 

Topics Covered in this Course
  1. Introduction to R and R-studio
  2. Basic statistical analysis in R: histograms, box and scatter plots
  3. Continuous variables regression beyond linearity
  4. Logistic regression I: Bayes, Maximum Likelihood and Linear Discrimination Analysis (LDA) methods
  5. Logistic regression II: Quadratic Discrimination Analysis (QDA), K-Nearest Neighbors (KNN) and other classifiers
  6. Practical applications of classification and regression
  7. Connecting analytics to Big Data distributed processing systems
  8. Forecasting

 

Weekly Topics
Please note that the instructor reserves the right to modify this schedule
Week 1 Topics 1 and 2
Introduction, course description and R overview.
Basic statistical analysis in R: histograms, box and scatter plots
Week 2 Topics 3 and 4
Continuous variables regression beyond linearity
Logistic regression I: Bayes, Maximum Likelihood and Linear Discrimination Analysis methods
Week 3 Topics 5 and 6
Logistic regression I: Quadratic Discrimination Analysis, K-Nearest Neighbors and other classifiers
Week 4 Topic 7
Forecasting and concluding remarks

 

SOFTWARE TO BE USED

For the course, we will mainly use R, which is the industry standard for statistical learning and provides functions for most of the methods. Other software will be addressed in the course to give the participant a holistic view of statistical learning.

 

LABS and DATASETS

In the labs, the participant will apply the prediction and classification methods seen in class using practical datasets. Among others, we will use the following datasets.

  1. Uber trip data: Trip information including Uber service type, source, destination, distance, duration and paid fare. Continuous regression methods are applied to estimate the trip fare based on trip distance, time and service demand. Example:

 https://public.tableau.com/views/Lab4-DatacharacterizationA-categoricalfields/Dashboard2

  1. Advertisement data: Dataset containing the budget spent on advertisement by a company on different markets. Continuous variables regression methods help designing the best advertisement plan to maximize profit. Example:

https://public.tableau.com/shared/W5KC9CF5S

  1. Stocks return data: Dataset containing stock returns from previous years for different companies. Logistic regression methods are applied to predict the stock up or down return direction and devise the best investment strategy. Example:

https://public.tableau.com/shared/CTJTM9ZYN

  1. Automobile features data: Dataset containing car characteristics to develop a model that predicts whether a car gets high or low gas mileage. Example:

https://public.tableau.com/views/auto-mpgdatasetboxplots/Dashboard2

  1. Airline flights data: dataset containing airline flights historic data. Forecasting methods are applied to predict the number of flights and revenue in upcoming months. Example:

https://public.tableau.com/views/Lab7B-AirPassengersForecasting/ForecastDashboard

In addition to the previous datasets, participants are encouraged to bring their own data. The following are some websites with good data sources for Machine Learning.

https://www.kaggle.com/

https://archive.ics.uci.edu/ml/datasets.html

TOP