Data Science – Level 2

Return to schedule

register-button24

NEW!

Course Title Data Science – 2
Course number 900-081-EQ
Platform R
Duration 21 hours
Prerequisites Basic understanding of Probability and Algebra; Basic Data analysis and processing in Excel, Tableau or other database systems; Basic Linear regression – Data Science Level 1
Target Audience Data Analysts; Computer Analysts; Individuals in any role dealing with small or large amounts of data needing to model the data to obtain predictions.
Dates October 22, 24, 29, 31; November 5, 7, 12
Instructor Diego Perea  – Ph.D.
Room BH-210
Schedule Monday & Wednesday  6: 30 p.m. – 9:30 p.m.
Gouvernement du Québec fee $42.00
General public fee $394.42

NB: Certificate provided for all participants who have completed 80% of course hours

Recommended textbook
[1] An Introduction to Statistical Learning with Applications in R by G. James, D. Whitten, R. Tibshirani and T. Hastie.

[2] “Foundations of Data Analysis, Part I and II”, University of Texas online courses. https://www.class-central.com/report/best-statistics-probability-courses-data-science/
Course Description
Please note that this is a non-credit course.
The descriptive analytic methods, seen in previous big data courses, form the basis for the Predictive Analytic methods that are at the hearth of this new course. In this course, participants will learn the standard statistical methods currently used in industry to perform predictive analytics. These include non-linear regression and several classification methods such as logistic regression, LDA, QDA and KNN. Participants will learn how to research the available data and choose the best predictive method to apply. Key components of this course are the understanding of these methods, the methodology to evaluate them and the criteria to choose the best method.

The course methodology is based on lectures led by the instructor, who will present the concepts using examples. Each lecture is followed by a lab using real data, where the participants will complete specific tasks in R designed to reinforce the concepts introduced in the lecture.

Students will also formulate a small prediction project, which they will complete in the Data Science Level 3 course.

 

Topics Covered in this Course
  1. Review of R and R-studio
  2. Basic statistical analysis in R: histograms, box and scatter plots
  3. Continuous variables regression beyond linear
  4. Regression beyond linearity
  5. Maximum Likelihood and Logistic regression Classifier, K-Nearest Neighbors (KNN) Classifier
  6. Bayes, Linear Discrimination Analysis (LDA) methods and Quadratic Discrimination Analysis (QDA),
  7. Support Vector Machine Classifier and Practical applications of classification and regression

 

Weekly Topics
Please note that the instructor reserves the right to modify this schedule
Week 1 Topics 1 and 2
Introduction, course description and R overview.
Basic statistical analysis in R: histograms, box and scatter plots
Week 2 Topics 3 and 4
Continuous variables regression beyond linearity
Logistic regression I: Bayes, Maximum Likelihood and Linear Discrimination Analysis methods
Week 3 Topics 5 and 6
Supervised classification
Week 4 Topic 7
Support Vector Machines and concluding remarks

 

SOFTWARE TO BE USED

For the course, we will mainly use R, which is the industry standard for statistical learning and provides functions for most of the methods. Other software will be addressed in the course to give the participant a holistic view of statistical learning.

 

LABS and DATASETS

In the labs, the participant will apply the prediction and classification methods seen in class using practical datasets. Among others, we will use the following datasets.

  1. Uber trip data: Trip information including Uber service type, source, destination, distance, duration and paid fare. Continuous regression methods are applied to estimate the trip fare based on trip distance, time and service demand. Example:

 https://public.tableau.com/views/Lab4-DatacharacterizationA-categoricalfields/Dashboard2

  1. Advertisement data: Dataset containing the budget spent on advertisement by a company on different markets. Continuous variables regression methods help designing the best advertisement plan to maximize profit. Example:

https://public.tableau.com/shared/W5KC9CF5S

  1. Stocks return data: Dataset containing stock returns from previous years for different companies. Logistic regression methods are applied to predict the stock up or down return direction and devise the best investment strategy. Example:

https://public.tableau.com/shared/CTJTM9ZYN

  1. Automobile features data: Dataset containing car characteristics to develop a model that predicts whether a car gets high or low gas mileage. Example:

https://public.tableau.com/views/auto-mpgdatasetboxplots/Dashboard2

  1. Airline flights data: dataset containing airline flights historic data. Forecasting methods are applied to predict the number of flights and revenue in upcoming months. Example:

https://public.tableau.com/views/Lab7B-AirPassengersForecasting/ForecastDashboard

In addition to the previous datasets, participants are encouraged to bring their own data. The following are some websites with good data sources for Machine Learning.

https://www.kaggle.com/

https://archive.ics.uci.edu/ml/datasets.html

The projects presented by the students in the previous term are showcased here:

https://public.tableau.com/StudentProjects/StudentProjects

TOP