|Course Title||Data Science – Intermediate
|Prerequisites||Basic understanding of Probability and Algebra. Basic Data analysis and processing in Excel, Tableau or other database systems. Basic Linear regression.|
|Target Audience||Data Analysts; Computer Analysts; Individuals in any role dealing with small or large amounts of data needing to model the data to obtain predictions.|
|Dates||March 7-12-14-19-21-26-28, 2018|
|Instructor||Diego Perea – Ph.D.|
|Schedule||Monday & Wednesday 6: 30 p.m. – 9:30 p.m.|
|Gouvernement du Québec fee||$42.00|
|General public fee||$338.03|
Recommended textbook: An Introduction to Statistical Learning with Applications in R by G. James, D. Whitten, R. Tibshirani and T. Hastie.
NB: Certificate provided for all participants who have completed 80% of course hours
|Please note that this is a non-credit course.|
|The descriptive analytic methods, seen in previous big data courses, form the basis for the predictive analytic methods that are at the hearth of this new course. In this course, participants will learn the standard statistical methods currently used in industry to perform predictive analytics. These include non-linear regression and several classification methods such as logistic regression, LDA, QDA and KNN. Participants will learn how to research the available data and choose the best predictive method to apply. Key components of this course are the understanding of these methods, the methodology to evaluate them and the criteria to choose the best method.
The course methodology is based on lectures led by the instructor, who will present the concepts using examples. Each lecture is followed by a lab using real data, where the participants will complete specific tasks in R designed to reinforce the concepts introduced in the lecture.
|Topics Covered in this Course|
Please note that the instructor reserves the right to modify this schedule
|Week 1||Topics 1 and 2
Introduction, course description and R overview.
Basic statistical analysis in R: histograms, box and scatter plots
|Week 2||Topics 3 and 4
Continuous variables regression beyond linearity
Logistic regression I: Bayes, Maximum Likelihood and Linear Discrimination Analysis methods
|Week 3||Topics 5 and 6
Logistic regression I: Quadratic Discrimination Analysis, K-Nearest Neighbors and other classifiers
|Week 4||Topic 7
Forecasting and concluding remarks
SOFTWARE TO BE USED
For the course, we will mainly use R, which is the industry standard for statistical learning and provides functions for most of the methods. Other software will be addressed in the course to give the participant a holistic view of statistical learning.
LABS and DATASETS
In the labs, the participant will apply the prediction and classification methods seen in class using practical datasets. Among others, we will use the following datasets.
- Uber trip data: Trip information including Uber service type, source, destination, distance, duration and paid fare. Continuous regression methods are applied to estimate the trip fare based on trip distance, time and service demand. Example:
- Advertisement data: Dataset containing the budget spent on advertisement by a company on different markets. Continuous variables regression methods help designing the best advertisement plan to maximize profit. Example:
- Stocks return data: Dataset containing stock returns from previous years for different companies. Logistic regression methods are applied to predict the stock up or down return direction and devise the best investment strategy. Example:
- Automobile features data: Dataset containing car characteristics to develop a model that predicts whether a car gets high or low gas mileage. Example:
- Airline flights data: dataset containing airline flights historic data. Forecasting methods are applied to predict the number of flights and revenue in upcoming months. Example:
In addition to the previous datasets, participants are encouraged to bring their own data. The following are some websites with good data sources for Machine Learning.