About the Data Science
Course Code: Data1001
Duration: 3 days
Course Outline
Module 1 – Introduction to Data Science
- Origins of Data Science and a brief history of the Big Data revolution.
- The Big Data landscape.
- How much data is there really, and does it matter?
- Un-siloing data: use paradigms for organisational data and public data.
- Descriptive, predictive and prescriptive analysis.
- From recommendations to insights: black-box and white-box analytic.
Module 2– Data as an Asset
- The V’s of Big Data: Volume, Velocity, Variability, Veracity.
- Data business strategies.
- Data sources, synergies and differentiation.
Module 3– Data Life Cycle
- The analytic value chain.
- Overview of the data analysis cycle, connecting data science to the business problems.
- Work cycle of a data scientist: wrangling, modelling and validation.
- Managing research.
Module 4 – Privacy and Ethics in Big Data
- The societal impacts of Big Data.
- Privacy in Australia and global perspectives.
- Big Data ethics: history and current thinking.
- Opportunities and risks for organisations and individuals
Module 5 – Data Engineering for Analysis
- Data Science engineering and its drivers for change.
- Data volumes, data structures, and how they vary.
- Data Science architectures: the common stages.
- The Usual suspects: Distributed File Systems, Map Reduce, Spark
Module 6 – Visualisation
- Practical and effective visualization: beyond bar charts.
- Finding the unexpected: the role of visualization in exploratory analysis.
- Communicating findings: the role of visualization in communicating Data Science outputs.
- Standard tools: R, Tableau, D3.
Module 7 –Data Wrangling and Exploratory Analysis
- Determining data quality. Data cleansing.
- Entity matching.
- Imputation.
- Background modelling.
- Exploratory analysis.
Module 8 – Fundamentals of Statistics
- Types of data: numerical, categorical, ordinal.
- Statistical summaries: mean, standard deviation, quantities, correlation.
- Simple data visualization: histograms, boxplots, time plots and scatterplots.
- Cross-tabulations.
- Causality vs. association, independence.
- Randomisation and random sampling.
- Statistical inference using bootstrapping.
Module 9 – Model Creation and Validation
- Prediction: linear regression, nonparametric regression, k-NN.
- Forecasting: auto.arima and Error-Trend-Seasonal exponential smoothing algorithms.
- Hold-out sets, cross-validation, AIC.
- Classification: logistic regression, classification trees, SVM.
- Clustering: k-means, hierarchical clustering.
- Supervised vs. unsupervised vs. semi-supervised learning.
- Dimension reduction: principal components.
- Languages and environments (e.g. R, Python, MATLAB or even Excel) and standards (PMML).
Module 10 –Operationalisation and the Model Life Cycle
- Determining the needs: on how much data must decisions be taken, how often and how quickly must they be made, how often must models be refreshed?
- Plugging into existing data paths and choosing appropriate technologies.
- Stale models and model refreshing.
- Operationalisation from a business perspective: determining value and making Data Science outputs part of standard business and decision-making processes.
Module 11 – Panel: Building a Data – Driven Enterprise
- Data Science as a process, rather than as a point event.
- The role of high-level management in enabling data-driven decisions.
- The role of direct management: on the un-Gantt-ability of research.
Module 12 – Case Study
- Operational efficiency by predictive analytics
- Architectural choices for integration and efficacy