Examples and materials for the module on Python in the course in Coding for Data Science and Data Management
The course syllabus is organized according to 4 main case studies. For each case study, we will implement solutions, preferably from scratch, in order to introduce all the main topics on Pyhton that are part of the course program.
The syllabus and the lectures presuppose a knowledge of the programming contents seen in the crash course on coding. For those who have difficulty it is suggested to complete the Python tutorial (https://docs.python.org/3/tutorial/) as an initial introduction to the language.
The course is not associated with a specific textbook, but a good introduction to Python for Data Science and Machine Learning can be found in the following book:
Gereon, A. (2018). Hands-on Machine Learning with Scikit-Learn and Tensor Flow. O’Reily Media Inc., USA. link
During the lectures no slides will be used. Materials are limited to blackboard, Jupyter notebooks and Python code examples.
The code developed during the lectures as well as other materials provided by the lecturers will be available on the GitHub dscoding
repository of this module at https://github.com/afflint/dscoding.
We learn how to implement a model for generating words according to several different languages. We start from a naive random model and we progressively improve the quality of the model by exploiting a real language-specific text dataset using Markov chains.
Sept 23 - Sept 30
- Advanced use of Python
dictionaries
- Introduction to
pandas
- Introduction to
nltk
- Read/write from files and csv
- Advenced use of
numpy
- Introduction to object oriented programming
We implement the KMeans clustering algorithm from scratch, working on several dataset, either 2d and multidimensional. At the end, we compare our implementation with the KMeans implementation of scikit-learn
. Furthermore, we see how to visualize the clusters as well as the dynamics of the KMean algorithm.
Oct 7 - Oct 14
- Advanced use of
pandas
for cleaning and preparing the datasets - Advanced use of
numpy
- Introduction to
scikit-learn
for PCA, clustering and evaluation metrics - Introduction to
matplotlib
and overview of other plot libraries (i.e.,plotly
,bokeh
) - More on object oriented programming
We implement from scratch linear regression and gradient descent. Both will be compared with available standard libraries and tested on several datasets.
Oct 28 - Nov 4
- Advanced use of
pandas
,numpy
andscikit-learn
- Pipelines, feature selection and model selection in
scikit-learn
We implement a complex simulation environment that requires a careful software design and advanced competences in object oriented programming.
Nov 11 - Nov 18 - Nov 25 - Dec 2
- Introduction to software design
- Advanced object oriented programming
- DB interaction with
SQLAlchemy
andpymongo
- Introduction to
networkx
- Introduction to graphical and interactive data apps with
streamlit
- Virtual environments using
venv