Light

fau-cs6 / kdd Goto Github PK

View Code? Open in Web Editor NEW

22.0 3.0 11.0 19.65 MB

Lecture and exercise of "Knowledge Discovery in Databases"

License: GNU General Public License v3.0

Jupyter Notebook 18.85% Python 0.97% TeX 80.02% Dockerfile 0.01% Makefile 0.14%

kdd's Introduction

KDD - Knowledge Discovery in Databases

Responsibilities

(Changed in 2024SS since Melanie B. Sigl is not participating this time)

Lecture

Course Intro - Dominik Probst
Introduction - Dominik Probst
Data - Dominik Probst
Preprocessing - Dominik Probst
OLAP - Dominik Probst
Mining Frequent Patterns, Associations and Correlations - Dominik Probst
Classification - Dominik Probst
Cluster Analysis - Dominik Probst
Outlier Analysis - Dominik Probst
Current Research at CS6 - Dominik Probst

Exercise

Introduction to Python & Pandas - Dominik Probst
Data analysis & data preprocessing - Dominik Probst
Frequent Patterns - Dominik Probst
Classification - Dominik Probst
Clustering - Dominik Probst

Submission

Frequent Patterns - Dominik Probst
Classification - Dominik Probst
Clustering - Dominik Probst

Summer semester 2024

Semester duration: 15 April 2024 – 19 July 2024
Public holidays:
- Wednesday, 1 May 2024
- Thursday, 9 May 2024
- Monday, 20 May 2024
- Thursday, 30 May 2024
FAU specific holidays (no lectures and exercises):
- Tuesday, 21 May 2024
- Friday, 31 May 2024

Setup for Building Lecture Slides Locally

To build these lecture slides locally on your machine you’ll need an up-to-date version of LaTeX such as texlive or MikTex.

Setup for Commiting

We use the framework pre-commit to manage our pre-commit hooks. This simplifies the maintenance of the hooks - especially on heterogeneous systems - but requires an initial installation process of the individual users.

First, the framework itself must be installed. This process is explained on the framework’s website under “Installation”.

The second thing that needs to be done is to install the pre-commit hooks themselves. This can be achieved by running the command pre-commit install in the root directory of this project.

We assume that each commit has been validated with these pre-commit hooks and will not accept pull requests that contain unvalidated commits (the pre-commit hooks are also checked again on the server side by a GitHub action).

kdd's People

Contributors

Stargazers

Watchers

Forkers

yeasirarafatratul lucew melsigl anta161 harshjoshi23 crazz-zaac mhpolas vibhugupta10616 kristen149 mahdi-qanbari timm638

kdd's Issues

Review lecture 1: Prologue

Update Lecture 8: Clustering

Review lecture 2: Introduction

Add Pre-Commit Hook

Create a pre-commit hook that removes trailing whitepaces, beautifies LaTeX code, as well as incorporates a linter to check for errors and bugs.

Exercise Rework

For the Summer semester of 2024, we are planning an overhaul of our exercise modules. The objective is to create two new types of exercises from the existing ones:

1. In-Person Exercises:
The in-person exercises are (likely) going to be designed in such a way that they do not require any prior preparation from the students (aside from reviewing the lecture material), as the tasks will be progressively worked on together. There will be two categories of in-person exercises, alternating more or less:

Data Science Experiments:
Leveraging our existing database, we will use Python libraries in a hands-on setting to analyze the data and unearth knowledge. This practical approach aims to complement the lecture material and offer a glimpse into the life of a data scientist.
Theoretical Tasks on Methods:
We have already dealt with theoretical tasks related to methods like A Priori and FP-Growth in the last semester, similar to what one might find in an exam. These will continue to be part of the in-person exercises and will be expanded to include new tasks (for instance, on classification).

2. Implementation Exercises:
The implementation exercises will be done as pure homework by the students. These will largely reflect the previous exercise sheets where methods were implemented step by step by hand. They will be enhanced to allow for automatic "correction" using Otter-Grader. Students who invest the effort to implement the tasks on their own and achieve a certain percentage of the points will gain access to a small bonus exam.

Dataset(s) for Exercise

Motivation
Exercises are based on datasets. While every exercise could be based on its own dataset, one dataset used by all exercises groups (e.g. Association Rules, Classification, Clustering, Outlier) is desirable. For instance a sales dataset originated from some ERP/DWH system similar to the one dataset referenced in the book of Han et. al ("Data Mining").

Definition of Done
Specify datasets for some lectures as well as one dataset that acts as a common theme throughout the semester.

Tasks

Specify dataset(s)
Integrate dataset with a download script

Make the artifacts available without accessing the GitHub Actions

This issue is an outsourcing of part of Issue #12. We have found that the best ways to make the artifacts directly available (without accessing the GitHub Actions) are to either publish the artifacts via a release (would only make sense on the master branch) or to use GitLab Pages (won't work until we make the repository public).

This issue will be put on hold until then for that reason.

Minor updates to lecture 7: Classification

Majority of lecture is up-to-date. Following minor reviews and adjustments are outstanding:

Update example of evaluation metrics with confusion matrix
Review hypothesis testing and significance
Naive Bayes in need of minor cosmetics
Rule induction, more specifically "sequential covering algorithm" in need of review and cosmetics

Update Lecture 5: OLAP

Definition of Done
All sections updated, as well as reviewed with incorporated feedback.

Tasks

Read chapter 4 of book
Insert general DWH workflow

Classification: Make Example Dataset "buys_computer" Consistent to Quinlan's Original Dataset

PR #55 found an inconsistency in the presented example dataset and information gain calculations. Our lecture example follows Quinlan's original table whereas our exercise dataset deviates from one tuple's attribute value. Review commit 05b2d11 corrected this lecture table, but these changes were not passed forward to the exercise dataset.

For the time being, we changed the lecture slides to prevent any confusion. Yet, in the upcoming semester, we may want to bring our examples in line with Quinlan's original dataset. Therefore, reintroduce changes made in commit 05b2d11 again but for lecture and exercise.

Provide documentation on contribution

Students are encouraged to report errors on the slides or in the assignment sheets or correct them right away via the GitHub repository. The first pull request showed that a small rule guide from our side could be helpful.

Included should be according to our Jour Fixe from 02.05.2022:

Commit messages should follow our prefix scheme
(e.g. "lecture-[xxx]: [description of the changes]" or "exercise-[xxx]: [description of the changes]")
Commit messages should clearly describe what has been changed
(e.g. "lecture-prologue: fixed typos on slide 15" instead of "lecture: fixed some stuff")
Pull requests should be directed to the dev branch

Create Exercise "Outlier"

Create Exercise "Clustering"

Build Zip-Files for Every Exercise With Github Actions

Our exercise JupyterNotebooks may end up containing additional files like Python scripts, and images. An extension of our current GitHub Actions workflow to include building ZIP-files for every single exercise unit makes the distribution easier.

Update Lecture 4: Preprocessing

Fix the formulas for correlation analysis
Implement the small typo and content fixes Prof. Meyer-Wegener proposed

Review Lecture 2: Getting To Know Your Data

Adding Cite Footnotes to the Frequent Pattern Lecture

As suggested by @melsigl in #68, the sources should be added as footnotes in the Frequent Pattern lecture.

Create Exercise "Data"

Create exercise that corresponds to lectures 3 (Getting To Know Your Data) and lecture 4 (Data Preprocessing).

Update Readme

Implementation of the new README contents decided in the meeting on 28.01.2022.

Link to the CI Artifacts
Responsibility for the exercise sheets

Update Lecture 9: Outlier

Definition of Done
All sections updated, as well as reviewed with incorporated feedback.

Tasks

Read chapter 12 of book
Update section "Outlier and Outlier Analysis
Update section "Outlier-Detection Methods
Update section "Statistical Approaches"
Update section "Proximity-Based Approaches
Update section "Summary"
Reviewed

Update Lecture 7: Classification

Definition of Done
All sections updated, as well as reviewed with incorporated feedback.

Tasks

Read chapter 8 of book
Update section "Decision-Tree"
Update section "Bayes Classification"
Update section "Rule-Based Classification"
Update section "Model Evaluation & Selection"
Update section "Ensemble Methods"
Reviewed

Update Lecture 6: Frequent Patterns

Refactor Makefile

Shorten makefile and incorporate steps from GitHub Actions into makefile. Additionally, add possibility to build all LaTeX pdf files.

Exercise 7 - Classification: Add Calculation and Plotting of ROC Curve

Add exercise to calculate ROC curve table and then plot ROC curve. Similar to figures 8.18 and 8.19 in the 3rd edition of the book.

Discuss and Resolve Exercises for The Upcoming Semesters

PR #59 showed that several unresolved questions exist. As we aim to continuously improve our lectures as well as the exercises, we should discuss them in order to implement them before the upcoming semester.

As of now, the discussed points so far include, but may not be limited to:

Split exercise notebooks per week or per algorithm? What is the best strategy here? What confuses the students less? What strategy provides the most intuitive take on for students?
Follow an object-oriented paradigm that is consistent with sklearn and consolidates all necessary state variables and functions to provide an intuitive fit/predict procedure that does not jeopardize a guided step-by-step walk through the algorithm. Again, the questions are: What is the best strategy here?

Each party concerned already commented on their respective opinion on these matters including advantages, disadvantages, as well as their individual opinions for the aforementioned points. See PR #59 for a detailed discussion.

Further comments and topics that we should discuss in the future shall be added here, or mentioned directly in our weekly Jour Fixe.

Update Images in Lecture OLAP

Update and extend images in lecture 5 OLAP:

Update images to use relative positioning to make resizing and moving single elements easier.
Update the image on slide "Cube: A Lattice of Cuboids" to conform to Figure 4.5 on page 139 of the book (some texts are missing). Additionally, put text in the foreground, lines in the background, and maybe use a different color for the lines.
Optionally, in image on slide "Data Warehouse Development: A Recommended Approach" use color coding for "Model refinement" lines.

Create Exercise/Workshop "Introduction to pandas"

Create Exercise "Frequent Patterns"

Create Prologue

Create Exercise "Classification"

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.