Coder Social home page Coder Social logo

kdd's Introduction

KDD - Knowledge Discovery in Databases

Responsibilities

(Changed in 2024SS since Melanie B. Sigl is not participating this time)

Lecture

  1. Course Intro - Dominik Probst
  2. Introduction - Dominik Probst
  3. Data - Dominik Probst
  4. Preprocessing - Dominik Probst
  5. OLAP - Dominik Probst
  6. Mining Frequent Patterns, Associations and Correlations - Dominik Probst
  7. Classification - Dominik Probst
  8. Cluster Analysis - Dominik Probst
  9. Outlier Analysis - Dominik Probst
  10. Current Research at CS6 - Dominik Probst

Exercise

  1. Introduction to Python & Pandas - Dominik Probst
  2. Data analysis & data preprocessing - Dominik Probst
  3. Frequent Patterns - Dominik Probst
  4. Classification - Dominik Probst
  5. Clustering - Dominik Probst

Submission

  1. Frequent Patterns - Dominik Probst
  2. Classification - Dominik Probst
  3. Clustering - Dominik Probst

Summer semester 2024

  • Semester duration: 15 April 2024 – 19 July 2024
  • Public holidays:
    • Wednesday, 1 May 2024
    • Thursday, 9 May 2024
    • Monday, 20 May 2024
    • Thursday, 30 May 2024
  • FAU specific holidays (no lectures and exercises):
    • Tuesday, 21 May 2024
    • Friday, 31 May 2024

Setup for Building Lecture Slides Locally

To build these lecture slides locally on your machine you’ll need an up-to-date version of LaTeX such as texlive or MikTex.

Setup for Commiting

We use the framework pre-commit to manage our pre-commit hooks. This simplifies the maintenance of the hooks - especially on heterogeneous systems - but requires an initial installation process of the individual users.

First, the framework itself must be installed. This process is explained on the framework’s website under “Installation”.

The second thing that needs to be done is to install the pre-commit hooks themselves. This can be achieved by running the command pre-commit install in the root directory of this project.

We assume that each commit has been validated with these pre-commit hooks and will not accept pull requests that contain unvalidated commits (the pre-commit hooks are also checked again on the server side by a GitHub action).

kdd's People

Contributors

melsigl avatar dominik-probst avatar quicktus avatar lucew avatar itodnerd avatar mahdi-qanbari avatar dependabot[bot] avatar marcjulianschwarz avatar

Stargazers

ParisaBaastani avatar Rahul Sawhney avatar Leonard Fischer avatar Minhaj Ahmed Moin avatar Lam Tran avatar MD MAINUL HAQUE avatar Andrej Kastrin avatar Michael Wiesenbauer avatar niyang.bai avatar Anuraag Mishra avatar Khaled Sazzad avatar Greshma Shaji avatar Hannes Jacobi avatar Muhammad Umar Naeem avatar Zeki Özen avatar Patrick Groth avatar  avatar  avatar Amir D avatar  avatar Siddharth Simediya avatar Hrushik Perumalla avatar

Watchers

 avatar  avatar Raj Sinha avatar

kdd's Issues

Add Pre-Commit Hook

Create a pre-commit hook that removes trailing whitepaces, beautifies LaTeX code, as well as incorporates a linter to check for errors and bugs.

Exercise Rework

For the Summer semester of 2024, we are planning an overhaul of our exercise modules. The objective is to create two new types of exercises from the existing ones:

1. In-Person Exercises:
The in-person exercises are (likely) going to be designed in such a way that they do not require any prior preparation from the students (aside from reviewing the lecture material), as the tasks will be progressively worked on together. There will be two categories of in-person exercises, alternating more or less:

  • Data Science Experiments:
    Leveraging our existing database, we will use Python libraries in a hands-on setting to analyze the data and unearth knowledge. This practical approach aims to complement the lecture material and offer a glimpse into the life of a data scientist.
  • Theoretical Tasks on Methods:
    We have already dealt with theoretical tasks related to methods like A Priori and FP-Growth in the last semester, similar to what one might find in an exam. These will continue to be part of the in-person exercises and will be expanded to include new tasks (for instance, on classification).

2. Implementation Exercises:
The implementation exercises will be done as pure homework by the students. These will largely reflect the previous exercise sheets where methods were implemented step by step by hand. They will be enhanced to allow for automatic "correction" using Otter-Grader. Students who invest the effort to implement the tasks on their own and achieve a certain percentage of the points will gain access to a small bonus exam.

Dataset(s) for Exercise

Motivation
Exercises are based on datasets. While every exercise could be based on its own dataset, one dataset used by all exercises groups (e.g. Association Rules, Classification, Clustering, Outlier) is desirable. For instance a sales dataset originated from some ERP/DWH system similar to the one dataset referenced in the book of Han et. al ("Data Mining").

Definition of Done
Specify datasets for some lectures as well as one dataset that acts as a common theme throughout the semester.

Tasks

  • Specify dataset(s)
  • Integrate dataset with a download script

Make the artifacts available without accessing the GitHub Actions

This issue is an outsourcing of part of Issue #12. We have found that the best ways to make the artifacts directly available (without accessing the GitHub Actions) are to either publish the artifacts via a release (would only make sense on the master branch) or to use GitLab Pages (won't work until we make the repository public).

This issue will be put on hold until then for that reason.

Minor updates to lecture 7: Classification

Majority of lecture is up-to-date. Following minor reviews and adjustments are outstanding:

  • Update example of evaluation metrics with confusion matrix
  • Review hypothesis testing and significance
  • Naive Bayes in need of minor cosmetics
  • Rule induction, more specifically "sequential covering algorithm" in need of review and cosmetics

Update Lecture 5: OLAP

Definition of Done
All sections updated, as well as reviewed with incorporated feedback.

Tasks

  • Read chapter 4 of book
  • Insert general DWH workflow

Classification: Make Example Dataset "buys_computer" Consistent to Quinlan's Original Dataset

PR #55 found an inconsistency in the presented example dataset and information gain calculations. Our lecture example follows Quinlan's original table whereas our exercise dataset deviates from one tuple's attribute value. Review commit 05b2d11 corrected this lecture table, but these changes were not passed forward to the exercise dataset.

For the time being, we changed the lecture slides to prevent any confusion. Yet, in the upcoming semester, we may want to bring our examples in line with Quinlan's original dataset. Therefore, reintroduce changes made in commit 05b2d11 again but for lecture and exercise.

Provide documentation on contribution

Students are encouraged to report errors on the slides or in the assignment sheets or correct them right away via the GitHub repository. The first pull request showed that a small rule guide from our side could be helpful.

Included should be according to our Jour Fixe from 02.05.2022:

  • Commit messages should follow our prefix scheme
    (e.g. "lecture-[xxx]: [description of the changes]" or "exercise-[xxx]: [description of the changes]")
  • Commit messages should clearly describe what has been changed
    (e.g. "lecture-prologue: fixed typos on slide 15" instead of "lecture: fixed some stuff")
  • Pull requests should be directed to the dev branch

Build Zip-Files for Every Exercise With Github Actions

Our exercise JupyterNotebooks may end up containing additional files like Python scripts, and images. An extension of our current GitHub Actions workflow to include building ZIP-files for every single exercise unit makes the distribution easier.

Create Exercise "Data"

Create exercise that corresponds to lectures 3 (Getting To Know Your Data) and lecture 4 (Data Preprocessing).

Update Readme

Implementation of the new README contents decided in the meeting on 28.01.2022.

  • Link to the CI Artifacts
  • Responsibility for the exercise sheets

Update Lecture 9: Outlier

Definition of Done
All sections updated, as well as reviewed with incorporated feedback.

Tasks

  • Read chapter 12 of book
  • Update section "Outlier and Outlier Analysis
  • Update section "Outlier-Detection Methods
  • Update section "Statistical Approaches"
  • Update section "Proximity-Based Approaches
  • Update section "Summary"
  • Reviewed

Update Lecture 7: Classification

Definition of Done
All sections updated, as well as reviewed with incorporated feedback.

Tasks

  • Read chapter 8 of book
  • Update section "Decision-Tree"
  • Update section "Bayes Classification"
  • Update section "Rule-Based Classification"
  • Update section "Model Evaluation & Selection"
  • Update section "Ensemble Methods"
  • Reviewed

Refactor Makefile

Shorten makefile and incorporate steps from GitHub Actions into makefile. Additionally, add possibility to build all LaTeX pdf files.

Discuss and Resolve Exercises for The Upcoming Semesters

PR #59 showed that several unresolved questions exist. As we aim to continuously improve our lectures as well as the exercises, we should discuss them in order to implement them before the upcoming semester.

As of now, the discussed points so far include, but may not be limited to:

  1. Split exercise notebooks per week or per algorithm? What is the best strategy here? What confuses the students less? What strategy provides the most intuitive take on for students?
  2. Follow an object-oriented paradigm that is consistent with sklearn and consolidates all necessary state variables and functions to provide an intuitive fit/predict procedure that does not jeopardize a guided step-by-step walk through the algorithm. Again, the questions are: What is the best strategy here?

Each party concerned already commented on their respective opinion on these matters including advantages, disadvantages, as well as their individual opinions for the aforementioned points. See PR #59 for a detailed discussion.

Further comments and topics that we should discuss in the future shall be added here, or mentioned directly in our weekly Jour Fixe.

Update Images in Lecture OLAP

Update and extend images in lecture 5 OLAP:

  • Update images to use relative positioning to make resizing and moving single elements easier.
  • Update the image on slide "Cube: A Lattice of Cuboids" to conform to Figure 4.5 on page 139 of the book (some texts are missing). Additionally, put text in the foreground, lines in the background, and maybe use a different color for the lines.
  • Optionally, in image on slide "Data Warehouse Development: A Recommended Approach" use color coding for "Model refinement" lines.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.