Data Mining course projects

This repository contains our joint work (Mohammad Mohammadzadeh and I) in the context of the "Introduction to Data Mining" course (held at the Ferdowsi University of Mashhad) projects. We were supposed to do the projects in several phases, which are as follows:

Data crawling
Preprocessing & Feature extraction
Clustering & Classification

Phase 1: Data crawling

In this phase, we should first select a university as our crawling ta rget. Next, we should extract course information from the course catalog pages and then collect the course information according to the description. We finally chose The University of Technology Sydney (UTS).

In our codes, we had to inherit from the given class BaseCrawler which was an interface for our codes. We used the requests and BeautifulSoup modules to crawl and parse the web pages. Additionally, to speed up the process, the code works with multi-threads (threads_count=<INTENDED_THREADS_NUM>).

On our target website, for most of the courses, the following information was provided:

Projects
Scores
Prerequisite
Objective
Description
Course title
Department name
Outcome
References

The output of this phase was a CSV file containing all the crawled courses (rows), as well as their information (columns).

Phase2: Preprocessing & Feature extraction

In this phase, we had to preprocess and then extract features/keywords from the crawled data obtained from the last phase.

Preprocessing

We had to work with 3 columns (features) for each course: Description, Objective, and Outcome, where all of them contain one or more sentences.

More concretely, we applied stemming, lemmatizing, and removing stopwords approaches to clean and preprocess the texts. It is worth mentioning that, we did not implement any of the above approaches from scratch, but also we took advantage of built-in modules provided by the nltk library.

!!! Please note that, for the `removing stopwords` approach, we only relied on the provided `nltk` module and did not apply any further method to remove/ignore less valuable words, i.e. the words appeared numerous times but have not shown up among `nltk` stopwords; such as word develop. We could have easily used TF-IDF or sth like that to avoid these words, but since those words didn't make any trouble for our work, we skipped using such approaches!

Keyword Extraction

After tokenizing and preprocessing the texts, it was time to extract the keywords. In order to extract keywords, we used KeyBERT. According to its website:

KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

The following figure illustrates the Wordclouds separately obtained from extracted keywords of each department. As you may have noticed, some words, e.g. develop, appear in most of the plots. This is mainly caused by the issue we have already discussed here.

Frequent Pattern Extraction

By having the keywords in hand, extracting the Frequent patterns was a piece of cake! We used mlxtend library to do such. In bellow, you can find a small snippet of this section's output.

Phase3: Clustering & Classification

This phase itself is divided into 3 sections: Embedding, Clustering, and Classification.

Embedding

In the previous phases, we crawled the required data, preprocessed it, and extracted the keywords. Meanwhile, in order to be able to compute the similarity/distance between two courses (for clustering and classification), the individual tokens are not so handy for us.

To address the above problem, we needed to work with vectors, so in that case, we can effortlessly use cosine distance for comparing the similarity between different courses. Hence, we took advantage of SentenceTransformers which, according to its webpage, is a:

Python framework for SOTA sentence, text and image embeddings. . . . You can use this framework to compute sentence/text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning.

Clustering

We used K-means and DBSCAN algorithms to cluster the obtained vectors. Although the clustering had been done unsupervised, to be able to evaluate the clusters we needed some kind of labels for our data. Hence, we set each course's department as its label.

Classification

Having ready vectors, as well as their labels, classifying the courses was not a pain in the rear! For this section, we used SVM, Perceptron, and MLP algorithms. Using these algorithms, we had a little bit more freedom to try different hyperparameters and we had to make an effort to tune them.

We finally took the best-obtained result from each algorithm and compared them together. Maybe, the obtained results seem discouraging at first, but please note that achieving the accuracy of ~68% while solving a classification problem with +40 existing classes (random accuracy=~2.5%) by using a very basic (and sometimes general) description for each course, make the results more valuable!

behzadshomali / data-mining-project Goto Github PK

data-mining-project's Introduction

Data Mining course projects

Phase 1: Data crawling

Phase2: Preprocessing & Feature extraction

Preprocessing

Keyword Extraction

Frequent Pattern Extraction

Phase3: Clustering & Classification

Embedding

Clustering

Classification

data-mining-project's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent