Coder Social home page Coder Social logo

behzadshomali / data-mining-project Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 7.77 MB

This repository contains our joint work (Mohammad Mohammadzadeh and I) in the context of the "Introduction to Data Mining" course (held at FUM) projects.

Python 0.34% Jupyter Notebook 99.66%
classification clustering data-crawling feature-extraction frequent-pattern-extraction multithreading text-preprocessing

data-mining-project's Introduction

Data Mining course projects

This repository contains our joint work (Mohammad Mohammadzadeh and I) in the context of the "Introduction to Data Mining" course (held at the Ferdowsi University of Mashhad) projects. We were supposed to do the projects in several phases, which are as follows:

  1. Data crawling
  2. Preprocessing & Feature extraction
  3. Clustering & Classification

Phase 1: Data crawling

In this phase, we should first select a university as our crawling ta rget. Next, we should extract course information from the course catalog pages and then collect the course information according to the description. We finally chose The University of Technology Sydney (UTS).

In our codes, we had to inherit from the given class BaseCrawler which was an interface for our codes. We used the requests and BeautifulSoup modules to crawl and parse the web pages. Additionally, to speed up the process, the code works with multi-threads (threads_count=<INTENDED_THREADS_NUM>).

On our target website, for most of the courses, the following information was provided:

  • Projects
  • Scores
  • Prerequisite
  • Objective
  • Description
  • Course title
  • Department name
  • Outcome
  • References

The output of this phase was a CSV file containing all the crawled courses (rows), as well as their information (columns).

Phase2: Preprocessing & Feature extraction

In this phase, we had to preprocess and then extract features/keywords from the crawled data obtained from the last phase.

Preprocessing

We had to work with 3 columns (features) for each course: Description, Objective, and Outcome, where all of them contain one or more sentences.

More concretely, we applied stemming, lemmatizing, and removing stopwords approaches to clean and preprocess the texts. It is worth mentioning that, we did not implement any of the above approaches from scratch, but also we took advantage of built-in modules provided by the nltk library.

!!! Please note that, for the `removing stopwords` approach, we only relied on the provided `nltk` module and did not apply any further method to remove/ignore less valuable words, i.e. the words appeared numerous times but have not shown up among `nltk` stopwords; such as word develop. We could have easily used TF-IDF or sth like that to avoid these words, but since those words didn't make any trouble for our work, we skipped using such approaches!

Keyword Extraction

After tokenizing and preprocessing the texts, it was time to extract the keywords. In order to extract keywords, we used KeyBERT. According to its website:

KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

The following figure illustrates the Wordclouds separately obtained from extracted keywords of each department. As you may have noticed, some words, e.g. develop, appear in most of the plots. This is mainly caused by the issue we have already discussed here.

Frequent Pattern Extraction

By having the keywords in hand, extracting the Frequent patterns was a piece of cake! We used mlxtend library to do such. In bellow, you can find a small snippet of this section's output.

Phase3: Clustering & Classification

This phase itself is divided into 3 sections: Embedding, Clustering, and Classification.

Embedding

In the previous phases, we crawled the required data, preprocessed it, and extracted the keywords. Meanwhile, in order to be able to compute the similarity/distance between two courses (for clustering and classification), the individual tokens are not so handy for us.

To address the above problem, we needed to work with vectors, so in that case, we can effortlessly use cosine distance for comparing the similarity between different courses. Hence, we took advantage of SentenceTransformers which, according to its webpage, is a:

Python framework for SOTA sentence, text and image embeddings. . . . You can use this framework to compute sentence/text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning.

Clustering

We used K-means and DBSCAN algorithms to cluster the obtained vectors. Although the clustering had been done unsupervised, to be able to evaluate the clusters we needed some kind of labels for our data. Hence, we set each course's department as its label.

Classification

Having ready vectors, as well as their labels, classifying the courses was not a pain in the rear! For this section, we used SVM, Perceptron, and MLP algorithms. Using these algorithms, we had a little bit more freedom to try different hyperparameters and we had to make an effort to tune them.

We finally took the best-obtained result from each algorithm and compared them together. Maybe, the obtained results seem discouraging at first, but please note that achieving the accuracy of ~68% while solving a classification problem with +40 existing classes (random accuracy=~2.5%) by using a very basic (and sometimes general) description for each course, make the results more valuable!

data-mining-project's People

Contributors

behzadshomali avatar mohammadmohammadzadehkalati avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.