This repository contains a Jupyter Notebook for analyzing job data using Python and various libraries. The notebook covers several key steps in the data analysis process, including data scraping, cleaning, transformation, and clustering. Below, you'll find an overview of each section in the notebook.
- Prepare JSON Files
- Prepare DataFrames
- Creating a Corpus
- Remove Noise Function
- Creating tfidf_matrix
- Clustering with K-Means
- Visualizing Data
- Creating a Spark Context and Reading the Data
- Creating a Pipeline and Clustering Using K-Means Algorithm
This section demonstrates how to scrape job data from websites using Scrapy. It provides Python code to define Scrapy spiders, run them, and save the scraped data in JSON format.
In this section, the notebook reads the previously created JSON files and attempts to create two Pandas DataFrames, 'df1' and 'df2'. It also includes a data cleaning function to clean the data. If the JSON files cannot be loaded, it provides an error message.
This part of the notebook initializes an empty list to store job titles and iterates through the 'jobTitle' column of the DataFrames to create a corpus of job titles.
Here, the notebook defines a function to remove noise from text data using the NLTK library. It removes non-alphanumeric characters, converts tokens to lowercase, and eliminates stopwords.
This section involves creating a TF-IDF matrix from the job title corpus. It uses the Scikit-Learn library's TfidfVectorizer to convert the text data into a numerical format for further analysis.
In this part, the notebook applies the K-Means clustering algorithm to cluster job titles based on their TF-IDF representations. It uses Scikit-Learn's KMeans class to create clusters and assigns cluster labels to job titles.
This section attempts to visualize the clustered job titles using PCA for dimensionality reduction and Matplotlib and Seaborn for plotting. It generates a scatterplot to visualize the data points in two dimensions.
Here, the notebook utilizes PySpark to create a SparkContext and a SparkSession. It reads the JSON data from the previously created files ('jobs_1.json' and 'jobs_2.json') into Pandas DataFrames and then converts them into a PySpark DataFrame named 'jobs_dataFrame'.
In the final section, the notebook creates a data processing and modeling pipeline using PySpark's MLlib. It tokenizes the job titles, removes stopwords, calculates TF-IDF features, and applies the K-Means clustering algorithm. The results are displayed, and the first 25 rows are shown.
This Jupyter Notebook provides a comprehensive guide for scraping, cleaning, analyzing, and clustering job data using various Python libraries and tools. It is intended for educational purposes and can be used as a reference for similar data analysis tasks.