Dimensionality-Reduction

Introduction :-

Dimensionality reduction is a technique used to reduce the number of features in a dataset while retaining as much of the important information as possible. In other words, it is a process of transforming high-dimensional data into a lower-dimensional space that still preserves the essence of the original data.

Before knowing the importance of dimensionality reduction, let's explore the problems with dimensionality reduction.

Problems with Dimensionality Reduction :-

It can mean high computational cost to perform learning.
It often leads to over-fitting when learning a model, which means that the model will perform well on the training data but poorly on test data.
Data are rarely randomly distributed in high-dimensions and are highly correlated, often with spurious correlations.
The distances between a nearest and farthest data point can become equidistant in high dimensions, that can hamper the accuracy of some distance-based analysis tools.

These limitations really create a big hardships in determining the accuracy of any model's prediction or classification.

Therefore importance of dimensionality reduction comes into play because :-

Dimensionality reduction helps with these problems, while trying to preserve most of the relevant information in the data needed to learn accurate, predictive models.
There are often too many factors on the basis of which the final prediction is done. These factors are basically variables called features.
The higher the number of features, the harder it gets to visualize the training set and then work on it.
Sometimes, most of these features are correlated, and hence redundant. This is where dimensionality reduction algorithms come into play.
It reduces the time and storage space required.
It helps Remove multi-collinearity which improves the interpretation of the parameters of the machine learning model.
It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.

Dataset :-

Text dataset generated by prompt Ai is used as raw data. This data consists of comedy, suspense and thrillers as 3 geners of all the 15 telugu movies in which 5 movies are assigned for each genere. The dataset file is a .xlsx file in my git repository . Dataset is made to generate in such a way that movie names appears to be in first column and each movie description of 500 -600 words appear in 2nd column respectively. So altogether 7500-9000 words of movie descriptions are present in this file.

Methodology :-

Whole project is divided into 4 tasks and the specific functionality of each task is as follows :-

Task 1: Text Data Reading and Pre-processing

Objective: In this initial task, the raw text data was ingested and prepared for subsequent analysis.

Approach:

The dataset was read using the Pandas Python library, employing a .csv file format.
The data was organized into two distinct columns: one for movie titles and another for movie descriptions.
The computation of Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in the corpus was automated using the TfidfVectorizer function from the sklearn library.
Standardization of TF-IDF features was performed, centering the data around its mean and scaling it by its standard deviation.

Task 2: PCA Implementation

Objective: Task 2 focused on the reduction of dimensionality within the TF-IDF features while preserving meaningful information.

Approach:

Principal Component Analysis (PCA), a linear dimensionality reduction technique available in the sklearn library, was utilized.
The explained variance ratio of each component was examined to determine the proportion of variance captured by each.
A bar plot was generated to visualize the explained variance ratio, enabling a reasoned choice for 'k,' the number of principal components to retain.

Task 3: Projection and Reconstruction

Objective: Task 3 entailed exploring the impacts of dimensionality reduction through PCA on the TF-IDF features and assessing the quality of data reconstruction.

Approach:

The TF-IDF features were projected into a reduced space defined by 'k' principal components.
Subsequently, the reverse operation was executed to reconstruct the original TF-IDF features.
Measurement of the quality of reconstruction was conducted through the computation of Mean Squared Error (MSE) between the original and reconstructed features.
A line plot depicting the reconstruction loss was generated while varying 'k,' allowing for a reevaluation of the choice of 'k' from Task 2 in light of the reconstruction loss results.

Task 4: Interpretation of Results

Objective: Task 4 involved the interpretation of results obtained from PCA analysis and the understanding of the significance of choices made.

Approach:

An analysis of the impact of standardization on PCA results was performed, emphasizing its influence on data distribution and scale.
The choice of 'k' was shown to play a crucial role in balancing the trade-off between information retention and dimensionality reduction.
Additionally, the extraction of the top 'k' terms associated with the final choice of 'k' was conducted, with insights provided regarding their contribution to the underlying structure of the original text data.

Task 5: PCA vs t-SNE Comparison

Objective: In the final task, a comparison between PCA, a linear dimensionality reduction technique, and t-SNE, a non-linear method, was undertaken to assess their effectiveness in data visualization.

Approach:

t-SNE, another dimensionality reduction technique from the sklearn library, was employed to obtain two components for comparison.
Instance types were used as labels to set colors in scatter plots to enhance interpretability.
Scatter plots were generated for both PCA and t-SNE projections, with a qualitative analysis of differences provided, offering insights into the strengths and weaknesses of each method.

Conclusion

In conclusion, this assignment has provided a comprehensive exploration of text data analysis and dimensionality reduction techniques. A systematic approach to text data preparation and analysis, including dimensionality reduction via PCA and a comparison with t-SNE, has yielded valuable insights into the dataset's structure and the impact of analytical decisions.

Detailed code implementations corresponding to each task can be found in the respective Jupyter notebook files.

srinathsai / dimensionality-reduction Goto Github PK