This documentation details the comprehensive steps and methodologies used to build a predictive algorithm for fraud detection using a training dataset. The objective is to create an accurate model to identify fraudulent transactions. The process is guided by Local Interpretable Model-agnostic Explanations (LIME) principles to ensure clarity and accessibility.
- Libraries for data manipulation, visualization, statistical methods, sampling methods, model selection, dimensionality reduction, simple ML models, and ensemble learning are imported.
- Set a seed for reproducibility.
- Load the dataset using pandas to read the CSV file containing the transaction data.
- Get a preliminary understanding of the dataset using functions like
df.info()
to check the structure and types of the data.
- Use
df.describe()
to generate summary statistics of the dataset.
- Analyze the distribution of the target variable to understand the class imbalance.
- Examine the basic statistics and distribution of key predictors like the transaction amount.
- Use histograms and log-scale transformations to visualize data distributions.
- Create a correlation matrix to identify relationships between variables.
- Visualize correlations using a heatmap.
- Identify columns with missing values.
- Impute missing values using median or mode, depending on the data type.
- Detect outliers using statistical methods like the z-score.
- Visualize outliers using boxplots and decide on handling strategies.
- Identify and remove duplicate observations to ensure data quality.
- Split the dataset into fraudulent and non-fraudulent transactions.
- Use histograms to compare the distribution of transaction amounts for both classes.
- Apply log transformation for better visualization.
- Plot transaction amounts over time to identify any time-based patterns in fraudulent activities.
- SMOTE is used to oversample the minority class by creating synthetic examples.
- This undersampling technique selects examples from the majority class that are close to the minority class examples.
- Apply both oversampling and undersampling techniques to balance the dataset.
- Reduce the number of features while retaining most of the variance in the data.
- Decompose the data matrix into singular vectors and values for dimensionality reduction.
- Use LDA to project data in a way that maximizes class separability.
- Split the data into training and testing sets to evaluate model performance.
- Implement cross-validation techniques to ensure robust model evaluation.
- Logistic Regression
- k-Nearest Neighbors (k-NN)
- Decision Tree
- Stochastic Gradient Descent (SGD)
- Random Forest
- Stochastic Gradient Boosting
- Stacking
- Summarize findings and recommend the best-performing model based on evaluation metrics.