GMV Forecasting

Our initiative is centered on forecasting Gross Merchandise Volume (GMV) through sophisticated machine learning models. Crafted to navigate the complexities of real-world data scenarios, including cold starts, sparse signals, and voluminous datasets, our forecasting system is equipped with a suite of solutions engineered to methodically overcome these challenges.

Exploratory Data Analysis
- Findings
- Feature Engineering
Methodology
Forecast
- Workflow
- Results
Engineering
- High-Level Flow
- Result Reproducible

Exploratory Data Analysis

Findings:

User Behavior Variability: There is a noticeable fluctuation in user behaviors when observed over various time frames, such as hourly or monthly.
Store Activity: Recent transaction records from around 60,000 stores suggest active commerce. Notably, the store with the highest frequency of visits appears 11 times, indicating its popularity.
Spatial Relationship Analysis: K-means clustering has been applied to categorize spatial locations, effectively preserving their inherent spatial relationships.
Context Feature Exploration: The RMF (Recency, Frequency, Monetary) analysis method has been utilized to pinpoint key context features that are significant.

For a comprehensive examination, please consult the exploratory data analysis notebook provided in the documentation - exploratory data analysis notebook.

Feature Engineering:

Missing Value Imputation: Apply median imputation to fill missing values in numeric features, and mode imputation for categorical features to maintain data integrity.
Context Feature Creation: Develop new context features derived from insights gathered during exploratory data analysis, enhancing model input.
Binarization of Continuous Features: Convert continuous features into binary form using summary statistics thresholds to streamline the xDeepFM model training process.

Methodology

The forecasting of GMV is approached through two steps:

Predicting Purchase Probability: Utilize a Click-Through Rate (CTR) model to predict the likelihood of a user completing a purchase at a given store in the subsequent period.
Estimating GMV: Use the predicted purchase probability in conjunction with the store's average GMV to forecast the forthcoming GMV for a user or a specific day.

For an in-depth understanding, please refer to the solution design documentation - solution design.

Training Model: Extreme DeepFM (xDeepFM)

The Extreme DeepFM (xDeepFM) model is an advanced neural network designed for click-through rate prediction and recommendation systems. It enhances the standard DeepFM framework by integrating a Compressed Interaction Network (CIN) layer, enabling the model to capture complex feature interactions at a higher order with greater efficacy.

Key Components:

Embedding Layer: Transforms categorical features into dense vectors, enabling nuanced feature interactions.
Deep Component: Comprises several fully connected layers that learn intricate data patterns.
Compressed Interaction Network (CIN): Efficiently computes feature interactions across layers, ideal for handling large datasets.

Training Process:

Sampling: Trains on both positive (real transactions) and negative (randomly sampled) labels to differentiate between outcomes.
Loss Function: Employs binary cross-entropy loss to quantify the difference between predicted probabilities and actual labels.
Training Data Generation:
- Implement a strategy to redefine unseen users/stores during model validation and testing to prevent data leakage.
- Utilize a leave-one experiment approach for creating train/validation/test datasets, e.g.,
  - Test: Current month
  - Valid: One month earlier than the test month
  - Train: One month earlier than the validation month

Architecture Overview:

The architecture diagram of xDeepFM illustrates the intricate design of the model, showcasing the embedding layers, deep network components, and the innovative CIN layer for advanced feature interaction learning.

Advantages of xDeepFM:

Complex Feature Learning: The CIN layer in xDeepFM intricately learns feature interactions, enhancing prediction accuracy.
Large-Scale Data Handling: Its efficient computation makes xDeepFM ideal for processing extensive datasets.
Versatile Application:: The model's flexible design adapts to various domains, including recommendation systems and digital marketing.

The xDeepFM model's sophisticated capabilities enable it to grasp the nuanced dynamics between users and stores, refining GMV forecast precision.

Training Strategy: Rolling Window

Temporal Dynamics: A rolling window strategy captures time-dependent patterns, addressing autocorrelation concerns.
Dynamic Adaptation: Regular updates to the training data ensure the model stays attuned to evolving data trends.
Informed Predictions: Historical data within each window informs the model, bolstering prediction reliability.
Feature Enrichment: Rolling windows compute RFM-like metrics, offering additional data insights.

For visual representation, please include the rolling window figure in the final documentation as referenced.

Training Validation

During a typical training month, the AUC score averages around 0.84. There is room for enhancement through careful hyperparameter optimization.

Forecast

Workflow

The process from the start_date to the end_date involves the following steps:
1. Embedding Utilization: Leverage user and store embeddings, adjusted for various dates, in conjunction with context embeddings from the trained model.
2. Top-1 Store Prediction by FAISS: Use FAISS to pinpoint the store where a user is most likely to transact, informed by EDA insights.
3. GMV Estimation: Calculate the expected daily user GMV using the established formula.
Afterward, synthesize the findings to present a dual perspective: individual user behavior and daily aggregate trends.

Results

The forecasted GMV results are accessible as follows:
- User GMV: Available in the file results/user_gmv_20220101_20220131.csv.
- Daily GMV (YayYay): Located in results/daily_gmv_20220101_20220131.csv.
Time-series comparison
- Visualizations and further analysis can be found in the (notebook)[notebooks/forecast_analysis.ipynb]. F
While the evaluation process may be simplified for this challenge, the necessary evaluation functions are prepared for integration within forecaster/evaluation/.
It is recommended to assess the mean-squared errors of the GMV forecasts to gain a deeper understanding of the model's performance.

Engineering

High-Level Flow

The architecture of the forecaster is depicted in the provided diagram, illustrating the system's workflow and component interaction.

Result Reproducible

To replicate the forecasting results, adhere to the following procedure:

Prerequisite: Confirm the installation of Miniconda is installed.
Environment Setup: Use the command for environment preparation.
```
make install
```

Model Training: Initiate rolling-window model training with

Default
```
make train
```

Custom data path:

source activate forecaster
python forecaster/run_training.py \
  --user_data_path "{user_data_path}" \
  --transaction_data_path "{transaction_data_path}" \
  --store_data_path "{transaction_data_path}"

GMV Forecasting: Generate GMV forecasts by running

Default:
```
make forecast
```

Custom data path and predicted date range:

 source activate forecaster
 python forecaster/run_forecasting.py \
   --user_data_path "{user_data_path}" \
   --transaction_data_path "{transaction_data_path}" \
   --store_data_path "{transaction_data_path}" \
   --start_date "{yyyymmdd}" \
   --end_date "{yyyymmdd}"

egpivo / gmv-forecaster Goto Github PK

gmv-forecaster's Introduction

GMV Forecasting

Table of Contents