The purpose of this project is to analyze marketing data from kaggle, Get acquainted with the data, Clean the data so it is ready for analysis, Develop some questions for analysis, and Analyze variables within the data to gain patterns and insights on these questions
Files : marketing_data
For my data mining I used one source: kaggle
Information regarding the features for the data are located in the Column
section on the website.
- There are 28 columns and 2240 rows.
- The name and datatype of each column -- most values are integers in this dataset.
- The income column has missing data, values that are not integers or floats, and an extra space in the column name, so some cleaning will be necessary for this - column prior to conducting EDA.
- The column names could be renamed for more consistency.
- Some basic summary statistics on each of the numerical variables.
The Income
column needs some cleaning. Renamed the column names for overall consistency. To do this, the following is done:
- Changed all columns in
snake case format
using regex and list comprehension - Changed
Income
values to floats - Changed the values as floats
The Income
distribution is then looked at using boxplots. Since there is one large outlier, it is removed from the marketing_data
. Next, the missing values are replaced with the mean income using the .mean()
method.
This boxplot showed a major outlier on the right, so I have removed it by limiting income variable from the dataset.
As seen above after removing the outlier, the distribution is more symmetric. There are still some outliers; however, with not major skewness or huge outliers remaining, the income variable is ready for analysis.
The marketing_data
DataFrame contains a year_birth
column; however, a column with the age of each customer may be easier for analysis. Because of this, the following is done:
- Added a new column called
age
by subracting each value ofyear_birth
from 2020 (the year the dataset is from). - Removed the outliers in
age
column by limitingage
that could affect the analysis. - After removing the major outliers the age distribution is symmetric and ready for analysis.
The education variable is another column to focus on in the analysis. Plotted a boxplot to see if any cleaning is needed before EDA. There is no missing data or other issues
After some data cleaning and tidying, the DataFrame is ready for EDA. The following are the independent variables to focus on in the analysis:
income
education
age
The goal is to see how these independent variables associate with the following dependent variables:
mnt_wines
mnt_fruits
mnt_meat_products
mnt_fish_products
mnt_sweet_products
mnt_gold_products
num_deals_purchases
num_web_purchases
num_catalog_purchases
num_store_purchases
The hope is to the answer the following question:
- Does a shopper's income, education level, and/or age relate to their purchasing behavior?
In order to observe the dataset as a whole used, DataFrame.hist()
To view of all numerical variables in the distribution. Most of the amount bought and number purchased variables are skewed right and have similar distributions.
Next, checked correlations between all numerical variables using a heat matrix. The heat matrix shows that income
has the strongest association with numerous variables. Interestingly, it showed that age
may not be a huge factor overall.
The overview below shows that the purchase behavior columns are all skewed to the right.
The table of correlations below does not offer much help as there are too many numbers to read through. However, the heat map shows that income will be the major variable to focus on in the analysis.
A for
loop is used to see the relationship bewteen income
and each num_{type}_purchases
variable. The hue
parameter with the education
variable is used to see if there are any patterns that can be deciphered between education
and num_{type}_purchases
.
First scatterplots are used and then regression plots are used for this analysis.
There is a fairly strong, positive linear relationship between income
and the following three variables:
num_catalog_purchases
num_store_purchases
num_web_purchases
Between income
and NumDealsPurchaes
, however, there is no obvious relationship. It appears there might be a weak, negative linear relationship but it is not strong enough to be confident. It is also difficult to decipher any patterns associated with education
in the plots, so further analysis will be done on this variable.
To get a better look at the linear relationships, .regplot()
was used. num_catalog_purchases
and num_store_purchases
have the strongest positive, linear relationship with income
.
These plots also show that income
and num_deals_purchases
have a linear, negative relationship; however, it is still too weak to be conclusive.
For some further analysis, a new column in the DataFrame called total_purchases
is added to the marketing_data
DataFrame. It is the sum of all num_{type}_purchases
variables. The same analysis with .scatterplot()
and .regplot()
plot methods is done on this new column
The overall relationship between income
and total_purchases
is strong and linear. Unfortunately, it is still hard to decipher any relationship with the education
and total_purchases
as the points are scattered randomly across the plot.
The following analysis is very similar as before. However, instead of looking at the relationship between income
and num_{type}_purchases
, this analysis will be looking at the relationship between income
and mnt_{type}_products
. The steps for this analysis will essentially be the same.
These plots all show a positive relationship between income
and each mnt_{type}_products
variable. However, there is not enough visual evidence to see that it is linear. For further analysis, The log scale of the the income
variable and the mnt_{type}_products
variables are plotted.
With the log scaled variables, it is easy to see there is an fairly strong linear, positive relationship between the variables across the board. It is still hard to see how education plays a role, however.
A seaborn method called .FacetGrid()
is used to see how education effects purchasing behavior along with income
. It gives a much clearer picture than the hue
parameter in previous plots. In this analysis, a loop and a dynamic Python variable are used to plot six sets of .FacetGrid()
plots.
After observing the plots detailing the relationship between income, education, and purchasing behavior, the following can be seen:
- This store does not have many shoppers with a
Basic
education level. - Regardless of the shopper's educational level, there is a positive, linear relationship for each
mnt_{type}_products
. mnt_wines
has the strongest positive, linear relationship witheducation
byincome
.
The last main variable in our analysis plan is age
. The .scatterplot()
method is used to see if there is any relationship bewteen age
and any purchasing behavior variables. The initial analysis showed no evidence of relationship as shown in all the graphs below. The graphs shown are:
total_purchases
vs.age
mnt_{type}_products
vs.age
num_{type}_purchases
vs.age
The process used to plot each one of these graphs is very similar to the one outlined in the Purchasing Behavior by income section.
It is hard to see any relationship between age
and total_purchases
in this plot.
To do further analysis on the age
variable, A new column called age_group
is added to marketing_data
. It contains the following categories of ages:
18 to 35
36 to 50
51 to 70
71 and Older
The age_group
variable proved to be much more useful quickly as a bar chart showed that 36 to 50
and 51 to 70
year-old age groups dominated shopping at the store.
To take the analysis further, a new DataFrame is created, which only has information about shopper age (age
and age_group
) and the total purchase amounts each age group buys (mnt_{type}_products
). This new DataFrame will have age_groups
as row data to make plotting a grouped bar graph easier.
Across the board, age_group
does not seem to effect purchasing habits. Wine is the most popular bought item for each age group followed by meat products. The least popular bought item is fruits for each age group. The next analysis of interest is to see if age_group
affects how many items customers buy each time.
This chart yields some very interesting insights. Here are some notable ones:
18 to 35
and71 and Older
age groups tend to be the least interested in deals.- On average,
71 and Older
age group customers tend to shop the most online, in store, and through the catalog. 36 to 50
and51 to 70
age groups are interested in deals. Most likely this is because they receive more deals since they have more loyal customers.
This information could be super helpful for a marketing department as strategies could be used to increase 36 to 50
and 71 and Older
customers for the store.
It has been shown income
has the strongest relationship with purchase behavior of customers. However, interesting insights about education
and age
along with age_group
have still been noted. These insights would be very helpful to how this store markets deals to their customers and prices items, such as wine since higher income groups tend to dominate alcohol sales. There is also opportunity to increase market to the 18 to 35
and 71 and Older
age groups to drive products sales.