I'm considering whether we should handle categorical variables in plot_correlation
. One use case of plot_correlation
is plot_correlation(df, x = label)
to rank the features that are correlated to the label. For this scenario, it would be important to have a uniform way to measure the correlation for both categorical variable and continuous variable.
My idea is to add one measure to handle categorical variables, such as Cramer's V (based on chi-square's test) or Uncertainty Coefficient (based on mutual information). For continuous variable, we make bins and treat it as categorical variable.
It requires to add one more tab on current output of plot_correlation(df)
and plot_correlation(df, x)
, which shows the Cramer's V or Uncertainty Coefficient for all columns. Please let me know any opinions. @dovahcrow @jnwang @Waterpine @brandonlockhart