Wednesday, June 3, 2015

SAS Predictive Discriminant Analysis (1)

  Discriminant Analysis is very useful for studying group effects on a set of variables and classfying records.
A descriptive discriminant analysis typically follows a multivariate analysis of variance if the group means were tested significantly different; it will give a discriminant function of the variables that is most responsible for seperating the means of the groups. A predictive discriminant analysis is a technique we can use to classify records. The classifications are based on discriminant scores which is found by applying discrimiant functions to the variables. The predictors need to be continuous variables and multi-varite normal (pretty robust to deviations from normality); the covariance of the predictors need to be similar for different classes; most importantly, discriminant analysis is very sensitive to outliers. This techniques deserves more attention; in Schmueli's book Business Intelligence for Data mining, Schmueli talked about when assumptions are reasonably met, discriminant analyis is 30% more efficiencient than logistic regression meaning we discirimant analysis needs 30% less data to reach to the same accuracy level as logistic regression.

  Anyway, the point of this blog is to use SAS to do discriminant analysis; the dataset I will use can be found from the following linkhttps://archive.ics.uci.edu/ml/datasets/Bank+Marketing

  The predicted variable in the data set is y: whether or not the person will accept the term deposit offer.
I will use the balance duration and previous as predictors. Please note that these predictors deviate a lot on multivariate normality and have a very extreme outliers; you can refer to my other discriminant analysis post to learn how to deal with them.

  Assuming you have already split the data set, you are not ready to fit the model. PROC DISCRIM is the standard procedure for discriminant analysi, but I created the following macro to make it more efficient.

%Macro pdiscrim(train, test,class,var);                                                                                                
                                                                                                                                       
proc discrim outstat=discrim_fun data=&train noprint;                                                                                   class &class;                                                                                                                          
var &var;                                                                                                                              
run;                                                                                                                                  
                                                                                                                                       
proc discrim data=discrim_fun testdata=&test;                                                                            
class &class;                                                                                                                          
run;                                                                                                                                  
                                                                                                                                       
%Mend pdiscrim;

The parameters are:
the name of training dataset,
the name of testing dataset,
the class variable to be predicted,
predictors

Next, you just need to specify the values for the parameters;
for me , I used balance duration and previous. Here I used sub as training data set.
Sub is my oversampled training data set (I will explain about this in later post).

%let xlist= balance duration previous;                                                                                                
%pdiscrim(sub,banktest,y,&xlist);

The output is a confusion matrix.


Specificity (predicting no when actual no) is 85.44% and sensitivity (predicting yes when actual) is 59.91%. 59.11% is much higher than the naive classification sensitivity 11.77% (found in proc freq). You may be not getting as high a sensitivity as you play with it; try using over-sampling technique (which I will cover in another post); over-sampling will will make the model more aggressive in detecting rare events.

I missed the whole exploratory analysis session; I will pick up some interesting points in EDA in the future blog.












No comments:

Post a Comment