Wednesday, June 3, 2015

SAS Predictive Discriminant Analysis (1)

  Discriminant Analysis is very useful for studying group effects on a set of variables and classfying records.
A descriptive discriminant analysis typically follows a multivariate analysis of variance if the group means were tested significantly different; it will give a discriminant function of the variables that is most responsible for seperating the means of the groups. A predictive discriminant analysis is a technique we can use to classify records. The classifications are based on discriminant scores which is found by applying discrimiant functions to the variables. The predictors need to be continuous variables and multi-varite normal (pretty robust to deviations from normality); the covariance of the predictors need to be similar for different classes; most importantly, discriminant analysis is very sensitive to outliers. This techniques deserves more attention; in Schmueli's book Business Intelligence for Data mining, Schmueli talked about when assumptions are reasonably met, discriminant analyis is 30% more efficiencient than logistic regression meaning we discirimant analysis needs 30% less data to reach to the same accuracy level as logistic regression.

  Anyway, the point of this blog is to use SAS to do discriminant analysis; the dataset I will use can be found from the following linkhttps://archive.ics.uci.edu/ml/datasets/Bank+Marketing

  The predicted variable in the data set is y: whether or not the person will accept the term deposit offer.
I will use the balance duration and previous as predictors. Please note that these predictors deviate a lot on multivariate normality and have a very extreme outliers; you can refer to my other discriminant analysis post to learn how to deal with them.

  Assuming you have already split the data set, you are not ready to fit the model. PROC DISCRIM is the standard procedure for discriminant analysi, but I created the following macro to make it more efficient.

%Macro pdiscrim(train, test,class,var);                                                                                                
                                                                                                                                       
proc discrim outstat=discrim_fun data=&train noprint;                                                                                   class &class;                                                                                                                          
var &var;                                                                                                                              
run;                                                                                                                                  
                                                                                                                                       
proc discrim data=discrim_fun testdata=&test;                                                                            
class &class;                                                                                                                          
run;                                                                                                                                  
                                                                                                                                       
%Mend pdiscrim;

The parameters are:
the name of training dataset,
the name of testing dataset,
the class variable to be predicted,
predictors

Next, you just need to specify the values for the parameters;
for me , I used balance duration and previous. Here I used sub as training data set.
Sub is my oversampled training data set (I will explain about this in later post).

%let xlist= balance duration previous;                                                                                                
%pdiscrim(sub,banktest,y,&xlist);

The output is a confusion matrix.


Specificity (predicting no when actual no) is 85.44% and sensitivity (predicting yes when actual) is 59.91%. 59.11% is much higher than the naive classification sensitivity 11.77% (found in proc freq). You may be not getting as high a sensitivity as you play with it; try using over-sampling technique (which I will cover in another post); over-sampling will will make the model more aggressive in detecting rare events.

I missed the whole exploratory analysis session; I will pick up some interesting points in EDA in the future blog.












Monday, June 1, 2015

Cleaning data to do time series analysis.

 I will use Tableau’s sample data set coffee-chain to demonstrate how to clean up data to perform time series analysis.

 /*importing data using proc import*/

 proc import out= work.coffee datafile= "C:/Users/lgh2811/Desktop/time_series/coffee.csv" dbms=csv replace; getnames=yes; datarow=2; run;

 /*print first 10 observations of the dataset*/

 proc print data=coffee(obs=10); run;

 /*please note that date is in the datetime format which includes the date part and time part we want to use month as id for time series; so we need to keep only the date part of the date variable. we can do this by using the datepart() function. Intck('month',date1,date2) calculates the number of months between the two dates. We need to add 1 so that the first month is 1 instead of 0.*/


 data coffee;
 set coffee; month=intck('month','01JAN2012'd,datepart(date))+1;
 run;


 /*Next, we need to group the records by the month number. We are interested in the sum of sales by month; so we can use the proc sql statement below to do that.*/

 proc sql; create table coffee1 as select month,sum(sales) as sales from coffee group by month;
 quit;

 /*printing a sample to make sure the table is the way we wanted*/
 proc print; run;