Course 4 Week 1: Assignment

This week’s assignment involves decision trees, and more specifically, classification trees. Decision trees are predictive models that allow for a data-driven exploration of nonlinear relationships and interactions among many explanatory variables in predicting a response or target variable. When the response variable is categorical (two levels), the model is a called a classification tree. Explanatory variables can be either quantitative, categorical or both. Decision trees create segmentations or subgroups in the data, by applying a series of simple rules or criteria over and over again which choose variable constellations that best predict the response (i.e. target) variable.

Since a new dataset was created by the creator of this course for this particular week’s assignment called treeaddhealth and I am a bit overworked this week so I am gonna use the same dataset and do my assignment as per the specific changes. So in the course videos, the Decision Tree was formed for a variable named TREG1 to create a model that correctly classifies those people who have smoked on a regular basis. For this particular kind of modeling several explanatory variables were used which were categorical as well as quantitative and we then saw the results we obtained.

In my assignment, I am choosing Alcohol Use Without Supervision as my variable and to see what the model predicts our model’s final value would be. But before continuing the modeling I changed my category values for Alcohol Use variable i.e. alcevr1. I changed 0 for No as 2 for No and 1 for Yes remained as it is. The reason for this particular change was mentioned in the video. Its because SAS predicts the lowest value of our target variable, this can cause our model event level to be zero, or no. So I need to recode the no’s for alcohol use to a two, keeping one equal to yes. To be able to interpret your trees correctly, it’s important to pay attention to this detail. I did this using PROC SQL command and rest can be seen below in the program:

GitHub Link

Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. All possible separations (categorical) or cut points (quantitative) are tested. For the present analyses, the entropy “goodness of split” criterion was used to grow the tree and a cost complexity algorithm was used for pruning the full tree into a final subtree.

The following explanatory variables were included as possible contributors to a classification tree model evaluating alcohol use (my response variable), age, gender, (race/ethnicity) Hispanic, White, Black, Native American and Asian. Smoking experimentation, marijuana use, cocaine use, inhalant use, availability of cigarettes in the home, whether or not either parent was on public assistance, any experience with being expelled from school. alcohol problems, deviance, violence, depression, self-esteem, parental presence, parental activities, family connectedness, school connectedness and grade point average.

Now interpreting the results obtained:

We can see in the model information Information table that the decision tree that SAS grew has 189 leaves before pruning and 7 leaves following pruning.

Model event level lets us confirm that the tree is predicting the value one, that is yes, for our target variable Alcohol Use.

Notice too that the number of observations read from my data set was 6,564 while the number of observations used was only 4,575. 4,575 represents the number of observations with valid data for the target variable, and each of the explanatory variables. Those observations with missing data on even one variable have been set aside.

Next, by default PROC HPSPILT creates a plot of the cross-validated average standard error, ASE, based on the number of leaves each of the trees generated on the training sample.

A vertical reference line is drawn for the tree with the number of leaves that has the lowest cross-validated ASE. In this case, the 7 leaf tree. The horizontal reference line represents the average standard error plus one standard error for this complexity parameter. The one minus SE rule is applied when we do pruning via the cost-complexity method. To potentially select a smaller tree that has only a slightly higher error rate than the minimum ASE. Selecting the smallest tree that has an ASE below the horizontal reference line is in effect implementing the one minus SE rule. By default, SAS uses this rule to select and display the final tree.

To potentially select a smaller tree that has only a slightly higher error rate than the minimum ASE. Selecting the smallest tree that has an ASE below the horizontal reference line is in effect implementing the one minus SE rule and by default, SAS uses this rule to select and display the final tree.

Following the pruning plot that chose a general model with 14 split levels and 7 leaves, the final, smaller tree is presented, which shows our model, with splits on alcohol problems, marijuana use, and deviant behavior.

Alcohol Problems was the first variable to separate the sample into two subgroups. Adolescents with alcohol problems score greater than or equal to 0.060 (range 0 to 6) were more likely to have used alcohol compared to adolescents not meeting this cutoff (100%) which is clearly a really easy thing to interpret. As the values of this particular variable are discrete and take values from 0 to 6, thus if an adolescent does have alcohol problems and his score anything but 0, so it’s pretty clear that he must be using alcohol without supervision which is our target variable. Thus, for such adolescents, the probability itself becomes 1 or 100% and this is clearly seen in our decision tree as well.

Of the adolescents with this scores less than 0.060, a further subdivision was made with the dichotomous variable of marijuana use. Adolescents who reported having used marijuana were more likely to have been using alcohol without supervision i.e. about 76%.

Adolescents with alcohol problems score less than 0.060 who had never used marijuana, a further subdivision was made with the deviance score. Adolescents with a deviance score greater than or equal to 0.270 were less likely to been using alcohol without supervision(56%). Adolescents with a deviance score less than 0.270 who had no alcohol problems and never used marijuana were less likely to have been using alcohol without supervision which was 78%.

SAS also generated a model-based confusion matrix which shows how well the final classification tree performed.

The total model correctly classifies 71% of those who have Used Alcohol Without Supervision. That is one minus the error rate of .29 and 81% of those who have not which is 1 minus the 19% error rate. So we are fairly able to predict both the results of those who used alcohol during adolescence, as well as the one who did not.

Finally, SAS also shows a variable importance table. Due to the fact that decision trees attempt to maximize correct classification with the simplest tree structure, it’s possible for variables that do not necessarily represent primary splits in the model to be of notable importance in the prediction of the target variable. When potential explanatory variables are for example highly correlated, or provide similar information then they tend to not make the final cut in our choice of variables. The absence of the alternate variable from the model does not necessarily suggest that it’s unimportant, but rather that it’s masked by the other.

To evaluate this phenomenon of masking, an importance measure is calculated on the primary splitting variables and for competing variables that were not selected as a primary predictor in our final model. The importance score measures a variable’s ability to mimic the chosen tree and to play the role as a stand-in for variables appearing as primary splits. Here we see Alcohol Problems and Marijuana Use are one of the two most important variables for this particular model training.

That is it for this week’s assignment.