Predicting Blood Donations- Drivendata

Data Analysis and Interpretation Capstone WEEK 1:

I’ll be doing the Blood Donation Prediction problem from Drivendata; link to which is – https://www.drivendata.org/competitions/2/warm-up-predict-blood-donations/

Here following are more details about the same:

  1. Research Question:
    This problem is part of drivendata.com and our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive. We want to predict whether or not a donor will give blood the next time the vehicle comes to campus given information about some of the donors. It’s a supervised machine learning classification problem.
  2. My motivation:
    This is my first time trying an actual Machine Learning problem and I wanted to start easy and simple. That’s why choosing this problem which I was able to grasp easily and will try to do my best.
  3. Potential Implications
    I have been working as a Business Analyst from the past 8 months, which my usual working domain is Time Series data. I am yet to be assigned or work with other analytics domain and thus this problem can help me understand a bit about other machine learning techniques.
Advertisement

K-means Cluster Analysis

Course 4 Week 4: Assignment

Introduction

This final week I am gonna talk about Cluster Analysis and in that we’re gonna discuss K-Means Cluster Analysis technique.The goal of cluster analysis is to group or cluster observations into subsets based on the similarity of responses on multiple variables. Observations that have similar response patterns are grouped together to form clusters. That is, the goal is to partition the observations in a data set. That into a smaller set of clusters and each observation belongs to only one cluster.

Cluster analysis is an unsupervised learning method. Meaning, there is no specific response variable included in the analysis.

Below is the assignment and since the Program Code is really really long, I’ve put it at the end of the blog post.

Assignment

A k-means cluster analysis was conducted to identify underlying subgroups of adolescents based upon the similarity of responses on 11 variables that represent characteristics that could have an impact on adolescent depression.

Clustered variables included three binary variables, indicative of whether or not the adolescent had ever used alcohol and marijuana. Quantitative variables included variables measuring problems with alcohol usage, a scale measuring engaging in deviant behavior (such as vandalism, other property damage, lying, stealing, running away from home, driving without parental permission, selling drugs and unexcused school absence), and scales measuring violent behavior, self-esteem, parental presence, parental activities, family connectedness, school connectedness and academic achievement (measured as grade point average). All of the clustered variables were standardized to a mean of 0 and standard deviation of 1.

Using simple random sampling, data were split into a training set that included 70% (N=3201) of the observations and a test set that included 30% (N=1701) of the observations. A series of k-means cluster analyses were conducted on the training set specifying k=1-9 clusters, using Euclidean distance measurement. The variances in the cluster variables that was accounted for by the clusters (R-square values) was plotted for each of the nine cluster solutions in an elbow curve to provide visual guidance for choosing the number of clusters to interpret.

Capture

Now we clearly see two specific bends in the curve at 2, 4, 6 and 7. While we can’t create a scatter plot for 2 variable. So we check for the variability of the data in 4, 6 and 7, to see whether the clusters overlap. Or the patterns of means on the clustering variables are unique and meaningful. And whether there are significant differences between the clusters on our external validation variable, Depression Level.

Here are the plots of 4, 6 and 7:

Capture3Capture3.2Capture3.3

A scatterplot of the 4 canonical variables by cluster indicate the observations are densely packed with low cluster variance and did not overlap significantly with the other clusters. Clusters 6 and 7, on the other hand, were for the most part distinct, but with greater spread amongst the observations, suggesting a higher cluster variance within the cluster. Cluster 4 showed observations that were least compacted of all of the clusters. These plot results suggest that an optimal cluster solution may have fewer than 4 clusters, so it is important that evaluation of lesser than 4 clusters solution be evaluated.

So we look into the Cluster Means table of 4 clusters to observe the relations:

Capture2

The means of the clustering variables show that compared to the other clusters, adolescents in cluster 3 had a relatively low likelihood of ever having used alcohol or marijuana, or of having problems with alcohol, deviant behavior, or violence. Furthermore, cluster 3 adolescents also showed higher self-esteem, school and family connectedness, and achieved the highest GPA scores compared with the other adolescent clusters.

Cluster 1 adolescents appear to be the most troubled adolescents, having the highest likelihood of alcohol and marijuana use, alcohol-related problems, deviant behavior, and violence. Additionally, they showed the lowest self-esteem, school, and family connectedness and lowest GPA scores.

And finally in order to externally validate the clusters, an Analysis of Variance (ANOVA) was performed to test for significant differences between the clusters on Adolescent Depression. The boxplot clearly shows that Depression Level in cluster 1 is really high and in cluster 3 the lowest.

Capture4

A Tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on Depression Level. The Tukey post hoc comparisons showed significant differences between each clusters on Depression Level.

Capture5

Adolescents in cluster 1 had the highest Depression Level (Mean=15.3777778, SD=8.5677158), and cluster 3 had the lowest Depression Level (Mean=6.2380637, SD=4.8378852).

And that’s the final assignment of this fourth course of this specialization. Below is the program code. Thanks for reading.

Program

GitHub Link

 

Running a Lasso Regression Analysis

Course 4 Week 3: Assignment

This week’s assignment involves another machine learning technique called Lasso Regression. Lasso regression is what is called the Penalized regression method, often used in machine learning to select the subset of variables. It is a supervised machine learning method. Specifically, LASSO is a Shrinkage and Variable Selection method for linear regression models. LASSO, is actually an acronym for Least Absolute Selection and Shrinkage Operator.

I again define the response or target variable- Alcohol Use Without Supervision as my response variable.
The candidate explanatory variables include gender, race, marijuana, cocaine, or inhalant use, regular smoking, availability of cigarettes in the home, whether or not either parent was on public assistance, any experience with being expelled from school. Age, alcohol problems, deviance, violence, depression, self-esteem, parental presence, activities with parents, family and school connectedness, and grade point average are the quantitative variables.

PROGRAM

GitHub Link

__________________________________________________________________________________________

Data were randomly split into a training set that included 65% of the observations (N=2972) and a test set that included 35% of the observations (N=1600) out of total 4572 observations. The least angle regression algorithm with k=10 fold cross-validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross-validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Here are the results:

Capture

The ASE and Test ASE are the averaged squared error, which is the same as the means square error for the training data and the test data. We see that at the beginning, there are no predictors in the model, just the intercept. Then variables are entered one at a time in order of the magnitude of the reduction in the mean, or average squared error which we already know that with an increase in the number of explanatory variables the model’s R-value increases. So they are ordered in terms of how important they are in predicting school connectedness. According to the lasso regression results, it appears that the most important predictor of alcohol use is marijuana use. Followed by alcohol problems, deviant behavior and so on.

Moreover, a really brilliant graph which SAS forms is the following:

Capture2

The coefficient progression plot shows the change in the regression coefficients at each step, and the vertical line represents the selected model. This plot shows the relative importance of the predictor selected at any step of the selection process, how the regression coefficients changed with the addition of a new predictor at each step. As well as, the steps at which each variable entered the model.

We also see how these variables are related to the response variable i.e. negatively or positively. The variables under horizontal 0 line are negatively related as clearly visible and vice-versa.

The lower plot shows how the chosen selection criterion, in this example CVPRESS, which is the residual sum of squares summed across all the cross-validation folds in the training set, changes as variables are added to the model. Initially, it decreases rapidly and then levels off to a point in which adding more predictors doesn’t lead to much production in the residual sum of squares.

During the estimation process, marijuana use and age were most strongly associated with alcohol use, followed by deviant behavior and cocaine abuse. While marijuana use, age, and deviant behavior were negatively associated with alcohol use; cocaine abuse was positively associated.

Finally, the output below shows the R-Square and adjusted R-Square for the selected model and the mean square error for both the training and test data. It also shows the final 12 explanatory variables for our model and what their estimated regression coefficients would be for the selected model.

Capture3.JPG

So that’s all from this week.

Running a Random Forest

Course 4 Week 2: Assignment

This week’s assignment involves another machine learning technique called as Random Forests. Random forests are predictive models that allow for a data-driven exploration of many explanatory variables in predicting a response or target variable. Random forests provide importance scores for each explanatory variable and also allow you to evaluate any increases in the correct classification with the growing of smaller and larger number of trees. This data mining algorithm is based on decision

Random Forest is a data mining algorithm based on decision trees but proceeds by growing many trees, i.e. a decision tree forest. In ways, directly address the problem of model reproducibility.

We define the response or target variable- Alcohol Use Without Supervision as my response variable.
The candidate explanatory variables include gender, race, marijuana, cocaine, or inhalant use, regular smoking, availability of cigarettes in the home, whether or not either parent was on public assistance, any experience with being expelled from school. Age, alcohol problems, deviance, violence, depression, self-esteem, parental presence, activities with parents, family and school connectedness, and grade point average.

Program

GitHub Link

_______________________________________________________________

We get following results from running the above code:

CaptureThe number of observations read from my data set was 6,504 while the number of observations used was 6,444. Within the baseline fit statistics output, you can see that the misclassification rate of the random forest is displayed. Here we see that the forest misclassified 44.8% of the sample. Suggesting that the forest correctly classified 80.2% of the sample.

Then next we have the importance table:

Capture2

The variables are listed from highest importance to lowest importance in predicting alcohol use. In this way, random forests are sometimes used as a data reduction technique, where variables are chosen in terms of their importance to be included in regression and other types of statistical models. Here we see that some of the most important variables in predicting alcohol use include marijuana use, deviant behavior, regular smoking, cigarette availability, race, inhalant use, cocaine use, etc.

To summarize, like decision trees, random forests are a type of data mining algorithm that can select from among a large number of variables, those that are most important in determining the target or response variable to be explained. Also, like decision trees, the target variable in a random forest can be categorical or quantitative. And the group of explanatory variables can be categorical or quantitative, or any combination.

Thus this concludes this week’s assignment.

Running a Classification Tree: Decision Tree

Course 4 Week 1: Assignment

This week’s assignment involves decision trees, and more specifically, classification trees. Decision trees are predictive models that allow for a data-driven exploration of nonlinear relationships and interactions among many explanatory variables in predicting a response or target variable. When the response variable is categorical (two levels), the model is a called a classification tree. Explanatory variables can be either quantitative, categorical or both. Decision trees create segmentations or subgroups in the data, by applying a series of simple rules or criteria over and over again which choose variable constellations that best predict the response (i.e. target) variable.

Since a new dataset was created by the creator of this course for this particular week’s assignment called treeaddhealth and I am a bit overworked this week so I am gonna use the same dataset and do my assignment as per the specific changes. So in the course videos, the Decision Tree was formed for a variable named TREG1 to create a model that correctly classifies those people who have smoked on a regular basis. For this particular kind of modeling several explanatory variables were used which were categorical as well as quantitative and we then saw the results we obtained.

In my assignment, I am choosing Alcohol Use Without Supervision as my variable and to see what the model predicts our model’s final value would be. But before continuing the modeling I changed my category values for Alcohol Use variable i.e. alcevr1. I changed 0 for No as 2 for No and 1 for Yes remained as it is. The reason for this particular change was mentioned in the video. Its because SAS predicts the lowest value of our target variable, this can cause our model event level to be zero, or no. So I need to recode the no’s for alcohol use to a two, keeping one equal to yes. To be able to interpret your trees correctly, it’s important to pay attention to this detail. I did this using PROC SQL command and rest can be seen below in the program:

GitHub Link

Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. All possible separations (categorical) or cut points (quantitative) are tested. For the present analyses, the entropy “goodness of split” criterion was used to grow the tree and a cost complexity algorithm was used for pruning the full tree into a final subtree.

The following explanatory variables were included as possible contributors to a classification tree model evaluating alcohol use (my response variable), age, gender, (race/ethnicity) Hispanic, White, Black, Native American and Asian. Smoking experimentation, marijuana use, cocaine use, inhalant use, availability of cigarettes in the home, whether or not either parent was on public assistance, any experience with being expelled from school. alcohol problems, deviance, violence, depression, self-esteem, parental presence, parental activities, family connectedness, school connectedness and grade point average.

Now interpreting the results obtained:

1We can see in the model information Information table that the decision tree that SAS grew has 189 leaves before pruning and 7 leaves following pruning.

Model event level lets us confirm that the tree is predicting the value one, that is yes, for our target variable Alcohol Use.

Notice too that the number of observations read from my data set was 6,564 while the number of observations used was only 4,575. 4,575 represents the number of observations with valid data for the target variable, and each of the explanatory variables. Those observations with missing data on even one variable have been set aside.

Next, by default PROC HPSPILT creates a plot of the cross-validated average standard error, ASE, based on the number of leaves each of the trees generated on the training sample.

2
A vertical reference line is drawn for the tree with the number of leaves that has the lowest cross-validated ASE. In this case, the 7 leaf tree. The horizontal reference line represents the average standard error plus one standard error for this complexity parameter. The one minus SE rule is applied when we do pruning via the cost-complexity method. To potentially select a smaller tree that has only a slightly higher error rate than the minimum ASE. Selecting the smallest tree that has an ASE below the horizontal reference line is in effect implementing the one minus SE rule. By default, SAS uses this rule to select and display the final tree.

To potentially select a smaller tree that has only a slightly higher error rate than the minimum ASE. Selecting the smallest tree that has an ASE below the horizontal reference line is in effect implementing the one minus SE rule and by default, SAS uses this rule to select and display the final tree.

Following the pruning plot that chose a general model with 14 split levels and 7 leaves, the final, smaller tree is presented, which shows our model, with splits on alcohol problems, marijuana use, and deviant behavior.

3

Alcohol Problems was the first variable to separate the sample into two subgroups. Adolescents with alcohol problems score greater than or equal to 0.060 (range 0 to 6) were more likely to have used alcohol compared to adolescents not meeting this cutoff (100%) which is clearly a really easy thing to interpret. As the values of this particular variable are discrete and take values from 0 to 6, thus if an adolescent does have alcohol problems and his score anything but 0, so it’s pretty clear that he must be using alcohol without supervision which is our target variable. Thus, for such adolescents, the probability itself becomes 1 or 100% and this is clearly seen in our decision tree as well.

Of the adolescents with this scores less than 0.060, a further subdivision was made with the dichotomous variable of marijuana use. Adolescents who reported having used marijuana were more likely to have been using alcohol without supervision i.e. about 76%.

Adolescents with alcohol problems score less than 0.060 who had never used marijuana, a further subdivision was made with the deviance score. Adolescents with a deviance score greater than or equal to 0.270 were less likely to been using alcohol without supervision(56%). Adolescents with a deviance score less than 0.270 who had no alcohol problems and never used marijuana were less likely to have been using alcohol without supervision which was 78%.

SAS also generated a model-based confusion matrix which shows how well the final classification tree performed.

Captur

The total model correctly classifies 71% of those who have Used Alcohol Without Supervision. That is one minus the error rate of .29 and 81% of those who have not which is 1 minus the 19% error rate. So we are fairly able to predict both the results of those who used alcohol during adolescence, as well as the one who did not.

Finally, SAS also shows a variable importance table. Due to the fact that decision trees attempt to maximize correct classification with the simplest tree structure, it’s possible for variables that do not necessarily represent primary splits in the model to be of notable importance in the prediction of the target variable. When potential explanatory variables are for example highly correlated, or provide similar information then they tend to not make the final cut in our choice of variables. The absence of the alternate variable from the model does not necessarily suggest that it’s unimportant, but rather that it’s masked by the other.

Capture

To evaluate this phenomenon of masking, an importance measure is calculated on the primary splitting variables and for competing variables that were not selected as a primary predictor in our final model. The importance score measures a variable’s ability to mimic the chosen tree and to play the role as a stand-in for variables appearing as primary splits. Here we see Alcohol Problems and Marijuana Use are one of the two most important variables for this particular model training.

That is it for this week’s assignment.

Assignment 4: Test a Logistic Regression Model (Does using protection really decrease chance of getting STD?)

Course 3 Week 4:

SUMMARY

This last week the assignment is about Logistic Regression. Study of relation when the response variable is Binary i.e. it has only two values either 0(means the absence of something) or 1(presence of something). Though this notion of 0 meaning absence and 1 meaning presence is not a hard and fast rule, it can be vice-versa as well.

So this week I’ll check if Addiction to any kind of alcohol or drug would increase the probability of getting affected by a Sexually Transmitted Disease and by how much. Moreover, I am taking an additional variable which describes the use of contraception and checks if using contraception really decrease the probability of getting an STD.

My initial hypothesis says that both addiction and not using protection will increase the chance of getting an STD, but I believe not using protection might overpower the analysis and turn out to be a confounding variable which is as simply the more appropriate choice.

VARIABLES

I am using the already available AddHealth dataset for my week 4 analysis.

Binary Explanatory Variables: Addiction and Protection
In addiction: 0=”Doesn’t use any alcohol or drug”
1=”Addicted”;

In Protection: 0=”Contraception used”
1=”Contraception not used”;
(The reason for using opposite here was because I want to see the effect of not using Contraception and getting STD due to that, that’s why reverse than usual.)

Binary Response Variable: STD( 0 means No STD and 1 means the person suffers from STD)

PROGRAM

FILENAME REFFILE ‘/home/amitamoladun0/course/addhealth_pds.csv’;
PROC IMPORT DATAFILE=REFFILE
DBMS=CSV
OUT=WORK.IMPORT;
GETNAMES=YES;
RUN;

DATA new;set IMPORT;
Keep H1CO16F H1CO16G H1CO16H H1CO16I H1CO16J H1FV5 H1FV7 H1FV8 H1FV10 H1CO3 Addict Alco Marijuana Coke Inhal Ille STD;

IF (H1TO15 EQ 7) OR (H1TO15 EQ 96) OR (H1TO15 EQ 97) OR (H1TO15 EQ 98) then Alco=1;
ELSE IF (H1TO15 EQ 5) OR (H1TO15 EQ 6) then Alco=2;
ELSE IF (H1TO15 EQ 3) OR (H1TO15 EQ 4) then Alco=3;
ELSE Alco=4;

IF (H1TO32 EQ 0) OR (H1TO32 EQ 996) OR (H1TO32 EQ 997) OR (H1TO32 EQ 998) OR (H1TO32 EQ 999) then Marijuana=1;
ELSE IF (H1TO32 GE 1) AND (H1TO32 LE 10) then Marijuana=2;
ELSE IF (H1TO32 GE 11) AND (H1TO32 LE 99) then Marijuana=3;
ELSE Marijuana=4;

IF (H1TO36 EQ 0) OR (H1TO36 EQ 996) OR (H1TO36 EQ 997) OR (H1TO36 EQ 998) OR (H1TO36 EQ 999) then Coke=1;
ELSE IF (H1TO36 GE 1) AND (H1TO36 LE 2) then Coke=2;
ELSE IF (H1TO36 GE 3) AND (H1TO36 LE 5) then Coke=3;
ELSE Coke=4;

IF (H1TO39 EQ 0) OR (H1TO39 EQ 996) OR (H1TO39 EQ 997) OR (H1TO39 EQ 998) OR (H1TO39 EQ 999) then Inhal=1;
ELSE IF (H1TO39 GE 1) AND (H1TO39 LE 4) then Inhal=2;
ELSE IF (H1TO39 GE 5) AND (H1TO39 LE 29) then Inhal=3;
ELSE Inhal=4;

IF (H1TO42 EQ 0) OR (H1TO42 EQ 996) OR (H1TO42 EQ 997) OR (H1TO42 EQ 998) OR (H1TO42 EQ 999) then Ille=1;
ELSE IF (H1TO42 GE 1) AND (H1TO42 LE 2) then Ille=2;
ELSE IF (H1TO42 GE 3) AND (H1TO42 LE 10) then Ille=3;
ELSE Ille=4;

IF (Alco EQ 4) OR (Marijuana EQ 4) OR (Ille EQ 4) OR (Inhal EQ 4) then Addict=4;
ELSE IF (Alco EQ 3) OR (Marijuana EQ 3) OR (Ille EQ 3) OR (Inhal EQ 3) then Addict=3;
ELSE IF (Alco EQ 2) OR (Marijuana EQ 2) OR (Ille EQ 2) OR (Inhal EQ 2) then Addict=2;
ELSE Addict=1;

IF (H1CO16A EQ 1) OR (H1CO16B EQ 1) OR (H1CO16C EQ 1) OR (H1CO16D EQ 1) OR (H1CO16E EQ 1) OR (H1CO16F EQ 1) OR (H1CO16G EQ 1) OR (H1CO16H EQ 1) OR (H1CO16I EQ 1) OR (H1CO16J EQ 1) then STD=1;
ELSE STD=0;

Label Addict=”Addicted to either alcohol or any drug.”
          STD=”Diagnosed with some kind of Sexually Transmitted Disease.”
run;

Data New2; Set New;
If Addict=1 then Addiction=0;
Else Addiction=1;

Proc Format ;
Value Add 0=”Doesn’t use any alocohol or drug”
                   1=”Addicted”;
Value ST 0=”No”
                1=”Yes”;
Value Vio 0=”No”
                  1=”Yes”;

Proc Freq ;
Tables STD Addiction;
Format Addiction add. STD ST.;
Run;

Proc logistic descending;
model STD=Addiction;
run;

Data New3; Set New2;
If H1CO3 EQ 7 then H1CO3=.;
run;

Data New4; Set New3;
if cmiss(of _all_) then delete;
If H1CO3 EQ 1 then Protection=0;
Else Protection=1;

Proc Format ;
Value Pro 0=”Contraception used”
1=”Contraception not used”;

Proc Freq; Tables STD Addiction Protection;
Format Addiction add. STD ST. Protection pro.;

Proc logistic descending; model STD=Protection;
Proc logistic descending; model STD=Addiction Protection;
Run;

RESULTS

After a lot of Data Management and creating Binary Explanatory and Response variables, I ran Logic Regression on SAS using Proc Logistic.

First the result of Addiction on STD:

Capture.JPG

We see that there is a significant association b/w addiction and STDs. And the odds ratio of 3.264 suggests that a person being addicted to alcohol and any other kind of drug is 3.264 times more likely to catch an STD. Moreover, we can say with 95% confidence that if any another sample is taken, the Odds Ratio estimate will lie b/w 2.332 and 4.566.

On the other hand, with Protection used or Contraception used, the result is:

Capture.JPG

We see that there is a significant association b/w Protection used and STDs as well. And the odds ratio of 1.58 suggests that a person not using any kind of protection while having intercourse is 1.58 times more likely to catch an STD. Moreover, we can say with 95% confidence that if any another sample is taken, the Odds Ratio estimate will lie b/w 1.164 and 2.145.

Now, this becomes interesting as both the explanatory variables are significant and do show some association with the response variable. So now to see if there is a confounding effect of one variable on the other or not, we do perform the Logistic Regression using both the explanatory variables and get this result:

Capture.JPG

Now we see that as soon as we put these two variables simultaneously in the logistic regression model, we see that Addiction isn’t significant enough to suggest its association with STDs, and thus Protection which is significant enough, turns out to be the confounding variable. Thus we can say that a young adult not using protection while having intercourse is 1.574 times more likely to get an STD than a young adult who does indeed use protection; after in fact controlling for addiction.

Thus my Initial Hypothesis was correct and that’s the end of this specialization.

Thanks for reading. 😀

Assignment 3: Test a Multiple Regression Model

Course 3 Week 3:

RESEARCH QUESTION

In the last week’s assignment, we checked if the CO2 emission of a country is related to its citizens’ average life expectancy or not? And by various tests, we concluded that there is no significant association b/w the two at all. Now the issue is, if not CO2 emission, then what variable might be associated with the Life Expectancy, i.e. was there any confounding variable that we neglected in our analysis. For that purpose, I included more explanatory variables like Alcohol Consumption and Urban Rate of a country.

Now the issue is, if not CO2 emission, then what variable might be associated with the Life Expectancy, i.e. was there any confounding variable that we neglected in our analysis. For that purpose, I included more explanatory variables like Alcohol Consumption and Urban Rate of a country.

VARIABLES

I am using the already available Gapminder dataset for my week 3 analysis.

Quantitative Explanatory Variables: co2emissions

Quantitative Explanatory Variables: co2emissions urbanrate alcconsumption
Quantitative Response Variable: lifeexpectancy

And then I use their centered values by subtracting each value by their mean.

PROGRAM

FILENAME REFFILE ‘/home/amitamoladun0/course/gapminder.csv’;

PROC IMPORT DATAFILE=REFFILE
DBMS=CSV
OUT=WORK.IMPORT;
GETNAMES=YES;
RUN;

proc sgplot ;
reg x=co2emissions y=lifeexpectancy / lineattrs=(color=blue thickness=2) clm;
yaxis label=”Life Expectancy”;
xaxis label=”CO2 Emissions”;
run;

proc sgplot ;
reg x=co2emissions y=lifeexpectancy / lineattrs=(color=blue thickness=2) degree=1 clm;
reg x=co2emissions y=lifeexpectancy / lineattrs=(color=green thickness=2) degree=2 clm;
yaxis label=”Life Expectancy”;
xaxis label=”CO2 Emissions”;
run;

Proc Means Data=work.Import; var co2emissions urbanrate alcconsumption;
run;

* centering quantitative explanatory variables;

Data new; Set import;
if co2emissions ne . and lifeexpectancy ne . and urbanrate ne .;
co2emissions_c= co2emissions-5531075587;
urbanrate_c= urbanrate-56.7693596;
alcconsumption_c=alcconsumption-6.6894118;
run;

Proc means ;var urbanrate_c co2emissions_c alcconsumption_c;
run;

proc sgplot ;
reg x=urbanrate_c y=lifeexpectancy / lineattrs=(color=blue thickness=2) degree=1 clm;
yaxis label=”Life Expectancy”;
xaxis label=”Urban Rate”;
run;

proc sgplot ;
reg x=alcconsumption_c y=lifeexpectancy / lineattrs=(color=green thickness=2) degree=1 clm;
yaxis label=”Life Expectancy”;
xaxis label=”Alcohol Consumption”;
run;

PROC glm ;
model lifeexpectancy= urbanrate_c co2emissions_c alcconsumption_c/solution clparm;
run;

PROC glm PLOTS(unpack)=all;
model lifeexpectancy=urbanrate_c co2emissions_c alcconsumption_c/solution clparm;
output residual=res student=stdres out=results;
run;

OUTPUT FOR CHECKING THE CENTERING

The quantitative explanatory variables were centered by subtracting their mean and respective new variables were created. We can see in the table below the before and after centering variable results.

Capture

Other OUTPUT & RESULTS

The various SGPLOT procedures gave the following graphs:

The following graph was of regression line plot b/w Co2 Emissions and Life Expectancy. We further see that using a degree two polynomial regression plot is still a better fit than a linear regression plot and this is shown very well in the graph below that.

CaptureCapture

On similar lines we check how the relation for other two variables in the consideration act with our response variable and the result is given below:

CaptureCapture

Now with GLM procedure we actually try to see what are the RSquare values for these plots and are they significant enough to reject our Null Hypothesis that there is no association or not:

So the result for all our centralized variables is:

Capture

We clearly see that CO2 Emission yet again like our last week’s assignment isn’t significant enough and shows there is no association b/w the two variables. Moreover, it’s estimate values being 0 and 95% CI too being b/w -0 to 0 shows that keeping CO2 Emission in the model doesn’t affect our model at all. In fact on removing the variable from our analysis, we see no change in the R² value.

On the other hand, Alcohol Consumption and Urban Rate, both variables having p-value <0.05 show a significant association with our response variable Life Expectancy. Moreover, the R² value of 0.41 indicates that the model explains 41% variability of the response data around its mean which is fairly good.

Analyzing other plots and graphs:

Capture

A Q-Q Plot plots the quantiles of the residuals that we would theoretically see if the residuals followed a normal distribution against the quantiles of the residuals estimated from our regression module. We want to see if the points follow a straight line or not, meaning that the model estimated residuals aren’t what we would expect if the residuals were normally distributed. We see above that the residuals deviate a lot than normal. This indicates that our residuals do not follow a perfectly normal distribution. This could mean that the linear association that we observed in our scatter plot may not be a good estimate. I did check for the other polynomial association of the same variable but it didn’t give any significant association. So I can say that there may be other explanatory variables that we might consider including in our model that could improve estimation.

We have another important plot… called Standardized Residual Plot, shown below:

Capture.JPG

If we take a look at this plot, we see that most of the residuals fall within one standard deviation of the mean. So basically, they’re either between -1 and 1.5. Few countries precisely 9 have residuals that are more than two standard deviations above or below the mean of zero. For the standard normal distribution, we would expect 95% of the values of the residuals to fall between two standard deviations of the mean. There are no observations that are three or more standard deviations from the mean. So we do not appear to have any extreme outliers.

And at last a leverage plot:

Capture.JPG

Finally, we can examine a leverage plot to identify observations that have an unusually large influence on the estimation of the predicted value of the response variable. The leverage of an observation can be thought of in terms of how much the predict scores for other observations would differ if the observation in question were not included in the analysis. The leverage always takes on values between zero and one. A point with zero leverage has no effect on the regression model. Outliers are observations with residuals greater than two or less than negative two. SAS shows outliers as red symbols, observations with high leverage values as green symbols, and observations that are both outliers and high leverage as brown symbols. We see in the leverage plot that we have a few outliers, i.e. countries that have a residual that is greater than 2 or less than negative 2. These outliers have small, that is close to zero, leverage values. Meaning that although they are outlying observations, they do not appear to have a strong influence on the estimation of the regression parameters. On the other hand, we see that there are few cases with higher than average leverage, but one, in particular, is more obvious in terms of having an influence on the estimation of the predicted value of Life Expectancy. This observation has high leverage but is not an outlier. We don’t have any observations that are both high leverage and outliers.

That leads me to the conclusion of this week’s assignment. I found out that CO2 Emission didn’t have any association with Life Expectancy at all whereas, I found two such confounded variables i.e. Alcohol Consumption and Urban Rate that affect the Life Expectancy of a country.

See you in next week where I’ll be talking about Logistic Regression.

Assignment 2: Test a Basic Linear Regression Model

Course 3 Week 2:

RESEARCH QUESTION

Is CO2 emission of a country related to its citizens’ average life expectancy?
On doing little searching on the internet I found out that various researches have already been done on the same topic and all of those indicate that people living in countries with low carbon emissions can attain a reasonably high life expectancy. I’ll try to see if I can come to the same conclusion from my analysis.

VARIABLES

I created a new data set using the available data from http://www.gapminder.org/data. My explanatory variable and response variable both are growth across 20 years.

Quantitative Explanatory Variable: co2final(representing the change in CO2 emission from 1989 to 2009)

Quantitative Response Variable: lefinal(representing the change in life expectancy from 1989 to 2009)

Centered Explanatory Variable: cco2 (the mean is very close to zero)

Centered Explanatory Variable without Two Extreme Outliers: nocincomeperperson (the mean is very close to zero)

PROGRAM

Data new; set Import;
Keep country co2_89 co2_09 le89 le09 co2final clefinal;

LABEL co2_89=”CO2 Emmisions in 1989″
co2_09=”CO2 Emmisions in 2009″
le89=”Life Expectancy in 1989″
le09=”Life Expectancy in 2009″;

co2final=co2_09-co2_89;
clefinal=le09-le89;

If clefinal ge 6 OR clefinal le (-6) then clefinal=.;

PROC MEANS; var co2final;
run;

Data new1; set new;
Keep country co2_89 co2_09 le89 le09 co2final cco2final clefinal;

cco2final= co2final-55029.50;
PROC MEANS; var co2final;
PROC MEANS; var cco2final;

PROC GLM; Model clefinal=cco2final;
Run;

OUTPUT FOR CHECKING THE CENTERING

The quantitative explanatory variable, co2final, was centered by subtracting the mean. A new variable named cco2final was created and served as the centered variable of co2final. cco2final has the mean which is very close to zero. The means procedure of co2final and it’s centered form cco2final is as follows:

1

 

In addition to this, I found out that CO2 emissions from India and China were way higher than normal so they were outliers in my data. So after the removal of outliers, the result of the mean procedure changed to following:

2

 

OUTPUT AND RESULT FOR THE LINEAR REGRESSION MODEL

(WITH AND WITHOUT EXTREME OUTLIERS)

OUTPUT WITH OUTLIERS

34

There were 167 countries in this test, including extreme outliers. The result of the linear regression model indicated that co2final (Beta/Regression Coefficient=0.000000703, p-value<.3456) wasn’t significant enough and thus there’s no association between CO2 emission and Life Expectancy

The r² (r square) of 0.005 suggests that if we know the CO2 emission change, we can predict only 0.5% of the variability we will see in life expectancy.

OUTPUT WITHOUT TWO EXTREME OUTLIERS

After removing two countries from CO2 Emission data and moreover removing outliers from the life expectancy as well, the result of the GLM analysis is:

56

There were 115 countries in this test. The result of the linear regression model indicated that co2final (Beta/Regression Coefficient=0.000003272, p-value<.1697) was still not significant enough and thus there’s still no association between CO2 emission and Life Expectancy after the removal of outliers.

The r² (r square) of 0.016625 suggests that if we know the CO2 emission change, we can predict only 1.66% of the variability we will see in life expectancy.

The conclusion did change significantly after the extreme outliers were removed but the result was still same.

Thus I conclude that there is no significant association between CO2 Emission of a country and the Life Expectancy of that country.

 

Effects of Corruption on Inflation and common people

Originally answered to a question in Quora- What is the link between inflation and corruption?

I can’t cite what I am going to answer, but as a student of Economics, I hope I do justice to this question.

As per the definition of Corruption by Corruptie.org :

  1. Corruption is the misuse of public power (by elected politician or appointed civil servant) for private gain.
  2. In case of private between individuals and businesses –
    Corruption is the misuse of entrusted power (by heritage, education, marriage, election, appointment or whatever else) for private gain.
  3. And now for the most complicated but to the point definition:
    Corruption is an improbity or decay in the decision-making process in which a decision-maker consents to deviate or demands deviation from the criterion which should rule his or her decision-making, in exchange for a reward or for the promise or expectation of a reward, while these motives influencing his or her decision-making cannot be part of the justification of the decision.

So now that we have defined corruption, let’s see how it can lead to affect Inflation?

When someone does some kind of fraud or anything, many a time it involves money. Now this money is what drives inflation. But how can corruption affect inflation? Good question right?

If the person who did commit a fraud, do something involving money then yes it might lead to affect inflation. If the person or entity take the money that was intended to be with someone else or somewhere, they might do it in two ways. Either doing it out in open or maybe not letting anyone know about their fraud.

The first kind I just talked about is what I refer to as legal corruption. A legal corruption is actually a corruption done under legal bounds. For example- a tender given by the head of the governing body to someone of his relation even though it was supposed to be given to some other party. Now that head was able to do it as he was the only one who knew all the tender values, so this way he committed fraud and that too in a legal manner as no one would ever doubt on his act and the companies obviously won’t collude.

Whatever example I gave above is just a made-up example to explain the point. What actually happens, in reality, is not known to me. So anyway, moving on.

So in the case of legal corruption, this money is not parked anywhere from where it won’t inject back to the market. So this doesn’t really affect the market. But if in case it’s illegal and this money is parked somewhere and doesn’t go directly into the free market, then it decreases the money supply in the money market. Now, this is something that does affect inflation big time.

To explain the same I am using some specific points from The link between Money Supply and Inflation here(you can read the whole article as well from that link):

  • In normal economic circumstances, if the money supply grows faster than real output it will cause inflation.
  • In a depressed economy (liquidity trap) this correlation breaks down because of a fall in the velocity of circulation. This is why in a depressed economy Central Banks can increase the money supply without causing inflation. This occurred in the US between 2008-11. – Large increase in money supply no inflation.
  • However, when the economy recovers and velocity of circulation rises, increased money supply is likely to cause inflation.

In our case, it has decreased the money supply, so I reckon that it would lead to decrease in the level of inflation.

Now here’s a thing- the parked money is eventually spent, so how does it affect the economy?

The answer to this is-

The money from illegal corruption is parked somewhere other than an institution like banking which hampers its injection back into the economy. Now it eventually does gets injected, that’s absolutely true. What good is the money taken and not used right? But we should know about Time Value of money. The time difference between parked money’s eventual injection have affected the market till that time and that affects the Economy a big time. So yes that money does get injected eventually but the damage is done till that time.

Note- Do share or tell if there was some error in my assumptions or anything. I am quite new to Economics and I would really appreciate if you would tell me my mistakes here. Thanks!

Assignment 1: Writing About Your Data

Course 3 Week 1:
In the first week of the Course 3 of specialization, we are asked to give details about the sample we have chosen for our analysis, the procedure for collecting the data and the measures that we took.

The Sample
The sample is taken from GapMinder.org website which is a non-profit venture promoting sustainable global development and achievement of the United Nations Millennium Development Goals. Gapminder seeks to increase the use and understanding of statistics about social, economic, and environmental development at local, national, and global levels.

Our sample contained data of 213 countries over the world and had data of each of these nations which included information regarding income per person, alcohol consumption, internet users and other similar variables. In total there are 15 variables and each contains their respective data of each nation.

Procedure of collection of the sample
GapMinder collects data from multiples sources, including the Institute for Health Metrics and Evaluation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank.

So we know that the data for this study was collected using a longitudinal survey method. According to the Gapminder website FAQ’s, it states that it uses a survey software called SurveyGizmo to create their surveys for data collection.

Measures
In course 2 of the specialization, after studying the Gapminder Codebook, I decided that I am particularly interested in determining if alcohol consumption related to the income of a person.

I, therefore, added to my codebook the following two variables: alcoconsumption and incomeperperson. Each of these contained data for each country as per their name suggests.

‘alcconsumption’ – 2008 alcohol consumption per adult (age 15+), litres Recorded and estimated average alcohol consumption, adult (15+) per capita consumption in litres pure alcohol. This data was taken from WHO.

‘incomeperperson’- 2010 Gross Domestic Product per capita in constant 2000 US$. The
inflation but not the differences in the cost of living between countries have been taken into account. This data was taken from World Bank.

Studying the codebook as well as some previous related works on the particular data set is particularly very helpful in either determining a valid research question to explore or in determining the variables to study.

My explanatory variable was incomeperperson and femaleemployrate. All these variables are quantitative continuous so there will be no need to manage or recode them for regression analysis. Moreover, there might have been a few confounding variables which would be the actual reason for the association, but for this week this is all I am supposed to discuss. We will see about the confounding variables in next week I believe.