About the project

Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.

This is a test and I am getting help for this..The short description for this course is that it is cool. I like it so far and thinking to complete all assignments
My link is link


ANALYSIS part of the Exercise Set 2

1. Reading the data from the previous step, exploting it and describing the dataset

data<- read.table("data/learning2014.txt", header=T, sep ="\t")
head(data)
##   gender Age Attitude Points        deep        Stra        Surf
## 1      F  53       37     25 -0.20335689  0.37527124 -0.39314657
## 2      M  55       31     12 -1.40867870 -0.43457118  0.70406641
## 3      F  49       25     24 -0.35402212  0.69920821 -1.02012542
## 4      M  53       35     10 -0.35402212  0.05133427 -1.02012542
## 5      M  49       37     22 -0.05269166  0.69920821  0.07708756
## 6      F  38       38     21  1.90595628  0.69920821 -0.70663599
dim(data)
## [1] 166   7

This dataset has 166 rows and 7 columns which corresponds to the 166 participants with 7 measured or calculated characteristics.Since I used R function scale() to the last three columns, there are now scaled variables, also negative ones. The dataset has info on age, gender and examination total points, as well as three variables deep, stra, surf that are described in the course outline. The rows containing zeroes are removed, so the final number is down to 166 participants (from 183).

2. Graphical representation of the data using ggplot2

Let’s check how variable Attitude looks like in relation to Points. I will use ggplot2 just plotting Points vs Attitude.

library(ggplot2)
p1 <- ggplot(data, aes(x = Attitude, y = Points))
p2 <- p1 + geom_point()
p2

Let’s as well check Age and deep in relation to Points.

p3 <- ggplot(data, aes(x = Age, y = Points))
p4 <- p3 + geom_point()
p4

The distribution is a bit scewed, as they are probably typical students, around 20 years of age. There is a good distribution of exam scores.

p5 <- ggplot(data, aes(x = deep, y = Points))
p6 <- p5 + geom_point()
p6

This distribution is centered around the 0, as variable “deep” was scaled with mean set to 0.

summary(data)
##  gender       Age           Attitude         Points     
##  F:110   Min.   :17.00   Min.   :14.00   Min.   : 7.00  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00  
##          Median :22.00   Median :32.00   Median :23.00  
##          Mean   :25.51   Mean   :31.43   Mean   :22.72  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75  
##          Max.   :55.00   Max.   :50.00   Max.   :33.00  
##       deep               Stra               Surf          
##  Min.   :-3.81932   Min.   :-2.37819   Min.   :-2.274083  
##  1st Qu.:-0.65535   1st Qu.:-0.59654   1st Qu.:-0.706636  
##  Median :-0.05269   Median : 0.13232   Median : 0.077088  
##  Mean   :-0.02909   Mean   : 0.04646   Mean   :-0.009783  
##  3rd Qu.: 0.70063   3rd Qu.: 0.69921   3rd Qu.: 0.704066  
##  Max.   : 2.20729   Max.   : 2.48086   Max.   : 2.898492

R function summary provides the basic statistics of each variable, except for gender, of course, for which it shows frequences.

3. Multiple regression with three variables and Points as dependent variable

Now we will use lm function to run a regression model

model1 <- lm(Points ~ Attitude + deep + Age, data = data)
summary(model1)
## 
## Call:
## lm(formula = Points ~ Attitude + deep + Age, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.206  -3.430   0.204   3.979  10.950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.38008    2.26458   5.908 1.98e-08 ***
## Attitude     0.35941    0.05696   6.310 2.56e-09 ***
## deep        -0.33338    0.41500  -0.803    0.423    
## Age         -0.07716    0.05322  -1.450    0.149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.307 on 162 degrees of freedom
## Multiple R-squared:  0.2043, Adjusted R-squared:  0.1896 
## F-statistic: 13.87 on 3 and 162 DF,  p-value: 4.305e-08

Attitude has a statistically significant p value of 2.56e-09. Deep and Age do not and to be removed from the model. Overall p value for model1 is nice (4.305e-08), so let’s now run it without Age and deep.

model2 <- lm(Points ~ Attitude, data = data)
summary(model2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Now p value is even nicer - 4.12e-09. So Age and deep do not influence significantly my model with Attitude. Let’s look at the plot of it.

p1 <- ggplot(data, aes(x = Attitude, y = Points))
p2 <- p1 + geom_point()+ geom_smooth(method = "lm")
p2

Nice graph of my model!

4. Interpretation of the summary of lm model

model2 <- lm(Points ~ Attitude, data = data)
summary(model2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

First of, all the summary gives out the residuals. The smaller the average of residuals, the better the model fits the data. SO here we look at Median 0.4339. In fact, with Age and deep it was smaller (0.2). Next, R outputs the regression coefficients of the model. For each coefficient (we have only Attitude left), standard errors, t-statistics, and two-sided p-values are presented in the output. There is a positive correlation (Estimate is 0.35255), the relationship is significant, as P value is 4.12e-09, it is also indicated by 3 asterics. There are 164 degrees of freedom (166-2), Multiple R squared is 0.1906. That means that the predictor Attitude explain only 19.06 of the observed variance on the dependent variable ‘Points’.

5. Four diagnostic plots for linear regression

model2 <- lm(Points ~ Attitude, data = data)
plot(model2)

  1. Residuals vs Fitted

This plot shows if residuals have non-linear patterns. As seen from the plot, there are equally spread residuals around a horizontal line without distinct patterns, which is a good indication we don’t have non-linear relationships.

  1. Normal Q-Q

This plot shows if residuals are normally distributed. The residuals may follow a straight line well or they deviate severely. We have them well lined and deviating at the “ends” of the graph, but overall there is a good alignment of them.

  1. Scale-Location

This plot shows if residuals are spread equally along the ranges of predictors. This is how we can check the assumption of equal variance (homoscedasticity). We can say that we see a sort of horizontal line with equally (randomly) spread points.

  1. Residuals vs Leverage

This plot helps us to find influential cases (i.e., subjects) if any. Not all outliers are influential in linear regression analysis. Even though data have extreme values, they might not be influential to determine a regression line. That means, the results wouldn’t be much different if we either include or exclude them from analysis. They follow the trend in the majority of cases and they don’t really matter; they are not influential. Here we see some labeled cases that deviate but we are not going to exclude them from analysis, I would say.


ANALYSIS part of the Exercise Set 3

1&2. Reading the joined student alcohol consumption data into R and exploring it

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
joined_data<-read.table("Joined student alcohol consumption data.txt", header=T)
dim(joined_data)
## [1] 382  35
glimpse(joined_data)
## Observations: 382
## Variables: 35
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob       <fct> teacher, other, other, services, other, other, othe...
## $ reason     <fct> course, course, other, home, home, reputation, home...
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian   <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...

The dataset “joined_data”" includes info on student achievement in secondary education of two Portuguese schools. The data variables (35) include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).

3. Choosing variables and study hypotheses.

The purpose of my analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, I choose the following 4 variables: 1. sex (hypothesis is that boys in average consume more alcohol an would be more often high-users), 2. age (older students would be consuming more and more of older ones would be found in high-users’ group), 3. internet (those who have internet would be high-users or consume more) and 4.absences (the idea is that the more absences would be associated with high-users.

4. The distrubutions of the chosen variables and what about my study hypotheses?

joined_data %>% group_by(sex) %>% summarise(count = n())
## # A tibble: 2 x 2
##   sex   count
##   <fct> <int>
## 1 F       198
## 2 M       184
joined_data %>% group_by(age) %>% summarise(count = n())
## # A tibble: 7 x 2
##     age count
##   <int> <int>
## 1    15    81
## 2    16   107
## 3    17   100
## 4    18    81
## 5    19    11
## 6    20     1
## 7    22     1
joined_data %>% group_by(internet) %>% summarise(count = n())
## # A tibble: 2 x 2
##   internet count
##   <fct>    <int>
## 1 no          58
## 2 yes        324
joined_data %>% group_by(absences) %>% summarise(count = n())
## # A tibble: 26 x 2
##    absences count
##       <int> <int>
##  1        0    65
##  2        1    51
##  3        2    58
##  4        3    41
##  5        4    36
##  6        5    22
##  7        6    21
##  8        7    12
##  9        8    20
## 10        9    12
## # ... with 16 more rows
library(ggplot2)
g1 <- ggplot(joined_data, aes(x = high_use, y = age))
g1 + geom_boxplot() + ylab("age")

The hypothesis about age looks like correct one: high-users are older in average.

library(ggplot2)
g1 <- ggplot(joined_data, aes(high_use))
g1 + geom_bar(aes(fill=sex)) 

Hypothesis about sex seems to be correct, too: in the high-users group there seems to be more boys than girls.

library(ggplot2)
g1 <- ggplot(joined_data, aes(high_use))
g1 + geom_bar(aes(fill=internet)) 

Here it is not so obvious: it looks like in the high-users there are more of those who have internet, but this relationship is no so obvious and requires testing.

library(ggplot2)
g1 <- ggplot(joined_data, aes(x = high_use, y = absences))
g1 + geom_boxplot() + ylab("absences")

In case of absences we need to do calculations: it is not anymore obvious, though it seems that higher number of absences are indeed observed in the group of high-users.

5. Logistic regression

Four variables are: sex, age, internet, absences.

m <- glm(high_use ~ age + sex + internet + absences, data = joined_data, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ age + sex + internet + absences, family = "binomial", 
##     data = joined_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2985  -0.8443  -0.6330   1.0846   2.1044  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.76662    1.74079  -2.738  0.00618 ** 
## age          0.17582    0.10153   1.732  0.08333 .  
## sexM         0.98257    0.24155   4.068 4.75e-05 ***
## internetyes  0.03073    0.33826   0.091  0.92761    
## absences     0.09197    0.02331   3.945 7.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 427.03  on 377  degrees of freedom
## AIC: 437.03
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)         age        sexM internetyes    absences 
## -4.76662101  0.17582448  0.98256834  0.03072935  0.09196982

Summary of glm: Overall model is nicely fitted with significant P value (0.00618). However we can see that effect size is biggest for sex (0.98 is the estimate for boys). Next one is age, which is logical, as perhaps in case of alcohol consumption, getting older might make it easier to get alcohol somehow.Though P value is greater than 0.05 (0.08333). Absences are at third place and they show significat association with high-users of alcohol. Internet seems to affect the least and is not significantly associated with being a high-user.The same was seen in the box plots and bars I made at the previous step.

Now I will compute coefficients of the model as odds ratios and provide confidence intervals for them.

OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                      OR        2.5 %    97.5 %
## (Intercept) 0.008509084 0.0002648954 0.2474229
## age         1.192228782 0.9784009269 1.4580461
## sexM        2.671308277 1.6736842196 4.3224336
## internetyes 1.031206374 0.5402411256 2.0509852
## absences    1.096331736 1.0494764765 1.1500097

6. Binary prediction

library(dplyr)
m <- glm(high_use ~ age + sex + internet + absences, data = joined_data, family = "binomial")
probabilities <- predict(m, type = "response")
joined_data <- mutate(joined_data, probability = probabilities)
joined_data <- mutate(joined_data, prediction = probability > 0.5)
select(joined_data, age, internet, absences, sex, high_use, probability, prediction) %>% tail(10)
##     age internet absences sex high_use probability prediction
## 373  19       no        0   M    FALSE   0.3909341      FALSE
## 374  18       no        7   M     TRUE   0.5061439       TRUE
## 375  18       no        1   F    FALSE   0.1809669      FALSE
## 376  18      yes        6   F    FALSE   0.2651770      FALSE
## 377  19      yes        2   F    FALSE   0.2294734      FALSE
## 378  18      yes        2   F    FALSE   0.1998693      FALSE
## 379  18       no        2   F    FALSE   0.1950003      FALSE
## 380  18       no        3   F    FALSE   0.2098432      FALSE
## 381  17      yes        4   M     TRUE   0.4021709      FALSE
## 382  18      yes        2   M     TRUE   0.4002213      FALSE
table(high_use = joined_data$high_use, prediction = joined_data$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   260    8
##    TRUE     84   30

Comment: My prediction managed to predict 30 true high-users and 8 of them it predicted wrongly. For the oppositem 260 were predicted correctly and 84 not. Let’s see how it looks like on plots. Looks like qiote a large error to me.

library(dplyr)
library(ggplot2)
g <- ggplot(joined_data, aes(x = probability, y = high_use, col = prediction))
g + geom_point()

table(high_use = joined_data$high_use, prediction = joined_data$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68062827 0.02094241 0.70157068
##    TRUE  0.21989529 0.07853403 0.29842932
##    Sum   0.90052356 0.09947644 1.00000000

Now I calculate the trainng error or average number of wrong predictions in the training data.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = joined_data$high_use, prob = 0)
## [1] 0.2984293

So the average number of wrong predictions is higher than we had in DataCamp example (it was 0.26 there). My model is not as good as the one in the example.According to own experience it is quite a large error, so model does not work too well, based on these 4 variables. Maybe internet and

Bonus

I will do K-fold cross-validation, as shown in the example with K=10.

library(boot)
cv <- cv.glm(data = joined_data, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2539267

The average number of wrong predictions in the cross validation is 0.2549267 which is a lower that in my model (see the one above). So 10-cross validation gives a lower error than my model.


ANALYSIS part of the Exercise Set 4

1&2. Reading the Boston data from the MASS package and exploring it

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. However, these comparisons were primarily done outside of Delve and are thus somewhat suspect. The dataset is small in size with only 506 cases.The data was originally published by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978.

3. Variables of Boston data.

There are 14 attributes in each case of the dataset. They are: CRIM - per capita crime rate by town; ZN - proportion of residential land zoned for lots over 25,000 sq.ft.; INDUS - proportion of non-retail business acres per town; CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise); NOX - nitric oxides concentration (parts per 10 million); RM - average number of rooms per dwelling; AGE - proportion of owner-occupied units built prior to 1940; DIS - weighted distances to five Boston employment centres; RAD - index of accessibility to radial highways; TAX - full-value property-tax rate per $10,000; PTRATIO - pupil-teacher ratio by town; B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town; LSTAT - % lower status of the population; MEDV - Median value of owner-occupied homes in $1000’s.

Let’ see the matrix of Boston variables.

pairs(Boston)

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Let’s see correlations between variables in the dataset.

library(dplyr)
library(corrplot)
## corrplot 0.84 loaded
cor_matrix<-cor(Boston) %>% round(digits = 2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

Interpretation: we can see that some variables correlate greatly with ech other, for example, INDUS, NOX and AGE (positive correlation). INDUS and NOX correlate negatively. COrrelation plot is very informative about the relationships of the variables.

4.Standardization the dataset using scale() and summaries of the scaled data.

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
class(boston_scaled)
## [1] "data.frame"

As we can see now, mean of each variable is set to zero.We also converted boston_scale object from matrix to the dateframe using R functions and verified the class of the object.

Now we will create a categorical variable of the crime rate from the scaled crime rate using the quantiles as the break points in the categorical variable. We need to drop the old crime rate variable from the dataset, too.

Let’s have a look at the variable “crime”.

summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110

Now we make a quantile vector of it and check it.

bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610

Now we make a categorical variable out of variable “crime” with 4 labels.Then we check frequences by table().

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

Next step is to remove original crim from the dataset and add the new categorical value to scaled data. We will also check it afterwards.

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
table(boston_scaled$crime)
## 
##      low  med_low med_high     high 
##      127      126      126      127

Now we will create a train dataset(80% of the dataset) and a test dataset from Boston.

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

5.Linear discriminant analysis

Let’s run linear discriminant analysis.

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2475248 0.2400990 0.2673267 0.2450495 
## 
## Group means:
##                   zn      indus         chas        nox          rm
## low       0.96415088 -0.8872982 -0.075474056 -0.8886618  0.42873365
## med_low  -0.07283487 -0.2717505 -0.028797094 -0.5703153 -0.21352474
## med_high -0.38322343  0.1437378  0.238035778  0.3646033  0.09465255
## high     -0.48724019  1.0149946  0.006051757  1.0361143 -0.32934304
##                 age        dis        rad        tax      ptratio
## low      -0.9030881  0.8912268 -0.6924575 -0.7217232 -0.454299467
## med_low  -0.3777287  0.3984921 -0.5355082 -0.4505113 -0.008032196
## med_high  0.4079866 -0.3760454 -0.4065743 -0.3294479 -0.314769825
## high      0.7989217 -0.8366353  1.6596029  1.5294129  0.805778429
##               black       lstat        medv
## low       0.3787438 -0.76957232  0.52779588
## med_low   0.3049605 -0.10357860 -0.05860452
## med_high  0.0794439  0.03333508  0.18449539
## high     -0.8051289  0.81289267 -0.63233324
## 
## Coefficients of linear discriminants:
##                  LD1           LD2         LD3
## zn       0.107831700  6.219678e-01 -0.93912477
## indus    0.035019054 -1.065905e-01  0.30829282
## chas    -0.068458437 -3.695115e-02  0.06682377
## nox      0.224439618 -8.841736e-01 -1.17164240
## rm      -0.118145567 -1.568722e-01 -0.26586907
## age      0.250011132 -2.988660e-01 -0.12994559
## dis     -0.110759509 -1.458461e-01  0.23740978
## rad      3.456416321  8.159788e-01 -0.20555858
## tax     -0.002526983  1.649667e-01  0.58598343
## ptratio  0.120739134  4.646396e-03 -0.20508331
## black   -0.146597461 -4.982588e-05  0.11755787
## lstat    0.181882938 -2.644175e-01  0.31846127
## medv     0.171566586 -3.448778e-01 -0.15856327
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9518 0.0366 0.0116

Now we will plot the results of lda.

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

6. classes prediction with the LDA model on the test data. Cross-tabulation of the results.

correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       18       8        1    0
##   med_low    6      18        5    0
##   med_high   0       5       12    1
##   high       0       0        1   27

7. Reload of Boston and k-means algorithm.

library(MASS)
data("Boston")
dist_eu <- dist(scale(Boston))
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
dist_man <- dist(scale(Boston), method = "manhattan")
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Now we do k-means clustering and plot it!

km <-kmeans(Boston, centers = 3)
pairs(Boston, col = km$cluster)

Next step is to investigate k number.

library(ggplot2)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)

Interpretation: The optimal number of clusters is when the value of total WCSS changes radically. In this case, two clusters would seem optimal.SO we reran k-means with two clusters.

Super-Bonus.

model_predictors <- dplyr::select(train, -crime)
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

Nice plot!


ANALYSIS part of the Exercise Set 5

1. Reading the human data from the data folder and exploring it

human<-read.table("C:/Users/ELITEBOOK/Documents/GitHub/IODS-project/data/human.txt", header=T)
library(dplyr)
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(corrplot)
#ggpairs(human,cardinality_threshold = 155)# this is crashing my laptop
#str(human)
human<-human %>% 
  mutate_at(vars(GNI), as.numeric)
summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI           Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :  1.0   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.: 39.5   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 78.0   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 78.0   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.:116.5   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :155.0   Max.   :1100.0   Max.   :204.80   Max.   :57.50
cor(human) %>% corrplot #  the correlation matrix and corrplot

I tried to show graphical overview of the data using Ggally ggpairs, but it crashes my computer, as there are too many countries. So I left the code here but I don’t run it. Corplot worked and shows variables correlating highly (bigger circles) or not (smaller size of circles), directly (blue) or inversely (red).The distribution of the variables can be checked from the summaries: if the variable is a ratio, it varies 0-1. Other ones seem to vari greately.

2. PCA on the non-standardized data

Let’s first use non-standardized data and go directly to PCA.

pca_human <- prcomp(human)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

We will have a look at the principal components and the variance.

s <- summary(pca_human)
s
## Importance of components:
##                             PC1      PC2      PC3     PC4     PC5     PC6
## Standard deviation     214.2937 44.75162 26.34667 11.4791 4.06656 1.60664
## Proportion of Variance   0.9416  0.04106  0.01423  0.0027 0.00034 0.00005
## Cumulative Proportion    0.9416  0.98267  0.99690  0.9996 0.99995 1.00000
##                           PC7    PC8
## Standard deviation     0.1905 0.1587
## Proportion of Variance 0.0000 0.0000
## Cumulative Proportion  1.0000 1.0000
pca_pr <- round(100*s$importance[2,], digits = 1) 
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 94.2  4.1  1.4  0.3  0.0  0.0  0.0  0.0

We can see that PC1 explains 94% of the variance in the data set.Together first two components PC1 and PC2 explain allmost whole variance (98%). Let’s egnerate a bit more clear plot of two components.

pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

3&4. PCA on the standardized data. Interpretation. Should we scale the data?

Now we repeat same steps with the scaled data.

human_std <- scale(human)
summary(human_std)
##     Edu2.FM           Labo.FM           Life.Exp          Edu.Exp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-1.7154   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.8577   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median : 0.0000   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.8577   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 1.7154   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
pca_human <- prcomp(human_std)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

Plot looks completely different! Let’s have a look at Principal components.

s1 <- summary(pca_human)
s1
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.9659 1.1387 0.9896 0.8662 0.69949 0.54002 0.46700
## Proportion of Variance 0.4831 0.1621 0.1224 0.0938 0.06116 0.03645 0.02726
## Cumulative Proportion  0.4831 0.6452 0.7676 0.8614 0.92254 0.95899 0.98625
##                            PC8
## Standard deviation     0.33165
## Proportion of Variance 0.01375
## Cumulative Proportion  1.00000
pca_pr <- round(100*s$importance[2,], digits = 1) 
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 94.2  4.1  1.4  0.3  0.0  0.0  0.0  0.0
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

Now we got completely different results. PC1 explains barely half of variance, together PC1 and PC2 explain about 65% of variance. Let’s see what is PC1 and PC2 in each case.In case of the non-standardized variable, PC1 is Edu2.FM and PC2 is Labo.FM, and they, as we remember, explained almost all the variance in the human dataset. Once the dataset was scaled, the PCA again suggests same variables as PC1 and PC2, but scaled, they explain 2/3 of the variance. The decision whether we should scale or not the variables is crucial here. In reality, I think that we need to see what is the variable’s distribution before scaling and after. Sometimes, scaling makes things worse, as the true nature and meaning of the measurement vanishes after scaling. In this case, the scaling probably served a better purpose: our dataset is not anymore explained by just two components.I can only suggest, that once we have all kinds of variables, e.g. some measured in hundreds (GNI) and some between 0 and 1 (ratios Edu2.FM and Labo.FM), it makes sense to scale the dateset and make them comparable.

5. Tea dataset: load and explore.MCA.

Let’s install all packages and load the dataset and have a “glimpse” on it.

library(FactoMineR)
library(ggplot2)
library(tidyr)
library(dplyr)
data("tea")
glimpse(tea)
## Observations: 300
## Variables: 36
## $ breakfast        <fct> breakfast, breakfast, Not.breakfast, Not.brea...
## $ tea.time         <fct> Not.tea time, Not.tea time, tea time, Not.tea...
## $ evening          <fct> Not.evening, Not.evening, evening, Not.evenin...
## $ lunch            <fct> Not.lunch, Not.lunch, Not.lunch, Not.lunch, N...
## $ dinner           <fct> Not.dinner, Not.dinner, dinner, dinner, Not.d...
## $ always           <fct> Not.always, Not.always, Not.always, Not.alway...
## $ home             <fct> home, home, home, home, home, home, home, hom...
## $ work             <fct> Not.work, Not.work, work, Not.work, Not.work,...
## $ tearoom          <fct> Not.tearoom, Not.tearoom, Not.tearoom, Not.te...
## $ friends          <fct> Not.friends, Not.friends, friends, Not.friend...
## $ resto            <fct> Not.resto, Not.resto, resto, Not.resto, Not.r...
## $ pub              <fct> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, ...
## $ Tea              <fct> black, black, Earl Grey, Earl Grey, Earl Grey...
## $ How              <fct> alone, milk, alone, alone, alone, alone, alon...
## $ sugar            <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, N...
## $ how              <fct> tea bag, tea bag, tea bag, tea bag, tea bag, ...
## $ where            <fct> chain store, chain store, chain store, chain ...
## $ price            <fct> p_unknown, p_variable, p_variable, p_variable...
## $ age              <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 3...
## $ sex              <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, ...
## $ SPC              <fct> middle, middle, other worker, student, employ...
## $ Sport            <fct> sportsman, sportsman, sportsman, Not.sportsma...
## $ age_Q            <fct> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35-...
## $ frequency        <fct> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, 3...
## $ escape.exoticism <fct> Not.escape-exoticism, escape-exoticism, Not.e...
## $ spirituality     <fct> Not.spirituality, Not.spirituality, Not.spiri...
## $ healthy          <fct> healthy, healthy, healthy, healthy, Not.healt...
## $ diuretic         <fct> Not.diuretic, diuretic, diuretic, Not.diureti...
## $ friendliness     <fct> Not.friendliness, Not.friendliness, friendlin...
## $ iron.absorption  <fct> Not.iron absorption, Not.iron absorption, Not...
## $ feminine         <fct> Not.feminine, Not.feminine, Not.feminine, Not...
## $ sophisticated    <fct> Not.sophisticated, Not.sophisticated, Not.sop...
## $ slimming         <fct> No.slimming, No.slimming, No.slimming, No.sli...
## $ exciting         <fct> No.exciting, exciting, No.exciting, No.exciti...
## $ relaxing         <fct> No.relaxing, No.relaxing, relaxing, relaxing,...
## $ effect.on.health <fct> No.effect on health, No.effect on health, No....
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

Dataset “tea” has 300 rows and 36 columns. The data concern a questionnaire on tea. 300 individuals were asked how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).Rows represent the individuals, columns represent the different questions. The first 18 questions are active ones, the 19th is a supplementary quantitative variable (the age) and the last variables are supplementary categorical variables. We can check summaries and some plots of the data, but it is almost solely categorical, so plot tools are quite limited for this data visualization.

summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 
gather(tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

Now let’s do MCA but first we take some columns of tea, as it is done in the Datacamp exercise.

keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep_columns))
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage = "quali")

We got a biplot of first two principal components - “where” and “how”, if I understand correctly. “how” is whether the individuals prefer tea bags or not, and “where” corresponds to where the tea was purchased - in the chain store or tea shop. These two questions give the most to Dim 1 and Dim 2. The plot supports these explanations.


ANALYSIS part of the Exercise Set 6

1. Reading the data and exploring it (the longer formats of rats and bprs)

I will repeat here steps from R script, so I could have data transformation here and start analyses.

library(dplyr)
library(tidyr)
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep = "", header =T)
rats<-read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep = "", header =T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
rats$ID <- factor(rats$ID)
rats$Group <- factor(rats$Group)
BPRSL <-  BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <-  BPRSL %>% mutate(week = as.integer(substr(weeks,5,5)))
ratsL <-  rats %>% gather(key =WD, value = rats, -ID, -Group)
ratsL <-  ratsL %>% mutate(Time = as.integer(substr(WD,3,4)))

Now quick look at data structures of long formats, as I already went through it in the R script. We can also use head(), to have a quick glance how the data look like, if we compeletly forgot what is where.

glimpse(BPRSL)
## Observations: 360
## Variables: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0"...
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
glimpse(ratsL)
## Observations: 176
## Variables: 5
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1,...
## $ WD    <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", ...
## $ rats  <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 5...
## $ Time  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8,...
head(BPRSL)
##   treatment subject weeks bprs week
## 1         1       1 week0   42    0
## 2         1       2 week0   58    0
## 3         1       3 week0   54    0
## 4         1       4 week0   55    0
## 5         1       5 week0   72    0
## 6         1       6 week0   48    0
head(ratsL)
##   ID Group  WD rats Time
## 1  1     1 WD1  240    1
## 2  2     1 WD1  225    1
## 3  3     1 WD1  245    1
## 4  4     1 WD1  260    1
## 5  5     1 WD1  255    1
## 6  6     1 WD1  260    1

Now we recalled what we have as our datasets and we can proceed to the analyses of rats data (Chapter 8).

2. RATS data analyses (Chapter 8, ANOVA).

We will start with visualizing individuals measurements (weight) by three groups on the plot.

library(ggplot2)
ggplot(ratsL, aes(x = Time, y = rats, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(ratsL$rats), max(ratsL$rats)))

These plots show the variability of weight in three groups: for example, in Group1, the weights are the lowest, while in Group 2 we have higher values and one rat with the highest weight. Group 3 is more homogenous and with higher values.

Data standardization

We want now to standardize the variables to see if this changes the way data are distributed.

ratsL <- ratsL %>%
  group_by(Time) %>%
  mutate(stdrats = (rats - mean(rats))/sd(rats) ) %>%
  ungroup()
glimpse(ratsL)
## Observations: 176
## Variables: 6
## $ ID      <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Group   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, ...
## $ WD      <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1"...
## $ rats    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445,...
## $ Time    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, ...
## $ stdrats <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.881...
ggplot(ratsL, aes(x = Time, y = stdrats, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized rats")

We now can see that Y-axis has the universal scale, as weight is scaled for all groups. So we did not loose the variability inside the groups but we made data a bit more homogenious between the groups. Now we will make another plot, checking groups by weight mean values across time.

n <- ratsL$Time %>% unique() %>% length()
ratsS <- ratsL %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(rats), se = sd(rats)/sqrt(n) ) %>%
  ungroup()
glimpse(ratsS)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, ...
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 26...
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.5529...
ggplot(ratsS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.9,0.5)) +
  scale_y_continuous(name = "mean(rats) +/- se(rats)")

Now we see the plot for three groups by weight mean values and it gives a very nice overview of three groups against each other. As in previous lines, we see that Group 1 has lowest mean of weight, while Groups 2 and 3 are closer to each other.

Outliers in the data?

Now we will check of rats data contain outliers.

ratsL8S <- ratsL %>%
  filter(Time > 0) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(rats) ) %>%
  ungroup()
glimpse(ratsL8S)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.72...
# Draw a boxplot of the mean versus treatment
ggplot(ratsL8S, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(rats), WD ALL")

# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
ratsL8S1 <- ratsL8S %>%
  filter(mean < 550)# first outlier gone.
ratsL8S2 <- ratsL8S1 %>%
  filter(mean > 250)#second outlier gone.ratsL8S2 is our data.
ggplot(ratsL8S2, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(rats minus 2 outliers), WD ALL")

Now we removed 2 outliers: one from Group 1 with lowest mean and one from the Group 2 with the highest mean. I am not sure that this is how you deal with them but I follow the example. One outlier in Group 3 stays, as it is inside the range of Group 2 means. Now we are ready for ANOVA.

ANOVA

# Add the baseline from the original data as a new variable to the summary data
ratsL8S2 <- ratsL8S %>%
  mutate(baseline = rats$WD1)
# Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline + Group, data = ratsL8S2)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## baseline   1 252125  252125 2237.0655 5.217e-15 ***
## Group      2    726     363    3.2219   0.07586 .  
## Residuals 12   1352     113                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see a very low P value (5.217E-15), so we can say that there is a statistically significant difference in weight between the groups. ANOVA does not show between which groups, as it tests var1=var2=var3 as Hyp0, so somewhere variances are not equal.Probably, Group 1 is different from Group 2 and 3, so it drives the analysis of variance towards significance. So, we completed ANOVA for 3 groups.

3. BPRS data analyses (Chapter 9, Linear Mixed Effects Models).

Now we switch to BPRS data where we have bprs measurements for 20 men who have taking either Treatment 1 or Treatment 2 (20 subjects in each group). Let’s have a quick look at the data.

ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_line()

Here we see plots for all individuals across 8 weeks showing the changes in bprs measurement.

A better way would be this kind of a plot:

ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

Now we can see them by treatments which is much better way to do it.

Linear regression model

Let’s proceed to the linear model assuming that measurements are independent. That would be a basic regression model run by lm() function in R. Here we probably do not need to remove the outliers as in rats dataset, as lm is not so sensitive to them as ANOVA is.

BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)

# print out a summary of the model
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

We got a statistically significant difference between the groups, with a P value 2E-16, but between weeks, not treatments, if I understand correctly.

Random intercept model

Now we fit the random intercept model, as measurements are not necessarily independent and we need to check different models for the data analysis.

library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)

# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Fitting a random intercept model allows the linear regression fit for each man to differ in intercept from other men. Now we can move on to fit the random intercept and random slope model to the bprs measurement data.

Random Intercept and Random Slope Model

We will run a new model and perform ANOVA on both of them.

BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8201  8.0511        
##           week         0.9608  0.9802   -0.51
##  Residual             97.4307  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref   5 2748.7 2768.1 -1369.4   2738.7                           
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s check chi-squared statistics (7.2721) and P value (0.02636) of the likelihood ratio test between BPRS_ref1 and BPRS_ref. The lower the value the better the fit against the comparison model.We have 0.02636 as a P value, which is below the cutoff of 0.05 then we usually choose for the statistical significancee. Let’s try to fit interaction of week and treatment and see if that improves the fitting.

Random Intercept and Random Slope Model with interaction

Here we fit a random intercept and slope model that allows for a treatment × week interaction. We run the new model and test ANOVA for the new one and the previous model.

BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0767  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 65.0016  8.0624        
##           week         0.9688  0.9843   -0.51
##  Residual             96.4699  9.8219        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2522  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4                           
## BPRS_ref2  8 2744.3 2775.4 -1364.1   2728.3 3.1712      1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again we check chi-squared statistics (3.1712) and P value (0.07495) of the likelihood ratio test between BPRS_ref2 and BPRS_ref1. Now P value is greater than 0.05, so this interaction does not seem to improve the previous model.

Visualization for the Random Intercept and Random Slope Model fitting

Let’s try to visualize it before and after we fitted Random Intercept and Random Slope Model BPRS_ref1.Since last model (with interaction) had P value greater then 0.05, I will focus on the previous model where we at least saw a smaller P value.

ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_line() +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2)) +
  scale_y_continuous(name = "bprs") +
  theme(legend.position = "top")

# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref1)

# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>%
  mutate(Fitted)

# draw the plot of BPRSL with fitted value
ggplot(BPRSL, aes(x = week, y = Fitted, group = subject)) +
  geom_line() +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2)) +
  scale_y_continuous(name = "Fitted bprs") +
  theme(legend.position = "top")