Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.
This is a test and I am getting help for this..The short description for this course is that it is cool. I like it so far and thinking to complete all assignments
My link is link
data<- read.table("data/learning2014.txt", header=T, sep ="\t")
head(data)
## gender Age Attitude Points deep Stra Surf
## 1 F 53 37 25 -0.20335689 0.37527124 -0.39314657
## 2 M 55 31 12 -1.40867870 -0.43457118 0.70406641
## 3 F 49 25 24 -0.35402212 0.69920821 -1.02012542
## 4 M 53 35 10 -0.35402212 0.05133427 -1.02012542
## 5 M 49 37 22 -0.05269166 0.69920821 0.07708756
## 6 F 38 38 21 1.90595628 0.69920821 -0.70663599
dim(data)
## [1] 166 7
This dataset has 166 rows and 7 columns which corresponds to the 166 participants with 7 measured or calculated characteristics.Since I used R function scale() to the last three columns, there are now scaled variables, also negative ones. The dataset has info on age, gender and examination total points, as well as three variables deep, stra, surf that are described in the course outline. The rows containing zeroes are removed, so the final number is down to 166 participants (from 183).
Let’s check how variable Attitude looks like in relation to Points. I will use ggplot2 just plotting Points vs Attitude.
library(ggplot2)
p1 <- ggplot(data, aes(x = Attitude, y = Points))
p2 <- p1 + geom_point()
p2
Let’s as well check Age and deep in relation to Points.
p3 <- ggplot(data, aes(x = Age, y = Points))
p4 <- p3 + geom_point()
p4
The distribution is a bit scewed, as they are probably typical students, around 20 years of age. There is a good distribution of exam scores.
p5 <- ggplot(data, aes(x = deep, y = Points))
p6 <- p5 + geom_point()
p6
This distribution is centered around the 0, as variable “deep” was scaled with mean set to 0.
summary(data)
## gender Age Attitude Points
## F:110 Min. :17.00 Min. :14.00 Min. : 7.00
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00
## Median :22.00 Median :32.00 Median :23.00
## Mean :25.51 Mean :31.43 Mean :22.72
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75
## Max. :55.00 Max. :50.00 Max. :33.00
## deep Stra Surf
## Min. :-3.81932 Min. :-2.37819 Min. :-2.274083
## 1st Qu.:-0.65535 1st Qu.:-0.59654 1st Qu.:-0.706636
## Median :-0.05269 Median : 0.13232 Median : 0.077088
## Mean :-0.02909 Mean : 0.04646 Mean :-0.009783
## 3rd Qu.: 0.70063 3rd Qu.: 0.69921 3rd Qu.: 0.704066
## Max. : 2.20729 Max. : 2.48086 Max. : 2.898492
R function summary provides the basic statistics of each variable, except for gender, of course, for which it shows frequences.
Now we will use lm function to run a regression model
model1 <- lm(Points ~ Attitude + deep + Age, data = data)
summary(model1)
##
## Call:
## lm(formula = Points ~ Attitude + deep + Age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.206 -3.430 0.204 3.979 10.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.38008 2.26458 5.908 1.98e-08 ***
## Attitude 0.35941 0.05696 6.310 2.56e-09 ***
## deep -0.33338 0.41500 -0.803 0.423
## Age -0.07716 0.05322 -1.450 0.149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.307 on 162 degrees of freedom
## Multiple R-squared: 0.2043, Adjusted R-squared: 0.1896
## F-statistic: 13.87 on 3 and 162 DF, p-value: 4.305e-08
Attitude has a statistically significant p value of 2.56e-09. Deep and Age do not and to be removed from the model. Overall p value for model1 is nice (4.305e-08), so let’s now run it without Age and deep.
model2 <- lm(Points ~ Attitude, data = data)
summary(model2)
##
## Call:
## lm(formula = Points ~ Attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Now p value is even nicer - 4.12e-09. So Age and deep do not influence significantly my model with Attitude. Let’s look at the plot of it.
p1 <- ggplot(data, aes(x = Attitude, y = Points))
p2 <- p1 + geom_point()+ geom_smooth(method = "lm")
p2
Nice graph of my model!
model2 <- lm(Points ~ Attitude, data = data)
summary(model2)
##
## Call:
## lm(formula = Points ~ Attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
First of, all the summary gives out the residuals. The smaller the average of residuals, the better the model fits the data. SO here we look at Median 0.4339. In fact, with Age and deep it was smaller (0.2). Next, R outputs the regression coefficients of the model. For each coefficient (we have only Attitude left), standard errors, t-statistics, and two-sided p-values are presented in the output. There is a positive correlation (Estimate is 0.35255), the relationship is significant, as P value is 4.12e-09, it is also indicated by 3 asterics. There are 164 degrees of freedom (166-2), Multiple R squared is 0.1906. That means that the predictor Attitude explain only 19.06 of the observed variance on the dependent variable ‘Points’.
model2 <- lm(Points ~ Attitude, data = data)
plot(model2)
This plot shows if residuals have non-linear patterns. As seen from the plot, there are equally spread residuals around a horizontal line without distinct patterns, which is a good indication we don’t have non-linear relationships.
This plot shows if residuals are normally distributed. The residuals may follow a straight line well or they deviate severely. We have them well lined and deviating at the “ends” of the graph, but overall there is a good alignment of them.
This plot shows if residuals are spread equally along the ranges of predictors. This is how we can check the assumption of equal variance (homoscedasticity). We can say that we see a sort of horizontal line with equally (randomly) spread points.
This plot helps us to find influential cases (i.e., subjects) if any. Not all outliers are influential in linear regression analysis. Even though data have extreme values, they might not be influential to determine a regression line. That means, the results wouldn’t be much different if we either include or exclude them from analysis. They follow the trend in the majority of cases and they don’t really matter; they are not influential. Here we see some labeled cases that deviate but we are not going to exclude them from analysis, I would say.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
joined_data<-read.table("Joined student alcohol consumption data.txt", header=T)
dim(joined_data)
## [1] 382 35
glimpse(joined_data)
## Observations: 382
## Variables: 35
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob <fct> teacher, other, other, services, other, other, othe...
## $ reason <fct> course, course, other, home, home, reputation, home...
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
The dataset “joined_data”" includes info on student achievement in secondary education of two Portuguese schools. The data variables (35) include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).
The purpose of my analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, I choose the following 4 variables: 1. sex (hypothesis is that boys in average consume more alcohol an would be more often high-users), 2. age (older students would be consuming more and more of older ones would be found in high-users’ group), 3. internet (those who have internet would be high-users or consume more) and 4.absences (the idea is that the more absences would be associated with high-users.
joined_data %>% group_by(sex) %>% summarise(count = n())
## # A tibble: 2 x 2
## sex count
## <fct> <int>
## 1 F 198
## 2 M 184
joined_data %>% group_by(age) %>% summarise(count = n())
## # A tibble: 7 x 2
## age count
## <int> <int>
## 1 15 81
## 2 16 107
## 3 17 100
## 4 18 81
## 5 19 11
## 6 20 1
## 7 22 1
joined_data %>% group_by(internet) %>% summarise(count = n())
## # A tibble: 2 x 2
## internet count
## <fct> <int>
## 1 no 58
## 2 yes 324
joined_data %>% group_by(absences) %>% summarise(count = n())
## # A tibble: 26 x 2
## absences count
## <int> <int>
## 1 0 65
## 2 1 51
## 3 2 58
## 4 3 41
## 5 4 36
## 6 5 22
## 7 6 21
## 8 7 12
## 9 8 20
## 10 9 12
## # ... with 16 more rows
library(ggplot2)
g1 <- ggplot(joined_data, aes(x = high_use, y = age))
g1 + geom_boxplot() + ylab("age")
The hypothesis about age looks like correct one: high-users are older in average.
library(ggplot2)
g1 <- ggplot(joined_data, aes(high_use))
g1 + geom_bar(aes(fill=sex))
Hypothesis about sex seems to be correct, too: in the high-users group there seems to be more boys than girls.
library(ggplot2)
g1 <- ggplot(joined_data, aes(high_use))
g1 + geom_bar(aes(fill=internet))
Here it is not so obvious: it looks like in the high-users there are more of those who have internet, but this relationship is no so obvious and requires testing.
library(ggplot2)
g1 <- ggplot(joined_data, aes(x = high_use, y = absences))
g1 + geom_boxplot() + ylab("absences")
In case of absences we need to do calculations: it is not anymore obvious, though it seems that higher number of absences are indeed observed in the group of high-users.
Four variables are: sex, age, internet, absences.
m <- glm(high_use ~ age + sex + internet + absences, data = joined_data, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ age + sex + internet + absences, family = "binomial",
## data = joined_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2985 -0.8443 -0.6330 1.0846 2.1044
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.76662 1.74079 -2.738 0.00618 **
## age 0.17582 0.10153 1.732 0.08333 .
## sexM 0.98257 0.24155 4.068 4.75e-05 ***
## internetyes 0.03073 0.33826 0.091 0.92761
## absences 0.09197 0.02331 3.945 7.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 427.03 on 377 degrees of freedom
## AIC: 437.03
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) age sexM internetyes absences
## -4.76662101 0.17582448 0.98256834 0.03072935 0.09196982
Summary of glm: Overall model is nicely fitted with significant P value (0.00618). However we can see that effect size is biggest for sex (0.98 is the estimate for boys). Next one is age, which is logical, as perhaps in case of alcohol consumption, getting older might make it easier to get alcohol somehow.Though P value is greater than 0.05 (0.08333). Absences are at third place and they show significat association with high-users of alcohol. Internet seems to affect the least and is not significantly associated with being a high-user.The same was seen in the box plots and bars I made at the previous step.
Now I will compute coefficients of the model as odds ratios and provide confidence intervals for them.
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.008509084 0.0002648954 0.2474229
## age 1.192228782 0.9784009269 1.4580461
## sexM 2.671308277 1.6736842196 4.3224336
## internetyes 1.031206374 0.5402411256 2.0509852
## absences 1.096331736 1.0494764765 1.1500097
library(dplyr)
m <- glm(high_use ~ age + sex + internet + absences, data = joined_data, family = "binomial")
probabilities <- predict(m, type = "response")
joined_data <- mutate(joined_data, probability = probabilities)
joined_data <- mutate(joined_data, prediction = probability > 0.5)
select(joined_data, age, internet, absences, sex, high_use, probability, prediction) %>% tail(10)
## age internet absences sex high_use probability prediction
## 373 19 no 0 M FALSE 0.3909341 FALSE
## 374 18 no 7 M TRUE 0.5061439 TRUE
## 375 18 no 1 F FALSE 0.1809669 FALSE
## 376 18 yes 6 F FALSE 0.2651770 FALSE
## 377 19 yes 2 F FALSE 0.2294734 FALSE
## 378 18 yes 2 F FALSE 0.1998693 FALSE
## 379 18 no 2 F FALSE 0.1950003 FALSE
## 380 18 no 3 F FALSE 0.2098432 FALSE
## 381 17 yes 4 M TRUE 0.4021709 FALSE
## 382 18 yes 2 M TRUE 0.4002213 FALSE
table(high_use = joined_data$high_use, prediction = joined_data$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 260 8
## TRUE 84 30
Comment: My prediction managed to predict 30 true high-users and 8 of them it predicted wrongly. For the oppositem 260 were predicted correctly and 84 not. Let’s see how it looks like on plots. Looks like qiote a large error to me.
library(dplyr)
library(ggplot2)
g <- ggplot(joined_data, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
table(high_use = joined_data$high_use, prediction = joined_data$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68062827 0.02094241 0.70157068
## TRUE 0.21989529 0.07853403 0.29842932
## Sum 0.90052356 0.09947644 1.00000000
Now I calculate the trainng error or average number of wrong predictions in the training data.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = joined_data$high_use, prob = 0)
## [1] 0.2984293
So the average number of wrong predictions is higher than we had in DataCamp example (it was 0.26 there). My model is not as good as the one in the example.According to own experience it is quite a large error, so model does not work too well, based on these 4 variables. Maybe internet and
I will do K-fold cross-validation, as shown in the example with K=10.
library(boot)
cv <- cv.glm(data = joined_data, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2539267
The average number of wrong predictions in the cross validation is 0.2549267 which is a lower that in my model (see the one above). So 10-cross validation gives a lower error than my model.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. However, these comparisons were primarily done outside of Delve and are thus somewhat suspect. The dataset is small in size with only 506 cases.The data was originally published by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978.
There are 14 attributes in each case of the dataset. They are: CRIM - per capita crime rate by town; ZN - proportion of residential land zoned for lots over 25,000 sq.ft.; INDUS - proportion of non-retail business acres per town; CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise); NOX - nitric oxides concentration (parts per 10 million); RM - average number of rooms per dwelling; AGE - proportion of owner-occupied units built prior to 1940; DIS - weighted distances to five Boston employment centres; RAD - index of accessibility to radial highways; TAX - full-value property-tax rate per $10,000; PTRATIO - pupil-teacher ratio by town; B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town; LSTAT - % lower status of the population; MEDV - Median value of owner-occupied homes in $1000’s.
Let’ see the matrix of Boston variables.
pairs(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Let’s see correlations between variables in the dataset.
library(dplyr)
library(corrplot)
## corrplot 0.84 loaded
cor_matrix<-cor(Boston) %>% round(digits = 2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
Interpretation: we can see that some variables correlate greatly with ech other, for example, INDUS, NOX and AGE (positive correlation). INDUS and NOX correlate negatively. COrrelation plot is very informative about the relationships of the variables.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
class(boston_scaled)
## [1] "data.frame"
As we can see now, mean of each variable is set to zero.We also converted boston_scale object from matrix to the dateframe using R functions and verified the class of the object.
Now we will create a categorical variable of the crime rate from the scaled crime rate using the quantiles as the break points in the categorical variable. We need to drop the old crime rate variable from the dataset, too.
Let’s have a look at the variable “crime”.
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
Now we make a quantile vector of it and check it.
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
Now we make a categorical variable out of variable “crime” with 4 labels.Then we check frequences by table().
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
Next step is to remove original crim from the dataset and add the new categorical value to scaled data. We will also check it afterwards.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
table(boston_scaled$crime)
##
## low med_low med_high high
## 127 126 126 127
Now we will create a train dataset(80% of the dataset) and a test dataset from Boston.
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
Let’s run linear discriminant analysis.
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2475248 0.2400990 0.2673267 0.2450495
##
## Group means:
## zn indus chas nox rm
## low 0.96415088 -0.8872982 -0.075474056 -0.8886618 0.42873365
## med_low -0.07283487 -0.2717505 -0.028797094 -0.5703153 -0.21352474
## med_high -0.38322343 0.1437378 0.238035778 0.3646033 0.09465255
## high -0.48724019 1.0149946 0.006051757 1.0361143 -0.32934304
## age dis rad tax ptratio
## low -0.9030881 0.8912268 -0.6924575 -0.7217232 -0.454299467
## med_low -0.3777287 0.3984921 -0.5355082 -0.4505113 -0.008032196
## med_high 0.4079866 -0.3760454 -0.4065743 -0.3294479 -0.314769825
## high 0.7989217 -0.8366353 1.6596029 1.5294129 0.805778429
## black lstat medv
## low 0.3787438 -0.76957232 0.52779588
## med_low 0.3049605 -0.10357860 -0.05860452
## med_high 0.0794439 0.03333508 0.18449539
## high -0.8051289 0.81289267 -0.63233324
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.107831700 6.219678e-01 -0.93912477
## indus 0.035019054 -1.065905e-01 0.30829282
## chas -0.068458437 -3.695115e-02 0.06682377
## nox 0.224439618 -8.841736e-01 -1.17164240
## rm -0.118145567 -1.568722e-01 -0.26586907
## age 0.250011132 -2.988660e-01 -0.12994559
## dis -0.110759509 -1.458461e-01 0.23740978
## rad 3.456416321 8.159788e-01 -0.20555858
## tax -0.002526983 1.649667e-01 0.58598343
## ptratio 0.120739134 4.646396e-03 -0.20508331
## black -0.146597461 -4.982588e-05 0.11755787
## lstat 0.181882938 -2.644175e-01 0.31846127
## medv 0.171566586 -3.448778e-01 -0.15856327
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9518 0.0366 0.0116
Now we will plot the results of lda.
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 18 8 1 0
## med_low 6 18 5 0
## med_high 0 5 12 1
## high 0 0 1 27
library(MASS)
data("Boston")
dist_eu <- dist(scale(Boston))
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
dist_man <- dist(scale(Boston), method = "manhattan")
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Now we do k-means clustering and plot it!
km <-kmeans(Boston, centers = 3)
pairs(Boston, col = km$cluster)
Next step is to investigate k number.
library(ggplot2)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)
Interpretation: The optimal number of clusters is when the value of total WCSS changes radically. In this case, two clusters would seem optimal.SO we reran k-means with two clusters.
model_predictors <- dplyr::select(train, -crime)
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
Nice plot!
human<-read.table("C:/Users/ELITEBOOK/Documents/GitHub/IODS-project/data/human.txt", header=T)
library(dplyr)
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(corrplot)
#ggpairs(human,cardinality_threshold = 155)# this is crashing my laptop
#str(human)
human<-human %>%
mutate_at(vars(GNI), as.numeric)
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 1.0 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 39.5 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 78.0 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 78.0 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.:116.5 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :155.0 Max. :1100.0 Max. :204.80 Max. :57.50
cor(human) %>% corrplot # the correlation matrix and corrplot
I tried to show graphical overview of the data using Ggally ggpairs, but it crashes my computer, as there are too many countries. So I left the code here but I don’t run it. Corplot worked and shows variables correlating highly (bigger circles) or not (smaller size of circles), directly (blue) or inversely (red).The distribution of the variables can be checked from the summaries: if the variable is a ratio, it varies 0-1. Other ones seem to vari greately.
Let’s first use non-standardized data and go directly to PCA.
pca_human <- prcomp(human)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
We will have a look at the principal components and the variance.
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 214.2937 44.75162 26.34667 11.4791 4.06656 1.60664
## Proportion of Variance 0.9416 0.04106 0.01423 0.0027 0.00034 0.00005
## Cumulative Proportion 0.9416 0.98267 0.99690 0.9996 0.99995 1.00000
## PC7 PC8
## Standard deviation 0.1905 0.1587
## Proportion of Variance 0.0000 0.0000
## Cumulative Proportion 1.0000 1.0000
pca_pr <- round(100*s$importance[2,], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 94.2 4.1 1.4 0.3 0.0 0.0 0.0 0.0
We can see that PC1 explains 94% of the variance in the data set.Together first two components PC1 and PC2 explain allmost whole variance (98%). Let’s egnerate a bit more clear plot of two components.
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Now we repeat same steps with the scaled data.
human_std <- scale(human)
summary(human_std)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-1.7154 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.8577 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median : 0.0000 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8577 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 1.7154 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
pca_human <- prcomp(human_std)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
Plot looks completely different! Let’s have a look at Principal components.
s1 <- summary(pca_human)
s1
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9659 1.1387 0.9896 0.8662 0.69949 0.54002 0.46700
## Proportion of Variance 0.4831 0.1621 0.1224 0.0938 0.06116 0.03645 0.02726
## Cumulative Proportion 0.4831 0.6452 0.7676 0.8614 0.92254 0.95899 0.98625
## PC8
## Standard deviation 0.33165
## Proportion of Variance 0.01375
## Cumulative Proportion 1.00000
pca_pr <- round(100*s$importance[2,], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 94.2 4.1 1.4 0.3 0.0 0.0 0.0 0.0
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
Now we got completely different results. PC1 explains barely half of variance, together PC1 and PC2 explain about 65% of variance. Let’s see what is PC1 and PC2 in each case.In case of the non-standardized variable, PC1 is Edu2.FM and PC2 is Labo.FM, and they, as we remember, explained almost all the variance in the human dataset. Once the dataset was scaled, the PCA again suggests same variables as PC1 and PC2, but scaled, they explain 2/3 of the variance. The decision whether we should scale or not the variables is crucial here. In reality, I think that we need to see what is the variable’s distribution before scaling and after. Sometimes, scaling makes things worse, as the true nature and meaning of the measurement vanishes after scaling. In this case, the scaling probably served a better purpose: our dataset is not anymore explained by just two components.I can only suggest, that once we have all kinds of variables, e.g. some measured in hundreds (GNI) and some between 0 and 1 (ratios Edu2.FM and Labo.FM), it makes sense to scale the dateset and make them comparable.
Let’s install all packages and load the dataset and have a “glimpse” on it.
library(FactoMineR)
library(ggplot2)
library(tidyr)
library(dplyr)
data("tea")
glimpse(tea)
## Observations: 300
## Variables: 36
## $ breakfast <fct> breakfast, breakfast, Not.breakfast, Not.brea...
## $ tea.time <fct> Not.tea time, Not.tea time, tea time, Not.tea...
## $ evening <fct> Not.evening, Not.evening, evening, Not.evenin...
## $ lunch <fct> Not.lunch, Not.lunch, Not.lunch, Not.lunch, N...
## $ dinner <fct> Not.dinner, Not.dinner, dinner, dinner, Not.d...
## $ always <fct> Not.always, Not.always, Not.always, Not.alway...
## $ home <fct> home, home, home, home, home, home, home, hom...
## $ work <fct> Not.work, Not.work, work, Not.work, Not.work,...
## $ tearoom <fct> Not.tearoom, Not.tearoom, Not.tearoom, Not.te...
## $ friends <fct> Not.friends, Not.friends, friends, Not.friend...
## $ resto <fct> Not.resto, Not.resto, resto, Not.resto, Not.r...
## $ pub <fct> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, ...
## $ Tea <fct> black, black, Earl Grey, Earl Grey, Earl Grey...
## $ How <fct> alone, milk, alone, alone, alone, alone, alon...
## $ sugar <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, N...
## $ how <fct> tea bag, tea bag, tea bag, tea bag, tea bag, ...
## $ where <fct> chain store, chain store, chain store, chain ...
## $ price <fct> p_unknown, p_variable, p_variable, p_variable...
## $ age <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 3...
## $ sex <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, ...
## $ SPC <fct> middle, middle, other worker, student, employ...
## $ Sport <fct> sportsman, sportsman, sportsman, Not.sportsma...
## $ age_Q <fct> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35-...
## $ frequency <fct> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, 3...
## $ escape.exoticism <fct> Not.escape-exoticism, escape-exoticism, Not.e...
## $ spirituality <fct> Not.spirituality, Not.spirituality, Not.spiri...
## $ healthy <fct> healthy, healthy, healthy, healthy, Not.healt...
## $ diuretic <fct> Not.diuretic, diuretic, diuretic, Not.diureti...
## $ friendliness <fct> Not.friendliness, Not.friendliness, friendlin...
## $ iron.absorption <fct> Not.iron absorption, Not.iron absorption, Not...
## $ feminine <fct> Not.feminine, Not.feminine, Not.feminine, Not...
## $ sophisticated <fct> Not.sophisticated, Not.sophisticated, Not.sop...
## $ slimming <fct> No.slimming, No.slimming, No.slimming, No.sli...
## $ exciting <fct> No.exciting, exciting, No.exciting, No.exciti...
## $ relaxing <fct> No.relaxing, No.relaxing, relaxing, relaxing,...
## $ effect.on.health <fct> No.effect on health, No.effect on health, No....
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
Dataset “tea” has 300 rows and 36 columns. The data concern a questionnaire on tea. 300 individuals were asked how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).Rows represent the individuals, columns represent the different questions. The first 18 questions are active ones, the 19th is a supplementary quantitative variable (the age) and the last variables are supplementary categorical variables. We can check summaries and some plots of the data, but it is almost solely categorical, so plot tools are quite limited for this data visualization.
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
gather(tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Now let’s do MCA but first we take some columns of tea, as it is done in the Datacamp exercise.
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep_columns))
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage = "quali")
We got a biplot of first two principal components - “where” and “how”, if I understand correctly. “how” is whether the individuals prefer tea bags or not, and “where” corresponds to where the tea was purchased - in the chain store or tea shop. These two questions give the most to Dim 1 and Dim 2. The plot supports these explanations.
I will repeat here steps from R script, so I could have data transformation here and start analyses.
library(dplyr)
library(tidyr)
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep = "", header =T)
rats<-read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep = "", header =T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
rats$ID <- factor(rats$ID)
rats$Group <- factor(rats$Group)
BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,5)))
ratsL <- rats %>% gather(key =WD, value = rats, -ID, -Group)
ratsL <- ratsL %>% mutate(Time = as.integer(substr(WD,3,4)))
Now quick look at data structures of long formats, as I already went through it in the R script. We can also use head(), to have a quick glance how the data look like, if we compeletly forgot what is where.
glimpse(BPRSL)
## Observations: 360
## Variables: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0"...
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
glimpse(ratsL)
## Observations: 176
## Variables: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1,...
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", ...
## $ rats <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 5...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8,...
head(BPRSL)
## treatment subject weeks bprs week
## 1 1 1 week0 42 0
## 2 1 2 week0 58 0
## 3 1 3 week0 54 0
## 4 1 4 week0 55 0
## 5 1 5 week0 72 0
## 6 1 6 week0 48 0
head(ratsL)
## ID Group WD rats Time
## 1 1 1 WD1 240 1
## 2 2 1 WD1 225 1
## 3 3 1 WD1 245 1
## 4 4 1 WD1 260 1
## 5 5 1 WD1 255 1
## 6 6 1 WD1 260 1
Now we recalled what we have as our datasets and we can proceed to the analyses of rats data (Chapter 8).
We will start with visualizing individuals measurements (weight) by three groups on the plot.
library(ggplot2)
ggplot(ratsL, aes(x = Time, y = rats, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(ratsL$rats), max(ratsL$rats)))
These plots show the variability of weight in three groups: for example, in Group1, the weights are the lowest, while in Group 2 we have higher values and one rat with the highest weight. Group 3 is more homogenous and with higher values.
We want now to standardize the variables to see if this changes the way data are distributed.
ratsL <- ratsL %>%
group_by(Time) %>%
mutate(stdrats = (rats - mean(rats))/sd(rats) ) %>%
ungroup()
glimpse(ratsL)
## Observations: 176
## Variables: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, ...
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1"...
## $ rats <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445,...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, ...
## $ stdrats <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.881...
ggplot(ratsL, aes(x = Time, y = stdrats, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized rats")
We now can see that Y-axis has the universal scale, as weight is scaled for all groups. So we did not loose the variability inside the groups but we made data a bit more homogenious between the groups. Now we will make another plot, checking groups by weight mean values across time.
n <- ratsL$Time %>% unique() %>% length()
ratsS <- ratsL %>%
group_by(Group, Time) %>%
summarise( mean = mean(rats), se = sd(rats)/sqrt(n) ) %>%
ungroup()
glimpse(ratsS)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, ...
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 26...
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.5529...
ggplot(ratsS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.5)) +
scale_y_continuous(name = "mean(rats) +/- se(rats)")
Now we see the plot for three groups by weight mean values and it gives a very nice overview of three groups against each other. As in previous lines, we see that Group 1 has lowest mean of weight, while Groups 2 and 3 are closer to each other.
Now we will check of rats data contain outliers.
ratsL8S <- ratsL %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise( mean=mean(rats) ) %>%
ungroup()
glimpse(ratsL8S)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.72...
# Draw a boxplot of the mean versus treatment
ggplot(ratsL8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(rats), WD ALL")
# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
ratsL8S1 <- ratsL8S %>%
filter(mean < 550)# first outlier gone.
ratsL8S2 <- ratsL8S1 %>%
filter(mean > 250)#second outlier gone.ratsL8S2 is our data.
ggplot(ratsL8S2, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(rats minus 2 outliers), WD ALL")
Now we removed 2 outliers: one from Group 1 with lowest mean and one from the Group 2 with the highest mean. I am not sure that this is how you deal with them but I follow the example. One outlier in Group 3 stays, as it is inside the range of Group 2 means. Now we are ready for ANOVA.
# Add the baseline from the original data as a new variable to the summary data
ratsL8S2 <- ratsL8S %>%
mutate(baseline = rats$WD1)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = ratsL8S2)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 252125 252125 2237.0655 5.217e-15 ***
## Group 2 726 363 3.2219 0.07586 .
## Residuals 12 1352 113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see a very low P value (5.217E-15), so we can say that there is a statistically significant difference in weight between the groups. ANOVA does not show between which groups, as it tests var1=var2=var3 as Hyp0, so somewhere variances are not equal.Probably, Group 1 is different from Group 2 and 3, so it drives the analysis of variance towards significance. So, we completed ANOVA for 3 groups.
Now we switch to BPRS data where we have bprs measurements for 20 men who have taking either Treatment 1 or Treatment 2 (20 subjects in each group). Let’s have a quick look at the data.
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
geom_line()
Here we see plots for all individuals across 8 weeks showing the changes in bprs measurement.
A better way would be this kind of a plot:
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
Now we can see them by treatments which is much better way to do it.
Let’s proceed to the linear model assuming that measurements are independent. That would be a basic regression model run by lm() function in R. Here we probably do not need to remove the outliers as in rats dataset, as lm is not so sensitive to them as ANOVA is.
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
We got a statistically significant difference between the groups, with a P value 2E-16, but between weeks, not treatments, if I understand correctly.
Now we fit the random intercept model, as measurements are not necessarily independent and we need to check different models for the data analysis.
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Fitting a random intercept model allows the linear regression fit for each man to differ in intercept from other men. Now we can move on to fit the random intercept and random slope model to the bprs measurement data.
We will run a new model and perform ANOVA on both of them.
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8201 8.0511
## week 0.9608 0.9802 -0.51
## Residual 97.4307 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s check chi-squared statistics (7.2721) and P value (0.02636) of the likelihood ratio test between BPRS_ref1 and BPRS_ref. The lower the value the better the fit against the comparison model.We have 0.02636 as a P value, which is below the cutoff of 0.05 then we usually choose for the statistical significancee. Let’s try to fit interaction of week and treatment and see if that improves the fitting.
Here we fit a random intercept and slope model that allows for a treatment × week interaction. We run the new model and test ANOVA for the new one and the previous model.
BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0767 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 65.0016 8.0624
## week 0.9688 0.9843 -0.51
## Residual 96.4699 9.8219
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2522 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again we check chi-squared statistics (3.1712) and P value (0.07495) of the likelihood ratio test between BPRS_ref2 and BPRS_ref1. Now P value is greater than 0.05, so this interaction does not seem to improve the previous model.
Let’s try to visualize it before and after we fitted Random Intercept and Random Slope Model BPRS_ref1.Since last model (with interaction) had P value greater then 0.05, I will focus on the previous model where we at least saw a smaller P value.
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
geom_line() +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2)) +
scale_y_continuous(name = "bprs") +
theme(legend.position = "top")
# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref1)
# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>%
mutate(Fitted)
# draw the plot of BPRSL with fitted value
ggplot(BPRSL, aes(x = week, y = Fitted, group = subject)) +
geom_line() +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2)) +
scale_y_continuous(name = "Fitted bprs") +
theme(legend.position = "top")