Cross Validated Asked by wdnsd on January 1, 2022
I have two questions about how to specify random effects structures in mgcv using bam. I’m using bam because I have a large data set (~15,000 data points) that consists of interviews with different speakers. The binary outcome variable consists of a particular choice between two words, which is influenced by the last choice that a speaker made and the (log) time elapsed between the choices.
m0 = bam(obs ~ prev*log.lag, data=mydata, family='binomial')
I want to include random intercepts and slopes by speaker for the effect of the previous choice. The random intercept models the fact different speakers choose one form or another at different rates, the random slope models the fact that different speakers have different degrees of influence from the previous choice. Everybody tends to repeat their choice, but some people tend to do so more than others.
My first question is which of the following two models (if either) accurately reflects the random effects structure that I want. The first way of doing this, which I’ve seen suggested a couple of places, includes two random effects terms as in m1
:
m1 = bam(obs ~ prev*log.lag + s(speaker, bs='re') + s(prev, speaker, bs='re'), data=mydata, family='binomial')
This seems redundant. For example, when I get the values out of m1
using plot.gam it looks like s(prev, speaker, bs='re')
contains both a random intercept and a random slope for each speaker. There are twice as many values returned as there are factor levels for speaker.
plot.m1 = plot.gam(m1)
> length(plot.m1[[2]]$fit) == 2*length(unique(mydata$speaker))
[1] TRUE
Is this an appropriate way of specifying the random effects structure? or is the model fitting a random intercept by speaker twice? What would it mean if it were?
Another way of specifying the random effects is to not include a separate random intercept term.
m2 = bam(obs ~ prev*log.lag + s(prev, speaker, bs='re'), data=mydata, family='binomial')
plot.m2 = plot.gam(m2)
> length(plot.m2[[1]]$fit) == 2*length(unique(mydata$speaker))
[1] TRUE
Could someone help me understand what the difference between these two is? In either case, extracting $fit
from s(prev, speaker, bs='re')
yields what look like both random intercepts and slopes. Which of these is which? Are random intercepts for all speakers listed then random slopes, are they listed as pairs per speaker? Looking at the code of plot.gam
it seems like these $fit
values are a function of the random effects matrix, but I’m not sure exactly how they follow.
I also want to include a random smooth by speaker to account for aspects of each unique interview that could account for changes in choices by each speaker, as in m3
. My second question is whether the random smooths include both random intercepts and slopes.
m3 = bam(obs ~ prev*log.lag + s(prev, speaker, bs='re') + s(time, by=speaker, bs="fs", m=1), data=mydata, family='binomial')
Looking at the mgcv
documentation (p.171) it seems that using ‘by’ with the factor speaker will make the smooths subject to a centering constraint. How is this centering constraint related to random intercepts?
Thanks!
This is an old question, but I ran into a similar situation when trying to run a gamm
model using bam
.
I believe what's going on is that s(prev, speaker, bs='re')
in the model above is estimating a variance term for the ~prev:speaker-1
interaction (random slopes), whereas s(time, by=speaker, bs="fs", m=1)
is estimating a separate slope term for each speaker. These slopes do not share a variance term, meaning that they aren't really a random effect (and aren't necessarily centred on zero).
Here's a demonstration of the differences between the two methods, using gamm
to compare them. These simulated data are from a normal distribution, but this should work for whatever distribution you're using:
#Idea: y = [b0 + boi] + x[b1 + b1i] + s(z) + error,
#
# where b0i and b1i are random intercepts and slopes, and s(z) is some curvy spline
library(mgcv)
#Generate some fake data
set.seed(123)
ndat <- 900 #900 data points
ngroup <- 30 #30 groups
gInd <- sample(rep(1:30,length.out=900)) #Group index
xVal <- runif(ndat,-1,1) #Range of x values
zVal <- runif(ndat,-pi,pi) #Range of z values
bo <- -1 #Intercept
b1 <- 2 #Slope of x
b0i <- rnorm(ndat/ngroup,0,0.5) #Random intercepts
b1i <- rnorm(ndat/ngroup,0,0.5) #Random slopes
s_z <- sin(zVal) #Sine curve to estimate by spline
err <- rnorm(ndat,0,0.2) #Noise
yVal <- (bo + b0i[gInd]) + (b1 + b1i[gInd])*xVal + s_z + err #Generate values
plot(xVal,yVal) #Looks OK
plot(zVal,yVal)
dat <- data.frame(y=yVal,x=xVal,z=zVal,group=factor(gInd)) #Assemble into dataframe
str(dat)
#Help file for s(x,bs='re')
# ?smooth.construct.re.smooth.spec
#GAMM MODEL
mod1 <- gamm(y~1+x+s(z,bs='cr'),random=list(group=~1+x),data=dat,
control=list(msVerbose=TRUE))
summary(mod1$lme) #Estimates variance terms well, slope terms well
summary(mod1$gam)
plot(mod1$gam); curve(sin(x),-pi,pi,col='red',add=T) #Pretty good estimation of sine curve
#BAM MODEL
#Using "by" notation within s()
mod2a <- bam(y~1+x+s(z,bs='cr')+s(group,bs='re')+s(x,by=group,bs='re'),data=dat)
gam.vcomp(mod2a) #Note the number of sd terms estimated -- 1 for each group
summary(mod2a)
#Using paired notation
mod2b <- bam(y~1+x+s(z,bs='cr')+s(group,bs='re')+s(x,group,bs='re'),data=dat)
gam.vcomp(mod2b) #Fewer sd terms estimated
summary(mod2b)
#Compare estimates of intercepts and slopes to actual values:
ranInt <- data.frame(actual=b0i,gamm=ranef(mod1$lme)[[2]][,1],
bamA=coef(mod2a)[grepl('s\(group\).',names(coef(mod2a)))],
bamB=coef(mod2b)[grepl('s\(group\).',names(coef(mod2b)))])
ranSlope <- data.frame(actual=b1i,gamm=ranef(mod1$lme)[[2]][,2],
bamA=coef(mod2a)[grepl('s\(x\).',names(coef(mod2a)))],
bamB=coef(mod2b)[grepl('s\(x,group\).',names(coef(mod2b)))])
#Good estimates of random intercept terms
pairs(ranInt,upper.panel=function(x,y) {points(x,y); abline(0,1,col='red')},
lower.panel=function(x,y) text(mean(x),mean(y),round(cor(x,y),3),cex=0.1+cor(x,y)*2),
main='Intercepts')
#bamA slope estimates are offset by a constant value when using "by"
pairs(ranSlope,upper.panel=function(x,y) {points(x,y); abline(0,1,col='red')},
lower.panel=function(x,y) text(mean(x),mean(y),round(cor(x,y),3),cex=0.1+cor(x,y)*2),
main='Slopes')
Answered by S. Robinson on January 1, 2022
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP