6.5 The multinomial logit model

The multinomial logit model is used to model mutually exclusive discrete outcomes or qualitative response variables. However, this model assumes the independence of irrelevant alternatives (IIA), meaning that the choice between two alternatives does not depend on a third alternative. We consider the multinomial mixed logit model (not to be confused with the random parameters logit model), which accounts for both alternative-varying regressors (conditional) and alternative-invariant regressors (multinomial) simultaneously.²⁸

In this setting, there are \(L\) mutually exclusive alternatives, and the dependent variable \(y_{il}\) is equal to 1 if the \(l\)th alternative is chosen by individual \(i\), and 0 otherwise, where \(l=\left\{1,2,\dots,L\right\}\). The likelihood function is

\[ p(\boldsymbol{\beta} \mid \boldsymbol{y}, \boldsymbol{X}) = \prod_{i=1}^{N} \prod_{l=1}^{L} p_{il}^{y_{il}}, \]

where the probability that individual \(i\) chooses the alternative \(l\) is given by

\[ p_{il} := p(y_i = l \mid \boldsymbol{\beta}, \boldsymbol{X}) = \frac{\exp\left\{\boldsymbol{x}_{il}^{\top} \boldsymbol{\beta}_l\right\}}{\sum_{j=1}^{L} \exp\left\{\boldsymbol{x}_{ij}^{\top} \boldsymbol{\beta}_j\right\}}, \]

\(\boldsymbol{y}\) and \(\boldsymbol{X}\) are the vector and matrix of the dependent variable and regressors, respectively, and \(\boldsymbol{\beta}\) is the vector containing all the coefficients.

Remember that coefficients associated with alternative-invariant regressors are set to 0 for the baseline category, and the coefficients associated with the alternative-varying regressors are the same for all the categories. In addition, we assume \(\boldsymbol{\beta} \sim N(\boldsymbol{\beta}_0, \boldsymbol{B}_0)\) as the prior distribution. Thus, the posterior distribution is

\[ \pi(\boldsymbol{\beta} \mid \boldsymbol{y}, \boldsymbol{X}) \propto p(\boldsymbol{\beta} \mid \boldsymbol{y}, \boldsymbol{X}) \times \pi(\boldsymbol{\beta}). \]

As the multinomial logit model does not have a standard posterior distribution, P. E. Rossi, Allenby, and McCulloch (2012) propose a “tailored” independent Metropolis-Hastings algorithm where the proposal distribution is a multivariate Student’s \(t\) distribution with \(v\) degrees of freedom (tuning parameter), mean equal to the maximum likelihood estimator, and scale equal to the inverse of the Hessian matrix.

Example: Simulation exercise

Let’s conduct a simulation exercise to evaluate the performance of the Metropolis-Hastings algorithm for inference in the multinomial logit model. We consider a scenario with three alternatives, one alternative-invariant regressor (plus the intercept), and three alternative-varying regressors. The population parameters are given by \(\boldsymbol{\beta}_1 = [1 \ -2.5 \ 0.5 \ 0.8 \ -3]^{\top}\), \(\boldsymbol{\beta}_2 = [1 \ -3.5 \ 0.5 \ 0.8 \ -3]^{\top}\), and \(\boldsymbol{\beta}_3 = [0 \ 0 \ 0.5 \ 0.8 \ -3]^{\top}\), where the first two elements of each vector correspond to the intercept and the alternative-invariant regressor, while the last three elements correspond to the alternative-varying regressors. The sample size is 1000, and all regressors are simulated from standard normal distributions.

We can deploy our GUI using the command line at the beginning of this chapter. We should follow the next Algorithm to run multinomial logit models in our GUI (see Chapter 5 for details).

Algorithm: Multinomial logit models

Select Univariate Models on the top panel
Select Multinomial Logit model using the left radio button
Upload the dataset by first selecting whether there is a header in the file, and the kind of separator in the csv file of the dataset (comma, semicolon, or tab). Then, use the Browse button under the Choose File legend. You should see a preview of the dataset
Select MCMC iterations, burn-in, and thinning parameters using the Range sliders
Select dependent and independent variables using the Formula builder table
Select the Base Alternative
Select the Number of choice categorical alternatives
Select the Number of alternative specific variables
Select the Number of Non-alternative specific variables
Click the Build formula button to generate the formula in R syntax.
Set the hyperparameters: mean vector and covariance matrix. This step is not necessary as by default our GUI uses non-informative priors
Select the tuning parameter for the Metropolis-Hastings algorithm, that is, the Degrees of freedom: Multivariate Student’s t distribution
Click the Go! button
Analyze results
Download posterior chains and diagnostic plots using the Download Posterior Chains and Download Posterior Graphs buttons

The following code in R demonstrates how to implement the M-H algorithm from scratch. The first part simulates the dataset, the second part constructs the log-likelihood function, and the third part implements the M-H algorithm. We use vague priors centered at zero, with a covariance matrix of \(1000\mathbf{I}_7\). We observe that the posterior estimates closely match the population parameters, and all 95% credible intervals contain the population parameters.

remove(list = ls())
set.seed(12345)
# Simulation of data
N<-1000  # Sample Size
B<-c(0.5,0.8,-3); B1<-c(-2.5,-3.5,0); B2<-c(1,1,0)
# Alternative specific attributes of choice 1, for instance, price, quality and duration of choice 1
X1<-matrix(cbind(rnorm(N,0,1),rnorm(N,0,1),rnorm(N,0,1)),N,length(B)) 
# Alternative specific attributes of choice 2, for instance, price, quality and duration of choice 2
X2<-matrix(cbind(rnorm(N,0,1),rnorm(N,0,1),rnorm(N,0,1)),N,length(B))
# Alternative specific attributes of choice 3, for instance, price, quality and duration of choice 3
X3<-matrix(cbind(rnorm(N,0,1),rnorm(N,0,1),rnorm(N,0,1)),N,length(B))
X4<-matrix(rnorm(N,1,1),N,1)
V1<-B2[1]+X1%*%B+B1[1]*X4; V2<-B2[2]+X2%*%B+B1[2]*X4; V3<-B2[3]+X3%*%B+B1[3]*X4
suma<-exp(V1)+exp(V2)+exp(V3)
p1<-exp(V1)/suma; p2<-exp(V2)/suma; p3<-exp(V3)/suma
p<-cbind(p1,p2,p3)
y<- apply(p,1, function(x)sample(1:3, 1, prob = x, replace = TRUE))
y1<-y==1; y2<-y==2; y3<-y==3
# Log likelihood
log.L<- function(Beta){
    V1<-Beta[1]+Beta[3]*X4+X1%*%Beta[5:7]
    V2<-Beta[2]+Beta[4]*X4+X2%*%Beta[5:7]
    V3<- X3%*%Beta[5:7]
    suma<-exp(V1)+exp(V2)+exp(V3)
    p11<-exp(V1)/suma;  p22<-exp(V2)/suma;  p33<-exp(V3)/suma
    suma2<-NULL
    for(i in 1:N){
        suma1<-y1[i]*log(p11[i])+y2[i]*log(p22[i])+y3[i]*log(p33[i])
        suma2<-c(suma2,suma1)}
    logL<-sum(suma2)
    return(-logL)
}
# Parameters: Proposal
k <- 7
res.optim<-optim(rep(0, k), log.L, method="BFGS", hessian=TRUE)
MeanT <- res.optim$par
ScaleT <- as.matrix(Matrix::forceSymmetric(solve(res.optim$hessian))) # Force this matrix to be symmetric
# Hyperparameters: Priors
B0 <- 1000*diag(k); b0 <- rep(0, k)
MHfunction <- function(iter, tuning){
    Beta <- rep(0, k); Acept <- NULL 
    BetasPost <- matrix(NA, iter, k)
    pb <- winProgressBar(title = "progress bar", min = 0, max = iter, width = 300)
    for(s in 1:iter){
        LogPostBeta <- -log.L(Beta) + mvtnorm::dmvnorm(Beta, mean = b0, sigma = B0, log = TRUE)
        BetaC <- c(LaplacesDemon::rmvt(n=1, mu = MeanT, S = ScaleT, df = tuning))
        LogPostBetaC <- -log.L(BetaC) + mvtnorm::dmvnorm(BetaC, mean = b0, sigma = B0, log = TRUE)
        alpha <- min(exp((LogPostBetaC-mvtnorm::dmvt(BetaC, delta = MeanT, sigma = ScaleT, df = tuning, log = TRUE))-(LogPostBeta-mvtnorm::dmvt(Beta, delta = MeanT, sigma = ScaleT, df = tuning, log = TRUE))) ,1)
        u <- runif(1)
        if(u <= alpha){
            Acepti <- 1; Beta <- BetaC
        }else{
            Acepti <- 0; Beta <- Beta
        }
        BetasPost[s, ] <- Beta; Acept <- c(Acept, Acepti)
        setWinProgressBar(pb, s, title=paste( round(s/iter*100, 0),"% done"))
    }
    close(pb); AcepRate <- mean(Acept)
    Results <- list(AcepRate = AcepRate, BetasPost = BetasPost)
    return(Results)
}
# MCMC parameters
mcmc <- 10000; burnin <- 1000; thin <- 5; iter <- mcmc + burnin; keep <- seq(burnin, iter, thin); tuning <- 6 # Degrees of freedom
ResultsPost <- MHfunction(iter = iter, tuning = tuning)
summary(coda::mcmc(ResultsPost$BetasPost[keep[-1], ]))

## 
## Iterations = 1:2000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 2000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##         Mean      SD Naive SE Time-series SE
## [1,]  0.9711 0.20162 0.004508       0.004508
## [2,]  0.9742 0.20934 0.004681       0.004681
## [3,] -2.4350 0.18950 0.004237       0.004137
## [4,] -3.4195 0.24656 0.005513       0.005513
## [5,]  0.5253 0.07396 0.001654       0.001654
## [6,]  0.8061 0.08007 0.001790       0.001790
## [7,] -3.0853 0.17689 0.003955       0.003955
## 
## 2. Quantiles for each variable:
## 
##         2.5%     25%     50%     75%   97.5%
## var1  0.5862  0.8367  0.9650  1.1017  1.3683
## var2  0.5679  0.8310  0.9681  1.1151  1.3761
## var3 -2.8239 -2.5607 -2.4291 -2.3050 -2.0812
## var4 -3.9176 -3.5806 -3.4074 -3.2496 -2.9423
## var5  0.3840  0.4761  0.5250  0.5759  0.6647
## var6  0.6555  0.7494  0.8064  0.8616  0.9604
## var7 -3.4476 -3.1991 -3.0777 -2.9641 -2.7500

References

Rossi, Peter E, Greg M Allenby, and Rob McCulloch. 2012. Bayesian Statistics and Marketing. John Wiley & Sons.

The multinomial mixed logit model can be implemented as a conditional logit model.↩︎