Why the Median of Residuals in \(L_1\) Regression Model With Intercept is Zero

"Everyone" knows that if you fit a regression model with a constant term using ordinary least squares the average of the residuals is zero. Recall that a regression model with intercept means it is assumed that each observation \(y_i\) can be predicted by a linear combination of \(k\) predictors \(x_{ij}\) and an intercept term: $$ y_i= \alpha + \beta_1x_{i1} + \dots + \beta_kx_{ik} + \epsilon_i $$ The numbers \(\alpha, \beta_1, \dots, \beta_k;\) are known as parameters (in Machine Learning, they are often referred to weights as weights instead). The \(\epsilon_i\) terms represent random and unpredicatble error. It is usually assumed that these errors are independent and identically distributed, and we shall do so for the rest of this article. Usually the parameters are unknown and must be estimated from the observed data. There multiple means by which parameter estimates, denoted by \(\hat{\alpha}, \hat{\beta_1}, \dots, \hat{\beta_k};\) can be computed. Using these estimates, the \(y_i\) are estimated by \(\hat{y}_i = \hat{\alpha} + \hat{\beta}_1x_{i1} + \dots + \hat{\beta}_kx_{ik}.\) Why would we try to estimate the \(y_i\) when we already know them? Because it allows us to estimate the error terms \(\epsilon_i\) using the residuals: $$ \hat{\epsilon}_i = y - \hat{y}_i $$ The most common approach to fitting this model is given by least squares regression. Here the values are chosen to minimise the sum of squared residuals: $$ (\hat{\alpha}, \hat{\beta}_1, \dots, \hat{\beta}_k) = \underset{(\alpha, \beta_1, \dots, \beta_2)}{\arg\min}\sum_{i=1}^n |y_i - \alpha - \beta_1x_{i1} - \dots - \beta_kx_{ik}|^2 $$ It's common knowledge that if the model is fitted by least squares, then the average of the residuals is zero. For example, open an instance of R and type in the folowing code:

	attach(cars)
    
	model <- lm(dist~speed);
	residuals <- model$residuals; # extract the residuals from the fitted model
  
	print(mean(residuals)); # get the mean of the residuals
	[1] 8.65974e-17

What readers mightn't know is that if you fit the a linear regression model using the least absolute error criterion, then the median of the residuals is zero. Recall that least absolute errors regression, usually called \(L_1\) regression chooses the parameters to minimise the sum of absolute residuals instead of the sum of squared residuals: $$\sum_{i=1}^n |y_i - \alpha - \beta_1x_{i1} - \dots - \beta_kx_{ik}|$$ \(L_1\) regression is more robust against badly behaved data, but least squares regression (or \(L_2\) regression) is easier to fit and performs better on nice data. To demonstrate this you can use the L1pack package to fit an \(L_1\) model in R and take the median of the residuals:

	library(L1pack); 
	attach(cars);    

	l1.model  <- l1fit(y = dist, x = speed) # fit an l1 model 
	l1.residuals <- l1.model$residuals # extract the residuals 

	print(median(l1.residuals)) # print the median of the residuals
	[1] 0

Looking at Least Squares More Closely

Given a set of numbers \(x_1, \dots, x_n,\) the \(L_2\) location problem is defined as that of finding the number \(\mu\) that minimises the least squares distance to the \(x_i\): $$ \mu_1 = \underset{\mu}{\arg\min}\sum_{i=1}^n |x_i - \mu|^2 $$ It's not that hard to see that the minimiser is given by the sample mean \(\bar{x}\): $$ \mu_1 = \frac{1}{n}\sum_{i=1}^n x_i $$ What about the linear regression model with intercept? In this case In least squares regression, the model is fitted by choosing the coefficients to minimise the least squares error: $$ (\hat{\alpha}, \hat{\beta}_1, \dots, \hat{\beta}_k) = \underset{(\alpha, \beta_1, \dots, \beta_2)}{\arg\min}\sum_{i=1}^n |y_i - \alpha - \beta_1x_{i1} - \dots - \beta_kx_{ik}|^2 $$ What can we say about \(\hat{\alpha}\)?Assume for a moment that the \(\hat{\beta}_j\) are known and let \(z_i = y_i - \hat{\beta}_1x_{i1} - \dots - \hat{\beta}_kx_{ik}\). Written in this form, it can be seen that \(\hat{\alpha}\) is given by: $$ \hat{\alpha} = \underset{\alpha}{\arg\min}\sum_{i=1}^n |z_i - \alpha|^2 $$ But we know already that this just means that \(\hat{\alpha} = \bar{z}\). This in turn means that the sum (and hence the mean) of the residuals is zero: $$ \begin{align*} \sum_{i=1}^n \left(y_i - \hat{\alpha} - \hat{\beta}_1x_{i1} - \dots - \hat{\beta}_kx_{ik}\right) & = \sum_{i=1}^n\left(y_i - \hat{\beta}_1x_{i1} - \dots - \hat{\beta}_kx_{ik}\right) - \sum_{i=1}^n\hat{\alpha} \\ &= \sum_{i=1}^n z_i - n\hat{\alpha}\\ &= \sum_{i=1}^n z_i - n\bar{z}\\ &= 0 \end{align*} $$

\(L_1\) Regression Again

What about the case of \(L_1\) regression? Assume again that the \(\hat{\beta}_j\) are known and let \(z_i = y_i - \hat{\beta}_1x_{i1} - \dots - \hat{\beta}_kx_{ik}\). In this case the absolute errors instead of squared errors are used, so that: $$ \hat{\alpha} = \underset{\alpha}{\arg\min}\sum_{i=1}^n |z_i - \alpha| $$ Analgously with the case of the mean above, this means that \(\hat{\alpha}\) is the median of the \(\{z_i\}.\). And so: $$\begin{align*} \text{median}\left( y_i - \hat{\alpha} - \hat{\beta}_1x_{i1} - \dots - \hat{\beta}_kx_{ik}\right) &= \text{median}(z_i -\hat{\alpha})\\ &= \text{median}(z_i) - \hat{\alpha}\\ &= \hat{\alpha} - \hat{\alpha} \\ &= 0 \end{align*}$$

Non-Linear \(L_1\) Regression with an intercept.

Suppose our model was that \(y = \alpha + f(\mathbf{x}; \mathbf{\beta})\) where \(f(\mathbf{x};\beta)\) is a nonlinear function in the parameter \(\beta\) and \(\alpha\) acts as an intercept term. Then as before, the values of \(\alpha\) and \(\beta\) are chosen to minimise \(\sum |y_i - \alpha - f(x_{i1}, \dots, x_{ik}; \mathbf{\beta})|\). As before, if the optimal value of \(\beta\) is known, then \(\hat{\alpha} = \text{median}(y_i - f(\mathbf{x}_i; \mathbf{\hat{\beta}})\) It then follows that: $$\begin{align*} \text{median}\left(y_i - \hat{\alpha} - f(\mathbf{x}_i; \mathbf{\hat{\beta}})\right) &= \text{median}\left(y_i - f(\mathbf{x}_i; \mathbf{\hat{\beta}})\right) - \hat{\alpha}\\ &= \hat{\alpha} - \hat{\alpha}\\ &= 0 \end{align*}$$

Comments