Why the Median of Residuals in \(L_1\) Regression Model With Intercept is Zero
"Everyone" knows that if you fit a regression model with a constant term using ordinary least squares the average of the residuals is zero. Recall that a regression model with intercept means it is assumed that each observation \(y_i\) can be predicted by a linear combination of \(k\) predictors \(x_{ij}\) and an intercept term:
$$ y_i= \alpha + \beta_1x_{i1} + \dots + \beta_kx_{ik} + \epsilon_i $$
The numbers \(\alpha, \beta_1, \dots, \beta_k;\) are known as parameters (in Machine Learning, they are often referred to weights as weights instead). The \(\epsilon_i\) terms represent random and unpredicatble error. It is usually assumed that these errors are independent and identically distributed, and we shall do so for the rest of this article.
Usually the parameters are unknown and must be estimated from the observed data. There multiple means by which parameter estimates, denoted by \(\hat{\alpha}, \hat{\beta_1}, \dots, \hat{\beta_k};\) can be computed.
Using these estimates, the \(y_i\) are estimated by \(\hat{y}_i = \hat{\alpha} + \hat{\beta}_1x_{i1} + \dots + \hat{\beta}_kx_{ik}.\) Why would we try to estimate the \(y_i\) when we already know them? Because it allows us to estimate the error terms \(\epsilon_i\) using the residuals:
$$ \hat{\epsilon}_i = y - \hat{y}_i $$
The most common approach to fitting this model is given by least squares regression. Here the values are chosen to minimise the sum of squared residuals:
$$ (\hat{\alpha}, \hat{\beta}_1, \dots, \hat{\beta}_k) = \underset{(\alpha, \beta_1, \dots, \beta_2)}{\arg\min}\sum_{i=1}^n |y_i - \alpha - \beta_1x_{i1} - \dots - \beta_kx_{ik}|^2 $$
It's common knowledge that if the model is fitted by least squares, then the average of the residuals is zero. For example, open an instance of R and type in the folowing code:
What readers mightn't know is that if you fit the a linear regression model using the least absolute error criterion, then the median of the residuals is zero. Recall that least absolute errors regression, usually called \(L_1\) regression chooses the parameters to minimise the sum of absolute residuals instead of the sum of squared residuals: $$\sum_{i=1}^n |y_i - \alpha - \beta_1x_{i1} - \dots - \beta_kx_{ik}|$$ \(L_1\) regression is more robust against badly behaved data, but least squares regression (or \(L_2\) regression) is easier to fit and performs better on nice data. To demonstrate this you can use the
attach(cars)
model <- lm(dist~speed);
residuals <- model$residuals; # extract the residuals from the fitted model
print(mean(residuals)); # get the mean of the residuals
[1] 8.65974e-17
What readers mightn't know is that if you fit the a linear regression model using the least absolute error criterion, then the median of the residuals is zero. Recall that least absolute errors regression, usually called \(L_1\) regression chooses the parameters to minimise the sum of absolute residuals instead of the sum of squared residuals: $$\sum_{i=1}^n |y_i - \alpha - \beta_1x_{i1} - \dots - \beta_kx_{ik}|$$ \(L_1\) regression is more robust against badly behaved data, but least squares regression (or \(L_2\) regression) is easier to fit and performs better on nice data. To demonstrate this you can use the
L1pack package to fit an \(L_1\) model in R and take the median of the residuals:
library(L1pack);
attach(cars);
l1.model <- l1fit(y = dist, x = speed) # fit an l1 model
l1.residuals <- l1.model$residuals # extract the residuals
print(median(l1.residuals)) # print the median of the residuals
[1] 0
Comments
Post a Comment