1

1

The bias-variance trade-off is like a law in machine learning. You cannot have the best of both worlds. What is it about supervised learning in machine learning that makes it impossible to satisfy the two at the same time?

1

1

The bias-variance trade-off is like a law in machine learning. You cannot have the best of both worlds. What is it about supervised learning in machine learning that makes it impossible to satisfy the two at the same time?

5

The tradeoff between bias and variance summarizes the "tug of war" game between fitting a model that predicts the underlying training dataset well (low bias) and producing a model that doesn't change much with the training dataset (low variance).

What statisticians/mathematicians a while ago realized is that any model can be made to perfectly fit the dataset at hand (i.e. have zero bias). Look at this picture, for instance from the Wikipedia page on overfitting:

The graph depicts two binary classifiers that are trying to distinguish between the blue class and the red class. The classifiers have been trained on all of the blue and red observations shown in the graph to generate their boundaries. Notice that the green line has zero bias; it perfectly separates the blue and red classes whereas the black line clearly has a few errors on the training set and therefore has higher bias.

The problem? Observations in datasets are realizations of random variables that follow some unknown probability distribution. Thus, observed data will always have some sort of noise, containing observations that don't actually represent the underlying probability distribution. These anomalies can be seen in the graph above in which there are red and blue observations that appear to be "crossing" the implied boundary between the two classes. Clearly, the majority of observations are easily distinguishable in this simple example except for a few cases. These few cases can be viewed as noise, which is inherently random and therefore, cannot be predicted.

The green line, as a result, is essentially fitting random noise which by definition is unpredictable and non-representative. The result is that if we were to train both classifiers again using a new training dataset (from the same process), the "green boundary model" will be expected to generate a very different boundary compared to what is shown above, since it was influenced by data that did not represent the actual underlying data generating process. This is overfitting = high variance. The model associated with the black line, which has refrained from fitting the noise of the data, should be expected to remain relatively stable since it was not influenced by data that was not actually representative of the population.

What does this all mean? The model associated with the black line is closer to reality (the "truth"), and therefore, if we were to pull new observations from "reality" that neither classifier has seen it will, on average, be expected to produce lower overall error **despite** having higher bias. This is the bias-variance tradeoff, which is really just a way of stating the tradeoff between underfitting and overfitting. I don't have another graph to illustrate this, but imagine if I straight up drew a diagonal line from the bottom left corner to the top right corner, and called this another classifier. Now, my "model" has zero variance; in fact, it is completely independent of the training dataset. But now, I haven't even fit the training dataset. I now have zero variance for tons of bias and large overall error.

The decomposition of MSE into bias, variance, and irreducible error encapsulates the tradeoff. When you hope to fit a model that **generalizes well to the population**, simply having a model with low bias is not enough and minimizing variance is also important, if not more important. But, if we are to take variance into consideration, we now have to sacrifice some ability to predict the training set well (through regularization) in hopes of obtaining a model that is closer to the truth.

Almost all ML methods use regularization to effectively trade bias (again, worse performance on the training dataset) for hopefully, lower variance. Some examples:

- L1/L2/elastic nets introduce extra terms to the unbiased maximum likelihood estimators for the regression coefficients in GLM's. The introduction of these terms make the estimators biased, but the idea is to hopefully penalize in an effective way such that we get bigger gains in reducing variance, which in turn results in lower overall error.
- Support vector machines (see the picture above) have a cost parameter that controls how smooth or "wiggly" the boundary is, which as you can see in the graph above trades bias by purposely missing some observations in the training dataset for a model that generalizes better to the population overall.
- Neural networks have dropout layers that effectively introduce bias by simply removing hidden units from the model at random. The goal here is to purposely sack parts of the model that have learned from the training dataset in hopes of preventing complex co-adaptions that are highly specific to the training set and not to the general population.
- Decision trees can be controlled through pruning (cost complexity, removing branches that do not significantly reduce in sample error/lead to large enough information gains) and through limiting the depth of the tree. The result of both methods? Again, higher bias (removing rules near the bottom of the tree that have potentially been created to better predict non-representative noise observations, leading to sparse terminal nodes) but hopefully, lower variance.

1

Bias and variance are just descriptions for the two ways that a model can give subpar results. Either the model hasn't learned enough yet and its understanding of the problem is very general (bias), or it has learned the data given to it too well and cannot relate that knowledge to new data (variance). By carefully monitoring how our model is doing in relation to training, test, and validation sets, we are able to minimize bias and variance.

Even if we minimize some error functuon over our test sets we still had to likely trade some bias for lower variance. All regularization techniques (which, if you are tuning hyperparameters is probably being done) introduce bias in some fashion and trade higher bias for hopefully lower variance. – aranglol – 2019-08-27T16:38:09.520