In the previous article of the series, we talked about Neural Networks, giving an introduction of its structure, architecture, neurons, layers and so on. We also studied that it is necessary to have as much data as possible to train the neural network so that it will be able to generalize the data and achieve a high accuracy.
In this second part of this series, we will see how to divide this data into different datasets and the reason why we must do this. We will also introduce problemas related to underfitting and overfitting, and some robust techniques that are used to solve these problems. If you want to read the first article of this series, feel free to acess through this link.
When developing a new neural network system, it is necessary to split our entire data into three sets: train, development and test set.
The workflow is that you keep on training the neural network using the training set and use the development (also called hold-out or simply dev) set to see which one of the different trained models performs best. This training set will be exclusively used to update the weights and biases of the neural network. After training for a number of times, this development set is used to give an estimative of how well the neural network is doing, which can be used to verify which model achieved the best performance.
After having done this for long enough and when you have a final model that you want to evaluate, you can take the best model and evaluate on the test set to get an unbiased estimate of how well your neural network is performing. This test set is very important and must be used only when the training is already over. By doing this, we can verify if the system is capable of doing well on data that it has never seen, which will be the case when this neural network is used in any type of application.
It is important to choose the dev and test sets from the same distribution and it must be taken randomly from all the data. Besides, you must choose a dev set and test set to reflect data you expect to get in the future. For example, consider the case where you have access to 100 k high-resolution images of cats that will be used as training set. If your application will run in such a device that acquires photos of cats with low resolution, both your test and dev set must have low resolution photos of cats. Otherwise, the accuracy found when testing your system will not be meaningful for the application that your system was developed to work.
The proportion of images for each set depends on the quantity of data that you have. In the previous era of deep learning, it was common to split the entire data in 60/20/20% (train/dev/test). However, since nowadays the amount of information is much higher, it is common use to split the data in (98/1/1%) or something around this proportion.
Consider now an example where you have 100 k images of cats from the web (which has a high resolution) and 8 k images from the target distribution, (images that has the same quality of your application). In this case, you could have a dev and test set with 2k images each, and the remaining 104 k images to the training set. However, based on what you learned, what is the best way to divide this images? The figure below shows how you should perform.
The entire set of images taken from the web must be used as the training set. Besides, since both dev and test sets must come from the same distribution, and they must be images with quality similar to the target distribution, we must separate the 8 k images into two groups of 4 k images. The first 4 k images will be shuffled with the 100 k images from the web to create the training set. The other 4 k images must also be shuffled and separated into 2 groups of 2 k images each, which will become the dev and test set.
Bias vs Variance
To understand the concepts of bias and variance, consider the Figure below, where there are two output classes and we want to build a classifier that can distinguish between two classes: red X and black O.
In the most left image, the system fit a simple straight line to the data. Since there are many red X in the lower part of the graph, this classifier is not a good fit to the data. In this case, we say that the classifier has high bias and is underfitting the data.
In the most right image, we created a very complex classifier ,which was able to correctly classify all the classes. However, this classifier is not a good fit either, since it will perform right only if the distribution is always like shown in the most right figure. Therefore, this system will probably not work for new samples of these two classes. In this case, we say that the classifier has high variance and is overfitting the data.
Finally, in the center figure, the classifier is something between the other two systems. This particular classifier is neither too simple nor too complex. Therefore, it will have a high accuracy and it will also probably be able to correctly classify new samples. This classifier is a much more reasonable way to fit the data, so we can call it as just right.
When developing a neural network, you must avoid building a classifier that overfits and underfits the data. However, how can we know if the system is overfitting or underfitting? It’s simple! We just have to analyze the error of the dev and train set that we just studied! Now, let’s consider some scenarios shown in the figure below:
In the first scenario, the system performed very well in the training set, but when we use data that it has never seen (dev set), the error got pretty high. Therefore, in this scenario, the neural neural has high variance and, consequently, overfits the data.
In the second scenario, the system performed poorly both on train and dev set, which characterizes a high bias problem. Thus, in this case, the system is underfitting the data.
In the third scenario, the system performed bad in the training set and even worst in the dev set. Therefore, this particular system has both a high bias and a high variance.
Finally, in the fourth scenario, the system performed well in both training and dev set. In this case, there is no problem in the neural network and the system is ready to be used.
Now that we know how to identify whether the system has high variance and high bias, we must learn how to solve these problems and build a better neural network. In general, high variance is the most common problem and it can be solved by increasing the number of samples in the training process. However, it is not always possible to get more data; so other techniques, such as regularization, can be performed to improve the variance of the neural network.
In this article, we focused on explaining how the data must be separated in different sets to properly train and evaluate a neural network. Besides, we also discussed about the importance of carrefully selecting the samples for each set.
We also discussed ways to evaluate a neural network through the analysis of bias and variance. Both properties can be analyzed using the error of the train and development sets and are used to decide what must be done to get better accuracy in the neural network.