I’m currently in South Africa attending a wedding. Whilst the South Africans are complaining about the weather as it’s their winter here, I’m walking around in shorts, sandals, and short sleeved summer shirts. Still, this doesn’t stop me packing my spare Macbook as I have deadlines. Imperial College London is teaching a computational medicine degree, and we are focusing on teaching machine learning. On top of this we have to write material for the students which could be turned into a textbook in time. What I love about writing educational material is the fact that you have to think about the concept in another way. I was at the overfitting section in machine learning before I had to close the laptop. Some local South African bankers gave me their numbers and wanted to take me out dinner locally to try some of the food. We started a Whatsapp group to organise when and where to go out to dinner. Thankfully, they were treating me as a guest to the country and I got to decide what I liked the sound of. This was a very different story to my birthday a couple of years ago that resulted in too many people attending. This contrast hit me, I suddenly realised that if you’ve thrown a party, you already understand overfitting in machine learning.
Overfitting is when an algorithm performs well to a training dataset but does not generalise well. A human comparison to this concept is revising for an exam. A very bad strategy would be to memorise all the answers. When you go over the previous exams that you’ve memorised, you will score very well because you know all the answers. However, you will not generalise well. If the new exam tests the understanding of a concept from a different angle there is a high chance you will get it wrong. To test this, we split our dataset randomly into test and train. The algorithm is trained on the train set, and then tested on the unseen test set. The metrics on the test set should be the ones considered when evaluating the algorithm. Diagnosing overfitting to be the cause of bad test metrics can be done by using learning curve. This is where the cost functions of the test and train is plotted with relation to the amount of data that the algorithm has consumed.
Initially the error of the training set should be low and the error of the testing set should be high. As the data consumed increases, the error of the training set increases and the error of the testing set decreases. This is because even though the data being fed into the algorithm starts small and increases, the testing set is always a full testing set. In order to understand this lets look at the extremes. Lets say that we only feed one journey into the algorithm off the training set. When testing against the training set it will have a 100 percent accuracy. All it has to do is fit the weights to one outcome. However, when you test the model against all the data in the testing dataset it will have a terrible accuracy. As more data is fed into the algorithm, it has to optimise the weights to accommodate more outcomes. Anyone who has ever thrown a party knows that the more people you invite, the harder it is to please everyone. If you have one person round you can overfit to their every desire. You can watch that strange French movie they love and order or cook food that satisfies every quirk they have. Then at the last minute, your flatmate calls 20 random friends up and invites them round. Whilst your initial friend is completely satisfied, it’s unlikely that the 20 random guests will be leaving the party happy that they sat through a subtitled black and white French movie whilst eating quirky food. Later on you throw another party, and invite 20 people. Your food has to be a bit more general, and your entertainment can still be niche because you can overfit to your friendship group. But it’s unlikely to be that French film. Your flatmate’s 20 last minute friends will be a bit more satisfied but not as much as your friends as there’s some selection bias in your friends. Now finally you throw another party and really push the boat out. Because you’re inviting 60 friends you can’t be so picky. You have to invite people from work, old school friends and people you’re not so close with to make up the numbers. Your food now has to be very generic and so does the entertainment. There is a very high chance that the 20 last minute friends that your flatmate invites will be equally satisfied with the party as the 60 people you originally invited. Generally it’s better for your training sample to be bigger than your testing set.