With the 2018 FIFA World Cup rapidly approaching, our data science team figured it was time to seek out an answer to the question everyone has been asking: which team will win the finals? Inspired by a blog post written by Carlos Iglesias Fernández, we decided to put our own spin on his approach using neural networks to predict the tournaments results. In the next sections, we’ll detail the tools and data we used and tell you about our findings.
In order to make our predictions, we used the International Football Results from 1872 to 2018 dataset from Kaggle. We used a stateful LSTM (Long Short-Term Memory) Neural Network, filtered the teams that aren’t taking part in this year’s World Cup, and then transformed the values using one hot encoding to ensure the model would not assume any undue dependencies. In other words, this guarantees that factors like team names don’t have any kind of sequential order.
We decided Neural Networks would be useful for our purpose because, much like the human brain, they identify patterns within a set of inputs and outputs, allowing them to predict an output for a given input based on past results. This ability gives neural networks practical use in many different fields, including object detection, translation, and time series forecasting and prediction. In theory, a Neural Network can find patterns in the outcomes of previous soccer games to help us generate predictions for the outcome of future games.
In the representative example above, this is shown perfectly. The input runs through hidden layers within the network which apply functions based on patterns from training data.
Recurrent neural networks are a special class of neural network in which output data can also serve as input data. Our data spans all the way back to 1872, and obviously, soccer teams have changed very much since then, so we chose to use a stateful LSTM (Long Short-Term Memory) model Neural Network to focus on finding long-term dependencies patterns over time. In other words, the model we chose can take into account the fact that more recent games will have the more bearing on the winners of future games.
Keras is one of the most used libraries for building neural networks due to its simplicity and efficiency. Because of this, we decided that, based on our data and time constraints, this was the best tool for building our model. Our stateful LSTM Neural Network is comprised of two stacked LSTM layers, each with 128 units. The output layer consists of three nodes, each corresponding to a class of home_team, away_team and draw. Here’s the code for building the model:
model = Sequential() model.add(LSTM(128, batch_input_shape=(1, 1, X_train.shape[2]), return_sequences=True, stateful=True)) Dropout(0.2) model.add(LSTM(128, stateful=True)) Dropout(0.2) model.add(Dense(3, activation="softmax")) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
To train the model, we set apart the data from 1902 to 2014 and used the data from 2014 to 2018 for validation. We also configured a checkpoint that saves the best model, and implemented early stopping and learning rate reduction to help with our training. Our best fit was a model with 53% accuracy on the validation set.
model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), epochs=30, batch_size=1, shuffle=False, callbacks=[earlyStopping, mcp_save, reduce_lr_loss])
In order to account for the distinction between home and away teams in our data, we calculated the prediction by averaging the results of two predictions generated by swapping the home and away positions for each pair of teams.
After training the model, we decided to run it on the 2014 World Cup knockout stage games results, just to see how many it would get right. To our surprise, it predicted correctly the outcome of 75% of the games, although it has a clear bias for Brazil. The image below shows the results for the knockout stage of the 2014 World Cup. The green marks indicate the games for which we got the right prediction.
The knockout results of the 2014 World Cup. The green marks show the ones we got right.
Going back to the 2018 World Cup, we ran our model to calculate the results of the group phase. The 1st and 2nd place of each group are as follows:
A | B | C | D | |
1st | Russia | Spain | France | Argentina |
2nd | Uruguay | Portugal | Denmark | Nigeria |
E | F | G | H | |
1st | Brazil | Germany | England | Colombia |
2nd | Serbia | Mexico | Belgium | Poland |
Then, we assembled the brackets and used the model to pit the teams against each other. These are the results our model predicted:
Brazil vs. Germany in the finals, and the victory goes to Brazil!
Although these findings represent an interesting application of deep learning and neural networks, we’re in no way insinuating that these will be the actual results of the 2018 World Cup. Our model is only based on historical data and there are a lot of factors we simply couldn’t have accounted for, like variability in individual players’ performance from game to game, changes in teams’ rosters, and injuries. Additionally, it is important to take into account the fact soccer results are known for their unpredictability. As the Brazilian saying goes, soccer is a box of surprises!
During the World Cup, at the end of each phase, we’re gonna update our predictions with the real results, so stay connected with us!
Check out our video explaining a little bit more about this project: