As the World cup is fast approaching all 32 nations who will compete in Russia for the glorious cup are engaged in last minute preparation. Deadline for submitting the squad is on May 19 and they have no more friendlies to analyze their players. Being a hardcore England fan my thought went into analyzing the pool of talent England has and wondering what kind of strategy they may resort to this year’s tournament. England manager Gareth Southgate has tried out around 50 players during the world cup qualifier and has to cut down the squad to 22 members by next month. With an average age of around 25 years, England has a relatively young and inexperienced squad. Majority of the players play in the premier league and 75% of them comes from the top 6 clubs representing the league. In this article, I am trying to analyze the attributes of these 50 players. The dataset contains information regarding their all-around skills like for instance dribbling, crossing, finishing, free kicks etc. It also has the info related to their pace, shot accuracy, acceleration, strength, agility etc. Close to 30 variables are defining a particular player. All these players are initially classified into 4 categories according to the position they play for their respective clubs and England.
- Defense 
- Midfield(including wings) 
- Striker(including Central attacking midfielders) 
- Goalkeepers 
What we are trying to do here is a cluster analysis and see how we can group these players according to their attributes and generate insights from that. We will be performing a K means clustering. It is a simple unsupervised learning algorithm that groups the data into a user-specified number (k) of clusters. Even if k is not the right number, the algorithm will group it to k clusters. Using Elbow method we will calculate the optimum number of cluster. This method calculates the Sum of Square Error between all data points and the center of the cluster. The idea here is that we want a small SSE but that value tends to move towards zero as we increase the value of k. So we have to choose a small value for that still has a low SSE.
Using R we implemented the below code to get the plot.
set.seed(123456) wcss=vector() for (i in 1:10) wcss[i]<-sum(kmeans(x,i)$withinss) plot(1:10,wcss,type = "b",main=paste('Clusters of clients'),xlab="number of Clusters",ylab='wcss')
From the above graph, we choose k=4. With cluster number decided to be 4, we will first perform a dimensionality reduction with the help of principal component analysis. PCA helps us to reduce the multivariate independent variables into smaller components of variable retaining the variance present in the dataset to the maximum extent. Below code is used to implement it.
library(caret) library(e1071) pca=preProces s(x,method='pca',pcaComp = 2) x=predict(pca,x)
With the selection of cluster number and reducing the dimensionality of the dataset we now perform K means cluster analysis with below set of codes.
cl<-kmeans(x,4,iter.max = 300,nstart=10) dataset $cluster<-cl$cluster library(cluster) clusplot(x,cl$cluster,lines =0, shade=TRUE, color=TRUE, labels=2, plotchar=FALSE, span=FALSE, main=paste("Cluster of Players") )
> table(cl$cluster) 1 2 3 4 15 5 16 10
The table value shows how many players are classified into each cluster.
Cluster 1: Mainly comprises of all the striker and CAM. Basically comprises of all those players who have the ability to attack and score goals with high finishing ability. Total players under this category are 15 under our analysis. Initially, when we tagged each player with their position in their respective clubs and country, the total came as 9 players. There is an increase of 6 more players with the capability of an offensive play.
Cluster 2: This cluster consists of all the goalkeepers and they were correctly classified into one cluster and the number is matching with our initial observation.
Cluster 3: This cluster consists of all the players who can play in the midfield. Initial observation gave us a total of 13 players playing in this position. Our analysis shows that an addition of 3 more players with similar attributes who can fit into this position.
Cluster 4: This cluster consists of all the players who can play in the defense department. Initial observation gave us a total of 19 players but our analysis shows that there are only 10 players out of the total pool who has the ability to take the defense position out of the total pool of players
Cluster analysis tells us that this England team consist of players with greater offensive trait than defensive. Manager Gareth Southgate is trying out players who can fill in the double role. For example, a player who is tagged as a defender can also play in the midfield position or in the wings. Ashley Young turned out to be an interesting data point. He plays in defense both for his country and club but our analysis shows that he can be grouped with the likes of Harry Kane and Sterling (Cluster1-Strikers and CAM). Also if we compare the actuals and the result obtained, nearly 50% of the defenders have been grouped under the cluster of midfielders.
Each department requires a different set of skillset to play in that position. One who plays as a striker requires more pace, shooting power, finishing etc. as against a defender who needs strength, header clearance etc. It is obvious that England is developing a system wherein they want to play a vibrant, exciting counter-attack style game which is supported by the data where 31 out of 46 players are grouped under Cluster 1 and 3(which comprises of Strikers and Midfielders). Also if we look at the dataset pace and strength of this team is also quite good
With this analysis, we could conclude that England is going to prepare a team with the strategy of possession based counter attack style. Having the liberty of defenders who can press forward and help feed the ball to midfielders or strikers are going to be an added advantage. Overall we could say that England has a good team with a lot of pace and agility which allows them to be flexible.