Initialize Notebook

Load required packages.

library(tidyverse)
library(ggthemr)
library(ggrepel)

Set up workspace, i.e., remove all existing data from working memory, initialize the random number generator, turn of scientific notation of large numbers, set a standard theme for plotting. Finally, we activate parallel processing on multiple CPU cores.

rm(list=ls())
set.seed(42)
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
options(scipen=10000)
ggthemr('fresh')

Problem Description

Pokémon, also known as Pocket Monsters in Japan, is a Japanese media franchise managed by The Pokémon Company, a company founded by Nintendo, Game Freak, and Creatures. The franchise was created by Satoshi Tajiri in 1995 and is centered on fictional creatures called “Pokémon”, which humans, known as Pokémon Trainers, catch and train to battle each other for sport.

The dataset used in this tutorial is from Kaggle: https://www.kaggle.com/rounakbanik/pokemon

Prepare Data

Load data

Read data from CSV file.

data <- read_csv("pokemon.csv")
Rows: 780 Columns: 8
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): name
dbl (7): base_total, attack, defense, speed, capture_rate, height_m, weight_kg

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Prepare data

Remove some variables and standardize (i.e., center and scale) remaining numerical variables.

data_prep <- data %>%
  select(-name) %>% 
  scale()

Clustering

K-Means Clustering

Perform k-means clustering with k=5. Extract cluster assignments and cluster centroids.

kmeans_cluster_model <- kmeans(data_prep, centers=5, nstart=20)

kmeans_clusters <- kmeans_cluster_model$cluster
data_prep_w_kmeans_clusters <- cbind(data, kmeans_clusters)

centroids <- t(kmeans_cluster_model$centers)

Hierarchical Clustering

Perform hierarchical clustering and plot results as dendrogram.

distances <- dist(data_prep, method="euclidean")
h_cluster_model <- hclust(distances)

plot(h_cluster_model, labels=as.data.frame(data)[,1])

Cut the dendrogram at k=5 and extract cluster assignments.

plot(h_cluster_model, labels=as.data.frame(data)[,1])
rect.hclust(h_cluster_model, k=5, border="red") 


h_clusters <- cutree(h_cluster_model, k=5)
data_prep_w_h_clusters <- cbind(data, h_clusters)

Calculate summary statistics per cluster to interpret them. Note: calculating the mean of each variable per cluster is basically the same as looking at the cluster’s centroid.

data_prep_w_h_clusters_agg <- data_prep_w_h_clusters %>% 
  group_by(h_clusters) %>%
  summarise(
    n = n(),
    attack = mean(attack),
    defense = mean(defense),
    height_m = mean(height_m),
    weight_kg = mean(weight_kg)
  )
  

Principal Components Analysis (PCA)

Estimate PCA model and show results.

pca_model <- prcomp(data_prep)
pca_model
Standard deviations (1, .., p=7):
[1] 1.9166724 1.0859646 0.8881759 0.7084026 0.6271856 0.5929397 0.3337936

Rotation (n x k) = (7 x 7):
                    PC1        PC2        PC3         PC4         PC5         PC6         PC7
base_total    0.4855330 -0.1764071  0.1140251 -0.03834291 -0.14663083 -0.02420832 -0.83461349
attack        0.4094367 -0.1164073  0.1372341 -0.69643895  0.51471241  0.02337807  0.22243002
defense       0.3592070  0.2898041  0.6128269 -0.01328290 -0.53923184 -0.10787536  0.32991301
speed         0.2517701 -0.6913699 -0.3643265 -0.03409929 -0.45008822  0.14199370  0.31934510
capture_rate -0.3951548  0.2054898 -0.2010699 -0.71187615 -0.46674431 -0.04991278 -0.18463012
height_m      0.3659557  0.3127230 -0.5324700  0.06298111 -0.00145025 -0.68734525  0.09134689
weight_kg     0.3378319  0.5033184 -0.3686749  0.03780382 -0.04866010  0.70152216  0.02624489
summary(pca_model)
Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
Standard deviation     1.9167 1.0860 0.8882 0.70840 0.62719 0.59294 0.33379
Proportion of Variance 0.5248 0.1685 0.1127 0.07169 0.05619 0.05023 0.01592
Cumulative Proportion  0.5248 0.6933 0.8060 0.87766 0.93386 0.98408 1.00000

Create scree plot.

plot(pca_model)

Use predict function to extract principal components for each pokemon.

pcomps <- predict(pca_model, newdata = data_prep)

Attach principal components and cluster assignments to the original dataset and visualize them using a scatterplot.

data_w_pc <- data %>% 
  cbind(pcomps[,1:7]) %>% 
  cbind(cluster = as.factor(data_prep_w_h_clusters[,9]))

ggplot(data=data_w_pc) +
  geom_point(mapping=aes(x=PC1, y=PC2, color=cluster)) +
  geom_text_repel(mapping=aes(x=PC1, y=PC2, label=name)) 

From Bulbapedia: Wailord is a huge Pokémon based on the blue whale. It has small beady eyes, a huge mouth, and a throat that is lined with grooves. It has a blue back with and four white spots, and a tan, grooved underbelly. It has two pairs of fins along its sides and a horizontal tail at the back.

When it is jumping out of the water, it makes a giant splash due to its large size. It can create shockwaves by breaching and crashing onto the water, which can knock out the opponents. It can dive deep at 10,000 feet (3,000 meters) in one breath. It lives in the sea in large groups called pods. A pod of Wailord travels together in order to search for food and is able to eat large quantities at one time, such as schools of Wishiwashi. When a Wailmer is attacked by its predators Sharpedo or Dhelmise, its whole pod works to defend it. Wailord is popular due to its immense size. As a result, Wailord-watching is a favorite sightseeing activity in various parts of the world. However, they are known to be driven away by fishermen, so they don’t eat too much fish Pokémon.

LS0tCnRpdGxlOiAiRGF0YSBTY2llbmNlIGZvciBCdXNpbmVzcyAtIFdlZWsgMTI6IFVuc3VwZXJ2aXNlZCBMZWFybmluZyIKYXV0aG9yOiAiT2xpdmVyIE11ZWxsZXIiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCiMjIEluaXRpYWxpemUgTm90ZWJvb2sKCkxvYWQgcmVxdWlyZWQgcGFja2FnZXMuCgpgYGB7ciBsb2FkIHBhY2thZ2VzLCB3YXJuaW5nPUZBTFNFLCBtZXNzYWdlPUZBTFNFfQpsaWJyYXJ5KHRpZHl2ZXJzZSkKbGlicmFyeShnZ3RoZW1yKQpsaWJyYXJ5KGdncmVwZWwpCgpgYGAKClNldCB1cCB3b3Jrc3BhY2UsIGkuZS4sIHJlbW92ZSBhbGwgZXhpc3RpbmcgZGF0YSBmcm9tIHdvcmtpbmcgbWVtb3J5LCBpbml0aWFsaXplIHRoZSByYW5kb20gbnVtYmVyIGdlbmVyYXRvciwgdHVybiBvZiBzY2llbnRpZmljIG5vdGF0aW9uIG9mIGxhcmdlIG51bWJlcnMsIHNldCBhIHN0YW5kYXJkIHRoZW1lIGZvciBwbG90dGluZy4gRmluYWxseSwgd2UgYWN0aXZhdGUgcGFyYWxsZWwgcHJvY2Vzc2luZyBvbiBtdWx0aXBsZSBDUFUgY29yZXMuCgpgYGB7ciBzZXR1cH0Kcm0obGlzdD1scygpKQpzZXQuc2VlZCg0MikKa25pdHI6Om9wdHNfY2h1bmskc2V0KGVjaG8gPSBUUlVFLCB3YXJuaW5nID0gRkFMU0UpCm9wdGlvbnMoc2NpcGVuPTEwMDAwKQpnZ3RoZW1yKCdmcmVzaCcpCgpgYGAKCiMjIFByb2JsZW0gRGVzY3JpcHRpb24KCiFbXShwb2tlbW9uLnBuZykKClBva8OpbW9uLCBhbHNvIGtub3duIGFzIFBvY2tldCBNb25zdGVycyBpbiBKYXBhbiwgaXMgYSBKYXBhbmVzZSBtZWRpYSBmcmFuY2hpc2UgbWFuYWdlZCBieSBUaGUgUG9rw6ltb24gQ29tcGFueSwgYSBjb21wYW55IGZvdW5kZWQgYnkgTmludGVuZG8sIEdhbWUgRnJlYWssIGFuZCBDcmVhdHVyZXMuIFRoZSBmcmFuY2hpc2Ugd2FzIGNyZWF0ZWQgYnkgU2F0b3NoaSBUYWppcmkgaW4gMTk5NSBhbmQgaXMgY2VudGVyZWQgb24gZmljdGlvbmFsIGNyZWF0dXJlcyBjYWxsZWQgIlBva8OpbW9uIiwgd2hpY2ggaHVtYW5zLCBrbm93biBhcyBQb2vDqW1vbiBUcmFpbmVycywgY2F0Y2ggYW5kIHRyYWluIHRvIGJhdHRsZSBlYWNoIG90aGVyIGZvciBzcG9ydC4KClRoZSBkYXRhc2V0IHVzZWQgaW4gdGhpcyB0dXRvcmlhbCBpcyBmcm9tIEthZ2dsZTogPGh0dHBzOi8vd3d3LmthZ2dsZS5jb20vcm91bmFrYmFuaWsvcG9rZW1vbj4KCiMjIFByZXBhcmUgRGF0YQoKIyMjIExvYWQgZGF0YQoKUmVhZCBkYXRhIGZyb20gQ1NWIGZpbGUuCgpgYGB7cn0KZGF0YSA8LSByZWFkX2NzdigicG9rZW1vbi5jc3YiKQoKYGBgCgojIyMgUHJlcGFyZSBkYXRhCgpSZW1vdmUgc29tZSB2YXJpYWJsZXMgYW5kIHN0YW5kYXJkaXplIChpLmUuLCBjZW50ZXIgYW5kIHNjYWxlKSByZW1haW5pbmcgbnVtZXJpY2FsIHZhcmlhYmxlcy4KCmBgYHtyfQpkYXRhX3ByZXAgPC0gZGF0YSAlPiUKICBzZWxlY3QoLW5hbWUpICU+JSAKICBzY2FsZSgpCgpgYGAKCiMjIENsdXN0ZXJpbmcKCiMjIyBLLU1lYW5zIENsdXN0ZXJpbmcKClBlcmZvcm0gay1tZWFucyBjbHVzdGVyaW5nIHdpdGggaz01LiBFeHRyYWN0IGNsdXN0ZXIgYXNzaWdubWVudHMgYW5kIGNsdXN0ZXIgY2VudHJvaWRzLgoKYGBge3J9CmttZWFuc19jbHVzdGVyX21vZGVsIDwtIGttZWFucyhkYXRhX3ByZXAsIGNlbnRlcnM9NSwgbnN0YXJ0PTIwKQoKa21lYW5zX2NsdXN0ZXJzIDwtIGttZWFuc19jbHVzdGVyX21vZGVsJGNsdXN0ZXIKZGF0YV9wcmVwX3dfa21lYW5zX2NsdXN0ZXJzIDwtIGNiaW5kKGRhdGEsIGttZWFuc19jbHVzdGVycykKCmNlbnRyb2lkcyA8LSB0KGttZWFuc19jbHVzdGVyX21vZGVsJGNlbnRlcnMpCgpgYGAKCiMjIyBIaWVyYXJjaGljYWwgQ2x1c3RlcmluZwoKUGVyZm9ybSBoaWVyYXJjaGljYWwgY2x1c3RlcmluZyBhbmQgcGxvdCByZXN1bHRzIGFzIGRlbmRyb2dyYW0uCgpgYGB7cn0KZGlzdGFuY2VzIDwtIGRpc3QoZGF0YV9wcmVwLCBtZXRob2Q9ImV1Y2xpZGVhbiIpCmhfY2x1c3Rlcl9tb2RlbCA8LSBoY2x1c3QoZGlzdGFuY2VzKQoKcGxvdChoX2NsdXN0ZXJfbW9kZWwsIGxhYmVscz1hcy5kYXRhLmZyYW1lKGRhdGEpWywxXSkKCmBgYAoKQ3V0IHRoZSBkZW5kcm9ncmFtIGF0IGs9NSBhbmQgZXh0cmFjdCBjbHVzdGVyIGFzc2lnbm1lbnRzLgoKYGBge3J9CnBsb3QoaF9jbHVzdGVyX21vZGVsLCBsYWJlbHM9YXMuZGF0YS5mcmFtZShkYXRhKVssMV0pCnJlY3QuaGNsdXN0KGhfY2x1c3Rlcl9tb2RlbCwgaz01LCBib3JkZXI9InJlZCIpIAoKaF9jbHVzdGVycyA8LSBjdXRyZWUoaF9jbHVzdGVyX21vZGVsLCBrPTUpCmRhdGFfcHJlcF93X2hfY2x1c3RlcnMgPC0gY2JpbmQoZGF0YSwgaF9jbHVzdGVycykKCmBgYAoKQ2FsY3VsYXRlIHN1bW1hcnkgc3RhdGlzdGljcyBwZXIgY2x1c3RlciB0byBpbnRlcnByZXQgdGhlbS4gTm90ZTogY2FsY3VsYXRpbmcgdGhlIG1lYW4gb2YgZWFjaCB2YXJpYWJsZSBwZXIgY2x1c3RlciBpcyBiYXNpY2FsbHkgdGhlIHNhbWUgYXMgbG9va2luZyBhdCB0aGUgY2x1c3RlcidzIGNlbnRyb2lkLgoKYGBge3J9CmRhdGFfcHJlcF93X2hfY2x1c3RlcnNfYWdnIDwtIGRhdGFfcHJlcF93X2hfY2x1c3RlcnMgJT4lIAogIGdyb3VwX2J5KGhfY2x1c3RlcnMpICU+JQogIHN1bW1hcmlzZSgKICAgIG4gPSBuKCksCiAgICBhdHRhY2sgPSBtZWFuKGF0dGFjayksCiAgICBkZWZlbnNlID0gbWVhbihkZWZlbnNlKSwKICAgIGhlaWdodF9tID0gbWVhbihoZWlnaHRfbSksCiAgICB3ZWlnaHRfa2cgPSBtZWFuKHdlaWdodF9rZykKICApCiAgCmBgYAoKIyMgUHJpbmNpcGFsIENvbXBvbmVudHMgQW5hbHlzaXMgKFBDQSkKCkVzdGltYXRlIFBDQSBtb2RlbCBhbmQgc2hvdyByZXN1bHRzLgoKYGBge3J9CnBjYV9tb2RlbCA8LSBwcmNvbXAoZGF0YV9wcmVwKQpwY2FfbW9kZWwKc3VtbWFyeShwY2FfbW9kZWwpCgpgYGAKCkNyZWF0ZSBzY3JlZSBwbG90LgoKYGBge3J9CnBsb3QocGNhX21vZGVsKQoKYGBgCgpVc2UgcHJlZGljdCBmdW5jdGlvbiB0byBleHRyYWN0IHByaW5jaXBhbCBjb21wb25lbnRzIGZvciBlYWNoIHBva2Vtb24uCgpgYGB7cn0KcGNvbXBzIDwtIHByZWRpY3QocGNhX21vZGVsLCBuZXdkYXRhID0gZGF0YV9wcmVwKQoKYGBgCgpBdHRhY2ggcHJpbmNpcGFsIGNvbXBvbmVudHMgYW5kIGNsdXN0ZXIgYXNzaWdubWVudHMgdG8gdGhlIG9yaWdpbmFsIGRhdGFzZXQgYW5kIHZpc3VhbGl6ZSB0aGVtIHVzaW5nIGEgc2NhdHRlcnBsb3QuCgpgYGB7cn0KZGF0YV93X3BjIDwtIGRhdGEgJT4lIAogIGNiaW5kKHBjb21wc1ssMTo3XSkgJT4lIAogIGNiaW5kKGNsdXN0ZXIgPSBhcy5mYWN0b3IoZGF0YV9wcmVwX3dfaF9jbHVzdGVyc1ssOV0pKQoKZ2dwbG90KGRhdGE9ZGF0YV93X3BjKSArCiAgZ2VvbV9wb2ludChtYXBwaW5nPWFlcyh4PVBDMSwgeT1QQzIsIGNvbG9yPWNsdXN0ZXIpKSArCiAgZ2VvbV90ZXh0X3JlcGVsKG1hcHBpbmc9YWVzKHg9UEMxLCB5PVBDMiwgbGFiZWw9bmFtZSkpIAoKYGBgCgpGcm9tIEJ1bGJhcGVkaWE6ICoqV2FpbG9yZCoqIGlzIGEgaHVnZSBQb2vDqW1vbiBiYXNlZCBvbiB0aGUgYmx1ZSB3aGFsZS4gSXQgaGFzIHNtYWxsIGJlYWR5IGV5ZXMsIGEgaHVnZSBtb3V0aCwgYW5kIGEgdGhyb2F0IHRoYXQgaXMgbGluZWQgd2l0aCBncm9vdmVzLiBJdCBoYXMgYSBibHVlIGJhY2sgd2l0aCBhbmQgZm91ciB3aGl0ZSBzcG90cywgYW5kIGEgdGFuLCBncm9vdmVkIHVuZGVyYmVsbHkuIEl0IGhhcyB0d28gcGFpcnMgb2YgZmlucyBhbG9uZyBpdHMgc2lkZXMgYW5kIGEgaG9yaXpvbnRhbCB0YWlsIGF0IHRoZSBiYWNrLgoKIVtdKHdhaWxvcmQucG5nKQoKV2hlbiBpdCBpcyBqdW1waW5nIG91dCBvZiB0aGUgd2F0ZXIsIGl0IG1ha2VzIGEgZ2lhbnQgc3BsYXNoIGR1ZSB0byBpdHMgbGFyZ2Ugc2l6ZS4gSXQgY2FuIGNyZWF0ZSBzaG9ja3dhdmVzIGJ5IGJyZWFjaGluZyBhbmQgY3Jhc2hpbmcgb250byB0aGUgd2F0ZXIsIHdoaWNoIGNhbiBrbm9jayBvdXQgdGhlIG9wcG9uZW50cy4gSXQgY2FuIGRpdmUgZGVlcCBhdCAxMCwwMDAgZmVldCAoMywwMDAgbWV0ZXJzKSBpbiBvbmUgYnJlYXRoLiBJdCBsaXZlcyBpbiB0aGUgc2VhIGluIGxhcmdlIGdyb3VwcyBjYWxsZWQgcG9kcy4gQSBwb2Qgb2YgV2FpbG9yZCB0cmF2ZWxzIHRvZ2V0aGVyIGluIG9yZGVyIHRvIHNlYXJjaCBmb3IgZm9vZCBhbmQgaXMgYWJsZSB0byBlYXQgbGFyZ2UgcXVhbnRpdGllcyBhdCBvbmUgdGltZSwgc3VjaCBhcyBzY2hvb2xzIG9mIFdpc2hpd2FzaGkuIFdoZW4gYSBXYWlsbWVyIGlzIGF0dGFja2VkIGJ5IGl0cyBwcmVkYXRvcnMgU2hhcnBlZG8gb3IgRGhlbG1pc2UsIGl0cyB3aG9sZSBwb2Qgd29ya3MgdG8gZGVmZW5kIGl0LiAqKldhaWxvcmQgaXMgcG9wdWxhciBkdWUgdG8gaXRzIGltbWVuc2Ugc2l6ZS4qKiBBcyBhIHJlc3VsdCwgV2FpbG9yZC13YXRjaGluZyBpcyBhIGZhdm9yaXRlIHNpZ2h0c2VlaW5nIGFjdGl2aXR5IGluIHZhcmlvdXMgcGFydHMgb2YgdGhlIHdvcmxkLiBIb3dldmVyLCB0aGV5IGFyZSBrbm93biB0byBiZSBkcml2ZW4gYXdheSBieSBmaXNoZXJtZW4sIHNvIHRoZXkgZG9uJ3QgZWF0IHRvbyBtdWNoIGZpc2ggUG9rw6ltb24uCg==