Initialize Notebook
Load required packages.
library(tidyverse)
library(ggthemr)
library(ggrepel)
Set up workspace, i.e., remove all existing data from working memory, initialize the random number generator, turn of scientific notation of large numbers, set a standard theme for plotting. Finally, we activate parallel processing on multiple CPU cores.
rm(list=ls())
set.seed(42)
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
options(scipen=10000)
ggthemr('fresh')
Problem Description

Pokémon, also known as Pocket Monsters in Japan, is a Japanese media franchise managed by The Pokémon Company, a company founded by Nintendo, Game Freak, and Creatures. The franchise was created by Satoshi Tajiri in 1995 and is centered on fictional creatures called “Pokémon”, which humans, known as Pokémon Trainers, catch and train to battle each other for sport.
The dataset used in this tutorial is from Kaggle: https://www.kaggle.com/rounakbanik/pokemon
Prepare Data
Load data
Read data from CSV file.
data <- read_csv("pokemon.csv")
Rows: 780 Columns: 8
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): name
dbl (7): base_total, attack, defense, speed, capture_rate, height_m, weight_kg
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Prepare data
Remove some variables and standardize (i.e., center and scale) remaining numerical variables.
data_prep <- data %>%
select(-name) %>%
scale()
Clustering
K-Means Clustering
Perform k-means clustering with k=5. Extract cluster assignments and cluster centroids.
kmeans_cluster_model <- kmeans(data_prep, centers=5, nstart=20)
kmeans_clusters <- kmeans_cluster_model$cluster
data_prep_w_kmeans_clusters <- cbind(data, kmeans_clusters)
centroids <- t(kmeans_cluster_model$centers)
Hierarchical Clustering
Perform hierarchical clustering and plot results as dendrogram.
distances <- dist(data_prep, method="euclidean")
h_cluster_model <- hclust(distances)
plot(h_cluster_model, labels=as.data.frame(data)[,1])

Cut the dendrogram at k=5 and extract cluster assignments.
plot(h_cluster_model, labels=as.data.frame(data)[,1])
rect.hclust(h_cluster_model, k=5, border="red")

h_clusters <- cutree(h_cluster_model, k=5)
data_prep_w_h_clusters <- cbind(data, h_clusters)
Calculate summary statistics per cluster to interpret them. Note: calculating the mean of each variable per cluster is basically the same as looking at the cluster’s centroid.
data_prep_w_h_clusters_agg <- data_prep_w_h_clusters %>%
group_by(h_clusters) %>%
summarise(
n = n(),
attack = mean(attack),
defense = mean(defense),
height_m = mean(height_m),
weight_kg = mean(weight_kg)
)
Principal Components Analysis (PCA)
Estimate PCA model and show results.
pca_model <- prcomp(data_prep)
pca_model
Standard deviations (1, .., p=7):
[1] 1.9166724 1.0859646 0.8881759 0.7084026 0.6271856 0.5929397 0.3337936
Rotation (n x k) = (7 x 7):
PC1 PC2 PC3 PC4 PC5 PC6 PC7
base_total 0.4855330 -0.1764071 0.1140251 -0.03834291 -0.14663083 -0.02420832 -0.83461349
attack 0.4094367 -0.1164073 0.1372341 -0.69643895 0.51471241 0.02337807 0.22243002
defense 0.3592070 0.2898041 0.6128269 -0.01328290 -0.53923184 -0.10787536 0.32991301
speed 0.2517701 -0.6913699 -0.3643265 -0.03409929 -0.45008822 0.14199370 0.31934510
capture_rate -0.3951548 0.2054898 -0.2010699 -0.71187615 -0.46674431 -0.04991278 -0.18463012
height_m 0.3659557 0.3127230 -0.5324700 0.06298111 -0.00145025 -0.68734525 0.09134689
weight_kg 0.3378319 0.5033184 -0.3686749 0.03780382 -0.04866010 0.70152216 0.02624489
summary(pca_model)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.9167 1.0860 0.8882 0.70840 0.62719 0.59294 0.33379
Proportion of Variance 0.5248 0.1685 0.1127 0.07169 0.05619 0.05023 0.01592
Cumulative Proportion 0.5248 0.6933 0.8060 0.87766 0.93386 0.98408 1.00000
Create scree plot.
plot(pca_model)

Use predict function to extract principal components for each pokemon.
pcomps <- predict(pca_model, newdata = data_prep)
Attach principal components and cluster assignments to the original dataset and visualize them using a scatterplot.
data_w_pc <- data %>%
cbind(pcomps[,1:7]) %>%
cbind(cluster = as.factor(data_prep_w_h_clusters[,9]))
ggplot(data=data_w_pc) +
geom_point(mapping=aes(x=PC1, y=PC2, color=cluster)) +
geom_text_repel(mapping=aes(x=PC1, y=PC2, label=name))

From Bulbapedia: Wailord is a huge Pokémon based on the blue whale. It has small beady eyes, a huge mouth, and a throat that is lined with grooves. It has a blue back with and four white spots, and a tan, grooved underbelly. It has two pairs of fins along its sides and a horizontal tail at the back.

When it is jumping out of the water, it makes a giant splash due to its large size. It can create shockwaves by breaching and crashing onto the water, which can knock out the opponents. It can dive deep at 10,000 feet (3,000 meters) in one breath. It lives in the sea in large groups called pods. A pod of Wailord travels together in order to search for food and is able to eat large quantities at one time, such as schools of Wishiwashi. When a Wailmer is attacked by its predators Sharpedo or Dhelmise, its whole pod works to defend it. Wailord is popular due to its immense size. As a result, Wailord-watching is a favorite sightseeing activity in various parts of the world. However, they are known to be driven away by fishermen, so they don’t eat too much fish Pokémon.
LS0tCnRpdGxlOiAiRGF0YSBTY2llbmNlIGZvciBCdXNpbmVzcyAtIFdlZWsgMTI6IFVuc3VwZXJ2aXNlZCBMZWFybmluZyIKYXV0aG9yOiAiT2xpdmVyIE11ZWxsZXIiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCiMjIEluaXRpYWxpemUgTm90ZWJvb2sKCkxvYWQgcmVxdWlyZWQgcGFja2FnZXMuCgpgYGB7ciBsb2FkIHBhY2thZ2VzLCB3YXJuaW5nPUZBTFNFLCBtZXNzYWdlPUZBTFNFfQpsaWJyYXJ5KHRpZHl2ZXJzZSkKbGlicmFyeShnZ3RoZW1yKQpsaWJyYXJ5KGdncmVwZWwpCgpgYGAKClNldCB1cCB3b3Jrc3BhY2UsIGkuZS4sIHJlbW92ZSBhbGwgZXhpc3RpbmcgZGF0YSBmcm9tIHdvcmtpbmcgbWVtb3J5LCBpbml0aWFsaXplIHRoZSByYW5kb20gbnVtYmVyIGdlbmVyYXRvciwgdHVybiBvZiBzY2llbnRpZmljIG5vdGF0aW9uIG9mIGxhcmdlIG51bWJlcnMsIHNldCBhIHN0YW5kYXJkIHRoZW1lIGZvciBwbG90dGluZy4gRmluYWxseSwgd2UgYWN0aXZhdGUgcGFyYWxsZWwgcHJvY2Vzc2luZyBvbiBtdWx0aXBsZSBDUFUgY29yZXMuCgpgYGB7ciBzZXR1cH0Kcm0obGlzdD1scygpKQpzZXQuc2VlZCg0MikKa25pdHI6Om9wdHNfY2h1bmskc2V0KGVjaG8gPSBUUlVFLCB3YXJuaW5nID0gRkFMU0UpCm9wdGlvbnMoc2NpcGVuPTEwMDAwKQpnZ3RoZW1yKCdmcmVzaCcpCgpgYGAKCiMjIFByb2JsZW0gRGVzY3JpcHRpb24KCiFbXShwb2tlbW9uLnBuZykKClBva8OpbW9uLCBhbHNvIGtub3duIGFzIFBvY2tldCBNb25zdGVycyBpbiBKYXBhbiwgaXMgYSBKYXBhbmVzZSBtZWRpYSBmcmFuY2hpc2UgbWFuYWdlZCBieSBUaGUgUG9rw6ltb24gQ29tcGFueSwgYSBjb21wYW55IGZvdW5kZWQgYnkgTmludGVuZG8sIEdhbWUgRnJlYWssIGFuZCBDcmVhdHVyZXMuIFRoZSBmcmFuY2hpc2Ugd2FzIGNyZWF0ZWQgYnkgU2F0b3NoaSBUYWppcmkgaW4gMTk5NSBhbmQgaXMgY2VudGVyZWQgb24gZmljdGlvbmFsIGNyZWF0dXJlcyBjYWxsZWQgIlBva8OpbW9uIiwgd2hpY2ggaHVtYW5zLCBrbm93biBhcyBQb2vDqW1vbiBUcmFpbmVycywgY2F0Y2ggYW5kIHRyYWluIHRvIGJhdHRsZSBlYWNoIG90aGVyIGZvciBzcG9ydC4KClRoZSBkYXRhc2V0IHVzZWQgaW4gdGhpcyB0dXRvcmlhbCBpcyBmcm9tIEthZ2dsZTogPGh0dHBzOi8vd3d3LmthZ2dsZS5jb20vcm91bmFrYmFuaWsvcG9rZW1vbj4KCiMjIFByZXBhcmUgRGF0YQoKIyMjIExvYWQgZGF0YQoKUmVhZCBkYXRhIGZyb20gQ1NWIGZpbGUuCgpgYGB7cn0KZGF0YSA8LSByZWFkX2NzdigicG9rZW1vbi5jc3YiKQoKYGBgCgojIyMgUHJlcGFyZSBkYXRhCgpSZW1vdmUgc29tZSB2YXJpYWJsZXMgYW5kIHN0YW5kYXJkaXplIChpLmUuLCBjZW50ZXIgYW5kIHNjYWxlKSByZW1haW5pbmcgbnVtZXJpY2FsIHZhcmlhYmxlcy4KCmBgYHtyfQpkYXRhX3ByZXAgPC0gZGF0YSAlPiUKICBzZWxlY3QoLW5hbWUpICU+JSAKICBzY2FsZSgpCgpgYGAKCiMjIENsdXN0ZXJpbmcKCiMjIyBLLU1lYW5zIENsdXN0ZXJpbmcKClBlcmZvcm0gay1tZWFucyBjbHVzdGVyaW5nIHdpdGggaz01LiBFeHRyYWN0IGNsdXN0ZXIgYXNzaWdubWVudHMgYW5kIGNsdXN0ZXIgY2VudHJvaWRzLgoKYGBge3J9CmttZWFuc19jbHVzdGVyX21vZGVsIDwtIGttZWFucyhkYXRhX3ByZXAsIGNlbnRlcnM9NSwgbnN0YXJ0PTIwKQoKa21lYW5zX2NsdXN0ZXJzIDwtIGttZWFuc19jbHVzdGVyX21vZGVsJGNsdXN0ZXIKZGF0YV9wcmVwX3dfa21lYW5zX2NsdXN0ZXJzIDwtIGNiaW5kKGRhdGEsIGttZWFuc19jbHVzdGVycykKCmNlbnRyb2lkcyA8LSB0KGttZWFuc19jbHVzdGVyX21vZGVsJGNlbnRlcnMpCgpgYGAKCiMjIyBIaWVyYXJjaGljYWwgQ2x1c3RlcmluZwoKUGVyZm9ybSBoaWVyYXJjaGljYWwgY2x1c3RlcmluZyBhbmQgcGxvdCByZXN1bHRzIGFzIGRlbmRyb2dyYW0uCgpgYGB7cn0KZGlzdGFuY2VzIDwtIGRpc3QoZGF0YV9wcmVwLCBtZXRob2Q9ImV1Y2xpZGVhbiIpCmhfY2x1c3Rlcl9tb2RlbCA8LSBoY2x1c3QoZGlzdGFuY2VzKQoKcGxvdChoX2NsdXN0ZXJfbW9kZWwsIGxhYmVscz1hcy5kYXRhLmZyYW1lKGRhdGEpWywxXSkKCmBgYAoKQ3V0IHRoZSBkZW5kcm9ncmFtIGF0IGs9NSBhbmQgZXh0cmFjdCBjbHVzdGVyIGFzc2lnbm1lbnRzLgoKYGBge3J9CnBsb3QoaF9jbHVzdGVyX21vZGVsLCBsYWJlbHM9YXMuZGF0YS5mcmFtZShkYXRhKVssMV0pCnJlY3QuaGNsdXN0KGhfY2x1c3Rlcl9tb2RlbCwgaz01LCBib3JkZXI9InJlZCIpIAoKaF9jbHVzdGVycyA8LSBjdXRyZWUoaF9jbHVzdGVyX21vZGVsLCBrPTUpCmRhdGFfcHJlcF93X2hfY2x1c3RlcnMgPC0gY2JpbmQoZGF0YSwgaF9jbHVzdGVycykKCmBgYAoKQ2FsY3VsYXRlIHN1bW1hcnkgc3RhdGlzdGljcyBwZXIgY2x1c3RlciB0byBpbnRlcnByZXQgdGhlbS4gTm90ZTogY2FsY3VsYXRpbmcgdGhlIG1lYW4gb2YgZWFjaCB2YXJpYWJsZSBwZXIgY2x1c3RlciBpcyBiYXNpY2FsbHkgdGhlIHNhbWUgYXMgbG9va2luZyBhdCB0aGUgY2x1c3RlcidzIGNlbnRyb2lkLgoKYGBge3J9CmRhdGFfcHJlcF93X2hfY2x1c3RlcnNfYWdnIDwtIGRhdGFfcHJlcF93X2hfY2x1c3RlcnMgJT4lIAogIGdyb3VwX2J5KGhfY2x1c3RlcnMpICU+JQogIHN1bW1hcmlzZSgKICAgIG4gPSBuKCksCiAgICBhdHRhY2sgPSBtZWFuKGF0dGFjayksCiAgICBkZWZlbnNlID0gbWVhbihkZWZlbnNlKSwKICAgIGhlaWdodF9tID0gbWVhbihoZWlnaHRfbSksCiAgICB3ZWlnaHRfa2cgPSBtZWFuKHdlaWdodF9rZykKICApCiAgCmBgYAoKIyMgUHJpbmNpcGFsIENvbXBvbmVudHMgQW5hbHlzaXMgKFBDQSkKCkVzdGltYXRlIFBDQSBtb2RlbCBhbmQgc2hvdyByZXN1bHRzLgoKYGBge3J9CnBjYV9tb2RlbCA8LSBwcmNvbXAoZGF0YV9wcmVwKQpwY2FfbW9kZWwKc3VtbWFyeShwY2FfbW9kZWwpCgpgYGAKCkNyZWF0ZSBzY3JlZSBwbG90LgoKYGBge3J9CnBsb3QocGNhX21vZGVsKQoKYGBgCgpVc2UgcHJlZGljdCBmdW5jdGlvbiB0byBleHRyYWN0IHByaW5jaXBhbCBjb21wb25lbnRzIGZvciBlYWNoIHBva2Vtb24uCgpgYGB7cn0KcGNvbXBzIDwtIHByZWRpY3QocGNhX21vZGVsLCBuZXdkYXRhID0gZGF0YV9wcmVwKQoKYGBgCgpBdHRhY2ggcHJpbmNpcGFsIGNvbXBvbmVudHMgYW5kIGNsdXN0ZXIgYXNzaWdubWVudHMgdG8gdGhlIG9yaWdpbmFsIGRhdGFzZXQgYW5kIHZpc3VhbGl6ZSB0aGVtIHVzaW5nIGEgc2NhdHRlcnBsb3QuCgpgYGB7cn0KZGF0YV93X3BjIDwtIGRhdGEgJT4lIAogIGNiaW5kKHBjb21wc1ssMTo3XSkgJT4lIAogIGNiaW5kKGNsdXN0ZXIgPSBhcy5mYWN0b3IoZGF0YV9wcmVwX3dfaF9jbHVzdGVyc1ssOV0pKQoKZ2dwbG90KGRhdGE9ZGF0YV93X3BjKSArCiAgZ2VvbV9wb2ludChtYXBwaW5nPWFlcyh4PVBDMSwgeT1QQzIsIGNvbG9yPWNsdXN0ZXIpKSArCiAgZ2VvbV90ZXh0X3JlcGVsKG1hcHBpbmc9YWVzKHg9UEMxLCB5PVBDMiwgbGFiZWw9bmFtZSkpIAoKYGBgCgpGcm9tIEJ1bGJhcGVkaWE6ICoqV2FpbG9yZCoqIGlzIGEgaHVnZSBQb2vDqW1vbiBiYXNlZCBvbiB0aGUgYmx1ZSB3aGFsZS4gSXQgaGFzIHNtYWxsIGJlYWR5IGV5ZXMsIGEgaHVnZSBtb3V0aCwgYW5kIGEgdGhyb2F0IHRoYXQgaXMgbGluZWQgd2l0aCBncm9vdmVzLiBJdCBoYXMgYSBibHVlIGJhY2sgd2l0aCBhbmQgZm91ciB3aGl0ZSBzcG90cywgYW5kIGEgdGFuLCBncm9vdmVkIHVuZGVyYmVsbHkuIEl0IGhhcyB0d28gcGFpcnMgb2YgZmlucyBhbG9uZyBpdHMgc2lkZXMgYW5kIGEgaG9yaXpvbnRhbCB0YWlsIGF0IHRoZSBiYWNrLgoKIVtdKHdhaWxvcmQucG5nKQoKV2hlbiBpdCBpcyBqdW1waW5nIG91dCBvZiB0aGUgd2F0ZXIsIGl0IG1ha2VzIGEgZ2lhbnQgc3BsYXNoIGR1ZSB0byBpdHMgbGFyZ2Ugc2l6ZS4gSXQgY2FuIGNyZWF0ZSBzaG9ja3dhdmVzIGJ5IGJyZWFjaGluZyBhbmQgY3Jhc2hpbmcgb250byB0aGUgd2F0ZXIsIHdoaWNoIGNhbiBrbm9jayBvdXQgdGhlIG9wcG9uZW50cy4gSXQgY2FuIGRpdmUgZGVlcCBhdCAxMCwwMDAgZmVldCAoMywwMDAgbWV0ZXJzKSBpbiBvbmUgYnJlYXRoLiBJdCBsaXZlcyBpbiB0aGUgc2VhIGluIGxhcmdlIGdyb3VwcyBjYWxsZWQgcG9kcy4gQSBwb2Qgb2YgV2FpbG9yZCB0cmF2ZWxzIHRvZ2V0aGVyIGluIG9yZGVyIHRvIHNlYXJjaCBmb3IgZm9vZCBhbmQgaXMgYWJsZSB0byBlYXQgbGFyZ2UgcXVhbnRpdGllcyBhdCBvbmUgdGltZSwgc3VjaCBhcyBzY2hvb2xzIG9mIFdpc2hpd2FzaGkuIFdoZW4gYSBXYWlsbWVyIGlzIGF0dGFja2VkIGJ5IGl0cyBwcmVkYXRvcnMgU2hhcnBlZG8gb3IgRGhlbG1pc2UsIGl0cyB3aG9sZSBwb2Qgd29ya3MgdG8gZGVmZW5kIGl0LiAqKldhaWxvcmQgaXMgcG9wdWxhciBkdWUgdG8gaXRzIGltbWVuc2Ugc2l6ZS4qKiBBcyBhIHJlc3VsdCwgV2FpbG9yZC13YXRjaGluZyBpcyBhIGZhdm9yaXRlIHNpZ2h0c2VlaW5nIGFjdGl2aXR5IGluIHZhcmlvdXMgcGFydHMgb2YgdGhlIHdvcmxkLiBIb3dldmVyLCB0aGV5IGFyZSBrbm93biB0byBiZSBkcml2ZW4gYXdheSBieSBmaXNoZXJtZW4sIHNvIHRoZXkgZG9uJ3QgZWF0IHRvbyBtdWNoIGZpc2ggUG9rw6ltb24uCg==