K-means clustering is a technique used to partition data into groups that
are similar to one another. The groups, or clusters aim to maximize between
group variability and minimize within group variability using algorithms.
I wanted to see how 20 lakes compared to one another in terms of their chemistry. So I did a k-means clustering on some basic water chemistry data (DO,
temperature and conductivity)
#munge data and require packages require(cluster) require(factoextra) require(reshape2) setwd('C:/Users/kk/Documents/ponds') dat <- read.csv("chem.csv") str(dat) dat2 <- reshape(dat, idvar = "site", timevar = "sampledate", direction = "wide")
rownames(dat2)<- dat2$site dat3 <- dat2[2:13] #describe data through basic statistics desc_stats <- data.frame( Min = apply(dat3, 2, min), # minimum Med = apply(dat3, 2, median), # median Mean = apply(dat3, 2, mean), # mean SD = apply(dat3, 2, sd), # Standard deviation Max = apply(dat3, 2, max) # Maximum ) #generate an elbow plot set.seed(123) fviz_nbclust(dat3, kmeans, method = "wss") + geom_vline(xintercept = 5, linetype = 2)
# Compute k-means clustering set.seed(123) km.res <- kmeans(dat3, 5, nstart = 25) print(km.res)
From this plot it looks like we have 5 groups of lakes.
Lake 15 may be an outlier since it is not similar to any of the other lakes.
From here we might be able to distinguish why some lakes are more similar than others.