K-means clustering is a technique used to partition data into groups that are similar to one another. The groups, or clusters aim to maximize between group variability and minimize within group variability using algorithms. I wanted to see how 20 lakes compared to one another in terms of their chemistry. So I did a k-means clustering on some basic water chemistry data (DO, temperature and conductivity)
#munge data and require packages require(cluster) require(factoextra) require(reshape2) setwd('C:/Users/kk/Documents/ponds') dat <- read.csv("chem.csv") str(dat) dat2 <- reshape(dat, idvar = "site", timevar = "sampledate", direction = "wide") rownames(dat2)<- dat2$site dat3 <- dat2[2:13] #describe data through basic statistics desc_stats <- data.frame( Min = apply(dat3, 2, min), # minimum Med = apply(dat3, 2, median), # median Mean = apply(dat3, 2, mean), # mean SD = apply(dat3, 2, sd), # Standard deviation Max = apply(dat3, 2, max) # Maximum ) #generate an elbow plot to see how many clusters you should use set.seed(123) fviz_nbclust(dat3, kmeans, method = "wss") + geom_vline(xintercept = 5, linetype = 2)
# Compute k-means clustering with desired number of clusters set.seed(123) km.res <- kmeans(dat3, 5, nstart = 25) print(km.res)
From this plot it looks like we have 5 groups of lakes. Lake 15 may be an outlier since it is not similar to any of the other lakes. From here we might be able to distinguish why some lakes are more similar than others.