summarystatistics

Cluster Analysis

K-means clustering is a technique used to partition data into groups that are similar to one another.  The groups, or clusters aim to maximize between group variability and minimize within group variability using algorithms.  I wanted to see how 20 lakes compared to one another in terms of their chemistry.  So I did a k-means clustering on some basic water chemistry data (DO, temperature and conductivity)

#munge data and require packages
require(cluster)
require(factoextra)

require(reshape2)

setwd('C:/Users/kk/Documents/ponds')
dat <- read.csv("chem.csv")
str(dat)

dat2 <- reshape(dat, idvar = "site", timevar = "sampledate", direction = "wide")
rownames(dat2)<- dat2$site

dat3 <- dat2[2:13]

#describe data through basic statistics
desc_stats <- data.frame(
 Min = apply(dat3, 2, min), # minimum
 Med = apply(dat3, 2, median), # median
 Mean = apply(dat3, 2, mean), # mean
 SD = apply(dat3, 2, sd), # Standard deviation
 Max = apply(dat3, 2, max) # Maximum
 )

stat

#generate an elbow plot to see how many clusters you should use
set.seed(123)
fviz_nbclust(dat3, kmeans, method = "wss") +
 geom_vline(xintercept = 5, linetype = 2)

cluster
# Compute k-means clustering with desired number of clusters
set.seed(123)
km.res <- kmeans(dat3, 5, nstart = 25)
print(km.res)
cluster3


From this plot it looks like we have 5 groups of lakes. Lake 15 may be an outlier since it is not similar to any of the other lakes.  From here we might be able to distinguish why some lakes are more similar than others.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s