summarystatistics

Cluster Analysis

K-means clustering is a technique used to partition data into groups that
are similar to one another.  The groups, or clusters aim to maximize between
group variability and minimize within group variability using algorithms.
I wanted to see how 20 lakes compared to one another in terms of their chemistry.  So I did a k-means clustering on some basic water chemistry data (DO,
temperature and conductivity)

#munge data and require packages
require(cluster)
require(factoextra)

require(reshape2)

setwd('C:/Users/kk/Documents/ponds')
dat <- read.csv("chem.csv")
str(dat)

dat2 <- reshape(dat, idvar = "site", timevar = 
"sampledate", direction = "wide")
rownames(dat2)<- dat2$site

dat3 <- dat2[2:13]

#describe data through basic statistics
desc_stats <- data.frame(
 Min = apply(dat3, 2, min), # minimum
 Med = apply(dat3, 2, median), # median
 Mean = apply(dat3, 2, mean), # mean
 SD = apply(dat3, 2, sd), # Standard deviation
 Max = apply(dat3, 2, max) # Maximum
 )

stat

#generate an elbow plot 
set.seed(123)
fviz_nbclust(dat3, kmeans, method = "wss") +
 geom_vline(xintercept = 5, linetype = 2)

cluster
# Compute k-means clustering 
set.seed(123)
km.res <- kmeans(dat3, 5, nstart = 25)
print(km.res)
cluster3


From this plot it looks like we have 5 groups of lakes.
Lake 15 may be an outlier since it is not similar to any of the other lakes.
From here we might be able to distinguish why some lakes are more similar than others.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s