In-class Ex5b: Correlation Matrix

Published

February 8, 2023

Modified

March 11, 2023

Correlation Matrix

Installing and launching R packages, and loading data

pacman::p_load(corrplot, ggstatsplot, tidyverse)
wine <- read_csv("data/wine_quality.csv")

Building a basic correlation matrix using pairs() function

pairs(wine[,1:11])

To show the lower half or upper half of the correlation matrix

pairs(wine[,2:12], upper.panel = NULL)

pairs(wine[,2:12], lower.panel = NULL)

To include correlation coefficient of each pair of variables instead of a scatter plot using panel.cor() function. This will also show higher values in larger font.

panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...) {
usr <- par("usr")
on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y, use="complete.obs"))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * (1 + r) / 2)
}

pairs(wine[,2:12], 
      upper.panel = panel.cor)

Building correlation matrix using ggcormat() from ggstatsplot package which has better visuals. However there is some conflict with ggtern package.

ggstatsplot::ggcorrmat(data = wine,
                       cor.vars = 1:11)

ggstatsplot::ggcorrmat(
  data = wine, 
  cor.vars = 1:11,
  ggcorrplot.args = list(outline.color = "black", 
                         hc.order = TRUE,
                         tl.cex = 10),
  title    = "Correlogram for wine dataset",
  subtitle = "Four pairs are no significant at p < 0.05"
)

Building multiple plots with facets using grouped_ggcorrmat() function from ggstatsplot package

grouped_ggcorrmat(
  data = wine,
  cor.vars = 1:11,
  grouping.var = type,
  type = "robust",
  p.adjust.method = "holm",
  plotgrid.args = list(ncol = 2),
  ggcorrplot.args = list(outline.color = "black", 
                         hc.order = TRUE,
                         tl.cex = 10),
  annotation.args = list(
    tag_levels = "a",
    title = "Correlogram for wine dataset",
    subtitle = "The measures are: alcohol, sulphates, fixed acidity, citric acid, chlorides, residual sugar, density, free sulfur dioxide and volatile acidity",
    caption = "Dataset: UCI Machine Learning Repository"
  )
)

Building corrgram using corrplot(), we need to first compute the correlation matrix of the wine data frame before plotting the corrgram

wine.cor <- cor(wine[, 1:11])
corrplot(wine.cor)

There are seven visual geometrics (parameter method) - circle, square, ellipse, number, shade, color and pie - that can be used to encode the attribute values. The default circle can be changed by using the method argument.

corrplot(wine.cor, 
         method = "shade") 

Corrplot() supports 3 layout types - full, upper, or lower which can be changed using the type argument

corrplot(wine.cor, 
         method = "shade", 
         type="lower")

Further customization can be done using diag and tl.col arguments to turn off the diagonal cells and to change the axis text colour.

corrplot(wine.cor, 
         method = "shade", 
         type="lower",
         diag = FALSE,
         tl.col = "black")

It is possible to design corrgram with mixed visual matrix using corrplot.mixed() and also using tl.pos to specify the placement of the axis label.

corrplot.mixed(wine.cor, 
               lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

In statistical analysis, we are also interested to know which pair of variables their correlation coefficients are statistically significant. We can use the cor.mtest() to compute the p-values and confidence interval for each pair of variables, followed by using p.mat argument of corrplot().

wine.sig = cor.mtest(wine.cor, conf.level= .95)
corrplot(wine.cor,
         method = "number",
         type = "lower",
         diag = FALSE,
         tl.col = "black",
         tl.srt = 45,
         p.mat = wine.sig$p,
         sig.level = .05)

Corrplot package supports 4 sorting methods with more methods found in the seriation package:

  • "AOE" is for the angular order of the eigenvectors. See Michael Friendly (2002) for details.

  • "FPC" for the first principal component order.

  • "hclust" for hierarchical clustering order, and "hclust.method" for the agglomeration method to be used. "hclust.method" should be one of "ward", "single", "complete", "average", "mcquitty", "median" or "centroid".

  • "alphabet" for alphabetical order.

corrplot.mixed(wine.cor, 
               lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               order="AOE",
               tl.col = "black")

corrplot(wine.cor, 
         method = "ellipse", 
         tl.pos = "lt",
         tl.col = "black",
         order="hclust",
         hclust.method = "ward.D",
         addrect = 3)