Correlation Matrix
chartAlso known as: correlation heatmap, correlation plot, correlogram
Description
A correlation matrix displays the pairwise correlation coefficients between a set of numerical variables in a symmetric grid. Each cell represents the relationship between two variables, with color encoding the correlation value: typically a diverging scale from strong negative (e.g., blue) through zero (white or light) to strong positive (e.g., red). The diagonal always shows perfect self-correlation (+1), and the matrix is symmetric around it.
The visualization serves as a powerful screening tool in exploratory data analysis. By scanning the color pattern, an analyst can quickly identify which pairs of variables move together (positive correlation), which move in opposite directions (negative correlation), and which are largely unrelated (near-zero correlation). Clusters of similarly colored cells reveal groups of co-varying variables, suggesting underlying factors or potential multicollinearity issues in regression models.
While the correlation matrix is information-dense, it relies on color perception, which is less precise than position. Readers can easily distinguish strong from weak correlations but struggle with fine differences (e.g., 0.72 vs. 0.68). Annotating cells with numeric values or pairing the matrix with an interactive scatterplot on click can address this limitation.
When to Use
- Exploring pairwise relationships among many variables (5-50+) in a single compact view
- Screening for multicollinearity before building regression models
- Identifying variable clusters that co-vary, suggesting latent factors
- Communicating the overall relationship structure of a dataset at a glance
When NOT to Use
- When you have only 2-3 variables — a scatterplot or scatterplot matrix shows more detail
- When the relationship is non-linear — correlation coefficients measure linear association only; use a scatterplot to see the actual shape
- When precise values are critical — color discrimination is imprecise; add numeric annotations or use a table
- For categorical data — correlation is defined for numerical variables; use a contingency table or mosaic plot instead
Anatomy
- Grid cells: Square cells at each (i, j) position, colored by the correlation coefficient between variables i and j.
- Color scale: A diverging scale (e.g., blue-white-red) centered at zero, with a legend showing the value-to-color mapping.
- Diagonal: Self-correlations (always 1.0), often shown in a neutral color or omitted.
- Variable labels: Names along both axes, in the same order.
- Cell annotations: Optional numeric values (e.g., “0.85”) printed inside each cell for precision.
- Upper/lower triangle: Since the matrix is symmetric, one triangle is sometimes hidden or replaced with a different encoding (e.g., circle size).
Variations
- Half-matrix: Only the lower (or upper) triangle is shown, eliminating redundancy.
- Clustered correlation matrix: Variables are reordered by hierarchical clustering to group correlated variables together.
- Bubble correlation matrix: Circle size encodes the absolute correlation value, and color encodes direction, providing a dual encoding.
- Annotated matrix: Numeric correlation values printed in each cell alongside the color fill.
- Significance-masked matrix: Cells for non-significant correlations are blanked out or marked with an “X”.
Code Reference
// Observable Plot - correlation matrix
Plot.plot({
marks: [
Plot.cell(correlations, {
x: "var1",
y: "var2",
fill: "r",
tip: true
}),
Plot.text(correlations, {
x: "var1",
y: "var2",
text: d => d.r.toFixed(2),
fill: d => Math.abs(d.r) > 0.5 ? "white" : "black",
fontSize: 10
})
],
color: {
scheme: "RdBu",
pivot: 0,
domain: [-1, 1],
legend: true,
label: "Correlation"
},
x: { label: null, tickRotate: -45 },
y: { label: null }
})