Scatterplot
chartAlso known as: scatter chart, scatter diagram, scatter plot, XY plot
Description
A scatterplot positions each observation as a point in a two-dimensional space defined by two quantitative variables. The horizontal position encodes one variable and the vertical position encodes the other. The resulting cloud of points reveals the nature of the relationship between the variables: linear or nonlinear correlation, clusters of similar observations, and outliers that deviate from the overall pattern.
Scatterplots are the workhorse of exploratory data analysis. They answer the fundamental question “how are these two measurements related?” without imposing any model or aggregation on the data. Each individual observation is visible, which makes scatterplots ideal for detecting patterns that summary statistics would obscure, such as Simpson’s paradox or heteroscedasticity.
Additional variables can be encoded through visual channels: color or shape for categorical grouping, size for a third quantitative variable (making it a bubble chart), and opacity to manage overplotting when datasets are large.
Prompt Examples
Try these prompts with Claude, ChatGPT, or other AI tools:
“소득과 기대수명의 관계를 산점도로 그려주세요. 인구를 점의 크기로, 대륙을 색상으로 표현해주세요.”
“Make a scatter plot of height vs weight, colored by gender, with a regression line.”
“Create a bubble chart where x=GDP, y=happiness, size=population, color=continent.”
When to Use
- Investigating the correlation or association between two quantitative variables
- Identifying clusters, gaps, or outliers in multivariate data
- Checking assumptions before fitting a regression model
- Exploring a dataset without strong prior hypotheses about structure
When NOT to Use
- When one variable is categorical — use a bar chart or strip plot
- Time series data where temporal order matters — use a line graph (a scatterplot loses the sequence)
- When overplotting makes individual points indistinguishable (thousands of points) — consider a heatmap (binned), density contours, or hexbin plot
- Comparing magnitudes of a single variable across groups — a bar chart or box plot is more direct
Anatomy
- Points (dots): Each mark represents one observation. Position encodes two quantitative values.
- X-axis: The horizontal quantitative variable (often the independent or predictor variable by convention).
- Y-axis: The vertical quantitative variable (often the dependent or response variable).
- Color/Shape encoding: Optional mapping of a categorical variable to distinguish groups.
- Size encoding: Optional mapping of a third quantitative variable (transforms the scatterplot into a bubble chart).
- Trend line: An optional fitted line (linear regression, LOESS) overlaid to summarize the relationship.
- Marginal distributions: Optional histograms or density plots along the axes showing each variable’s univariate distribution.
Variations
- Bubble chart: Adds a size channel to encode a third quantitative variable.
- Connected scatterplot: Points connected in temporal order, showing a trajectory through the 2D space over time.
- Jittered scatterplot: Random displacement added to points to reduce overplotting when values are discrete or rounded.
- Hexbin / 2D histogram: Aggregates dense point clouds into hexagonal bins colored by count, solving overplotting.
- Scatterplot matrix (SPLOM): A grid of small scatterplots for every pair of variables in a multivariate dataset.
Code Reference
// Observable Plot - scatterplot with color grouping
Plot.plot({
marks: [
Plot.dot(data, {
x: "income",
y: "life_expectancy",
fill: "continent",
r: 3,
opacity: 0.7,
tip: true
})
],
x: { label: "Income per capita ($)" },
y: { label: "Life expectancy (years)" },
color: { legend: true }
})