Scatterplot

설명 / Description

산점도는 각 관측치를 두 개의 정량적 변수로 정의된 2차원 공간상의 점으로 배치합니다. 수평 위치는 한 변수를 인코딩하고 수직 위치는 다른 변수를 인코딩합니다. 그 결과로 만들어지는 점 구름은 변수 간 관계의 성격을 드러냅니다: 선형 또는 비선형 상관관계, 유사한 관측치의 군집, 전체 패턴에서 벗어나는 이상치입니다.

A scatterplot positions each observation as a point in a two-dimensional space defined by two quantitative variables. The horizontal position encodes one variable and the vertical position encodes the other. The resulting cloud of points reveals the nature of the relationship between the variables: linear or nonlinear correlation, clusters of similar observations, and outliers that deviate from the overall pattern.

산점도는 탐색적 데이터 분석의 주력 도구입니다. 데이터에 어떤 모델이나 집계도 강제하지 않고 "이 두 측정치는 어떻게 관련되어 있는가?"라는 근본적인 질문에 답합니다. 개별 관측치가 모두 보이기 때문에, 심슨의 역설이나 이분산성처럼 요약 통계가 가려버릴 수 있는 패턴을 발견하는 데 산점도가 이상적입니다.

Scatterplots are the workhorse of exploratory data analysis. They answer the fundamental question "how are these two measurements related?" without imposing any model or aggregation on the data. Each individual observation is visible, which makes scatterplots ideal for detecting patterns that summary statistics would obscure, such as Simpson's paradox or heteroscedasticity.

추가 변수는 시각적 채널을 통해 인코딩할 수 있습니다: 범주형 그룹화를 위한 색상이나 형태, 세 번째 정량적 변수를 위한 크기(버블 차트가 됨), 데이터셋이 클 때 과다 플로팅을 관리하기 위한 불투명도입니다.

Additional variables can be encoded through visual channels: color or shape for categorical grouping, size for a third quantitative variable (making it a bubble chart), and opacity to manage overplotting when datasets are large.

Scatterplot — interactive example

프롬프트 예시 / Prompt Examples

Claude, ChatGPT 등 AI 도구에 다음과 같은 프롬프트를 시도해보세요:

Try these prompts with Claude, ChatGPT, or other AI tools:

"소득과 기대수명의 관계를 산점도로 그려주세요. 인구를 점의 크기로, 대륙을 색상으로 표현해주세요."

"Make a scatter plot of height vs weight, colored by gender, with a regression line."

"Create a bubble chart where x=GDP, y=happiness, size=population, color=continent."

언제 사용하나 / When to Use

두 정량적 변수 간의 상관관계나 연관성을 조사할 때
Investigating the correlation or association between two quantitative variables
다변량 데이터에서 군집, 공백, 이상치를 식별할 때
Identifying clusters, gaps, or outliers in multivariate data
회귀 모델을 적합하기 전에 가정을 확인할 때
Checking assumptions before fitting a regression model
구조에 대한 강한 사전 가설 없이 데이터셋을 탐색할 때
Exploring a dataset without strong prior hypotheses about structure

이럴 땐 피하세요 / When NOT to Use

한 변수가 범주형인 경우 -- 막대 차트나 스트립 플롯을 사용하세요
When one variable is categorical -- use a bar chart or strip plot
시간적 순서가 중요한 시계열 데이터의 경우 -- 선 그래프를 사용하세요(산점도는 순서를 잃습니다)
Time series data where temporal order matters -- use a line graph (a scatterplot loses the sequence)
과다 플로팅으로 개별 점을 구별할 수 없는 경우(수천 개의 점) -- 구간화된 히트맵, 밀도 등고선, 또는 헥스빈 플롯을 고려하세요
When overplotting makes individual points indistinguishable (thousands of points) -- consider a heatmap (binned), density contours, or hexbin plot
그룹 간 단일 변수의 크기를 비교할 때 -- 막대 차트나 박스 플롯이 더 직접적입니다
Comparing magnitudes of a single variable across groups -- a bar chart or box plot is more direct

구조 / Anatomy

점: 각 마크는 하나의 관측치를 나타냅니다. 위치가 두 정량적 값을 인코딩합니다.
Points (dots): Each mark represents one observation. Position encodes two quantitative values.
X축: 수평 정량적 변수입니다(관례적으로 독립 변수나 예측 변수인 경우가 많습니다).
X-axis: The horizontal quantitative variable (often the independent or predictor variable by convention).
Y축: 수직 정량적 변수입니다(흔히 종속 변수나 반응 변수입니다).
Y-axis: The vertical quantitative variable (often the dependent or response variable).
색상/형태 인코딩: 그룹을 구별하기 위한 범주 변수의 선택적 매핑입니다.
Color/Shape encoding: Optional mapping of a categorical variable to distinguish groups.
크기 인코딩: 세 번째 정량적 변수의 선택적 매핑입니다(산점도를 버블 차트로 바꿉니다).
Size encoding: Optional mapping of a third quantitative variable (transforms the scatterplot into a bubble chart).
추세선: 관계를 요약하기 위해 오버레이하는 선택적 적합선(선형 회귀, LOESS)입니다.
Trend line: An optional fitted line (linear regression, LOESS) overlaid to summarize the relationship.
주변 분포: 각 변수의 일변량 분포를 보여주는 축을 따라 배치된 선택적 히스토그램이나 밀도 플롯입니다.
Marginal distributions: Optional histograms or density plots along the axes showing each variable's univariate distribution.

변형 / Variations

버블 차트: 세 번째 정량적 변수를 인코딩하기 위해 크기 채널을 추가합니다.
Bubble chart: Adds a size channel to encode a third quantitative variable.
연결된 산점도: 점들이 시간 순서대로 연결되어, 시간에 따른 2차원 공간상의 궤적을 보여줍니다.
Connected scatterplot: Points connected in temporal order, showing a trajectory through the 2D space over time.
지터 산점도: 값이 이산적이거나 반올림된 경우 과다 플로팅을 줄이기 위해 점에 무작위 변위를 추가합니다.
Jittered scatterplot: Random displacement added to points to reduce overplotting when values are discrete or rounded.
헥스빈 / 2D 히스토그램: 조밀한 점 구름을 개수에 따라 색칠된 육각형 구간으로 집계하여 과다 플로팅 문제를 해결합니다.
Hexbin / 2D histogram: Aggregates dense point clouds into hexagonal bins colored by count, solving overplotting.
산점도 행렬(SPLOM): 다변량 데이터셋의 모든 변수 쌍에 대한 작은 산점도들의 격자입니다.
Scatterplot matrix (SPLOM): A grid of small scatterplots for every pair of variables in a multivariate dataset.

코드 레퍼런스 / Code Reference

// Observable Plot - scatterplot with color grouping
Plot.plot({
  marks: [
    Plot.dot(data, {
      x: "income",
      y: "life_expectancy",
      fill: "continent",
      r: 3,
      opacity: 0.7,
      tip: true
    })
  ],
  x: { label: "Income per capita ($)" },
  y: { label: "Life expectancy (years)" },
  color: { legend: true }
})