Histogram

설명 / Description

히스토그램은 연속적인 수치 변수를 인접하고 겹치지 않는 구간(빈)으로 나누고, 각 구간에 대해 그 범위에 속하는 관측치의 개수나 밀도를 높이로 나타내는 막대를 그립니다. 막대 차트와 달리 막대 사이에 간격이 없어, 기저 변수가 범주형이 아니라 연속형임을 시각적으로 전달합니다.

A histogram divides a continuous numerical variable into contiguous, non-overlapping intervals (bins) and draws a bar for each bin whose height represents the count or density of observations falling within that range. Unlike a bar chart, the bars are adjacent with no gaps, visually communicating that the underlying variable is continuous rather than categorical.

히스토그램은 일변량 분포의 형태를 이해하는 가장 흔한 도구입니다. 최빈성(단봉, 쌍봉, 다봉), 왜도(좌측, 우측, 대칭), 산포, 이상치나 공백의 존재를 드러냅니다. 이러한 분포적 특징들은 평균이나 표준편차 같은 요약 통계에서는 보이지 않기 때문에, 히스토그램은 탐색적 데이터 분석에 필수적입니다.

The histogram is the most common tool for understanding the shape of a univariate distribution. It reveals modality (unimodal, bimodal, multimodal), skewness (left, right, symmetric), spread, and the presence of outliers or gaps. These distributional features are invisible in summary statistics like mean and standard deviation, making histograms essential for exploratory data analysis.

구간 너비는 결정적인 설계 선택입니다. 구간이 너무 적으면 데이터를 지나치게 평활화하여 구조를 감추고, 너무 많으면 노이즈가 많고 읽기 어려운 차트를 만듭니다. 스터지스 규칙(Sturges' rule), 프리드먼-디아코니스 규칙(Freedman-Diaconis rule), 스콧 규칙(Scott's rule) 등이 흔히 쓰이는 휴리스틱이지만, 최종 결정은 시각적 검토와 도메인 지식이 이끌어야 합니다.

Bin width is the critical design choice. Too few bins over-smooth the data and hide structure; too many bins create noisy, hard-to-read charts. Common heuristics include Sturges' rule, the Freedman-Diaconis rule, and Scott's rule, but visual inspection and domain knowledge should guide the final decision.

Histogram — interactive example

프롬프트 예시 / Prompt Examples

Claude, ChatGPT 등 AI 도구에 다음과 같은 프롬프트를 시도해보세요:

Try these prompts with Claude, ChatGPT, or other AI tools:

"고객 연령 분포를 5세 간격의 히스토그램으로 만들어주세요."

"Create a histogram of test scores with 10-point bins. Add a normal distribution overlay."

언제 사용하나 / When to Use

연속 변수 분포의 형태를 이해할 때(급여, 온도, 시험 점수)
Understanding the shape of a continuous variable's distribution (salary, temperature, test scores)
통계 모델링 전에 왜도, 이상치, 다봉성을 확인할 때
Checking for skewness, outliers, or multi-modality before statistical modeling
겹치거나 패싯 처리된 히스토그램으로 그룹 간 변수 분포를 비교할 때
Comparing the distribution of a variable across groups using overlapping or faceted histograms
품질 관리: 측정값이 예상 범위 내에 있는지 확인할 때
Quality control: verifying that measurements fall within expected ranges

이럴 땐 피하세요 / When NOT to Use

범주 간 값을 비교할 때 -- 막대 차트를 사용하세요(히스토그램은 연속 데이터용입니다)
Comparing values across categories -- use a bar chart (histograms are for continuous data)
표본 크기가 매우 작은 경우(관측치 20개 미만) -- 개별 데이터 포인트나 점 도표가 더 유익합니다
When the sample size is very small (fewer than 20 observations) -- individual data points or a dot plot is more informative
많은 그룹 간 분포를 비교할 때 -- 박스 플롯이나 바이올린 플롯이 더 간결합니다
Comparing distributions across many groups -- a box plot or violin plot is more compact
시간적 순서가 중요한 경우 -- 선 그래프를 사용하세요
When temporal ordering matters -- use a line graph

구조 / Anatomy

빈(막대): 인접한 직사각형 막대입니다. 너비는 구간 범위를 나타내고, 높이는 빈도나 밀도를 인코딩합니다.
Bins (bars): Contiguous rectangular bars. Width spans the bin interval; height encodes frequency or density.
X축: 구간으로 나뉜 연속 변수입니다.
X-axis: The continuous variable, divided into bin intervals.
Y축: 개수(빈도), 상대 빈도, 또는 밀도입니다.
Y-axis: Count (frequency), relative frequency, or density.
빈 경계: 인접한 빈 사이의 경계입니다. 구간을 정의합니다.
Bin edges: The boundaries between adjacent bins. They define the intervals.
간격 없음: 막대가 서로 맞닿아 연속 변수임을 나타냅니다. 이는 히스토그램을 막대 차트와 구별하는 요소입니다.
No gaps: Bars touch each other, signaling a continuous variable. This distinguishes histograms from bar charts.
러그 플롯: x축을 따라 개별 데이터 포인트를 보여주는 선택적 눈금 표시로, 구간화가 감추는 세부 정보를 추가합니다.
Rug plot: Optional tick marks along the x-axis showing individual data points, adding detail that binning hides.
밀도 곡선: 분포 형태를 요약하는 선택적 평활 오버레이(커널 밀도 추정)입니다.
Density curve: An optional smoothed overlay (kernel density estimate) summarizing the distributional shape.

변형 / Variations

밀도 히스토그램: Y축이 개수 대신 확률 밀도를 보여주어 전체 면적의 합이 1이 됩니다. 크기가 다른 데이터셋 간 비교를 가능하게 합니다.
Density histogram: Y-axis shows probability density rather than counts, so the total area sums to 1. Enables comparison across datasets of different sizes.
누적 히스토그램: 막대가 누적 합계를 보여주며, 백분위수 분석에 유용합니다.
Cumulative histogram: Bars show running totals, useful for percentile analysis.
누적(스택) 히스토그램: 막대가 범주 변수로 세분되어 그룹 분포를 비교합니다.
Stacked histogram: Bars are subdivided by a categorical variable to compare group distributions.
중첩 히스토그램: 여러 그룹의 반투명 히스토그램이 같은 축에 그려집니다.
Overlapping histograms: Semi-transparent histograms from multiple groups drawn on the same axes.
2D 히스토그램(헥스빈): 개념을 2차원으로 확장하여, 산점도를 개수에 따라 색칠된 사각형이나 육각형으로 구간화합니다.
2D histogram (hexbin): Extends the concept to two dimensions, binning a scatterplot into rectangles or hexagons colored by count.

코드 레퍼런스 / Code Reference

// Observable Plot - histogram with automatic binning
Plot.plot({
  marks: [
    Plot.rectY(data, Plot.binX({ y: "count" }, {
      x: "age",
      fill: "steelblue",
      tip: true
    })),
    Plot.ruleY([0])
  ],
  x: { label: "Age" },
  y: { label: "Count" }
})