Python | データ処理：ヒストグラム

概要（ヒストグラムは「値の分布」を一目でつかむための入口）
基本の作り方（pandas・matplotlib・seabornの最小コード）
重要パラメータの深掘り（bins・range・density・alpha・cumulative）
実務での使い方（外れ値・歪み・多峰性・対数スケール）
前処理の型（dtype・欠損・範囲整備）と比較の作法
応用例（二変量の関係、階級表の作成、ビン幅の自動化）
まとめ（「正しい型→適切なbins→目的に合わせた範囲と密度」。外れ値・多峰性も誠実に示す）

概要（ヒストグラムは「値の分布」を一目でつかむための入口）

ヒストグラムは、数値データを区間（ビン）に分け、各区間に何件入ったかを棒で表すグラフです。平均や中央値では見えない「偏り」「裾の重さ」「外れ値の影響」「多峰性（山が２つ以上）」を視覚的につかめます。pandasやmatplotlibで数行のコードから作れ、bins（区間数）・range（範囲）・density（密度）・alpha（透明度）などの調整で、分析の質がぐっと上がります。

基本の作り方（pandas・matplotlib・seabornの最小コード）

pandasでサクッと描く

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({"score": [55, 60, 65, 70, 72, 75, 78, 80, 82, 85, 88, 90, 92, 95, 100]})
ax = df["score"].plot.hist(bins=10, figsize=(6,4), title="Score distribution")
ax.set_xlabel("Score")
ax.set_ylabel("Count")
plt.tight_layout(); plt.show()

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({"score": [55, 60, 65, 70, 72, 75, 78, 80, 82, 85, 88, 90, 92, 95, 100]})
ax = df["score"].plot.hist(bins=10, figsize=(6,4), title="Score distribution")
ax.set_xlabel("Score")
ax.set_ylabel("Count")
plt.tight_layout(); plt.show()

Python

DataFrame/Seriesのplot.histで、区間数（bins）を指定するだけで基本形が作れます。

matplotlibで柔軟に描く

import matplotlib.pyplot as plt

plt.figure(figsize=(6,4))
plt.hist(df["score"], bins=10, color="#4e79a7", edgecolor="white")
plt.xlabel("Score"); plt.ylabel("Count"); plt.title("Score distribution")
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout(); plt.show()

import matplotlib.pyplot as plt

plt.figure(figsize=(6,4))
plt.hist(df["score"], bins=10, color="#4e79a7", edgecolor="white")
plt.xlabel("Score"); plt.ylabel("Count"); plt.title("Score distribution")
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout(); plt.show()

Python

色、枠線（edgecolor）、グリッドなどの細かい見た目調整が自在です。

seabornで見栄え＋密度曲線

import seaborn as sns

sns.histplot(data=df, x="score", bins=10, kde=True)  # kde=Trueで滑らかな密度曲線を重ねる

import seaborn as sns

sns.histplot(data=df, x="score", bins=10, kde=True)  # kde=Trueで滑らかな密度曲線を重ねる

Python

密度曲線（kde）を重ねると、分布の形（山・裾）が直感的に掴めます。

重要パラメータの深掘り（bins・range・density・alpha・cumulative）

bins（区間数）の選び方で見え方が変わる

df["score"].plot.hist(bins=5)   # 粗く（大まかな形）
df["score"].plot.hist(bins=20)  # 細かく（細部まで）

df["score"].plot.hist(bins=5)   # 粗く（大まかな形）
df["score"].plot.hist(bins=20)  # 細かく（細部まで）

Python

粗すぎると形が見えず、細かすぎるとノイズが強調されます。最初は10前後→データ量や散らばりを見て調整すると安定します。

range（表示範囲）でスケールを固定

df["score"].plot.hist(bins=10, range=(0, 100))

df["score"].plot.hist(bins=10, range=(0, 100))

Python

範囲を揃えると、他の指標と比較したときの一貫性が保てます。外れ値が大きすぎる場合は範囲を切るか、別途注釈・外れ値用の図を用意します。

density（度数ではなく密度）で比較可能に

df["score"].plot.hist(bins=10, density=True)

df["score"].plot.hist(bins=10, density=True)

Python

縦軸を“確率密度”にすると、「サンプル数が違う分布同士」を同じグラフで比較できます（面積が1になる）。

alpha（透明度）で重ね描きを見やすく

import numpy as np
x = np.random.normal(70, 10, 300)
y = np.random.normal(80, 8, 300)

plt.hist(x, bins=15, alpha=0.6, label="Group A")
plt.hist(y, bins=15, alpha=0.6, label="Group B")
plt.legend(); plt.show()

import numpy as np
x = np.random.normal(70, 10, 300)
y = np.random.normal(80, 8, 300)

plt.hist(x, bins=15, alpha=0.6, label="Group A")
plt.hist(y, bins=15, alpha=0.6, label="Group B")
plt.legend(); plt.show()

Python

複数分布の比較には透明度を下げて重ねます。凡例を必ず付けて誤読を防ぎます。

累積ヒストグラム（しきい値超過の割合を見る）

plt.hist(df["score"], bins=10, cumulative=True, density=True)

plt.hist(df["score"], bins=10, cumulative=True, density=True)

Python

cumulative=Trueで「この値以下がどのくらいの割合か」を可視化できます。合格ラインやSLAしきい値の検討に便利です。

実務での使い方（外れ値・歪み・多峰性・対数スケール）

外れ値の影響を確認し、２枚の図で伝える

plt.figure(figsize=(10,4))
plt.subplot(1,2,1); plt.hist(df["score"], bins=10); plt.title("With outliers")
plt.subplot(1,2,2); plt.hist(df["score"].clip(0, 100), bins=10); plt.title("Clipped (0–100)")
plt.tight_layout(); plt.show()

plt.figure(figsize=(10,4))
plt.subplot(1,2,1); plt.hist(df["score"], bins=10); plt.title("With outliers")
plt.subplot(1,2,2); plt.hist(df["score"].clip(0, 100), bins=10); plt.title("Clipped (0–100)")
plt.tight_layout(); plt.show()

Python

外れ値を含む図・扱った図を並べて誠実に伝えると、議論が健全になります。

歪み（右裾・左裾）を読む

右に長い裾は「一部が大きい値」、左に長い裾は「一部が小さい値」。平均より中央値を重視するべきか、外れ値処理が必要かの判断材料になります。

多峰性（山が複数）ならセグメント分けを疑う

山が２つ以上あるなら、カテゴリ（例：地域・製品タイプ）で分けて描くと構造が見えます。

import seaborn as sns
sns.histplot(data=df, x="score", hue="category", bins=15, element="step", stat="density")

import seaborn as sns
sns.histplot(data=df, x="score", hue="category", bins=15, element="step", stat="density")

Python

対数スケールで広い桁を見やすく

vals = pd.Series([1, 2, 5, 10, 50, 100, 1000, 10000])
plt.hist(vals, bins=10)
plt.yscale("log")   # 縦軸を対数に（横軸をlogにしたい場合は変換してから描画）

vals = pd.Series([1, 2, 5, 10, 50, 100, 1000, 10000])
plt.hist(vals, bins=10)
plt.yscale("log")   # 縦軸を対数に（横軸をlogにしたい場合は変換してから描画）

Python

売上・アクセスのように桁が広いデータでは、対数スケールで傾向が読みやすくなります。

前処理の型（dtype・欠損・範囲整備）と比較の作法

dtypeを数値へ（文字列数値はまず変換）

raw = pd.DataFrame({"price": ["100", "200", "x", None]})
prices = pd.to_numeric(raw["price"], errors="coerce")
plt.hist(prices.dropna(), bins=10)

raw = pd.DataFrame({"price": ["100", "200", "x", None]})
prices = pd.to_numeric(raw["price"], errors="coerce")
plt.hist(prices.dropna(), bins=10)

Python

to_numeric(errors=”coerce”)で変換不可はNaNに。dropnaで除外して正しい分布を作ります。

範囲と粒度を目的に合わせて設計

「テストの点数」なら0–100に固定、5点刻みなど“意味のある粒度”に。
「年齢」なら0–100、5歳刻みが自然。目的（施策設計・レポート）に合わせてbinsを決めます。

サンプル数が違う分布比較は密度へ

plt.hist(groupA, bins=20, density=True, alpha=0.6, label="A (n=300)")
plt.hist(groupB, bins=20, density=True, alpha=0.6, label="B (n=120)")
plt.legend()

plt.hist(groupA, bins=20, density=True, alpha=0.6, label="A (n=300)")
plt.hist(groupB, bins=20, density=True, alpha=0.6, label="B (n=120)")
plt.legend()

Python

density=Trueで“面積＝1”。サンプル数の差に惑わされず形だけ比べられます。

応用例（二変量の関係、階級表の作成、ビン幅の自動化）

ヒスト＋箱ひげを並べて頑健な要約

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8,4))
ax1.hist(df["score"], bins=15, color="#4e79a7", alpha=0.8); ax1.set_title("Histogram")
ax2.boxplot(df["score"], vert=False); ax2.set_title("Box plot")
plt.tight_layout(); plt.show()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8,4))
ax1.hist(df["score"], bins=15, color="#4e79a7", alpha=0.8); ax1.set_title("Histogram")
ax2.boxplot(df["score"], vert=False); ax2.set_title("Box plot")
plt.tight_layout(); plt.show()

Python

箱ひげは外れ値・四分位・中央値を一発で示します。両方を見ると分布の理解が早いです。

度数分布表（階級ごとの件数）を数値で欲しいとき

import numpy as np

counts, edges = np.histogram(df["score"], bins=10, range=(0,100))
table = pd.DataFrame({"bin_left": edges[:-1], "bin_right": edges[1:], "count": counts})
print(table)

import numpy as np

counts, edges = np.histogram(df["score"], bins=10, range=(0,100))
table = pd.DataFrame({"bin_left": edges[:-1], "bin_right": edges[1:], "count": counts})
print(table)

Python

ヒストを“表”としてレポートに載せたいときの型です。

ビン幅の自動選択（Freedman–Diaconis の目安）

import numpy as np

x = df["score"].dropna().values
q75, q25 = np.percentile(x, [75, 25])
iqr = q75 - q25
bin_width = 2 * iqr / np.cbrt(len(x))  # FD rule
bins = max(1, int(np.ceil((x.max() - x.min()) / bin_width)))
plt.hist(x, bins=bins)

import numpy as np

x = df["score"].dropna().values
q75, q25 = np.percentile(x, [75, 25])
iqr = q75 - q25
bin_width = 2 * iqr / np.cbrt(len(x))  # FD rule
bins = max(1, int(np.ceil((x.max() - x.min()) / bin_width)))
plt.hist(x, bins=bins)

Python

“適度に滑らかで過度にノイズでない”ビン数の目安が欲しいときに便利です。

まとめ（「正しい型→適切なbins→目的に合わせた範囲と密度」。外れ値・多峰性も誠実に示す）

ヒストグラムは、分布の形を最速で掴むための基本グラフです。まず数値へ正規化し、binsを試しながら調整、目的に合わせてrangeとdensityを選ぶ。外れ値の扱いは二枚の図で誠実に示し、山が複数ならセグメント分けを検討する。透明度で比較を見やすくし、必要なら累積・対数・密度曲線で補助する。この型を守れば、初心者でも“伝わる分布可視化”を安定して作れます。

月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28