Python | データ処理：散布図

概要（散布図は「2つの変数の関係」を直感で掴むための基本グラフ）
基本の描き方（pandas・matplotlib・seabornの3通り）
色・サイズ・透明度の深掘り（情報を重ねる基本テクニック）
回帰線・相関の読み方（“傾き”と“ばらつき”で関係を掴む）
実践例（品質チェック、クラスタ、時系列の関係、注釈）
つまずき対策（重なり・スケール・欠損・カテゴリ・保存）
まとめ（「正しい型→基本描画→色・サイズで情報追加→回帰で関係確認」）

概要（散布図は「2つの変数の関係」を直感で掴むための基本グラフ）

散布図は、横軸にX（説明変数）、縦軸にY（目的変数）を置き、各行を点としてプロットします。点の並びから、正の相関・負の相関・無相関、外れ値、クラスタ（グループの塊）を視覚的に見つけられます。初心者は「どう描くか（pandas/matplotlib/seaborn）」「色・サイズ・透明度の調整」「回帰線や相関の読み方」を押さえると、すぐに実務で使える図が作れます。

基本の描き方（pandas・matplotlib・seabornの3通り）

pandasで手早く描く（DataFrame.plot.scatter）

import pandas as pd

df = pd.DataFrame({
    "x": [5, 7, 8, 7, 6, 9, 5, 6, 7, 8],
    "y": [99, 86, 87, 88, 100, 86, 103, 87, 94, 78]
})

ax = df.plot.scatter(x="x", y="y", figsize=(6,4), title="Basic scatter")
ax.set_xlabel("X")
ax.set_ylabel("Y")

import pandas as pd

df = pd.DataFrame({
    "x": [5, 7, 8, 7, 6, 9, 5, 6, 7, 8],
    "y": [99, 86, 87, 88, 100, 86, 103, 87, 94, 78]
})

ax = df.plot.scatter(x="x", y="y", figsize=(6,4), title="Basic scatter")
ax.set_xlabel("X")
ax.set_ylabel("Y")

Python

数行で基本の散布図が描けます。軸ラベルとタイトルを付けるだけでも、読みやすさが大きく向上します。

matplotlibで柔軟に描く（plt.scatter）

import matplotlib.pyplot as plt

plt.figure(figsize=(6,4))
plt.scatter(df["x"], df["y"], s=50, c="#4e79a7", alpha=0.7, marker="o")
plt.xlabel("X"); plt.ylabel("Y"); plt.title("Matplotlib scatter")
plt.grid(True, alpha=0.3)
plt.show()

import matplotlib.pyplot as plt

plt.figure(figsize=(6,4))
plt.scatter(df["x"], df["y"], s=50, c="#4e79a7", alpha=0.7, marker="o")
plt.xlabel("X"); plt.ylabel("Y"); plt.title("Matplotlib scatter")
plt.grid(True, alpha=0.3)
plt.show()

Python

s（サイズ）、c（色）、alpha（透明度）、marker（形）などのカスタマイズが自在です。

seabornで色分けや回帰線を簡単に

import seaborn as sns

df = pd.DataFrame({
    "x": [5,7,8,7,6,9,5,6,7,8],
    "y": [99,86,87,88,100,86,103,87,94,78],
    "category": ["A","B","A","B","A","B","A","A","B","B"]
})

sns.scatterplot(data=df, x="x", y="y", hue="category", palette="Set2")

import seaborn as sns

df = pd.DataFrame({
    "x": [5,7,8,7,6,9,5,6,7,8],
    "y": [99,86,87,88,100,86,103,87,94,78],
    "category": ["A","B","A","B","A","B","A","A","B","B"]
})

sns.scatterplot(data=df, x="x", y="y", hue="category", palette="Set2")

Python

カテゴリ色分け（hue）や凡例の自動生成が簡単。見せる図を素早く作りたいときに便利です。

色・サイズ・透明度の深掘り（情報を重ねる基本テクニック）

カテゴリで色分け（pandasでも可能）

# カテゴリを数値コードに変換して色に使う
df["category_code"] = df["category"].astype("category").cat.codes
ax = df.plot.scatter(x="x", y="y", c="category_code", cmap="Set2", s=60, alpha=0.8)

# カテゴリを数値コードに変換して色に使う
df["category_code"] = df["category"].astype("category").cat.codes
ax = df.plot.scatter(x="x", y="y", c="category_code", cmap="Set2", s=60, alpha=0.8)

Python

カテゴリ列はそのまま色指定できないため、コード化→cmapで色付けします。凡例は別途作るか、seabornを使うと自動です。

連続値でサイズや色を表現（第三の情報を載せる）

import numpy as np

df["size"] = np.linspace(30, 200, len(df))  # 例：規模
df["color_val"] = df["y"]                  # 例：濃淡に使う

plt.figure(figsize=(6,4))
plt.scatter(df["x"], df["y"], s=df["size"], c=df["color_val"], cmap="viridis", alpha=0.7)
plt.colorbar(label="Y intensity")  # カラーバーで値の目安を示す

import numpy as np

df["size"] = np.linspace(30, 200, len(df))  # 例：規模
df["color_val"] = df["y"]                  # 例：濃淡に使う

plt.figure(figsize=(6,4))
plt.scatter(df["x"], df["y"], s=df["size"], c=df["color_val"], cmap="viridis", alpha=0.7)
plt.colorbar(label="Y intensity")  # カラーバーで値の目安を示す

Python

サイズ（s）や色（c）に意味を持たせると、1枚で多次元の情報を伝えられます。

透明度（alpha）とジッターで重なりを解消

# 小さなランダム値を足して重なりをずらす（ジッター）
jitter_x = df["x"] + np.random.normal(scale=0.05, size=len(df))
jitter_y = df["y"] + np.random.normal(scale=0.05, size=len(df))
plt.scatter(jitter_x, jitter_y, alpha=0.5)

# 小さなランダム値を足して重なりをずらす（ジッター）
jitter_x = df["x"] + np.random.normal(scale=0.05, size=len(df))
jitter_y = df["y"] + np.random.normal(scale=0.05, size=len(df))
plt.scatter(jitter_x, jitter_y, alpha=0.5)

Python

点が重なるデータでは、alphaを下げ、わずかにジッターを入れると分布が見やすくなります。

回帰線・相関の読み方（“傾き”と“ばらつき”で関係を掴む）

回帰線を重ねて傾向を可視化

import seaborn as sns
sns.regplot(data=df, x="x", y="y", scatter_kws={"alpha":0.6}, line_kws={"color":"red"})

import seaborn as sns
sns.regplot(data=df, x="x", y="y", scatter_kws={"alpha":0.6}, line_kws={"color":"red"})

Python

regplotは線形回帰線を重ねます。線の傾きが関係の方向、点の散らばりが強さの目安です。

相関係数で“定量的に”確認

corr = df[["x","y"]].corr().loc["x","y"]
print(f"Correlation (Pearson): {corr:.2f}")

corr = df[["x","y"]].corr().loc["x","y"]
print(f"Correlation (Pearson): {corr:.2f}")

Python

散布図で視覚的に掴んだ関係を、相関係数（-1〜1）で確認します。0付近は無相関、±0.7以上なら強めの相関の目安。

外れ値の影響を意識する（平均や回帰線が引っ張られる）

外れ値が1点あるだけで回帰線の傾きや相関が大きく変わることがあります。散布図で外れ値を目視し、必要ならロバスト回帰、外れ値除外、分位点で切るなどの対策を検討します。

実践例（品質チェック、クラスタ、時系列の関係、注釈）

品質チェック：数値の型揃え→散布図

df = pd.DataFrame({"height": ["160","175","180","170"], "weight": ["55","68","75","60"]})
df["height"] = pd.to_numeric(df["height"], errors="coerce")
df["weight"] = pd.to_numeric(df["weight"], errors="coerce")

ax = df.plot.scatter(x="height", y="weight", title="Height vs Weight")
ax.set_xlabel("Height (cm)"); ax.set_ylabel("Weight (kg)")

df = pd.DataFrame({"height": ["160","175","180","170"], "weight": ["55","68","75","60"]})
df["height"] = pd.to_numeric(df["height"], errors="coerce")
df["weight"] = pd.to_numeric(df["weight"], errors="coerce")

ax = df.plot.scatter(x="height", y="weight", title="Height vs Weight")
ax.set_xlabel("Height (cm)"); ax.set_ylabel("Weight (kg)")

Python

まずdtypeを数値へ揃えるのが基本。文字列のままだと正しく描けません。

クラスタの可視化（カテゴリ別に色分け）

df = pd.DataFrame({
    "x": [1,2,2,8,9,9],
    "y": [1,2,3,8,9,7],
    "cluster": ["A","A","A","B","B","B"]
})
sns.scatterplot(data=df, x="x", y="y", hue="cluster", palette="Set2")

df = pd.DataFrame({
    "x": [1,2,2,8,9,9],
    "y": [1,2,3,8,9,7],
    "cluster": ["A","A","A","B","B","B"]
})
sns.scatterplot(data=df, x="x", y="y", hue="cluster", palette="Set2")

Python

クラスタリング結果やカテゴリを色で分けると、グループの分離具合が一目で分かります。

時系列の関係（例：売上と広告費）

df = pd.DataFrame({
    "date": pd.date_range("2025-01-01", periods=8, freq="D"),
    "sales": [100,120,90,140,110,130,115,150],
    "ads":   [20,30,15,40,25,35,28,45]
})
ax = df.plot.scatter(x="ads", y="sales", title="Ads vs Sales")
ax.set_xlabel("Ad spend (k JPY)"); ax.set_ylabel("Sales (k JPY)")

df = pd.DataFrame({
    "date": pd.date_range("2025-01-01", periods=8, freq="D"),
    "sales": [100,120,90,140,110,130,115,150],
    "ads":   [20,30,15,40,25,35,28,45]
})
ax = df.plot.scatter(x="ads", y="sales", title="Ads vs Sales")
ax.set_xlabel("Ad spend (k JPY)"); ax.set_ylabel("Sales (k JPY)")

Python

原因と結果の関係を疑うとき、まず散布図で“関係の形”を見てから、時差（ラグ）や回帰で深掘りします。

注釈で意味のある点を強調

import matplotlib.pyplot as plt

ax = df.plot.scatter(x="ads", y="sales")
max_idx = df["sales"].idxmax()
ax.annotate("Peak", (df.loc[max_idx, "ads"], df.loc[max_idx, "sales"]),
            xytext=(10,10), textcoords="offset points",
            arrowprops=dict(arrowstyle="->"))

import matplotlib.pyplot as plt

ax = df.plot.scatter(x="ads", y="sales")
max_idx = df["sales"].idxmax()
ax.annotate("Peak", (df.loc[max_idx, "ads"], df.loc[max_idx, "sales"]),
            xytext=(10,10), textcoords="offset points",
            arrowprops=dict(arrowstyle="->"))

Python

ピークや例外的な点を注釈で指し示すと、レポートの説得力が上がります。

つまずき対策（重なり・スケール・欠損・カテゴリ・保存）

重なりが多いときは透明度・サイズ・ビン化で対処

alphaを下げ、sを小さくし、それでも厳しいならhexbin（六角ヒートマップ）や2Dヒストグラムで密度を表現します。

スケールが極端に違うと見誤る

軸範囲（xlim/ylim）を適切に設定し、必要に応じて対数スケール（log）や標準化で比較可能な範囲へ揃えます。

欠損は事前に処理

to_numeric(errors=”coerce”)でNaN化→dropnaで除外、またはfillnaで埋める。欠損が混じると点が抜けたり、色・サイズのマッピングが崩れます。

カテゴリの扱いはコード化が肝

pandasで色分けしたい場合は、astype(“category”).cat.codesで数値化→c=codes, cmap=…。凡例は自前で作るか、seabornで自動化します。

仕上げと保存

タイトル・軸ラベル・グリッド・凡例を明記し、plt.tight_layout()で余白調整。レポート用はplt.savefig(“scatter.png”, dpi=200)やSVG/PDFで高解像度出力します。

まとめ（「正しい型→基本描画→色・サイズで情報追加→回帰で関係確認」）

散布図は、2変数の関係を最短で可視化する入口です。まずdtypeを数値へ揃え、pandas/matplotlib/seabornのいずれかで描く。カテゴリは色、連続値はサイズ・濃淡、透明度で重なりを解消し、必要なら回帰線と相関係数で関係を定量化。軸範囲や欠損処理を整え、注釈で要点を示す。これだけで、初心者でも“伝わる散布図”を安定して作れます。