Python | Web / API：CSS セレクタ

概要（CSS セレクタは「構造を短く正確に指定する」最強の抽出術）
基本の使い方（ここが重要）
1. select と select_one の最短ルート
2. タグ・class・id の基本指定
構造指定の深掘り（親子・子孫・兄弟）
1. 親子（直下）と子孫（配下全体）の区別
2. 兄弟・隣接要素の指定
属性・部分一致・疑似クラス的指定（柔軟な絞り込み）
実務の型（スコープ絞り・安全な取り出し・速度）
例題で身につける（定番から一歩先まで）
まとめ

概要（CSS セレクタは「構造を短く正確に指定する」最強の抽出術）

BeautifulSoupのCSSセレクタ（select / select_one）は、タグ名・class・id・親子関係をひとつの文字列で表現できる強力な抽出方法です。初心者が押さえるポイントは「最短で正しく指定する書き方」「返り値の違い（1件 or 複数）」「属性や部分一致の指定」「階層（親子・兄弟）指定」です。まずは基本形（タグ・.class・#id）、次に構造指定（親子/子孫/兄弟）、最後に属性や部分一致を身につけると、複雑なページでも短く安全に抜けます。

基本の使い方（ここが重要）

select と select_one の最短ルート

# pip install requests beautifulsoup4 lxml
import requests
from bs4 import BeautifulSoup

url = "https://httpbin.org/html"
r = requests.get(url, timeout=5)
r.raise_for_status()
soup = BeautifulSoup(r.text, "lxml")

# 最初の1件（見出しなど単一想定）
h1 = soup.select_one("h1")
print(h1.get_text(strip=True) if h1 else "")

# 該当する全件（一覧など複数想定）
links = soup.select("a[href]")
print([a.get("href") for a in links][:5])

# pip install requests beautifulsoup4 lxml
import requests
from bs4 import BeautifulSoup

url = "https://httpbin.org/html"
r = requests.get(url, timeout=5)
r.raise_for_status()
soup = BeautifulSoup(r.text, "lxml")

# 最初の1件（見出しなど単一想定）
h1 = soup.select_one("h1")
print(h1.get_text(strip=True) if h1 else "")

# 該当する全件（一覧など複数想定）
links = soup.select("a[href]")
print([a.get("href") for a in links][:5])

Python

select_oneは最初の1件を返し、見つからなければNone。selectは全件のリストで、見つからなければ空リスト。後続処理の安全対策（Noneチェック or 空リスト前提）が異なる点が重要です。

タグ・class・id の基本指定

html = """
<ul id="news">
  <li><a class="title" href="/a">A News</a></li>
  <li><a class="title" href="/b">B News</a></li>
</ul>
"""
soup = BeautifulSoup(html, "lxml")

# タグ名
print([a.get_text() for a in soup.select("a")])

# class（ドット表記）
print([a.get_text() for a in soup.select("a.title")])

# id（ハッシュ表記）＋タグ
print([li.get_text(strip=True) for li in soup.select("ul#news li")])

html = """
<ul id="news">
  <li><a class="title" href="/a">A News</a></li>
  <li><a class="title" href="/b">B News</a></li>
</ul>
"""
soup = BeautifulSoup(html, "lxml")

# タグ名
print([a.get_text() for a in soup.select("a")])

# class（ドット表記）
print([a.get_text() for a in soup.select("a.title")])

# id（ハッシュ表記）＋タグ
print([li.get_text(strip=True) for li in soup.select("ul#news li")])

Python

タグはそのまま、classは「.class」、idは「#id」。組み合わせて範囲を最短で絞り込みます。

構造指定の深掘り（親子・子孫・兄弟）

親子（直下）と子孫（配下全体）の区別

html = """
<div id="a">
  <p class="x">直下</p>
  <div><p class="x">入れ子</p></div>
</div>
"""
soup = BeautifulSoup(html, "lxml")

# 直下だけ（親子）
print([p.get_text() for p in soup.select("#a > p.x")])  # → ["直下"]

# 子孫すべて（配下）
print([p.get_text() for p in soup.select("#a p.x")])    # → ["直下", "入れ子"]

html = """
<div id="a">
  <p class="x">直下</p>
  <div><p class="x">入れ子</p></div>
</div>
"""
soup = BeautifulSoup(html, "lxml")

# 直下だけ（親子）
print([p.get_text() for p in soup.select("#a > p.x")])  # → ["直下"]

# 子孫すべて（配下）
print([p.get_text() for p in soup.select("#a p.x")])    # → ["直下", "入れ子"]

Python

「>」は直下（親子）だけ、「空白」は配下の全て（子孫）。意図に合わせて使い分けると誤抽出が減ります。

兄弟・隣接要素の指定

html = """
<div id="wrap">
  <h2 id="hdr">見出し</h2>
  <p class="desc">本文1</p>
  <p class="desc">本文2</p>
</div>
"""
soup = BeautifulSoup(html, "lxml")

# 見出しに隣接する1つ目の段落（隣接兄弟）
first = soup.select_one("#hdr + p.desc")
print(first.get_text(strip=True))

# 見出しの後に続くすべての段落（一般兄弟）
all_after = [p.get_text(strip=True) for p in soup.select("#hdr ~ p.desc")]
print(all_after)

html = """
<div id="wrap">
  <h2 id="hdr">見出し</h2>
  <p class="desc">本文1</p>
  <p class="desc">本文2</p>
</div>
"""
soup = BeautifulSoup(html, "lxml")

# 見出しに隣接する1つ目の段落（隣接兄弟）
first = soup.select_one("#hdr + p.desc")
print(first.get_text(strip=True))

# 見出しの後に続くすべての段落（一般兄弟）
all_after = [p.get_text(strip=True) for p in soup.select("#hdr ~ p.desc")]
print(all_after)

Python

「+」が隣接兄弟、「~」が後続の兄弟。見出しの次の要素や続きの塊を短く指定できます。

属性・部分一致・疑似クラス的指定（柔軟な絞り込み）

属性存在・値指定

html = """
<div>
  <a href="/docs/a" data-type="doc">Doc A</a>
  <a href="/blog/b">Blog B</a>
</div>
"""
soup = BeautifulSoup(html, "lxml")

# 属性があるリンクだけ
print([a["href"] for a in soup.select('a[href]')])

# 特定の属性値
print([a.get_text() for a in soup.select('a[data-type="doc"]')])

html = """
<div>
  <a href="/docs/a" data-type="doc">Doc A</a>
  <a href="/blog/b">Blog B</a>
</div>
"""
soup = BeautifulSoup(html, "lxml")

# 属性があるリンクだけ
print([a["href"] for a in soup.select('a[href]')])

# 特定の属性値
print([a.get_text() for a in soup.select('a[data-type="doc"]')])

Python

[attr]で存在判定、[attr=”値”]で値一致。HTMLの「意味」を利用して精度を上げます。

class の前方・後方・部分一致（CSS属性セレクタ）

html = """
<div>
  <p class="hello">こんにちは</p>
  <p class="morning">おはよう</p>
  <p class="night">おやすみ</p>
</div>
"""
soup = BeautifulSoup(html, "lxml")

# 前方一致（^=）、後方一致（$=）、部分一致（*=）
print([p.get_text() for p in soup.select('p[class^="he"]')])  # hello
print([p.get_text() for p in soup.select('p[class$="ing"]')]) # morning
print([p.get_text() for p in soup.select('p[class*="igh"]')]) # night

html = """
<div>
  <p class="hello">こんにちは</p>
  <p class="morning">おはよう</p>
  <p class="night">おやすみ</p>
</div>
"""
soup = BeautifulSoup(html, "lxml")

# 前方一致（^=）、後方一致（$=）、部分一致（*=）
print([p.get_text() for p in soup.select('p[class^="he"]')])  # hello
print([p.get_text() for p in soup.select('p[class$="ing"]')]) # morning
print([p.get_text() for p in soup.select('p[class*="igh"]')]) # night

Python

属性セレクタの一致指定（^, $, *）は、命名規則が一貫しているサイトで威力を発揮します。

nth-child などの位置指定（必要時のみ）

html = """
<ul id="items">
  <li>A</li><li>B</li><li>C</li><li>D</li>
</ul>
"""
soup = BeautifulSoup(html, "lxml")

# 2番目だけ
print(soup.select_one("#items li:nth-child(2)").get_text())  # B

# 偶数番目
print([li.get_text() for li in soup.select("#items li:nth-child(even)")])  # B, D

html = """
<ul id="items">
  <li>A</li><li>B</li><li>C</li><li>D</li>
</ul>
"""
soup = BeautifulSoup(html, "lxml")

# 2番目だけ
print(soup.select_one("#items li:nth-child(2)").get_text())  # B

# 偶数番目
print([li.get_text() for li in soup.select("#items li:nth-child(even)")])  # B, D

Python

nth-child系は「位置」で取れる便利ワザ。ただしDOM差し込みでズレやすいので、安定性が必要な場面では属性やテキスト条件を優先します。

実務の型（スコープ絞り・安全な取り出し・速度）

まず「親コンテナ」でスコープを絞る

html = """
<section id="products">
  <div class="item"><h3>Alpha</h3><span class="price">¥1,200</span></div>
  <div class="item"><h3>Beta</h3><span class="price">¥980</span></div>
</section>
<section id="news"><div class="item"><h3>Noise</h3></div></section>
"""
soup = BeautifulSoup(html, "lxml")

# products 範囲だけに限定してから子を取る
for card in soup.select("section#products div.item"):
    name = card.select_one("h3")
    price = card.select_one("span.price")
    print(
        (name.get_text(strip=True) if name else ""),
        (price.get_text(strip=True) if price else "")
    )

html = """
<section id="products">
  <div class="item"><h3>Alpha</h3><span class="price">¥1,200</span></div>
  <div class="item"><h3>Beta</h3><span class="price">¥980</span></div>
</section>
<section id="news"><div class="item"><h3>Noise</h3></div></section>
"""
soup = BeautifulSoup(html, "lxml")

# products 範囲だけに限定してから子を取る
for card in soup.select("section#products div.item"):
    name = card.select_one("h3")
    price = card.select_one("span.price")
    print(
        (name.get_text(strip=True) if name else ""),
        (price.get_text(strip=True) if price else "")
    )

Python

「全ページからいきなりselect」ではなく、「親コンテナ→子要素」の順に絞ると誤抽出が激減し、速度も上がります。

None と空リストの違いを前提に安全に扱う

card = soup.select_one("div.item")      # None の可能性あり
title = card.select_one("h3").get_text(strip=True) if card and card.select_one("h3") else ""

cards = soup.select("div.item")         # 見つからなければ []
titles = [c.select_one("h3").get_text(strip=True) for c in cards if c.select_one("h3")]

card = soup.select_one("div.item")      # None の可能性あり
title = card.select_one("h3").get_text(strip=True) if card and card.select_one("h3") else ""

cards = soup.select("div.item")         # 見つからなければ []
titles = [c.select_one("h3").get_text(strip=True) for c in cards if c.select_one("h3")]

Python

select_oneはNone、selectは[]。この違いを前提に「安全な取り出し」を書くと落ちません。

パーサ・タイムアウト・文字化け対策

解析は”lxml”を推奨。取得時はtimeoutを必ず付け、resp.raise_for_status()で失敗を例外化。日本語サイトで文字化けする場合はresp.encodingをapparent_encodingで推定上書きしてから解析します。

例題で身につける（定番から一歩先まで）

例題1：ニュース一覧のタイトルとリンク

import requests
from bs4 import BeautifulSoup

html = """
<ul id="news">
  <li><a class="title" href="/a">A News</a><span class="date">2025-12-01</span></li>
  <li><a class="title" href="/b">B News</a><span class="date">2025-12-02</span></li>
</ul>
"""
soup = BeautifulSoup(html, "lxml")
items = soup.select("ul#news li")

records = []
for it in items:
    a = it.select_one("a.title[href]")
    d = it.select_one("span.date")
    records.append({
        "title": a.get_text(strip=True) if a else "",
        "href": a.get("href", "") if a else "",
        "date": d.get_text(strip=True) if d else ""
    })
print(records)

import requests
from bs4 import BeautifulSoup

html = """
<ul id="news">
  <li><a class="title" href="/a">A News</a><span class="date">2025-12-01</span></li>
  <li><a class="title" href="/b">B News</a><span class="date">2025-12-02</span></li>
</ul>
"""
soup = BeautifulSoup(html, "lxml")
items = soup.select("ul#news li")

records = []
for it in items:
    a = it.select_one("a.title[href]")
    d = it.select_one("span.date")
    records.append({
        "title": a.get_text(strip=True) if a else "",
        "href": a.get("href", "") if a else "",
        "date": d.get_text(strip=True) if d else ""
    })
print(records)

Python

例題2：親子・子孫の違いで精度を上げる

from bs4 import BeautifulSoup

html = """
<div id="a">
  <p class="x">直下</p>
  <div><p class="x">入れ子</p></div>
</div>
"""
soup = BeautifulSoup(html, "lxml")
print([p.get_text() for p in soup.select("#a > p.x")])  # 直下だけ
print([p.get_text() for p in soup.select("#a p.x")])    # 直下＋入れ子

from bs4 import BeautifulSoup

html = """
<div id="a">
  <p class="x">直下</p>
  <div><p class="x">入れ子</p></div>
</div>
"""
soup = BeautifulSoup(html, "lxml")
print([p.get_text() for p in soup.select("#a > p.x")])  # 直下だけ
print([p.get_text() for p in soup.select("#a p.x")])    # 直下＋入れ子

Python

例題3：属性一致・部分一致・位置指定

from bs4 import BeautifulSoup

html = """
<div>
  <a href="/docs/a" data-type="doc">Doc A</a>
  <a href="/blog/b">Blog B</a>
  <ul id="items"><li>A</li><li>B</li><li>C</li><li>D</li></ul>
</div>
"""
soup = BeautifulSoup(html, "lxml")

print([a["href"] for a in soup.select('a[href]')])                  # 存在
print([a.get_text() for a in soup.select('a[data-type="doc"]')])    # 値一致
print(soup.select_one("#items li:nth-child(2)").get_text())         # 位置指定

from bs4 import BeautifulSoup

html = """
<div>
  <a href="/docs/a" data-type="doc">Doc A</a>
  <a href="/blog/b">Blog B</a>
  <ul id="items"><li>A</li><li>B</li><li>C</li><li>D</li></ul>
</div>
"""
soup = BeautifulSoup(html, "lxml")

print([a["href"] for a in soup.select('a[href]')])                  # 存在
print([a.get_text() for a in soup.select('a[data-type="doc"]')])    # 値一致
print(soup.select_one("#items li:nth-child(2)").get_text())         # 位置指定

Python

例題4：兄弟セレクタで「隣の説明」を拾う

from bs4 import BeautifulSoup

html = """
<div id="wrap">
  <h2 id="hdr">見出し</h2>
  <p class="desc">本文1</p>
  <p class="desc">本文2</p>
</div>
"""
soup = BeautifulSoup(html, "lxml")
first = soup.select_one("#hdr + p.desc")
print(first.get_text(strip=True))
rest = [p.get_text(strip=True) for p in soup.select("#hdr ~ p.desc")]
print(rest)

from bs4 import BeautifulSoup

html = """
<div id="wrap">
  <h2 id="hdr">見出し</h2>
  <p class="desc">本文1</p>
  <p class="desc">本文2</p>
</div>
"""
soup = BeautifulSoup(html, "lxml")
first = soup.select_one("#hdr + p.desc")
print(first.get_text(strip=True))
rest = [p.get_text(strip=True) for p in soup.select("#hdr ~ p.desc")]
print(rest)

Python

まとめ

CSSセレクタは「構造を文字列で短く表現して、的確に要素を拾う」ための核技術です。select_one（1件/None）とselect（複数/空リスト）の返り値の違いを前提に、安全な取り出しを書く。タグ・.class・#idの基本から、親子（>）・子孫（空白）・兄弟（+ / ~）で構造を指定し、属性一致・部分一致・nth-childで精度を上げる。まず親コンテナでスコープを絞り、lxml＋timeout＋raise_for_status＋文字コード対策で安定運用。これらの型を身につければ、初心者でも複雑なページから“必要な情報だけを短く正確に”抽出できます。