Excel VBA 逆引き集 | 類似判定(簡易版)

Excel VBA
スポンサーリンク

類似判定(簡易版)

「“完全一致”じゃないけれど、だいたい同じものを見つけたい」——削除や統合の前に“似ている候補”だけを絞るための、軽くて速い簡易テンプレです。まずは文字の正規化→部分一致やワイルドカード→スコア判定(トークン重なり)という順で使うと失敗が減ります。


正規化ユーティリティ(日本語向けの表記揺れ吸収)

Option Explicit

' 日本語の表記揺れを吸収して比較しやすくする
Public Function NormText(ByVal s As String) As String
    Dim t As String
    t = Trim$(s)
    ' 全角→半角(英数記号)
    t = StrConv(t, vbNarrow)
    ' ひらがな→カタカナ(どちらかに統一)
    t = StrConv(t, vbKatakana)
    ' 大文字化(英字)
    t = UCase$(t)
    ' スペース・タブ類の連続を単一スペースに(簡易)
    t = Replace(t, vbTab, " ")
    t = Replace(t, " ", " ") ' 全角スペース
    Do While InStr(t, "  ") > 0
        t = Replace(t, "  ", " ")
    Loop
    NormText = t
End Function
VB
  • 狙い: 全角・半角、ひらがな・カタカナ、余分なスペース、大小文字の違いを吸収して“ほぼ同じ”を拾いやすくします。
  • 使い所: データ読み込み時や比較直前にNormTextを必ずかませる。

まずは“簡易一致”で拾う(包含・前方一致・後方一致・ワイルドカード)

' 部分一致(含まれているか)
Public Function LikeContains(ByVal a As String, ByVal b As String) As Boolean
    LikeContains = (InStr(1, NormText(a), NormText(b), vbTextCompare) > 0)
End Function

' 前方一致
Public Function LikeStartsWith(ByVal a As String, ByVal prefix As String) As Boolean
    LikeStartsWith = (NormText(a) Like (NormText(prefix) & "*"))
End Function

' 後方一致
Public Function LikeEndsWith(ByVal a As String, ByVal suffix As String) As Boolean
    LikeEndsWith = (NormText(a) Like ("*" & NormText(suffix)))
End Function

' ワイルドカード(?=1文字、*=任意文字列、#=数字)
Public Function LikePattern(ByVal a As String, ByVal pattern As String) As Boolean
    LikePattern = (NormText(a) Like NormText(pattern))
End Function
VB
  • 使い分けのコツ
    • Contains: まずは“含まれている”で粗く拾う。
    • Starts/Ends: 型番や接尾辞が意味を持つデータで強い。
    • Pattern: ルールが分かっているときに最短(例「PRJ-####」)。

スコア式の“簡易類似”(トークン重なり率)

文字を単語(トークン)に分け、重なり率で似ている度合いを0〜1で評価。0.6以上なら“類似”など、しきい値で判定します。

' 文字列→トークン配列(スペースと記号で分割、空を除去)
Private Function Tokens(ByVal s As String) As Collection
    Dim c As New Collection, i As Long, x As Variant
    Dim clean As String: clean = NormText(s)
    ' よくある記号をスペースに変換(簡易版)
    Dim punct As Variant: punct = Array("-", "_", "/", ",", ".", "(", ")", "[", "]", "・", "/", "-", "―")
    For i = LBound(punct) To UBound(punct)
        clean = Replace(clean, punct(i), " ")
    Next
    ' 分割
    For Each x In Split(clean, " ")
        If Len(x) > 0 Then c.Add x
    Next
    Set Tokens = c
End Function

' Jaccard係数(重なり率):共通トークン数 / ユニークトークン総数
Public Function Similarity_Jaccard(ByVal a As String, ByVal b As String) As Double
    Dim ta As Collection: Set ta = Tokens(a)
    Dim tb As Collection: Set tb = Tokens(b)
    Dim setA As Object: Set setA = CreateObject("Scripting.Dictionary")
    Dim setB As Object: Set setB = CreateObject("Scripting.Dictionary")
    Dim i As Long
    For i = 1 To ta.Count: setA(ta(i)) = True: Next
    For i = 1 To tb.Count: setB(tb(i)) = True: Next
    Dim unionCount As Long, interCount As Long
    Dim k As Variant
    For Each k In setA.Keys
        If setB.Exists(k) Then interCount = interCount + 1
        unionCount = unionCount + 1
    Next
    For Each k In setB.Keys
        If Not setA.Exists(k) Then unionCount = unionCount + 1
    Next
    If unionCount = 0 Then Similarity_Jaccard = 0 Else Similarity_Jaccard = interCount / unionCount
End Function

' 類似判定(しきい値でTrue/False)
Public Function IsSimilar(ByVal a As String, ByVal b As String, Optional ByVal threshold As Double = 0.6) As Boolean
    IsSimilar = (Similarity_Jaccard(a, b) >= threshold)
End Function
VB
  • ポイント
    • 強み: 漢字+英数字が混在する商品名や型番でも効く。
    • しきい値: 0.5〜0.7が実務的。迷ったら0.6でスタート。

列全体から“類似候補”を洗い出すテンプレ(色付け・一覧出力)

類似候補を色でマーキング

Sub MarkSimilarCells_InColumn()
    Application.ScreenUpdating = False
    Application.EnableEvents = False
    Application.Calculation = xlCalculationManual

    Dim ws As Worksheet: Set ws = Worksheets("Data")
    Dim lastRow As Long: lastRow = ws.Cells(ws.Rows.Count, "A").End(xlUp).Row
    Dim threshold As Double: threshold = 0.6

    ' 既存色クリア
    ws.Range("A2:A" & lastRow).Interior.ColorIndex = xlNone

    Dim i As Long, j As Long
    For i = 2 To lastRow
        Dim baseText As String: baseText = CStr(ws.Cells(i, "A").Value)
        If Len(Trim$(baseText)) = 0 Then GoTo contI

        For j = i + 1 To lastRow
            Dim cmpText As String: cmpText = CStr(ws.Cells(j, "A").Value)
            If Len(Trim$(cmpText)) = 0 Then GoTo contJ

            If IsSimilar(baseText, cmpText, threshold) Or _
               LikeContains(baseText, cmpText) Or _
               LikeContains(cmpText, baseText) Then
                ws.Cells(i, "A").Interior.Color = RGB(255, 245, 180)
                ws.Cells(j, "A").Interior.Color = RGB(255, 245, 180)
            End If
contJ:
        Next
contI:
    Next

    Application.Calculation = xlCalculationAutomatic
    Application.EnableEvents = True
    Application.ScreenUpdating = True

    MsgBox "類似候補を色付けしました(しきい値=" & threshold & ")"
End Sub
VB
  • 処理の工夫: まずJaccardで絞り、念のため相互包含(Contains)でも拾う二段構え。

類似ペアを一覧出力(スコア付き)

Private Function EnsureSheet(ByVal name As String, Optional ByVal clear As Boolean = True) As Worksheet
    Dim ws As Worksheet
    On Error Resume Next
    Set ws = Worksheets(name)
    On Error GoTo 0
    If ws Is Nothing Then
        Set ws = Worksheets.Add(After:=Worksheets(Worksheets.Count))
        ws.Name = name
    End If
    If clear Then ws.Cells.Clear
    Set EnsureSheet = ws
End Function

Sub ListSimilarPairs_InColumn()
    Application.ScreenUpdating = False
    Application.EnableEvents = False
    Application.Calculation = xlCalculationManual

    Dim ws As Worksheet: Set ws = Worksheets("Data")
    Dim lastRow As Long: lastRow = ws.Cells(ws.Rows.Count, "A").End(xlUp).Row
    Dim threshold As Double: threshold = 0.6

    Dim out As Worksheet: Set out = EnsureSheet("類似候補一覧", True)
    out.Range("A1:D1").Value = Array("行1", "行2", "テキスト類似度", "備考")

    Dim r As Long: r = 2
    Dim i As Long, j As Long
    For i = 2 To lastRow
        Dim s1 As String: s1 = CStr(ws.Cells(i, "A").Value)
        If Len(Trim$(s1)) = 0 Then GoTo contI

        For j = i + 1 To lastRow
            Dim s2 As String: s2 = CStr(ws.Cells(j, "A").Value)
            If Len(Trim$(s2)) = 0 Then GoTo contJ

            Dim score As Double: score = Similarity_Jaccard(s1, s2)
            If score >= threshold Or LikeContains(s1, s2) Or LikeContains(s2, s1) Then
                out.Cells(r, 1).Value = i
                out.Cells(r, 2).Value = j
                out.Cells(r, 3).Value = Round(score, 3)
                out.Cells(r, 4).Value = IIf(LikeContains(s1, s2) Or LikeContains(s2, s1), "部分一致あり", "")
                r = r + 1
            End If
contJ:
        Next
contI:
    Next

    out.Columns.AutoFit
    Application.Calculation = xlCalculationAutomatic
    Application.EnableEvents = True
    Application.ScreenUpdating = True

    MsgBox "類似候補一覧を作成しました。件数: " & r - 2
End Sub
VB
  • 見やすさ: スコアを並べて高い順にソートすれば、確認が速い。

よくある落とし穴と対策

  • 全角・半角・かな揺れで見逃す
    • 対策: NormTextを必ず通す。記号・スペースも正す。
  • スコア閾値が合わない
    • 対策: 0.5〜0.7で試し、一覧を見て現場の“納得”に合わせて調整。
  • O(n^2)比較で遅い(全件ペア比較)
    • 対策: 先に先頭トークンや文字数帯でバケット分けしてから比較、または対象列をフィルタして件数を絞る。
  • 数字・型番の微差が拾えない
    • 対策: ワイルドカードパターン(例「ABC-####」)を別条件で加点するなどのルール追加。

例題で練習

'例1:A列の類似セルを色付け(しきい値0.6)
Sub Example_MarkSimilar()
    MarkSimilarCells_InColumn
End Sub

'例2:A列の類似ペアを一覧(スコア付き)に出力
Sub Example_ListSimilarPairs()
    ListSimilarPairs_InColumn
End Sub

'例3:前方/後方/パターン一致のテスト
Sub Example_SimpleLike()
    Debug.Print LikeStartsWith("東京都港区芝浦", "東京都")
    Debug.Print LikeEndsWith("PRJ-2025-0001", "0001")
    Debug.Print LikePattern("PRJ-2025-0001", "PRJ-####-####")
End Sub
VB
タイトルとURLをコピーしました