類似判定(簡易版)
「“完全一致”じゃないけれど、だいたい同じものを見つけたい」——削除や統合の前に“似ている候補”だけを絞るための、軽くて速い簡易テンプレです。まずは文字の正規化→部分一致やワイルドカード→スコア判定(トークン重なり)という順で使うと失敗が減ります。
正規化ユーティリティ(日本語向けの表記揺れ吸収)
Option Explicit
' 日本語の表記揺れを吸収して比較しやすくする
Public Function NormText(ByVal s As String) As String
Dim t As String
t = Trim$(s)
' 全角→半角(英数記号)
t = StrConv(t, vbNarrow)
' ひらがな→カタカナ(どちらかに統一)
t = StrConv(t, vbKatakana)
' 大文字化(英字)
t = UCase$(t)
' スペース・タブ類の連続を単一スペースに(簡易)
t = Replace(t, vbTab, " ")
t = Replace(t, " ", " ") ' 全角スペース
Do While InStr(t, " ") > 0
t = Replace(t, " ", " ")
Loop
NormText = t
End Function
VB- 狙い: 全角・半角、ひらがな・カタカナ、余分なスペース、大小文字の違いを吸収して“ほぼ同じ”を拾いやすくします。
- 使い所: データ読み込み時や比較直前に
NormTextを必ずかませる。
まずは“簡易一致”で拾う(包含・前方一致・後方一致・ワイルドカード)
' 部分一致(含まれているか)
Public Function LikeContains(ByVal a As String, ByVal b As String) As Boolean
LikeContains = (InStr(1, NormText(a), NormText(b), vbTextCompare) > 0)
End Function
' 前方一致
Public Function LikeStartsWith(ByVal a As String, ByVal prefix As String) As Boolean
LikeStartsWith = (NormText(a) Like (NormText(prefix) & "*"))
End Function
' 後方一致
Public Function LikeEndsWith(ByVal a As String, ByVal suffix As String) As Boolean
LikeEndsWith = (NormText(a) Like ("*" & NormText(suffix)))
End Function
' ワイルドカード(?=1文字、*=任意文字列、#=数字)
Public Function LikePattern(ByVal a As String, ByVal pattern As String) As Boolean
LikePattern = (NormText(a) Like NormText(pattern))
End Function
VB- 使い分けのコツ
- Contains: まずは“含まれている”で粗く拾う。
- Starts/Ends: 型番や接尾辞が意味を持つデータで強い。
- Pattern: ルールが分かっているときに最短(例「PRJ-####」)。
スコア式の“簡易類似”(トークン重なり率)
文字を単語(トークン)に分け、重なり率で似ている度合いを0〜1で評価。0.6以上なら“類似”など、しきい値で判定します。
' 文字列→トークン配列(スペースと記号で分割、空を除去)
Private Function Tokens(ByVal s As String) As Collection
Dim c As New Collection, i As Long, x As Variant
Dim clean As String: clean = NormText(s)
' よくある記号をスペースに変換(簡易版)
Dim punct As Variant: punct = Array("-", "_", "/", ",", ".", "(", ")", "[", "]", "・", "/", "-", "―")
For i = LBound(punct) To UBound(punct)
clean = Replace(clean, punct(i), " ")
Next
' 分割
For Each x In Split(clean, " ")
If Len(x) > 0 Then c.Add x
Next
Set Tokens = c
End Function
' Jaccard係数(重なり率):共通トークン数 / ユニークトークン総数
Public Function Similarity_Jaccard(ByVal a As String, ByVal b As String) As Double
Dim ta As Collection: Set ta = Tokens(a)
Dim tb As Collection: Set tb = Tokens(b)
Dim setA As Object: Set setA = CreateObject("Scripting.Dictionary")
Dim setB As Object: Set setB = CreateObject("Scripting.Dictionary")
Dim i As Long
For i = 1 To ta.Count: setA(ta(i)) = True: Next
For i = 1 To tb.Count: setB(tb(i)) = True: Next
Dim unionCount As Long, interCount As Long
Dim k As Variant
For Each k In setA.Keys
If setB.Exists(k) Then interCount = interCount + 1
unionCount = unionCount + 1
Next
For Each k In setB.Keys
If Not setA.Exists(k) Then unionCount = unionCount + 1
Next
If unionCount = 0 Then Similarity_Jaccard = 0 Else Similarity_Jaccard = interCount / unionCount
End Function
' 類似判定(しきい値でTrue/False)
Public Function IsSimilar(ByVal a As String, ByVal b As String, Optional ByVal threshold As Double = 0.6) As Boolean
IsSimilar = (Similarity_Jaccard(a, b) >= threshold)
End Function
VB- ポイント
- 強み: 漢字+英数字が混在する商品名や型番でも効く。
- しきい値: 0.5〜0.7が実務的。迷ったら0.6でスタート。
列全体から“類似候補”を洗い出すテンプレ(色付け・一覧出力)
類似候補を色でマーキング
Sub MarkSimilarCells_InColumn()
Application.ScreenUpdating = False
Application.EnableEvents = False
Application.Calculation = xlCalculationManual
Dim ws As Worksheet: Set ws = Worksheets("Data")
Dim lastRow As Long: lastRow = ws.Cells(ws.Rows.Count, "A").End(xlUp).Row
Dim threshold As Double: threshold = 0.6
' 既存色クリア
ws.Range("A2:A" & lastRow).Interior.ColorIndex = xlNone
Dim i As Long, j As Long
For i = 2 To lastRow
Dim baseText As String: baseText = CStr(ws.Cells(i, "A").Value)
If Len(Trim$(baseText)) = 0 Then GoTo contI
For j = i + 1 To lastRow
Dim cmpText As String: cmpText = CStr(ws.Cells(j, "A").Value)
If Len(Trim$(cmpText)) = 0 Then GoTo contJ
If IsSimilar(baseText, cmpText, threshold) Or _
LikeContains(baseText, cmpText) Or _
LikeContains(cmpText, baseText) Then
ws.Cells(i, "A").Interior.Color = RGB(255, 245, 180)
ws.Cells(j, "A").Interior.Color = RGB(255, 245, 180)
End If
contJ:
Next
contI:
Next
Application.Calculation = xlCalculationAutomatic
Application.EnableEvents = True
Application.ScreenUpdating = True
MsgBox "類似候補を色付けしました(しきい値=" & threshold & ")"
End Sub
VB- 処理の工夫: まずJaccardで絞り、念のため相互包含(Contains)でも拾う二段構え。
類似ペアを一覧出力(スコア付き)
Private Function EnsureSheet(ByVal name As String, Optional ByVal clear As Boolean = True) As Worksheet
Dim ws As Worksheet
On Error Resume Next
Set ws = Worksheets(name)
On Error GoTo 0
If ws Is Nothing Then
Set ws = Worksheets.Add(After:=Worksheets(Worksheets.Count))
ws.Name = name
End If
If clear Then ws.Cells.Clear
Set EnsureSheet = ws
End Function
Sub ListSimilarPairs_InColumn()
Application.ScreenUpdating = False
Application.EnableEvents = False
Application.Calculation = xlCalculationManual
Dim ws As Worksheet: Set ws = Worksheets("Data")
Dim lastRow As Long: lastRow = ws.Cells(ws.Rows.Count, "A").End(xlUp).Row
Dim threshold As Double: threshold = 0.6
Dim out As Worksheet: Set out = EnsureSheet("類似候補一覧", True)
out.Range("A1:D1").Value = Array("行1", "行2", "テキスト類似度", "備考")
Dim r As Long: r = 2
Dim i As Long, j As Long
For i = 2 To lastRow
Dim s1 As String: s1 = CStr(ws.Cells(i, "A").Value)
If Len(Trim$(s1)) = 0 Then GoTo contI
For j = i + 1 To lastRow
Dim s2 As String: s2 = CStr(ws.Cells(j, "A").Value)
If Len(Trim$(s2)) = 0 Then GoTo contJ
Dim score As Double: score = Similarity_Jaccard(s1, s2)
If score >= threshold Or LikeContains(s1, s2) Or LikeContains(s2, s1) Then
out.Cells(r, 1).Value = i
out.Cells(r, 2).Value = j
out.Cells(r, 3).Value = Round(score, 3)
out.Cells(r, 4).Value = IIf(LikeContains(s1, s2) Or LikeContains(s2, s1), "部分一致あり", "")
r = r + 1
End If
contJ:
Next
contI:
Next
out.Columns.AutoFit
Application.Calculation = xlCalculationAutomatic
Application.EnableEvents = True
Application.ScreenUpdating = True
MsgBox "類似候補一覧を作成しました。件数: " & r - 2
End Sub
VB- 見やすさ: スコアを並べて高い順にソートすれば、確認が速い。
よくある落とし穴と対策
- 全角・半角・かな揺れで見逃す
- 対策:
NormTextを必ず通す。記号・スペースも正す。
- 対策:
- スコア閾値が合わない
- 対策: 0.5〜0.7で試し、一覧を見て現場の“納得”に合わせて調整。
- O(n^2)比較で遅い(全件ペア比較)
- 対策: 先に先頭トークンや文字数帯でバケット分けしてから比較、または対象列をフィルタして件数を絞る。
- 数字・型番の微差が拾えない
- 対策: ワイルドカードパターン(例「ABC-####」)を別条件で加点するなどのルール追加。
例題で練習
'例1:A列の類似セルを色付け(しきい値0.6)
Sub Example_MarkSimilar()
MarkSimilarCells_InColumn
End Sub
'例2:A列の類似ペアを一覧(スコア付き)に出力
Sub Example_ListSimilarPairs()
ListSimilarPairs_InColumn
End Sub
'例3:前方/後方/パターン一致のテスト
Sub Example_SimpleLike()
Debug.Print LikeStartsWith("東京都港区芝浦", "東京都")
Debug.Print LikeEndsWith("PRJ-2025-0001", "0001")
Debug.Print LikePattern("PRJ-2025-0001", "PRJ-####-####")
End Sub
VB