将两个文件与歌曲标题列表进行比较的最简单方法

时间:2015-06-30 21:57:17

标签: database list comparison recordset fuzzy-comparison

我有两个歌曲标题列表,每个都是纯文本文件,这是许可的歌词文件的文件名 - 我想检查较短列表标题(针)是否在更长的列表(haystack)中。脚本/应用程序应返回针头中不在大海捞针中的标题列表。

我更喜欢使用Python或shell脚本(BASH),或者只使用可以处理所需模糊性的可视差异程序。

主要问题是标题需要模糊匹配才能解决数据输入错误以及可能还有字词排序。

Haystack样本(注意一些重复和接近重复的行,突出显示匹配):

Yearn
Yesterday, Today And Forever
Yesterday, Today, Forever
You
You Alone
You Are Here (The Same Power)
You Are Holy
You Are Holy (Prince Of Peace)
You Are Mighty
You Are Mine
You Are My All In All
You Are My Hiding Place
You Are My King (Amazing Love)
You Are Righteous (Hope)
You Are So Faithful
You Are So Good to Me
You Are Worthy Of My Praise
You Have Been Good
You Led Me To The Cross
You Reign
You Rescued Me
You Said
You Sent Your Own
You Set Me Apart (Dwell In Your House)
You alone are worthy (Glory in the highest)
You are God in heaven (let my words be few)
You are always fighting for us (Hallelujah you have overcome)
You are beautiful (I stand in awe)
You are beautiful beyond description
You are mighty
You are my all in all
You are my hiding place
You are my passion
You are still Holy
You are the Holy One (We exalt Your name)
You are the mighty King
You are the mighty warrior
You are the vine
**You chose the cross (Lost in wonder)**
You have shown me favour unending
You hold the broken hearted
You laid aside Your majesty
You said
You're Worthy Of My Praise
You're calling me (Unashamed love)
You're the God of this city
You're the Lion of Judah
You're the word of God the Father (Across the lands)
You've put a new song in my heart
Your Beloved
Your Grace is Enough
Your Great Name We Praise
Your Great Name We Praise-2
Your Light (You Have Turned)
Your Light Is Over Me (His Love)
**Your Love**
**Your Love Is Amazing**
Your Love Is Deep
Your Love Is Deeper - Jesus, Lord of Heaven (Phil Wickham)
Your Love Oh Lord
Your Love Oh Lord (Psalm 36)
Your Love is Extravagant
Your Power (Send Me)
Your blood speaks a better word
Your everlasting love
**Your grace is enough**
**Your grace is enough (Great is Your faithfulness)**
Your mercy is falling
Your mercy taught us how to dance (Dancing generation)
Your voice stills the oceans (nothing is impossible)
Yours Is The Kingdom

针样品:

You Are Good (I Want To Scream It Out)
You Are My Strength (In The Fullness)
You Are My Vision O King Of My Heart
You Are The King Of Glory (Hosanna To The Son)
**You Chose The Cross (Lost In Wonder)**
**Your Grace Is Enough (This Is Our God)**
**Your Love Is Amazing Steady And Unchanging**
**Your Love Shining Like The Sun**

请注意针头标题"你的爱像太阳一样闪耀"只是一个可能的匹配"你的爱"。最好不能不匹配,因此任何不确定的标题匹配都应出现在输出中。

comm -1 -3 <(sort haystack.txt) <(sort needle.txt)

找不到任何匹配项。 diffgrep似乎他们有同样的问题而且不够模糊。 Kdiff3diffnow.com与手动比较一样快,因为我几乎所有比赛都必须扫描,他们只能处理空格和字母差异。

来自prestosoft.com的

ExamDiffPro看起来像是一种可能性,但仅限于MS Windows,在我弄乱WINE或VirtualBox之前,我更喜欢本机Linux解决方案。

针实际上是一个CSV,所以我考虑过使用LibreOffice并将其作为数据库处理并进行SQL查询或使用带有hlookup的电子表格...... Another问题导致我{ {3}}

这似乎是一个常见的问题类别(它基本上是&#34;记录链接&#34;它经常使用[Levenshtein]编辑距离计算),我应该如何处理它?建议好吗?

4 个答案:

答案 0 :(得分:3)

我在MySQL中做了类似的事情。我使用以下代码来定义Levenshtein距离和比率函数(我从答案this question获得):

DROP FUNCTION IF EXISTS `levenshtein`;
CREATE FUNCTION `levenshtein`(s1 text, s2 text) RETURNS int(11) DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR; 
    DECLARE cv0, cv1 text; 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN
            SET c = c_temp;
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
          IF c > c_temp THEN  
            SET c = c_temp;  
          END IF; 
          SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
END;

DROP FUNCTION IF EXISTS `levenshtein_ratio`;
CREATE FUNCTION `levenshtein_ratio`(s1 text, s2 text) RETURNS int(11) DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, max_len INT;
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
END;

假设您将列表导入为以下两个表:

needle (title VARCHAR)
haystack (title VARCHAR)

然后,您可以使用类似下面的查询来比较两个表。

SELECT title, best_match,
    levenshtein_ratio(TRIM(LOWER(title)), TRIM(LOWER(best_match))) AS ratio
FROM (
    SELECT n.title AS title, (
                SELECT h.title
                FROM haystack h
                ORDER BY levenshtein_ratio(TRIM(LOWER(n.title)), TRIM(LOWER(h.title))) DESC
                LIMIT 1
           ) AS best_match
    FROM needle n
) x
ORDER BY ratio DESC

选择一个截止值,低于该值的所有行都没有匹配。如果你想直接使用编辑距离,你可以使用levenshtein()而不是levenshtein_ratio(),在本例中是ORDER BY到ASC。

请注意,这并没有针对字序差异做出任何特殊规定。此外,如果您的列表很大,则比较可能会很慢。

答案 1 :(得分:2)

Fuzzywuzzy可以让你的梦想成真:

import csv
from fuzzywuzzy import fuzz

# Grab CSV data: 
with open('needles.csv', 'U') as z:
    reader = csv.reader(z)
    needles = list(reader)

with open('haystack.csv', 'U') as w:
    reader = csv.reader(w)
    haystack = list(reader)

# Calculate matches and append to list
NeedlesNotInHaystack = []

Fuzziness = 80 # ADJUST THIS VALUE TO FINE-TUNE RESULTS 
for x in needle:
    for y in haystack:
        if fuzz.ratio(x,y) > Fuzziness: 
            NeedlesNotInHaystack.append(x)

#Export results to CSV:
with open('Results', 'wb') as csvfile:
    temp = csv.writer(csvfile)
    temp.writerows(NeedlesNotInHaystack)

而不是fuzz.ratio,您可以使用以下方式获得更好的结果:

fuzz.token_sort_ratio OR fuzz.token_set_ratio

IMO,this tutorial是Python中模糊匹配的最佳+最简洁的概述

答案 2 :(得分:2)

如果你想采用OpenRefine方式,最好设置一个本地对帐服务器 - 我推荐reconcile csv

将干草堆加载到对帐服务器中,让OpenRefine处理您的指针文件并将其发送到对帐服务。

对帐服务器会为每个提案返回一个分数,以便您可以(我只是将这些分数作为示例,不要将它们视为理所当然):

  • 批量接受超过0.9的所有内容
  • 手动检查0.8到0.9之间的所有内容
  • 丢弃低于0.8
  • 的所有内容

答案 3 :(得分:1)

你可能想看看fuzzywuzzy(https://github.com/seatgeek/fuzzywuzzy)。

from fuzzywuzzy import process
needles = open('needle').read().split("\n")
haystack = open('haystack').read().split("\n")
for a in needles:
    print a + ' -> ',
    print process.extractBests(a, haystack, score_cutoff=90)

提取函数的有用参数是限制,记分和处理器。