在我的数据库中,我有很多非常相似但不完全相同的条目。例如,只有两个字符可能不同,例如:
第1行:“天气很好,请参阅http://xyz56.com”
第二行:“天气很好,请参阅http://xyz31.com”
我想摆脱这些部分重复,只收到这两行的一个结果。无论它是哪一个,我建议使用第一个出现的那个。
我可以使用MySQL的任何功能来有效地执行此操作吗?我的第一个想法是提取更多的数据并对字符串进行比较,如果匹配的字符超过某个阈值而不是忽略它。下行是我永远不会知道我必须从数据库中提取多少条目,因为我必须将每一行与所有其他行(O(n²))进行比较,所以它也很安静。
更新: 更具体地说明用例:方差的位置并不总是在字符串的末尾,也可能只有2个字符会发生变化。字符串长度随每行而变化。
答案 0 :(得分:4)
我的建议是使用Levenshtein distance,这是字符串相似度的衡量标准。要让MySQL直接计算它,你必须在存储过程中实现它,例如:http://www.artfulsoftware.com/infotree/queries.php#552。
还有PHP和Java的常见实现。
答案 1 :(得分:2)
您可以使用SOUNDEX。
SOUNDEX(str)
Returns a soundex string from str. Two strings that sound almost the same should have identical soundex strings. A standard soundex string is four characters long, but the SOUNDEX() function returns an arbitrarily long string. You can use SUBSTRING() on the result to get a standard soundex string. All non-alphabetic characters in str are ignored. All international alphabetic characters outside the A-Z range are treated as vowels.
mysql> SELECT SOUNDEX('Hello');
+---------------------------------------------------------+
| SOUNDEX('Hello') |
+---------------------------------------------------------+
| H400 |
+---------------------------------------------------------+
1 row in set (0.00 sec)
来源:http://www.tutorialspoint.com/mysql/mysql-string-functions.htm#operator_sounds-like
例如,在Oracle PL / SQL中,您的字符串具有相同的SOUNDEX并且SOUNDEX无法区分:
select soundex ('The weather is nice, see http://xyz56.com') from dual;
SOUNDEX('THEWEATHERISNICE,SEEHTTP://XYZ56.COM')
-----------------------------------------------
T362
1 row selected.
select soundex ('The weather is nice, see http://xyz31.com') from dual;
SOUNDEX('THEWEATHERISNICE,SEEHTTP://XYZ31.COM')
-----------------------------------------------
T362
1 row selected.
答案 2 :(得分:2)
SELECT * FROM test GROUP BY SUBSTR(mytext, 1, 10);
答案 3 :(得分:0)
MySQL的Levenshtein距离算法:
请参阅:Implementation of Levenshtein distance for mysql/fuzzy search?
Levenshtein距离 两个字符串之间的Levenshtein距离是将一个字符串转换为另一个字符串所需的最小操作数,其中操作可以是一个字符的插入,删除或替换。 Jason Rust在http://www.codejanitor.com/wp/发布了这个MySQL算法。
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END;
辅助功能:
CREATE FUNCTION levenshtein_ratio( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, max_len INT;
SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
IF s1_len > s2_len THEN
SET max_len = s1_len;
ELSE
SET max_len = s2_len;
END IF;
RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
END;
Levenshtein距离算法:Oracle PL / SQL实现
消息来源:http://www.merriampark.com/ldplsql.htm
CREATE OR REPLACE FUNCTION ld -- Levenshtein distance
(p_source_string IN VARCHAR2,
p_target_string IN VARCHAR2)
RETURN NUMBER
DETERMINISTIC
AS
v_length_of_source NUMBER := NVL (LENGTH (p_source_string), 0);
v_length_of_target NUMBER := NVL (LENGTH (p_target_string), 0);
TYPE mytabtype IS TABLE OF NUMBER INDEX BY BINARY_INTEGER;
column_to_left mytabtype;
current_column mytabtype;
v_cost NUMBER := 0;
BEGIN
IF v_length_of_source = 0 THEN
RETURN v_length_of_target;
ELSIF v_length_of_target = 0 THEN
RETURN v_length_of_source;
ELSE
FOR j IN 0 .. v_length_of_target LOOP
column_to_left(j) := j;
END LOOP;
FOR i IN 1.. v_length_of_source LOOP
current_column(0) := i;
FOR j IN 1 .. v_length_of_target LOOP
IF SUBSTR (p_source_string, i, 1) =
SUBSTR (p_target_string, j, 1)
THEN v_cost := 0;
ELSE v_cost := 1;
END IF;
current_column(j) := LEAST (current_column(j-1) + 1,
column_to_left(j) + 1,
column_to_left(j-1) + v_cost);
END LOOP;
FOR j IN 0 .. v_length_of_target LOOP
column_to_left(j) := current_column(j);
END LOOP;
END LOOP;
END IF;
RETURN current_column(v_length_of_target);
END ld;
如果您假设有一个名为EMPLOYEES的表,其中包含一个名为FIRST_NAME的列为VARCHAR2的列,您可以轻松找到Levenshtein Distance = 1的记录,如下所示:
SELECT *
FROM employees alfa
WHERE EXISTS
(SELECT 'X'
FROM employees beta
WHERE ld (beta.first_name, alfa.first_name) = 1);
使用此查询,您可以在结果集的每一行中显示具有Levenshtein Distance = 1的first_name列表:
SELECT a.first_name, b.first_name
FROM employees a
INNER JOIN
employees b
ON ld (b.first_name, a.first_name) = 1;
一个例子:
SELECT DISTINCT a.first_name, b.first_name
FROM employees a
INNER JOIN
employees b
ON ld (b.first_name, a.first_name) <= 2
AND ld (b.first_name, a.first_name) > 0;
FIRST_NAME;FIRST_NAME_1
Jean;John
Nancy;Vance
Alana;Allan
Alana;Clara
Ellen;Eleni
John;Jean
Daniel;Danielle
Danielle;Daniel
Shelley;Shelli
Sundita;Nandita
Lisa;Luis
Stephen;Steven
Nanette;Janette
Diana;Alana
TJ;Ki
Luis;Lisa
Sarath;Sarah
Louise;Luis
Ki;TJ
Allan;Ellen
Luis;Louise
Den;Lex
Clara;Alana
Matthew;Mattea
Shelli;Shelley
Sarah;Sarath
Girard;Gerald
Vance;Nancy
Mattea;Martha
Allan;Alana
Nandita;Sundita
Ellen;Allan
Jean;Den
Eleni;Ellen
Gerald;Girard
Lex;Den
Janette;Nanette
Steven;Stephen
Mattea;Matthew
Den;Jean
Martha;Mattea
Alana;Diana