MySQL Simhash 是一種基于文本內(nèi)容生成 Hash 值的算法,可以用來實(shí)現(xiàn)文本去重、相似查詢等功能。
CREATE FUNCTION simhash(text TEXT) RETURNS BIGINT DETERMINISTIC BEGIN DECLARE words TEXT; DECLARE word TEXT; DECLARE stopwords TEXT; DECLARE hash BIGINT DEFAULT 0; DECLARE weight INT DEFAULT 1; DECLARE bits INT DEFAULT 64; DECLARE i INT; SET words = REPLACE(text, '[^\w\x80-\xff]+', ' '); SET words = LOWER(words); SET stopwords = 'a an and are as at be by for from had he i in is it' + ' of on or that the there this to was with'; SET i = 1; wordloop: WHILE i<= LENGTH(words) DO SET word = SUBSTRING_INDEX(SUBSTRING(words, i), ' ', 1); SET i = i + LENGTH(word) + 1; IF FIND_IN_SET(word, stopwords) THEN SET weight = -1; ELSE SET weight = 1; END IF; SET hash = hash + weight * CRC32(word); END WHILE; SET i = 1; SET bits = 64; SET hash = 0; bitloop: WHILE i<= bits DO SET hash = hash | ((BIT_COUNT(hash >>i) MOD 2)<< i-1); SET i = i + 1; END WHILE; RETURN hash; END;
Simhash 函數(shù)的實(shí)現(xiàn)過程分為兩部分:分詞和 Hash 計(jì)算。輸入的文本通過正則表達(dá)式替換成空格,并轉(zhuǎn)成小寫后作為參數(shù)傳入函數(shù)。函數(shù)會(huì)對(duì)字符串進(jìn)行遍歷,用空格分隔,判斷詞語是否為停用詞,然后計(jì)算出每個(gè)詞語的 CRC32 值,并根據(jù)權(quán)重累加起來。最后將累加結(jié)果按位重新排序生成 simhash 值。