欧美一区二区三区,国内熟女精品熟女A片视频小说,日本av网,小鲜肉男男GAY做受XXX网站

mysql simhash

MySQL Simhash 是一種基于文本內(nèi)容生成 Hash 值的算法,可以用來實(shí)現(xiàn)文本去重、相似查詢等功能。

CREATE FUNCTION simhash(text TEXT)
RETURNS BIGINT DETERMINISTIC
BEGIN
DECLARE words TEXT;
DECLARE word TEXT;
DECLARE stopwords TEXT;
DECLARE hash BIGINT DEFAULT 0;
DECLARE weight INT DEFAULT 1;
DECLARE bits INT DEFAULT 64;
DECLARE i INT;
SET words = REPLACE(text, '[^\w\x80-\xff]+', ' ');
SET words = LOWER(words);
SET stopwords = 'a an and are as at be by for from had he i in is it'
+ ' of on or that the there this to was with';
SET i = 1;
wordloop: WHILE i<= LENGTH(words) DO
SET word = SUBSTRING_INDEX(SUBSTRING(words, i), ' ', 1);
SET i = i + LENGTH(word) + 1;
IF FIND_IN_SET(word, stopwords) THEN
SET weight = -1;
ELSE
SET weight = 1;
END IF;
SET hash = hash + weight * CRC32(word);
END WHILE;
SET i = 1;
SET bits = 64;
SET hash = 0;
bitloop: WHILE i<= bits DO
SET hash = hash | ((BIT_COUNT(hash >>i) MOD 2)<< i-1);
SET i = i + 1;
END WHILE;
RETURN hash;
END;

Simhash 函數(shù)的實(shí)現(xiàn)過程分為兩部分:分詞和 Hash 計(jì)算。輸入的文本通過正則表達(dá)式替換成空格,并轉(zhuǎn)成小寫后作為參數(shù)傳入函數(shù)。函數(shù)會(huì)對(duì)字符串進(jìn)行遍歷,用空格分隔,判斷詞語是否為停用詞,然后計(jì)算出每個(gè)詞語的 CRC32 值,并根據(jù)權(quán)重累加起來。最后將累加結(jié)果按位重新排序生成 simhash 值。