在大字符串中查找重复的字符串

时间:2019-05-13 18:27:33

标签: javascript arrays regex duplicates compression

有一个文本文件作为输入,大小约为5-10mb,其中包含许多部分重复的字符串,我需要找到重复的字符串(长度为min,max)并保存以创建字典用较短的字符串替换它们。

例如输入:

This is just a sample STRING: at the address of "https://example.com/content/1.jpeg" and another image address in another address maybe here https://example.com/content/3242341.jpeg.
And this sample string can be countinue for ever and you can see that there is no structure for the partial strings...

预期输出:

min=4,max=100

$1:this 
$2: sample string
$3: address
$4:https://example.com/content/
$5:.jpeg
$6: another
$7:here 
$8:And 

$1 is just a$2: at the$3 of "$41$5" and$6 image$3 in$6$3 maybe $7$43242341.$5.
$8$1$2 can be countinue for ever and you can see that t$7is no structure for the partial strings...

该示例编写得不太好,但是希望您能理解。 我想知道是否有可能做这样的事情还是没有意义? 我可以定义一个特殊字符,例如$或使用(__)或任何其他可以指定变量的字符。

(注意:该字符串几乎可以是任何utf-8字符,但我可以保留几个字符)

对算法有任何想法吗?或正则表达式?

0 个答案:

没有答案
相关问题