Question

我有一个创建脚本的任务，该脚本将一个巨大的文本文件作为输入。然后，它需要查找所有单词和出现次数，并创建一个新文件，每行显示一个唯一的单词及其出现位置。

作为示例，请使用包含此内容的文件：

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor 
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud 
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure
dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.   
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt 
mollit anim id est laborum.

我需要创建一个如下所示的文件：

1 AD
1 ADIPISICING
1 ALIQUA
...
1 ALIQUIP
1 DO
2 DOLOR
2 DOLORE
...

为此，我使用tr，sort和uniq编写了一个脚本：

#!/bin/sh
INPUT=$1
OUTPUT=$2
if [ -a $INPUT ]
then
    tr '[:space:][\-_?!.;\:]' '\n' < $INPUT | 
        tr -d '[:punct:][:special:][:digit:]' |
        tr '[:lower:]' '[:upper:]' |
        sort |
        uniq -c > $OUTPUT
fi

这样做是将空格分隔为分隔符。如果单词包含-_?!.;:，我会再次将它们分成单词。我删除标点符号，特殊字符和数字，并将整个字符串转换为大写。完成此操作后，我对其进行排序并将其传递给uniq以使其达到我想要的格式。

现在我以txt格式下载了圣经并将其用作输入。我得到了这个时间：

scripts|$ time ./text-to-word.sh text.txt b     
./text-to-word.sh text.txt b  16.17s user 0.09s system 102% cpu 15.934 total

我用Python脚本做了同样的事情：

import re
from collections import Counter
from itertools import chain
import sys

file = open(sys.argv[1])

c = Counter()

for line in file.readlines():
    c.update([re.sub('[^a-zA-Z]', '', l).upper()
            for l in chain(*[re.split('[-_?!.;:]', word)
                    for word in line.split()])])

file2 = open('output.txt', 'w')
for key in sorted(c):
    file2.write(key + ' ' + str(c[key]) + '\n')

当我执行脚本时，我得到了：

scripts|$ time python text-to-word.py text.txt
python text-to-word.py text.txt  7.23s user 0.04s system 97% cpu 7.456 total

正如您所看到的那样， 7.23s 与在 16.17s 中运行的shell脚本相比。我尝试过更大的文件，但Python似乎总是胜利。我对上面的Senario有几个问题：

为什么Python脚本更快，因为shell命令是用C语言编写的？我确实认为shell脚本可能不是最佳的。
如何改进shell脚本？
我可以改进Python脚本吗？

要明确我不是将Python与shell脚本进行比较。我并不是想要开始一场火焰战争，或者不需要用任何其他语言来回答自己的速度更快。使用管理小命令来执行任务的UNIX理念，如何使shell脚本更快？

Answer 1

这里的重点可能是进程间I / O. Python脚本在内存中包含所有数据，因此在处理数据时不会发生I / O.

另请注意，Python并不慢。 Python中的大多数功能都是用C实现的。

shell脚本必须启动5个进程，每个进程必须从stdin读取整个文本，并将整个文本写入stdout四次。

可能有一种方法可以让Python脚本更快一些：你可以将整个文本读成一个字符串，然后删除所有标点符号，拆分单词然后计算它们：

text = file.read()
text = re.sub(r'[.,:;-_]', '', text)
text = text.upper()
words = re.split(r'\\s+', text)
c = Counter()
c.update(words)

这样可以避免几个嵌套循环的开销。

至于shell脚本：您应该尝试减少进程数。三个tr进程可能会被一个sed调用替换。

Answer 2

这不是一种语言与另一种语言的问题。你的方法不同。

在Python中，您在遇到每个单词时递增一个计数器，然后迭代计数器以产生输出。这将是O（n）。

在bash中，你将所有单词单独放入一个长元组中，对元组进行排序，然后计算实例。这很可能是O（nlogn）的排序。

Answer 3

您可以改进bash脚本：

sed 's/[^a-zA-Z][^a-zA-Z]*/\'$'\n/g'  <$INPUT | sort -f -u >$OUTPUT

但对你的问题的简短答案是：因为你使用完全不同的算法。

Answer 4

你可以试试这个：

将输入文件视为Input.txt

Bash脚本

cat Input.txt | tr [:space:] '\n' | grep -v "^\s*$" | sort | uniq -c | sort -bnr | tr [:lower:] [:upper:]

Answer 5

使用GNU awk的一种方式：

WHINY_USERS=1 awk '{ for (i=1; i<=NF; i++) { sub("[,.]","",$i); array[toupper($i)]++ } } END { for (j in array) print array[j], j }' file.txt

伪代码/解释：

## WHINY_USERS=1 enables sorting by keys. A bit of a trick.
## Now loop through each word on each line, removing commas, full-stops,
## adding each word in uppercase to an array.
## Loop through the array printing vals and keys

YMMV

Answer 6

bash解决方案

#!/bin/bash
IFS=' -_?!.;\:,'
while read -r line; do
  for word in $line; do
    word=${word//[^[:alpha:]]/}
    [ $word ] || continue
    word=$(tr '[:lower:]' '[:upper:]' <<<"$word")
    ((_w_$word++))
  done
done <"$INPUT"
IFS=' '
for wword in ${!_w_*}; do echo "${!wword} ${wword#_w_}"; done > $OUTPUT.v1

perl高尔夫解决方案

perl -nle '$h{uc()}++for/(\w+)/g}{print"$h{$_} $_"for sort keys%h'  $INPUT > $OUTPUT.v2

是否有可能使这个shell脚本更快？

6 个答案: