Well, we can use memoryview for the dict generation avoiding creation of string objects until the time for the output:
import re, operator
def count_words(filename):
with open(filename, 'rb') as fp:
data= memoryview(fp.read())
word_counts= {}
for match in re.finditer(br'\S+', data):
word= data[match.start(): match.end()]
try:
word_counts[word]+= 1
except KeyError:
word_counts[word]= 1
word_counts= sorted(word_counts.items(), key=operator.itemgetter(1), reverse=True)
for word, count in word_counts:
print(word.tobytes().decode(), count)
We could also use `mmap.mmap`.For reasons I never quite understood python has a collections.Counter for the purpose of counting things. It's a bit cleaner.
This doesn't do the same thing though, since it's not Unicode aware.