Back in May I wrote a little command-line utility called Topfew (GitHub). It was fun to write, and faster than the shell incantation it replaced. Dirkjan Ochtman dropped in a comment noting that he’d written Topfew along the same lines but in in Rust (GitHub) and that it was 2.86 times faster; at GitHub, the README now says that with a few optimizations it’s now 6.7x faster. I found this puzzling and annoying so I did some optimization too, encountering surprises along the way. You already know whether you’re the kind of person who wants to read the rest of this.
Reminder: What topfew does is replace the sort | uniq -c | sort -rn | head
pipeline that you use to
do things like find the most popular API call or API caller by plowing through a logfile.
Initial numbers ·
I ran these tests against a sample of the ongoing Apache access_log
, around 2.2GB
containing 8,699,657 lines. In each case, I looked at field number 7, which for most entries gives you the URL being fetched.
Each number is the average of three runs following a warm-up run. (The variation between runs was very minor.)
Elapsed | User | System | vs Rust | |
---|---|---|---|---|
Rust | 4.96 | 4.45 | 0.50 | 1.0 |
ls/uniq/sort | 142.80 | 149.66 | 1.29 | 28.80 |
topfew 1.0 | 27.20 | 28.31 | 0.54 | 5.49 |
Yep, the Rust code (Surprise #1) measures over five times faster than the Go. In this task, the computation is trivial and the whole thing should be pretty well 100% IO-limited. Granted, the Mac I’m running this on has 32G of RAM so after the first run the code is almost certainly coming straight out of RAM cache.
But still, the data access should dominate the compute. I’ve heard plenty of talk about how fast Rust code runs, so I’d be unsurprised if its user CPU was smaller. But elapsed time!?
What Dirkjan did · I glanced over his code. Gripe: Compared to other modern languages, I find Rust suffers from a significant readability gap, and for me, that’s a big deal. But I understand the appeal of memory-safety and performance.
Anyhow, Dirkjan, with some help from others, had added a few optimizations:
Changed the regex from \s+
to [ \t]
.
The algorithm has a simple hash-map of keys to occurrence counts. I’d originally done this as just string-to-number. He made it string-to-pointer-to-number, so after you find it in the hash table, you don’t have to update the hash-map, you just increment the pointed-to value.
Broke the file up into segments and read them in separate threads, in an attempt to parallelize the I/O.
I’d already thought of #3 based my intuition that this task is I/O-bound, but not #2, even though it’s obvious.
What I did · In his comment, Dirkjan pointed out correctly that I’d built my binary “naively with ‘go build’”. So I went looking for how to make builds that are super-optimized for production and (Surprise #2) there aren’t any. (There’s a way to turn optimizations off to get more deterministic behavior.) Now that I think about it I guess this is good; make all the optimizations safe and just do them, don’t make the developer ask for them.
(Alternatively, maybe there are techniques for optimizing the build, but some combination of poor documentation or me not being smart enough led to me not finding them.)
Next, I tried switching in the simpler regex, as in Dirkjan’s optimization #1.
Elapsed | User | System | vs Rust | |
---|---|---|---|---|
Rust | 4.96 | 4.45 | 0.50 | 1.0 |
ls/uniq/sort | 142.80 | 149.66 | 1.29 | 28.80 |
topfew 1.0 | 27.20 | 28.31 | 0.54 | 5.49 |
“[ \\t]” | 25.03 | 26.18 | 0.53 | 5.05 |
That’s a little better. Hmm, now I’m feeling regex anxiety.
Breakthrough · At this point I fat-fingered one of the runs to select on the first rather than the seventh field and that really made topfew run a lot faster. Which strengthened my growing suspicion that I was spending a lot of my time selecting the field out of the record.
At this point I googled “golang regexp slow” and got lots of hits; there is indeed a widely expressed opinion that Go’s regexp implementation is far from the fastest. Your attitude to this probably depends on your attitude toward using regular expressions at all, particularly in performance-critical code.
So I tossed out the regexp and hand-wrangled a primitive brute-force field finder. Which by the way runs on byte-array slices rather than strings, thus skipping at least one conversion step. The code ain’t pretty but passed the unit tests and look what happened!
Elapsed | User | System | vs Rust | |
---|---|---|---|---|
Rust | 4.96 | 4.45 | 0.50 | 1.0 |
ls/uniq/sort | 142.80 | 149.66 | 1.29 | 28.80 |
topfew 1.0 | 27.20 | 28.31 | 0.54 | 5.49 |
“[ \\t]” | 25.03 | 26.18 | 0.53 | 5.05 |
no regexp | 4.23 | 4.83 | 0.66 | 0.85 |
There are a few things to note here. Most obviously (Surprise #3), the Go implementation now has less elapsed time. So yeah, the regexp performance was sufficiently bad to make this process CPU-limited.
Less obviously, in the Rust implementation the user CPU time is less than the elapsed; user+system CPU are almost identical to elapsed, all of which suggests it’s really I/O limited. Whereas in all the different Go permutations, the user CPU exceeds the elapsed. So there’s some concurrency happening in there somewhere even though my code was all single-threaded.
I’m wondering if Go’s sequential file-reading code is doing multi-threaded buffer-shuffling voodoo. It seems highly likely, since that could explain the Go implementation’s smaller elapsed time on what seems like an I/O-limited problem.
[Update: Dirkjan, after reading this, also introduced regex avoidance, but reports that it didn’t appear to speed up the program. Interesting.]
Non-breakthroughs · At this point I was pretty happy, but not ready to give up, so I implemented Dirkjan’s optimization #2 - storing pointers to counts to allow increments without hash-table updates.
Elapsed | User | System | vs Rust | |
---|---|---|---|---|
Rust | 4.96 | 4.45 | 0.50 | 1.0 |
ls/uniq/sort | 142.80 | 149.66 | 1.29 | 28.80 |
topfew 1.0 | 27.20 | 28.31 | 0.54 | 5.49 |
“[ \\t]” | 25.03 | 26.18 | 0.53 | 5.05 |
no regexp | 4.23 | 4.83 | 0.66 | 0.85 |
hash pointers | 4.16 | 5.10 | 0.65 | 0.84 |
Um well that wasn’t (Surprise #4) what I expected. Avoiding almost nine million hash-table updates had almost no observable effect on latency, while slightly increasing user CPU. At one level, since there’s evidence the code is limited by I/O throughput, the lack of significant change in elapsed time shouldn’t be too surprising. The increase in user CPU is though; possibly it’s just a measurement anomaly?
Well, I thought, if I’m I/O-limited in the filesystem, let’s be a little more modern and instead of reading the file, let’s just pull it into virtual memory with mmap(2). So it should be the VM paging code getting the data, and everyone knows that’s faster, right?
Elapsed | User | System | vs Rust | |
---|---|---|---|---|
Rust | 4.96 | 4.45 | 0.50 | 1.0 |
ls/uniq/sort | 142.80 | 149.66 | 1.29 | 28.80 |
topfew 1.0 | 27.20 | 28.31 | 0.54 | 5.49 |
“[ \\t]” | 25.03 | 26.18 | 0.53 | 5.05 |
no regexp | 4.23 | 4.83 | 0.66 | 0.85 |
hash pointers | 4.16 | 5.10 | 0.65 | 0.84 |
mmap | 7.17 | 8.33 | 0.62 | 1.45 |
So by my math, that’s (Surprise #5) 72% slower. I am actually surprised, because if the paging code just calls through to the filesystem, it ought to be a pretty thin layer and not slow things down that much. I have three hypotheses: First of all, Go’s runtime maybe does some sort of super-intelligent buffer wrangling to make the important special case of sequential filesystem I/O run fast. Second, the Go mmap library I picked more or less at random is not that great. Third, the underlying MacOS mmap(2) implementation might be the choke point. More than one of these could be true.
A future research task is to spin up a muscular EC2 instance and run a few of these comparisons there to see how a more typical server-side Linux box would fare.
Parallel? · Finally, I thought about Dirkjan’s parallel I/O implementation. But I decided not to go there, for two reasons. First of all, I didn’t think it would actually be real-world useful, because it requires having a file to seek around in, and most times I’m running this kind of a pipe I’ve got a grep or something in front of it to pull out the lines I’m interested in. Second, the Rust program that does this is already acting I/O limited, while the Go program that doesn’t is getting there faster. So it seems unlikely there’s much upside.
But hey, Go makes concurrency easy, so I refactored slightly. First, I wrapped a mutex around the hash-table access. Then I put the code that extracts the key and calls the counter in a goroutine, throwing a high-fanout high-TPS concurrency problem at the Go runtime without much prior thought. Here’s what happened.
Elapsed | User | System | vs Rust | |
---|---|---|---|---|
Rust | 4.96 | 4.45 | 0.50 | 1.0 |
ls/uniq/sort | 142.80 | 149.66 | 1.29 | 28.80 |
topfew 1.0 | 27.20 | 28.31 | 0.54 | 5.49 |
“[ \\t]” | 25.03 | 26.18 | 0.53 | 5.05 |
no regexp | 4.23 | 4.83 | 0.66 | 0.85 |
hash pointers | 4.16 | 5.10 | 0.65 | 0.84 |
goroutines! | 6.53 | 57.84 | 7.68 | 1.32 |
(Note: I turned the mmap off before I turned the goroutines on.)
I’m going to say Surprise #6 at this point even though this was a dumbass brute-force move that I probably wouldn’t consider doing in a normal state of mind. But wow, look at that user time; my Mac claims to be an 8-core machine but I was getting more than 8 times as many CPU-seconds as clock seconds. Uh huh. Pity those poor little mutexes.
I guess a saner approach would be to call runtime.NumCPU()
and limit the concurrency to that value, probably by
using channels. But once again, given that this smells I/O-limited, the pain and overhead of introducing parallel processing
seem unlikely to be cost-effective.
Or, maybe I’m just too lazy.
Lessons · First of all, Rust is not magically and universally faster than Go. Second, using regular expressions can lead to nasty surprises (but we already knew that, didn’t we? DIDN’T WE?!) Third, efficient file handling matters.
Thanks to Dirkjan! I’ll be interested to see if someone else finds a way to push this further forward. And I’ll report back any Linux testing results.