My “wiggle” program which applies a patch to a different version of the file needs to compute the shortest edit path between two texts, which is the smallest set of additions and deletions that must be made to one file so as to produce the other. From this edit path it is easy to produce a list of sections that are the same or are different, and wiggle uses this to work out what changes need to be made, and where to make them.
When the two files being compared are large – on the order of tens of thousands of words – this process can start taking a long time. I recently started experiencing such a case fairly often, so it was clearly time to optimize the code. I managed to find two optimizations that don’t change the result at all, and two others that degraded the result on large files in exchange for substantial speed improvements.
Improvements that preserve correctness.
To get some ideas on how to proceed I had a look at GNU-diff. Wiggle performs a similar task to GNU-diff using a generally similar algorithm but completely different code – because I wanted to write it myself. This means that any idea that I found in GNU-diff would probably apply, but I would have to write new code to apply them.
The first improvement involved ignoring all the unique lines. diff always looks at whole lines while wiggle can look at lines but usually works with words. So to translate the heuristic of discarding unique lines, it becomes a heuristic for discarding unique words.
By “unique” here, I mean words that appear in one file but not the other. These words can never result in a match between the files, and it isn’t too expensive to ignore them, and it can help.
The result of finds a minimal edit path is a list common sublists. Each sublist identifies a contiguous sequence of words in one file which also exists as a contiguous sequence in the other file. If we completely ignore unique words we might end up thinking we found two common sublists, but really either or both may have some unique words mixed in. So that isn’t quite what we do. Instead we look for a sequence of two or more unique words, and ignore all but the first. This will ensure no sublist will contain this sequence, as they cannot contain the first.
For the particular files that were causing me problems, this optimization helped a lot when looking for line-based differences, but not so much for word based differences. In this case, almost all the words from one file occurred in the others, so few were removed. Many of the lines were made of different combinations of these words, so when looking at lines as wholes, many were removed. So while this didn’t help my particular use case, it does seem like a generally worthwhile optimization.
The second optimization that preserved the result is a little more complex so I’ll need to briefly explain the algorithm – hopefully not so briefly that you cannot understand.
To find the edit distance (number of adds plus number of deletes) between two files (lists of lines or words) we imagine an MxN array of points where there are M-1 words in the first file, and N-1 in the second. Each point represents a sequence of words. The bottom left has no words at all. Top left has all the words of the first file. Bottom right has all the words of the second file, and top right has all the words of both files.
In the matrix an edit path starts at the top-left and needs to end at the bottom-right. Stepping downward deletes a word and has a cost of 1. Stepping rightward adds a word and also has a cost of one. If the mth word of the first file and the nth word of the second file are the same, then stepping from (m-1, n-1) to (m,n) has no cost.
We walk this matrix in breadth-first order starting at the top-left where the path has a cost of zero. The first level we explore is all the points with a path cost of one (from the top-left). Then those with a path cost of two, and so on. Each level allows us to find the next level. Each level being considers is a set of points which, if joined together, make a zig-zag line that divides the matrix into two parts. To the left and above are those points with a lower cost. To the right and below are those points with a higher cost. Eventually the line (which can be thought of like a wave-front) reaches the bottom right and we know what the minimum total cost is. (How we record the path which achieves that cost is not relevant here).
At each level we can calculate the worst possible total cost, being the cost so far plus the distance from each point to the end assuming no matches and that we have to do lots of deletes to get to the bottom, then lefts of inserts to get to the right. For each point, we can also compute a best case, which is the cost so far plus the number of insertions or deletions required to end up with the correct number of words – assuming all possible pairs don’t match.
Towards the bottom-left and top-right of this “wave-front” the best case can be quite poor as there aren’t many opportunities to find matches. On the bottom-left we’ve already deleted most of the first file. On the top right we’ve already added most of the second file. So the best case at the ends can easily be worst that the overall worst case. When that happens we can reduce the length of the wave-front that we examine at each level.
My code was already doing this a little bit, but not properly. When the best case was poor it would step back a single step towards the middle, but sometimes many more steps were possible. Fixing this up – which is really fixing a bug rather than adding an optimization – gave me a 20% performance improvement in my test case. So I was happy about that.
Optimizations which change the result.
It isn’t always necessary to get a perfect result. Sometimes near enough is good enough, particularly if getting a perfect result takes much longer. So throwing away information can be justified.
Wiggle already does this sometimes. When it has some fragments of a file from a patch, and needs to align them with parts of the base file, it discards some “words” that are uninteresting (mostly this is strings of spaces). Once it has figured out where an approximate match is and has a smaller range of words to work with, it can go back and do something more precise.
The general idea of “reducing the number of words we have to work with” has already been discussed above when we discarded unique words. I found a way to apply it again. Wiggle normally treats strings of alpha-numerics as words, and strings of spaces as words, but punctuation is one word for each character. When comparing code, individual punctuation characters are often interesting in their own right, so it is good to keep them separate. But when that results it too many words, it probably isn’t worth the effort. So I added an option to treat punctuation just like alphanumerics, so we end up with fewer longer words. Then, if a file has more than 50,000 words, that option get automatically enabled. The result was a substantial speed-up that made wiggle actually usable on my test case. Not brilliant, but usable.
The final heuristic is based on another idea in GNU-diff. The ideas is that if things are taking too long, take the best you have so far and hope that is close enough. The exact details of how this decision is made in wiggle is totally different to that in GNU-diff, but the principle is much the same.
The slowest part of wiggle is certainly walking that Mxn matrix and advancing the wave-front. So if that takes more than 20msecs, we give up. We choose the point where the worst-case cost is least, and decide to use an edit-path which goes through that point. Having made that decision, everything else falls into place. The exact reason for that is not worth explaining here but isn’t too hard to see if you examine the code (maybe…).
This optimization guarantees that wiggle will complete fairly quickly. It doesn’t guarantee that it will produce the correct answer. But for my test case it does, so that will do for now. I suspect that in most cases it will, and in the cases where it fails, the differences are probably so great that it was not going to be able to do anything really useful anyway.
So: wiggle is now faster for some cases. I should probably make a new release …. but there is a bug that I know about but don’t have a test case for. I’ll wait a while and see if I can find a test case.