Tonight, OpenTafl saw its first micro-optimization!
Let me start from the beginning, though. OpenTafl has a static Coord type, which holds precalculated information about coordinates: what their index is, where they are on the board, and so on. Before network mode, OpenTafl updated the Coord type at the start of each game to correspond to the board size for the selected rules variation.
Astute readers may have noticed the problem. On the server, multiple games may be ongoing, on boards of different sizes. The Coord object should therefore have all the information required for any size of board. The network mode patches included a rewrite of Coord to hold all the information at once, indexed in maps by board size. This solved the problem, but yielded about a performance hit of about ten to fifteen percent. (Obviously, the Coord code paths come up a lot.)
At first, I wasn’t going to worry about it. Then, I had a little downtime this afternoon and decided, “Well, I’m not going to get anything else done right now.” Firing up the old profiler, I found a hotspot to work on.
When exploring the game tree, OpenTafl caches information about how pieces can move. (Calculating it out every time is computationally expensive, or at least more computationally expensive than can be justified doing it every time.) Clearing this cache involves zeroing an array. The usual way to do this is to loop over the array and set everything to zero. (Whether you choose a library function or write it by hand, this is usually what happens.) Unfortunately, that ‘reset’ function, according to the profiler, took longer on aggregate than any other single operation, including such expensive calculations as shieldwall detection and surrounding victories. This would not do.
The solution? Well, although Java’s library methods for array filling do not use native code memory copying, Java’s library method for copying one array to another does. The titular micro-optimization, therefore, was to set the first slot of the array to the desired value, then copy that to the second, then copy those two to the second two, then copy those four to the next four, and so on until the array is fully filled. Since this works directly on system memory, it turns out to be faster.
How much faster? To answer that question, I wrote a little benchmark function for OpenTafl, which runs a 30-second brandub search, a 30-second tablut 9×9 search, a 60-second Copenhagen search, and a 60-second Tablut 15×15 search, which verified the 10-15% slowdown from pre-network to post-network. Fortunately, the move cache optimization yielded a speedup of about 15%, bringing us right back to where we were before. Et voila.