On tafl: game tree search optimizations

As you’ll recall from the tafl piece about complexity, tafl has a high branching factor compared to, say, chess1. That means that a tafl game tree is broad: it has a lot of nodes relative to its depth. Since searching deep into a broad tree is difficult, getting good lookahead out of a tafl engine is going to be difficult. And, to be clear on where this post is going, it started off being extremely difficult.

Before I get there, though, I want to talk about the state of the art in chess AI as of now, or at least a few years ago. (Recent sources are probably easy to find if I were to look in the right places, but I forgot I still have a .edu email address with which I can probably sneak into a bunch of online article repositories.) Chess engines, today, on a moderately powerful desktop computer, are able to generate about four million game states per second. They also have highly advanced pruning2: not only the standard sort I’ll go into in a later post, but also a more qualitative sort, discarding moves that pass the plausibility test but fail the usefulness test.

Four million states per second is an extremely high bar to hit. Chess engines have been in development since the dawn of digital computing, just about, and fifty to sixty years of optimization really shows3. I told myself I’d be happy with somewhere between ten and twenty-five percent of chess’s state of the art.

How far did I have to go? Well, I coded up my state explorer, put a bit of benchmarking code in, and set it exploring three ply deep in brandub. It chugged along for a while, and then, almost forty seconds later, finished. The number? 1,000 states per second, which comes to between four hundred and one thousand times too slow.

Ouch.

I quickly discovered a few obvious things I could fix. First, I was doing an expensive operation (getting the allowable moves for a piece) for every piece on each side, to check if either side had no allowable moves remaining. I replaced that with a short-circuit check that returns true as soon as it discovers any allowable move for either side, which is a much quicker proposition. Second, I was checking for edge forts at the end of every turn, whether or not the game allowed for edge forts. I put a stop to that. Third, I was checking for edge forts in a vastly more expensive way than I needed to, to allow for a situation that does not occur in any known tafl variant4.

Confidently, I ran the benchmark again: 4,000 states per second.

Well, I’d been hoping for more than that.

I did a little reading, and decided some of my problem was with the programming language I’d selected. My language of choice for most of my side projects is a scripting language called Groovy. It’s Java’s little brother, sort of; it provides a bunch of shortcuts Java does not for displaying things to the terminal, handling user inputs, playing around with files, and iterating over sets of objects. I suspected from the start that Groovy might cause me headaches, and profiling suggested that was correct—my code was spending most of its execution time in Groovy’s internals. Rewriting it in Java was quick, as such things go, and with that squared away, and me optimistic as to my chances, I set up another benchmark.

15,000 states per second.

Three times faster is great, but I still had a few orders of magnitude to go.

I decided that the board representation I had started with was clever, but inefficient. My profiler suggested I was using literally tens of millions of objects: each space was a first-class object, in the first reckoning. This would not do; objects in Java may be small, and instantiating them may be fast, but there’s a limit to what a Java virtual machine can be expected to put up with. I removed the class of object which represented spaces, and set about rewriting the code to reference an array of small integers to represent a board.

Moving to an integer representation meant that I had an opportunity to come up with an encoding I can eventually use to rid myself of helper objects altogether. (As I write this, I’m still using a class of objects to represent taflmen—they’re inflated from the integer representation when needed, and compressed when the state is no longer active during a game tree search.) The first six bits of the integer are used for an ID number—each taflman-representation has a unique ID within its side. Six bits lets me define up to 127 taflmen per side, which is sufficient for every known variant, and leaves some room for busier alea evangelii variants, if desired. The next three bits define the piece’s type. I’m aware of five potential types of taflman: a standard taflman, a knight or a commander from berserker, a mercenary (a defending piece which switches sides when captured, seen in one proposed alea evangelii variant), and a king. Finally, the next bit defines whether the piece is a besieger or a defender.

I was a little put out to realize that my efforts did not actually buy me very much—I was still at about 15,000 states per second.

My next task was to observe the behavior of my code, and find out where it was performing poorly. An unexpected but obvious optimization presented itself. When you run a Java program, it defaults to using a small amount of memory. You can provide command line switches to allow it to use a much larger amount of memory. Doing that, I found I was up to 50,000 states per second—that’s getting somewhere.

In fact, that’s what I’d call having gotten somewhere. 50,000 states per second is sufficient to play a game of brandub to a search depth of 3, without undue waiting. (It takes about 40,000 states explored to get to depth 3, and about 1.2 million to depth 4.) I whipped up a quick evaluation function (which I’ll describe in a subsequent post), hooked it up, and quickly found that it played only slightly more poorly than I do5.

With that heartening development, I also stumbled upon a happy fact. Java is, at first, an interpreted language, but it compiles sections of programs to faster, native code when it discovers that it runs into them a lot. After the computer took its first turn, I saw the benchmark report that it had explored 200,000 states per second. 200,000 states per second I can live with—my laptop is several years old and a laptop, and in testing, my desktop was about five times faster in an earlier version of the code. I intend to test it again soon, and see what a powerful machine can do with it.

I made one final optimization after that. Since the AI code used a thick representation of game states, it was extremely memory-hungry—brandub to depth 3 took over four gigabytes of memory, and other games were impossible altogether. I solved that one with two changes: first, I now represent the explored nodes of a tree as a list of moves required to reach each node from the root. Second, I trigger garbage collection manually.

Now, this may seem like a bad idea, to those of you with some Java development experience, but in this case, it’s called for. The default Java garbage collection scheme is two-leveled: objects just created go in a group called the Eden space. The Eden space contains objects whose longevity is not known, and which will therefore likely be quickly removed. If an object survives garbage collection in the Eden space, it moves into the Survivor space, where middle-longevity objects live. The Survivor space is touched less frequently by garbage collection. Finally, if an object lives in the Survivor space through a couple of garbage collections, it moves into the Tenured space, where it is troubled by garbage collections least frequently of all.

A lot of my objects end up in the Tenured generation—the objects which represent boards, particularly. They’re long-lived and durable, while almost every other object in the code comes and goes somewhat quickly. That means that Board objects, which are rather large, and which also contain references to other, shorter-lived objects, stay in memory more often than I’d like, and keep their sub-objects in memory, too. Triggering a garbage collection manually garbage collects everything, regardless of the space in which it lives, so the Board objects are cleaned up, freeing space. It does slow the code down somewhat: I get about 120,000 states per second on my laptop now, under ideal conditions.

Like I said, I can live with that. It’s not where I want it to be, but it’ll do for the time being. Coming up in the not-too-distant future: writing on evaluation function, although I have some research to do before I’m done with that.

1. Go starts out branchier than 11×11 tafl variants, obviously, since you can play a stone at any point on the board. Thinking in terms of smart opening moves, I suspect the order still goes go, 11×11 tafl, chess. Large (that is, 19×19) tafl variants probably branch more than go at almost every stage of the game; the board is the same size.
2. That is, trimming unproductive branches from the game tree before expending the effort to explore them.
3. Here’s one for you. Computer memory is organized into bytes: eight slots to hold either 0 or 1 (called ‘bits’). Some data types are eight bytes long, or 64 bits. The chess board is 64 spaces, so someone had the idea to represent chess boards in a single eight-byte long integer. You need a few of these bitboards to represent a whole chess game state, obviously: minimally, two to represent which side controls each space, and six to represent which kind of piece occupies each space. Even so, it’s very compact, and bitwise operations on 64 bits at a time are blazingly quick on today’s 64-bit CPUs.
Unfortunately, this does not generalize well to tafl; most of the boards we concern ourselves with are larger than 64 spaces.
4. Namely, allowing black, using Berserker-style pieces, to jump into a Copenhagen edge fort to break it.
5. This says a lot more about me than about the AI.

Leave a Reply