Nickname: Frictionless Bananas
Name: Jeremy Sawicki (jeremy at sawicki dot us)
Language: C++
Provisional results: 2 ms, 18th place (1st among non-zero runtimes)
Final results: 9 ms, 1st place
This page describes my CodinGame Spring Challenge 2025 entry. On the provisional leaderboard I had the lowest non-zero runtime of 2 ms with no hardcoded solutions.
The task is to simulate a dice game, computing all possible game states reachable from a given state after a given number of moves (or fewer moves for terminal states), including duplicate states reached via different move sequences. For each state, compute a “state hash” consisting of the dice values encoded as a base-10 number. Sum those state hashes, and report the result modulo 230. Scoring is based on total runtime for a set of test cases.
My solution uses breadth first search, computing the set of reachable states one depth at a time. At depth 0, the set is initialized with the input state. At each later depth, I compute the valid moves and resulting states for each state from the previous depth.
I represent a game state as a 32-bit integer. Die values are represented with 3 bits each, for a total of 27 bits. The remaining 5 bits store the number of zeros in the game state, allowing quick detection of terminal states.
Because the same state can be reached via multiple move sequences, it is necessary to merge duplicate states for the search to finish in a reasonable time. Therefore, each state is stored together with a count of the number of ways it can be reached. The count can be stored as a 32-bit integer because the final result is reported modulo 230. When adding a new state to the set, I perform a lookup in a hash table to check whether it already exists. If so, the existing count is updated, otherwise the state is added to the set.
Each game state can be transformed into 8 equivalent forms (4 rotations, 2 mirrorings). Runtime can be reduced significantly by storing and processing only one of the equivalent forms. Therefore, I only store game states in one canonical form (the one with the smallest integer representation), along with 8 counts representing the number of ways that each transformed version of that state can be reached.
The set of states for a depth is stored using two parallel arrays: an array of states, where each entry contains the 32-bit state value, and an array of counts, where each entry contains 8 32-bit counts. Entries are added to the arrays in order, starting at index 1. (Index 0 is used as a special value in the hash table.)
In principle, I need two state sets, once for the current depth and one for the next depth. In practice, I have two count arrays but only one state array due to the use of an intermediate move list (see below).
The hash table is a power-of-two sized array of 32-bit values, initialized with zeros. A zero represents an empty slot. A non-zero value represents an index into the state set arrays. To look up a state, I hash it and then use the low bits of the hash to read a value from the hash table. If it is zero, the state is not yet in the set. If it is non-zero, I check whether the corresponding entry in the state array has the state we are looking for. If yes, we found an existing state. Otherwise, I use linear probing, trying successive hash table entries until I find a zero or the state we are looking for.
The hash function is simple, fast, and good enough:
u32 calcHash(u32 state) { u32 hash = state * 0x587a1437; hash ^= hash >> 16; return hash; }
The hash table is only used when adding values to the state set, so only one hash table is needed.
When generating moves from the states at one depth, I don’t immediately add new states to the state set for the next depth. Instead, I use an intermediate array to store the states before they have been deduplicated. Each entry contains a 32-bit value representing the new state and a 32-bit index into the old state set, used to look up the counts.
The first step of processing each depth is to generate the new states reachable from each state and add them to the move list. I generate valid moves using a set of lookup tables that are precomputed at compile time.
For each of the nine board positions, I first check whether it is zero. If so, I combine the bits from each of the neighbor cells into an index into a lookup table. The table contains all possible moves for that set of neighbor cells. Corner cells always have exactly one move, so the table simply contains that move. For edge cells, the table contains a 4-element array which may have between 1 and 4 moves. For the center cell, the table contains a 16-element array which may have between 1 and 11 moves.
The moves in the lookup tables are “deltas”: 32-bit values which can be added to an old state to produce a new state. The deltas remove any captured dice, add the newly placed die, and adjust the count of zeros.
The next step is to convert each state on the move list to canonical form by generating all 8 transformed states and choosing the smallest one.
I generate the 8 transformed states with the help of another set of lookup tables. First, note that the middle die never moves as a result of the transformations. In principle, the 24 bits representing the other 8 dice can be used as an index into a table that stores the 8 transformations of those dice. Such a table would be too large, but I do something similar using three separate tables. The 24 bits are arbitrarily divided into three sets of 8 bits. Each set of 8 bits is used as an index into a much smaller table that stores the transformations of those 8 bits. Specifically, each table entry contains 8 32-bit values that can be combined with the original state using XOR to produce a state with the bits in their new locations.
To process all 8 transformations at once, I load the values from the lookup tables into 256-bit SIMD registers and XOR them together with the original state, producing all 8 transformed states within a single register. I use SIMD operations to find the minimum of the 8 states, as well as a value from 0-7 indicating which transformation was chosen (needed to permute the counts in the next step).
After I have the canonical representation of the new state, I look it up in the hash table. If the new state is already present, I add the counts to the existing entry. Otherwise, I create a new entry. In either case, I must first permute the counts according to the chosen transformation. I used the value from 0-7 determined in the previous step to look up the correct permutation, and apply the permutation to the counts using a SIMD permutation instruction.
When the hash table becomes too large, hash table lookups become one of the slowest operations due to CPU cache misses. I use prefetching to alleviate that problem.
As I loop over the states on the move list, I work on two states at once. In each loop iteration, I compute the canonical representation of a state, hash it, and use a prefetch instruction to bring its hash table entry into the cache. Then I finish the work for the previous state that was started during the previous loop iteration: performing the actual hash table lookup, permuting the counts, and adding the state to the state set.
When a terminal state is encountered before the final depth, and for all states encountered at the final depth, we must compute the base-10 state hash values for each of the 8 transformed states and multiply them by the counts.
To compute the state hashes, I use yet another set of lookup tables. The 27 bits representing the 9 dice are divided into three groups of 9 bits representing 3 dice each. Those values are used as indices into three lookup tables. Each table entry contains the base-10 representation of the three dice in all 8 transformations. Values from the three tables are loaded into SIMD registers, added, and multiplied by the counts.
I noticed that my program was spending a significant amount of time zeroing memory, both in the kernel (showing up as “sys” time using the time command) and in my own code (showing up in profiling with perf). I started allocating memory with mmap and carefully managing it to avoid that overhead.
When a process requests memory from the OS, it initially receives address space that is mapped to a shared read-only page filled with zeros. Performing the allocation is fast, even for large sizes. Reading zeros from the allocated memory is probably also fast, though I did not check. When a page is written for the first time, the OS replaces it with an unshared writable page filled with zeros. It takes time for the kernel to fill the page with zeros. If memory pages are returned to the OS and later allocated again, for example when resizing an array, the same memory ends up being zeroed again, which is wasteful.
My program allocates all of its main data structures with mmap at a size sufficient to handle the largest possible test cases, so they will not need to be resized (although the resizing code still exists, just in case). Most of the arrays are written to in order, so only the portion of the array that is actually used will be zeroed, and only once. The hash table needs to be zeroed for each depth. I make sure to zero only the portion that has already been used for a previous depth, because the rest will already contain zeros from the OS.
I implemented another more questionable optimization. I noticed that the CodinGame servers allow a program to run for a few milliseconds before the input becomes available. That time cannot be used for anything problem-specific, but it is possible to use it for work that is independent of the problem. In my case, I allocate my program’s main arrays and perform a write into each memory page, causing the OS to fill the page with zeros and make it fully available for my program’s use without further delays. I attempt to initialize 21 MiB of memory in that way, though there is not always enough time to finish the initialization before the input is ready. It is questionable whether this is a legitimate improvement, but it probably happens quite often by accident whenever a program happens to perform some initialization work before reading the input.
Here are the submissions I made with running times and brief descriptions of what was changed:
1 559 ms Initial implementation with custom hash table 2 384 ms Lookup tables for move generation 3 68 ms Symmetry 4 63 ms Prefetch 5 38 ms SIMD for transformations and state hashes; faster hash function 6 25 ms Don’t hash terminal states or last depth 7 13 ms Optimized memory allocation with mmap 8 14 ms Lookup tables for transformations 9 10 ms Lookup tables for state hashes 10 7 ms Smaller lookup tables for transformations 11 8 ms Don’t reduce state hash sums until the end 12 8 ms Only one state array 13 7 ms Add terminal and last depth states to move array 14 7 ms Reorder move generation (corners, then edges, then center) 15 8 ms Don’t zero more hash table memory than needed 16 2 ms Zero memory before input is ready 17 2 ms Non-temporal stores to move list
Many participants discussed solving this problem with depth first search, using a hash table to cache the results for (state, depth) pairs. A DFS-based solution can perform reasonably well, but I don’t think it can quite be competitive with BFS for a few reasons:
One interesting possibility with DFS that I don’t see how to do with BFS is reusing the work for a state at multiple depths, as long as the maximum number of moves that can be made from the state is smaller than the depths in question.
It is unusual to see a programming contest that is judged based on execution time. I enjoy performance optimization, so I had a lot of fun with this contest. Thank you to everyone who made it happen.