This is a trick that I learned from Matthias Wloka during my job interview at NVIDIA. I thought I had a good understanding of the behavior of the vertex post-transform cache, but embarrassingly it turned out it wasn’t good enough. I’m sure many people don’t know this either, so here it goes.
Rendering regular grids of triangles is common enough to make it worth spending some time thinking about how to do it most efficiently. They are used to render terrains, water effects, curved surfaces, and in general any regularly tessellated object.
It’s possible to simulate the native hardware tessellation by rendering a single grid multiple times, and the fastest way of doing that is using instancing. That idea was first proposed in Generic Mesh Refinement on GPU and at NVIDIA we also have examples that show how to do that in OpenGL and Direct3D.
That’s enough for the motivation. Imagine we have a 4×4 grid. The first two rows would look like this:
* - * - * - * - * | / | / | / | / | * - * - * - * - * | / | / | / | / | * - * - * - * - *
With a vertex cache with 8 entries, the location of the vertices after rendering the first 6 triangles of the first row should be as follows:
7 - 5 - 3 - 1 - * | / | / | / | / | 6 - 4 - 2 - 0 - * | / | / | / | / | * - * - * - * - *
And after the next two triangles:
* - 7 - 5 - 3 - 1 | / | / | / | / | * - 6 - 4 - 2 - 0 | / | / | / | / | * - * - * - * - *
Notice that the first two vertices are no longer in the cache. As we proceed to the next two triangles two of the vertices that were previously in the cache need to be loaded again:
* - * - * - 7 - 5 | / | / | / | / | 3 - 1 - * - 6 - 4 | / | / | / | / | 2 - 0 - * - * - *
Instead of using the straightforward traversal, it’s possible to traverse the triangles in Morton or Hilbert order, which are known to have better cache behavior. Another possibility is to feed the triangles to any of the standard mesh optimization algorithms.
All these options are better than not doing anything, but still produce results that are far from the optimal. In the table below you can see the results obtained for a 16×16 grid and with a FIFO cache with 20 entries:
Method ACMR ATVR Scanline 1.062 1.882 NVTriStrip 0.818 1.450 Morton 0.719 1.273 K-Cache-Reorder 0.711 1.260 Hilbert 0.699 1.239 Forsyth 0.666 1.180 Tipsy 0.658 1.166 Optimal 0.564 1.000
Note that I’m using my own implementation for all of these methods. So, the results with the code from the original author might differ slightly.
The most important observation is that, for every row of triangles, the only vertices that are reused are the vertices that are at the bottom of the triangles, and these are the vertices that we would like to have in the cache when rendering the next row of triangles.
When traversing triangles in scanline order the cache interleaves vertices from the first and second row. However, we can avoid that by prefetching the first row of vertices:
4 - 3 - 2 - 1 - 0 | / | / | / | / | * - * - * - * - * | | | | | * - * - * - * - *
That can be done issuing degenerate triangles. Once the first row of vertices is in the cache, you can continue adding the triangles in scanline order. The cool thing now is that the vertices that leave the cache are always vertices that are not going to be used anymore:
* - 7 - 6 - 5 - 4 | / | / | / | / | 3 - 2 - 1 - 0 - * | / | / | / | / | * - * - * - * - *
In general, the minimum cache size to render a W*W grid without transforming any vertex multiple times is W+2.
The degenerate triangles have a small overhead, so you also want to avoid them when the cache is sufficiently large to store two rows of vertices. When the cache is too small you also have to split the grid into smaller sections and apply this method to each of them. The following code accomplishes that:
void gridGen(int x0, int x1, int y0, int y1, int width, int cacheSize) { if (x1 - x0 + 1 < cacheSize) { if (2 * (x1 - x0) + 1 > cacheSize) { for (int x = x0; x < x1; x++) { indices.push_back(x + 0); indices.push_back(x + 0); indices.push_back(x + 1); } } for (int y = y0; y < y1; y++) { for (int x = x0; x < x1; x++) { indices.push_back((width + 1) * (y + 0) + (x + 0)); indices.push_back((width + 1) * (y + 1) + (x + 0)); indices.push_back((width + 1) * (y + 0) + (x + 1)); indices.push_back((width + 1) * (y + 0) + (x + 1)); indices.push_back((width + 1) * (y + 1) + (x + 0)); indices.push_back((width + 1) * (y + 1) + (x + 1)); } } } else { int xm = x0 + cacheSize - 2; gridGen(x0, xm, y0, y1, width, cacheSize); gridGen(xm, x1, y0, y1, width, cacheSize); } }
This may not be the most optimal grid partition, but the method still performs pretty well in those cases. Here are the results for a cache with 16 entries:
Method ACMR ATVR Scanline 1.062 1.882 NVTriStrip 0.775 1.374 K-Cache-Reorder 0.766 1.356 Hilbert 0.754 1.336 Morton 0.750 1.329 Tipsy 0.711 1.260 Forsyth 0.699 1.239 Optimal 0.598 1.059
And for a cache with only 12 entries:
Method ACMR ATVR Scanline 1.062 1.882 NVTriStrip 0.875 1.550 Forsyth 0.859 1.522 K-Cache-Reorder 0.807 1.491 Morton 0.812 1.439 Hilbert 0.797 1.412 Tipsy 0.758 1.343 Optimal 0.600 1.062
In all cases, the proposed algorithm is significantly faster than the other approaches. In the future it would interesting to take into account some of these observations in a general mesh optimization algorithm.
Good stuff. I guess this is because it’s an unusual cache that doesn’t actually LRU – it doesn’t promote vertices when you use ones existing in the cache.
One thing I like to see is what’s the behavior when you optimize for the wrong cache size. eg. if you optimize for 8-member cache then run on 16-member cache, how does it compare to Forsyth?
If I optimize for 8, but run it on a 16 entry cache I get:
Optimal ATVR: 1.125
Forsyth’s ATVR: 1.433
While if the cache has only 8-entries:
Optimal ATVR: 1.125
Forsyth’s ATVR: 1.547
The proposed method wins in both cases, but increases in the size of the actual cache do not improve the result. It’s possible that at some extreme configurations this method might lose.
Ha, that’s cool.
Nice, article. Is there any way to know the cache size for specific gpu?
alpha2: D3DQUERYTYPE_VCACHE (late reply, but I was looking for the same info…)
Seems this code generates triangle list. Do you have code for triangle strip?
No, but it should be trivial to generate a triangle strip instead. That is left as an exercise. Feel free to share your implementation!
Ok, gridGen(0, mWidth-1, 0, mHeight-1, mWidth, vcache_size);
Looks good, but note that when prewarming the cache you should use degenerate triangles, and in your code, only the first triangle is degenerate.
I think you can go one step further and represent the new vertex layout using a z-order curve with a swizzle pattern based on the vertex cache size and perhaps the output memory’s cache size. For example: yyyyxxxxyyyyxxxx
Group “x” bits to form the row size based on the vertex cache, and group “y” bits to decide the size of tiles of output.