Batching and Sorting

raynorpat · Post by **raynorpat** » Sun Aug 28, 2011 11:44 pm

Hey, all. I've been working on porting the Quake renderer to OpenGL ES 2

As it stands right now, it's a clean port, using vertex array's in all areas, along with simple GLSL shaders to fill in for the loss of the fixed-function pipeline.

Speed is good of course on my PC, but for the platform's that I'm targetting, I'm quite sure it's not that great. (some 5,000+ draw calls a frame would probably murder it)

I am at a loss of whether to sort my drawables by texture, shader, maybe by state even.
Or if anyone actually uses a sort to batch together drawables.
Any ideas?

mh · Post by mh » Mon Aug 29, 2011 1:06 am

Yeah, I sort heavily, use indexes and batch agressively in DirectQ; a typical ID1 scene comes out at a total of maybe 20-40 draw calls per frame (that's including the status bar and any notification text).

Each MDL can be collapsed to a single draw call (glDrawElements with - preferably - 16-bit indexes). Animate and interpolate on the GPU (just set up two pointers - one to verts for last frame and one to verts for curr frame and send the blend weight as a uniform) and the MDL can go in a static VBO. VBO switching is cheap in D3D so DirectQ keeps each MDL in it's own VBO; it's expensive in OpenGL so you'd put them all in a single VBO. Use vertex attrib pointers instead of old-style vertex arrays and send a 4-byte position (some hardware is unhappy with 3 - set w to 1 here). Alternatively abuse glColorPointer and glSecondaryColorPointer.

DirectQ can also use hardware instancing for MDLs where available, but that needs hardware support (which you may not have on an ES device) and a second dynamic VBO for per-instance data (and my experience is that dynamic VBOs are worthless in GL). I also believe that with GL you need to be pushing a lot more verts and a lot more instances than with D3D for hardware instancing to be really effective.

For brush surfaces I sort by both texture and lightmap. All brush surfaces are collected into an intermediate "modelsurfs" array, which is a collection of structs containing pointers to the original surf and the entity that contained it (because multiple entities can share the same model - ammo boxes!) (there's a few other things in there too but they're just for convenience) This is then walked through and sorted into linked lists of textures. Each such linked list is then walked through and sorted into further lists by lightmap, and finally these lists are drawn.

The drawing looks roughly like this:

Code: Select all

vertex_t *verts; // pointer to VBO/varray data
unsigned short *indexes; // pointer to index data
int numverts = 0;
int numindexes = 0;

for (surf = first; surf; surf = surf->next)
{
   // if we run out of buffer space we draw and reset
   if ((numverts + surf->numverts) > maxverts || (numindexes + surf->numindexes) > maxindexes)
   {
      DrawIndexedPrimitive/glDrawElements
      numverts = 0;
      numindexes = 0;
      // reset pointers here too....
   }

   for (i = 0, vert = surf->verts; i < surf->numverts; i++, vert++)
      verts[i] = vert;

   verts += surf->numverts;

   // i actually use a modified duffs device for this but i want you to understand it!!!
   for (i = 2; i < surf->numverts; i++, indexes += 3)
   {
      indexes[0] = numverts;
      indexes[1] = numverts + i - 1;
      indexes[2] = numverts + i;
   }

   numverts += surf->numverts;
   numindexes += surf->numindexes; // (surf->numverts - 2) * 3
}

// draw anything left over
DrawIndexedPrimitive/glDrawElements

Sprites and the 2D stuff are likewise collected into big arrays and drawn when the arrays fill or state changes.

Particles are interesting. I use Microsoft's "Shader Instancing (with Draw Call Batching)" trick for these. Each particle is defined by two float4s - position and colour (I stash a particle size into the spare float for position) - and the rest of the data needed for them comes from a static VBO. Because of limited constants register space I limit particle batches to 100 at a time, which still works out a good deal faster than sending the full vertex list dynamically. I've a funky vertex shader to do the billboarding in hardware from a combination of the constants registers and the VBO, and a funky pixel shader to generate the particle texture procedurally so that I get good detail at any resolution without needing a texture image. The neat thing about this is that you don't need high-end hardware; I've versions of these shaders written in both HLSL and ARB ASM that I've run on everything from an integrated Intel piece of crap to a modern NVIDIA. They run fast and look great.

There's a few more specific tricks but they're for special problem case content rather than more general case use, so that's enough to be going on for now.

Finally, no strips or fans; it's plain triangles with 16-bit indexes all the way. This is what works best on PC hardware; an ES device may or may not be different.

The total combination can run something like the big open area in the Marcher Fortress at close to 200 FPS on an Intel 945 - so yeah, it works quite well I'd say.

Enough (too much!) info?

raynorpat · Post by **raynorpat** » Mon Aug 29, 2011 1:34 am

That's a well written summary, Thanks.

Spike · Post by **Spike** » Mon Aug 29, 2011 1:46 am

create some sort of batch structure. each surface has a reference to one batch. when each surface is seen, add the surface to the batch. once all lightmaps were updated and all surfaces have been seen, loop through the batch structures and draw each that has some geometry.
this method means you group surfaces with similar properties with themselves. specifically that is texture number and lightmap number.

attempt to arrange for nearer surfaces and walls to be drawn before further stuff as well as floors within their batch or so, so pay attention to the bsp order and avoid inverting it. you don't otherwise need to sort

ents will probably need to have their own batches independantly of each other due to different model matricies, frames, etc. you only need to manually depth-sort models which are blended. traditionally, glquake blends only particles, and you don't want to depth-sort those, so you likely need no depth sorting.

how you manage your actual batches is up to you. vbo all potential surfaces in a batch if you want, but it won't be noticable. you can easily just add new verts to each batch structure. just use something other than trianglefan/polygon.

you should be able to get 80fps or so easily enough on an arm+powervr board.

mh · Post by mh » Mon Aug 29, 2011 3:09 am

Convert from a poly/fan to a strip:

Code: Select all

 	        for (i = 0, j = numverts; i < numverts; i++, j--)
 	        {
 	                int lindex = mod->surfedges[surf->firstedge + i];
 	                int stripdst = i ? (i * 2 - 1) : 0;
 	
 	                if (stripdst >= numverts) stripdst = j * 2;

Then write the data into verts[stripdst] and draw using GL_TRIANGLE_STRIP.

mh · Post by mh » Wed Aug 31, 2011 12:15 am

In relation to the topic of batching, a section from a GL Intercept (highly recommended by the way) log looks like this:

Code: Select all

glBindTexture(GL_TEXTURE_2D,39)
glDrawArrays(GL_TRIANGLE_FAN,19200,4) VP=36  FP=39  Time= 3us
glDrawArrays(GL_TRIANGLE_FAN,19254,3) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19248,3) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19290,4) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19350,4) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19306,4) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19326,4) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19318,4) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19366,4) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19370,6) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19151,4) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19145,6) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19139,6) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19136,3) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19130,6) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19108,3) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19103,5) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19096,7) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,20014,5) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19989,4) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19998,4) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,20002,4) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,20010,4) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,20006,4) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19993,5) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19904,4) VP=36  FP=39 
glDrawArrays(GL_TRIANGLE_FAN,19916,3) VP=36  FP=39

It's obvious that some extra work is being done in the first glDrawArrays call, but the rest of them take next to no time (there's obviously some small cost for each subsequent call but it's too low to register in the log - they're certainly not free though). That could be lazy state changes, it could be switching the driver into some internal "drawing stuff now" mode, it could be anything.

The upshot though is that you want to get as many consecutive primitives drawn as possible. It's a bad idea to draw stuff out of state order as individual draw calls (with state changes between them) all exhibit the same kind of setup cost:

Code: Select all

glActiveTextureARB(GL_TEXTURE0)
glBindTexture(GL_TEXTURE_2D,79)
glActiveTextureARB(GL_TEXTURE1)
glBindTexture(GL_TEXTURE_2D,184)
glDrawArrays(GL_TRIANGLE_FAN,23964,4) VP=36  FP=39  Time= 3us
glActiveTextureARB(GL_TEXTURE0)
glBindTexture(GL_TEXTURE_2D,52)
glDrawArrays(GL_TRIANGLE_FAN,19300,6) VP=36  FP=39  Time= 3us
glBindTexture(GL_TEXTURE_2D,51)
glDrawArrays(GL_TRIANGLE_FAN,20441,4) VP=36  FP=39  Time= 4us
glDrawArrays(GL_TRIANGLE_FAN,20417,4) VP=36  FP=39 
glBindTexture(GL_TEXTURE_2D,49)
glDrawArrays(GL_TRIANGLE_FAN,21515,6) VP=36  FP=39  Time= 2us
glDrawArrays(GL_TRIANGLE_FAN,21505,6) VP=36  FP=39 
glBindTexture(GL_TEXTURE_2D,48)
glDrawArrays(GL_TRIANGLE_FAN,18255,4) VP=36  FP=39  Time= 2us

So in the first example we got 39 draw calls with a cost of ~3us (there was actually about three times that number of calls in this section of the log but it would be just silly to paste it all). In the second we got 8 calls with a cost of ~14us. (And note the second glDrawArrays in a batch of two there which didn't incur a setup cost).

glDrawElements is interesting; I'm obviously getting a whole lot more triangles in a single draw call, but there is a much higher setup cost too:

Code: Select all

glDrawElements(GL_TRIANGLES,1041,GL_UNSIGNED_SHORT,0x35f4) VP=15  FP=16  Time= 23us

You'll probably need to evaluate each individual case and see where gains can be made or lost - sometimes a whole bunch of glDrawArrays calls comes out preferable to a single glDrawElements, particularly if it's going to be something like brush surfaces where you don't really know at creation time how many of them there are and in what order they're going to be for any given scene. You could set things up for a nice glDrawElements call only to find that you've just got a relatively small number of surfaces and the cost of setting it up in your program plus the added cost of glDrawElements actually runs slower.

mh · Post by mh » Fri Sep 16, 2011 10:48 pm

Just been doing quite a lot more hands-on research into this whole area; some conclusions here: http://mhquake.blogspot.com/2011/09/sur ... notes.html

The summary is - use glDraw(Range)Elements with GL_TRIANGLES and 16-bit indexes and you won't go too far wrong. (I already knew that but it was nice to formally confirm it and have the working code to prove it.) Some older hardware might be happier with glDraw(Range)Elements and GL_TRIANGLE_STRIP, and it's not too bad (a bit fiddly to get things straight though) to provide both options and let the user select which one they want. With newer hardware you can just blast a load of glDrawArrays calls and it will run almost as fast and save you a lot of work in setup - you most definitely need a big static VBO for this (and that means that you also need shaders if you want to do water and sky animations).

Key point number two is that there are a number of cases where - if you violate your hardware's limits - OpenGL will drop you back to software emulation without any warning. This would be just on the T&L stage of the pipeline, but it's still bye bye high performance, hello a drop to potentially one third or less of the framerate you had before (depending on scene complexity) and you scratching your head and wondering why you put in all that extra work only for things to run slower (if you're interested, D3D's reaction to this is normally a crash - I prefer the crash because I can catch it happening and take corrective action sooner, but I can see why OpenGL's behaviour exists). Hardware T&L devices are fussy and you need to be aware of this in order to properly get the fast path.

InsideQC Forums

Batching and Sorting

Batching and Sorting

Re: Batching and Sorting