Fast Dynamic Lighting

mh · Post by mh » Sun Mar 20, 2011 3:54 pm

A compilation of sorts of various tricks and techniques I've built up over time. Using these will result in an engine that's capable of blasting through scenes with heavy dynamic lighting as if the lighting was almost not even happening.

Don't use GL_RGB

I commonly see GL_RGB used as a lightmap format in engine sources, and can only assume that people somehow think it "saves memory". It not only doesn't, but it slows things down too. Read this first.

And if you are interested, most GPUs like chunks of 4 bytes. In other words, RGBA or BGRA is prefered. RGB and BGR is considered bizarre since most GPUs, most CPUs and any other kind of chip don't handle 24 bits. This means, the driver converts your RGB or BGR to what the GPU prefers, which typically is BGRA.

On NVIDIA, using GL_BGRA can upload textures up to 6 times faster than GL_RGB. On Intel it's something similar but subtly (or not so subtly - see below too) different. ATI, oddly enough, doesn't seem to care much, but nonetheless it makes sense to use the format that performs best on as much hardware as possible.

Don't use GL_UNSIGNED_BYTE

This one really only affects Intel, but it's no harm to use it for everything. With any type other than GL_UNSIGNED_INT_8_8_8_8_REV Intel seems to pull the texture data back to system memory for modification, whereas using GL_UNSIGNED_INT_8_8_8_8_REV allows glTexSubImage2D to send it directly. A combination of GL_BGRA and GL_UNSIGNED_INT_8_8_8_8_REV will run about 40 times faster on Intel than GL_RGB/GL_UNSIGNED_BYTE.

Both of these are only available if your GL_VERSION is 1.2 or higher, but I think that's a reasonable requirement to have these days. Of course you'll need to define them in your glquake.h file, so here they are:

Code: Select all

#define GL_BGRA 0x80E1
#define GL_UNSIGNED_INT_8_8_8_8_REV 0x8367

Get your glTexSubImage calls in the right place

If you just do the above changes you'll probably notice that nothing at all has changed in terms of performance; especially if your renderer is set up like GLQuake's. This is because of the dreaded R_DrawSequentialPoly function, which is one of the most evil things in GLQuake.

The single worst thing you can do is modify a resource, then use it, then modify it again, then use it again, and so on, in the same frame. This completely breaks CPU/GPU parallelism and means that your CPU will be constantly waiting for you your GPU to be ready, and your pipeline will be constantly stalling.

This is also the reason why disabling multitexturing is sometimes used as a performance enhancer with some maps - the non-multitextured path more commonly does things the right way, avoids the stalls, and therefore seems to be the faster one, even though it's actually substantially slower than a properly designed multitexture path.

Instead set things up so that you can blast through all of your visible surfaces in a first pass, updating lightmaps as you go, then do a second pass for actually drawing them. If this first pass can do something else useful - like sorting surfaces into texture chains - all the better.

Conclusion

There's frequently a reason why ID Software did things the way they did in Quake, but sometimes that reason may be one of:

Quake had to run on a MS-DOS machine with a p60 and 8MB RAM
It worked OK on the hardware that was available in 1996 (I'm thinking 3DFX in particular).
They were learning and experimenting, and didn't really know what they were doing.
It wasn't noticed as a problem because there were worse bottlenecks elsewhere (fillrate, software T&L, etc).

It does no harm to occasionally re-evaluate how things are set up and fix them to work properly on more recent hardware. Why would you compromise performance for 99% of your users just to keep the 1% that still has ancient crappy hardware happy? I wouldn't.

mh · Post by mh » Sun Mar 20, 2011 5:15 pm

And that's plenty of mouth, so here's the trousers.

Baker · Post by **Baker** » Sun Mar 20, 2011 6:10 pm

I find it interesting that uploading as unsigned integer is faster than bytes ... but get the concepts behind all of this.

Not too much to say, except lightmaps certainly can be a point of contention in rendering speed.

I like the in-depth hardware oriented performance analysis of this.

I will certainly be curious to play around with on the 3 or 4 Intel video equipped machines around to see the extent that frame per second improves (one includes my Mac Mini which has Intel Video).

[And maybe this stuff makes the FitzQuake renderer perform on par with some of the other engines in heavy dynamic lighting situations.]

More stuff of substance to experiment with

mh · Post by mh » Sun Mar 20, 2011 7:10 pm

It's not the unsigned integer so much as the "_REV" part that's crucial here. The way I figure it, this is acting as a hint to the driver that "the data is already laid out in the format you prefer to use, so there's no need to send it through a slow path; just take it direct instead".

This is only really important for Intels that have hardware T&L - say, the 965 onwards; it seems as though earlier generations are more tolerant.

mh · Post by mh » Tue Mar 29, 2011 9:24 pm

Lightmap Rectangle Updates

GLQuake updates the full width of a dynamic lightmap, which can be a lot more of the lightmap than actually needs to be updated. We can do better than that by supplying it with a proper subrectangle.

The following replacement structure for glRect_t will define a proper rectangle for use in the rest of this discussion:

Code: Select all

typedef struct gl_rect_s
{
	// use a proper rect
	int left, top, right, bottom;
} gl_rect_t;

You'll need one of these for each lightmap (which we'll call the "dirtyrect" and one for each surface (which we'll call the "lightrect"; you may as well store it in the msurface_t struct too).

surf->lightrect.left is equal to smax, surf->lightrect.right is equal to smax + surf->light_s, and I bet you can guess how the rest of them are calculated.

The dirtyrects are initialized similar to the current rectchange, with left set to BLOCK_WIDTH, right to 0, etc.

When a lightmap is modified you can then mark out the changed region with code similar to this:

Code: Select all

		if (surf->lightrect.left < dirtyrect->left) dirtyrect->left = surf->lightrect.left;
		if (surf->lightrect.right > dirtyrect->right) dirtyrect->right = surf->lightrect.right;
		if (surf->lightrect.top < dirtyrect->top) dirtyrect->top = surf->lightrect.top;
		if (surf->lightrect.bottom > dirtyrect->bottom) dirtyrect->bottom = surf->lightrect.bottom;

Now to update the lightmap.

The first thing we need is to tell OpenGL some information about the texture you're updating by calling glPixelStorei (GL_UNPACK_ROW_LENGTH, BLOCK_WIDTH). This lets OpenGL know the length of each row in the texture, so that when you do a partial update of a row it will skip to the start of the next one each time. Otherwise we'll get corrupted lightmap updates as it will most likely append data intended for the start of the next row to the end of the current update region. Call glPixelStorei (GL_UNPACK_ROW_LENGTH, 0) to set it back to default behaviour when done.

Finally we have our glTexSubImage2D call; an example might look something like this:

Code: Select all

		glTexSubImage2D
		(
			GL_TEXTURE_2D,
			0,
			dirtyrect->left,
			dirtyrect->top,
			(dirtyrect->right - dirtyrect->left),
			(dirtyrect->bottom - dirtyrect->top),
			GL_BGRA,
			GL_UNSIGNED_INT_8_8_8_8_REV,
			gl_lightmaps[i].data + (dirtyrect->top * BLOCK_WIDTH + dirtyrect->left) * LIGHTMAP_BYTES
		);

And we've just cut down on bandwidth usage for lightmap updating by a potentially significant amount.

Note that this technique is useless on it's own. You need to stop syncing the GPU with the CPU by following the techniques I've outlined up above first. Use this in addition to the above to get more speed, not instead of it.

ceriux · Post by **ceriux** » Wed Mar 30, 2011 3:07 am

i heard the reason quakes d lights are so slow is because they're based on doom 3's . i would like to see a quake engine which can handel dynamic lighting and not bog down my machines.

leileilol · Post by **leileilol** » Wed Mar 30, 2011 3:12 am

AmigaQuake had a 'crude dynamic lights' feature which turned the attenuation fades into weird star patterns. Don't know if this would speed up calculation of a changed texture on x86, though.

ceriux wrote:i heard the reason quakes d lights are so slow is because they're based on doom 3's .

*facepalm*

Do you honestly believe that?

obviously, you heard from a retard. Doom3 has a completely different, unrelated lighting model. Also how does one base on code from the future?

metlslime · Post by **metlslime** » Wed Mar 30, 2011 8:03 am

He probably heard someone talking about Tenebrae or Darkplaces rtlights, both of which could be said to be "based on" doom 3 lighting.

ceriux · Post by **ceriux** » Wed Mar 30, 2011 4:02 pm

metlslime wrote:He probably heard someone talking about Tenebrae or Darkplaces rtlights, both of which could be said to be "based on" doom 3 lighting.

ahh maybe thats it, maybe i was confused a little.

metlslime · Post by **metlslime** » Mon Aug 01, 2011 2:28 am

Thanks for this research. I haven't implemented these yet but I read through the diffs; pretty straightforward stuff.

Notes/questions:

1. While you need version 1.2 for GL_BGRA to be a core feature, it seems you could also check for GL_EXT_bgra, which seems to have virtually 100% support (even among version 1.1 drivers) according to this site (which you linked recently)

2. I assume the same changes would speed up texture loading at map init too, have you tried that?

3. I also wonder if this would be faster for downloading images for screenshots or the imagedump command. (Since TGAs are BGR anyway, it would at least save CPU time.)

mh · Post by mh » Mon Aug 01, 2011 10:45 am

Pretty much "yes" to all counts, although I personally haven't noticed much performance difference in regular texture loading (I guess the other work that glTexImage2D is doing - allocating GPU memory, etc - represents the bulk of time spent there).

szo · Post by **szo** » Mon Aug 01, 2011 4:22 pm

How is GL_UNSIGNED_INT_8_8_8_8_REV handled on big endian systems along with GLBGRA or GL_RGBA? Should the texture data be generated according to host's endianism?

mh · Post by mh » Mon Aug 01, 2011 6:18 pm

szo wrote:How is GL_UNSIGNED_INT_8_8_8_8_REV handled on big endian systems along with GLBGRA or GL_RGBA? Should the texture data be generated according to host's endianism?

I'd figure that's the most sensible way of doing it, although it makes sense to benchmark first as CPU endianness and GPU endianness may not necessarily be the same thing.

revelator · Post by **revelator** » Thu Aug 25, 2011 10:04 am

a jokebot oh hardyharhar

mh · Post by mh » Mon Jul 02, 2012 2:12 am

Just resurrecting this old one regarding the endianness issue mentioned by szo.

The GL spec (page 97, 1.2.1 version) clearly states which bits are assigned to which components for the UNSIGNED_INT_8_8_8_8_REV type so endianness is basically not an issue - 4th component goes in bits 31-24, 3rd in 23-16, 2nd in 15-8 and 1st in 7-0 and if an implementation does otherwise then it's non-conformant.

Where it would be an issue is if you used unsigned int * for your source data type, but because Tex(Sub)Image takes a GLvoid * parameter for data you can still use byte * even with this type and satisfy all requirements.

InsideQC Forums

Fast Dynamic Lighting

Fast Dynamic Lighting

Re: Fast Dynamic Lighting