Fast Dynamic Lighting

Discuss programming topics for the various GPL'd game engine sources.
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Fast Dynamic Lighting

Post by mh »

A compilation of sorts of various tricks and techniques I've built up over time. Using these will result in an engine that's capable of blasting through scenes with heavy dynamic lighting as if the lighting was almost not even happening.

Don't use GL_RGB

I commonly see GL_RGB used as a lightmap format in engine sources, and can only assume that people somehow think it "saves memory". It not only doesn't, but it slows things down too. Read this first.
And if you are interested, most GPUs like chunks of 4 bytes. In other words, RGBA or BGRA is prefered. RGB and BGR is considered bizarre since most GPUs, most CPUs and any other kind of chip don't handle 24 bits. This means, the driver converts your RGB or BGR to what the GPU prefers, which typically is BGRA.
On NVIDIA, using GL_BGRA can upload textures up to 6 times faster than GL_RGB. On Intel it's something similar but subtly (or not so subtly - see below too) different. ATI, oddly enough, doesn't seem to care much, but nonetheless it makes sense to use the format that performs best on as much hardware as possible.

Don't use GL_UNSIGNED_BYTE

This one really only affects Intel, but it's no harm to use it for everything. With any type other than GL_UNSIGNED_INT_8_8_8_8_REV Intel seems to pull the texture data back to system memory for modification, whereas using GL_UNSIGNED_INT_8_8_8_8_REV allows glTexSubImage2D to send it directly. A combination of GL_BGRA and GL_UNSIGNED_INT_8_8_8_8_REV will run about 40 times faster on Intel than GL_RGB/GL_UNSIGNED_BYTE.

Both of these are only available if your GL_VERSION is 1.2 or higher, but I think that's a reasonable requirement to have these days. Of course you'll need to define them in your glquake.h file, so here they are:

Code: Select all

#define GL_BGRA 0x80E1
#define GL_UNSIGNED_INT_8_8_8_8_REV 0x8367
Get your glTexSubImage calls in the right place

If you just do the above changes you'll probably notice that nothing at all has changed in terms of performance; especially if your renderer is set up like GLQuake's. This is because of the dreaded R_DrawSequentialPoly function, which is one of the most evil things in GLQuake.

The single worst thing you can do is modify a resource, then use it, then modify it again, then use it again, and so on, in the same frame. This completely breaks CPU/GPU parallelism and means that your CPU will be constantly waiting for you your GPU to be ready, and your pipeline will be constantly stalling.

This is also the reason why disabling multitexturing is sometimes used as a performance enhancer with some maps - the non-multitextured path more commonly does things the right way, avoids the stalls, and therefore seems to be the faster one, even though it's actually substantially slower than a properly designed multitexture path.

Instead set things up so that you can blast through all of your visible surfaces in a first pass, updating lightmaps as you go, then do a second pass for actually drawing them. If this first pass can do something else useful - like sorting surfaces into texture chains - all the better.

Conclusion

There's frequently a reason why ID Software did things the way they did in Quake, but sometimes that reason may be one of:
  • Quake had to run on a MS-DOS machine with a p60 and 8MB RAM
  • It worked OK on the hardware that was available in 1996 (I'm thinking 3DFX in particular).
  • They were learning and experimenting, and didn't really know what they were doing.
  • It wasn't noticed as a problem because there were worse bottlenecks elsewhere (fillrate, software T&L, etc).
It does no harm to occasionally re-evaluate how things are set up and fix them to work properly on more recent hardware. Why would you compromise performance for 99% of your users just to keep the 1% that still has ancient crappy hardware happy? I wouldn't.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Post by mh »

We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
Baker
Posts: 3666
Joined: Tue Mar 14, 2006 5:15 am

Post by Baker »

I find it interesting that uploading as unsigned integer is faster than bytes ... but get the concepts behind all of this.

Not too much to say, except lightmaps certainly can be a point of contention in rendering speed.

I like the in-depth hardware oriented performance analysis of this.

I will certainly be curious to play around with on the 3 or 4 Intel video equipped machines around to see the extent that frame per second improves (one includes my Mac Mini which has Intel Video).

[And maybe this stuff makes the FitzQuake renderer perform on par with some of the other engines in heavy dynamic lighting situations.]

More stuff of substance to experiment with ;)
The night is young. How else can I annoy the world before sunsrise? 8) Inquisitive minds want to know ! And if they don't -- well like that ever has stopped me before ..
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Post by mh »

It's not the unsigned integer so much as the "_REV" part that's crucial here. The way I figure it, this is acting as a hint to the driver that "the data is already laid out in the format you prefer to use, so there's no need to send it through a slow path; just take it direct instead".

This is only really important for Intels that have hardware T&L - say, the 965 onwards; it seems as though earlier generations are more tolerant.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Post by mh »

Lightmap Rectangle Updates

GLQuake updates the full width of a dynamic lightmap, which can be a lot more of the lightmap than actually needs to be updated. We can do better than that by supplying it with a proper subrectangle.

The following replacement structure for glRect_t will define a proper rectangle for use in the rest of this discussion:

Code: Select all

typedef struct gl_rect_s
{
	// use a proper rect
	int left, top, right, bottom;
} gl_rect_t;
You'll need one of these for each lightmap (which we'll call the "dirtyrect" and one for each surface (which we'll call the "lightrect"; you may as well store it in the msurface_t struct too).

surf->lightrect.left is equal to smax, surf->lightrect.right is equal to smax + surf->light_s, and I bet you can guess how the rest of them are calculated.

The dirtyrects are initialized similar to the current rectchange, with left set to BLOCK_WIDTH, right to 0, etc.

When a lightmap is modified you can then mark out the changed region with code similar to this:

Code: Select all

		if (surf->lightrect.left < dirtyrect->left) dirtyrect->left = surf->lightrect.left;
		if (surf->lightrect.right > dirtyrect->right) dirtyrect->right = surf->lightrect.right;
		if (surf->lightrect.top < dirtyrect->top) dirtyrect->top = surf->lightrect.top;
		if (surf->lightrect.bottom > dirtyrect->bottom) dirtyrect->bottom = surf->lightrect.bottom;
Now to update the lightmap.

The first thing we need is to tell OpenGL some information about the texture you're updating by calling glPixelStorei (GL_UNPACK_ROW_LENGTH, BLOCK_WIDTH). This lets OpenGL know the length of each row in the texture, so that when you do a partial update of a row it will skip to the start of the next one each time. Otherwise we'll get corrupted lightmap updates as it will most likely append data intended for the start of the next row to the end of the current update region. Call glPixelStorei (GL_UNPACK_ROW_LENGTH, 0) to set it back to default behaviour when done.

Finally we have our glTexSubImage2D call; an example might look something like this:

Code: Select all

		glTexSubImage2D
		(
			GL_TEXTURE_2D,
			0,
			dirtyrect->left,
			dirtyrect->top,
			(dirtyrect->right - dirtyrect->left),
			(dirtyrect->bottom - dirtyrect->top),
			GL_BGRA,
			GL_UNSIGNED_INT_8_8_8_8_REV,
			gl_lightmaps[i].data + (dirtyrect->top * BLOCK_WIDTH + dirtyrect->left) * LIGHTMAP_BYTES
		);
And we've just cut down on bandwidth usage for lightmap updating by a potentially significant amount.

Note that this technique is useless on it's own. You need to stop syncing the GPU with the CPU by following the techniques I've outlined up above first. Use this in addition to the above to get more speed, not instead of it.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
ceriux
Posts: 2230
Joined: Sat Sep 06, 2008 3:30 pm
Location: Indiana, USA

Post by ceriux »

i heard the reason quakes d lights are so slow is because they're based on doom 3's . i would like to see a quake engine which can handel dynamic lighting and not bog down my machines.
leileilol
Posts: 2783
Joined: Fri Oct 15, 2004 3:23 am

Post by leileilol »

AmigaQuake had a 'crude dynamic lights' feature which turned the attenuation fades into weird star patterns. Don't know if this would speed up calculation of a changed texture on x86, though.
ceriux wrote:i heard the reason quakes d lights are so slow is because they're based on doom 3's .
*facepalm*

Do you honestly believe that?

obviously, you heard from a retard. Doom3 has a completely different, unrelated lighting model. Also how does one base on code from the future?
i should not be here
metlslime
Posts: 316
Joined: Tue Feb 05, 2008 11:03 pm

Post by metlslime »

He probably heard someone talking about Tenebrae or Darkplaces rtlights, both of which could be said to be "based on" doom 3 lighting.
ceriux
Posts: 2230
Joined: Sat Sep 06, 2008 3:30 pm
Location: Indiana, USA

Post by ceriux »

metlslime wrote:He probably heard someone talking about Tenebrae or Darkplaces rtlights, both of which could be said to be "based on" doom 3 lighting.
ahh maybe thats it, maybe i was confused a little.
metlslime
Posts: 316
Joined: Tue Feb 05, 2008 11:03 pm

Post by metlslime »

Thanks for this research. I haven't implemented these yet but I read through the diffs; pretty straightforward stuff.

Notes/questions:

1. While you need version 1.2 for GL_BGRA to be a core feature, it seems you could also check for GL_EXT_bgra, which seems to have virtually 100% support (even among version 1.1 drivers) according to this site (which you linked recently)

2. I assume the same changes would speed up texture loading at map init too, have you tried that?

3. I also wonder if this would be faster for downloading images for screenshots or the imagedump command. (Since TGAs are BGR anyway, it would at least save CPU time.)
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Post by mh »

Pretty much "yes" to all counts, although I personally haven't noticed much performance difference in regular texture loading (I guess the other work that glTexImage2D is doing - allocating GPU memory, etc - represents the bulk of time spent there).
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
szo
Posts: 132
Joined: Mon Dec 06, 2010 4:42 pm

Post by szo »

How is GL_UNSIGNED_INT_8_8_8_8_REV handled on big endian systems along with GLBGRA or GL_RGBA? Should the texture data be generated according to host's endianism?
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Post by mh »

szo wrote:How is GL_UNSIGNED_INT_8_8_8_8_REV handled on big endian systems along with GLBGRA or GL_RGBA? Should the texture data be generated according to host's endianism?
I'd figure that's the most sensible way of doing it, although it makes sense to benchmark first as CPU endianness and GPU endianness may not necessarily be the same thing.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
revelator
Posts: 2621
Joined: Thu Jan 24, 2008 12:04 pm
Location: inside tha debugger

Post by revelator »

a jokebot oh hardyharhar :shock:
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Re: Fast Dynamic Lighting

Post by mh »

Just resurrecting this old one regarding the endianness issue mentioned by szo.

The GL spec (page 97, 1.2.1 version) clearly states which bits are assigned to which components for the UNSIGNED_INT_8_8_8_8_REV type so endianness is basically not an issue - 4th component goes in bits 31-24, 3rd in 23-16, 2nd in 15-8 and 1st in 7-0 and if an implementation does otherwise then it's non-conformant.

Where it would be an issue is if you used unsigned int * for your source data type, but because Tex(Sub)Image takes a GLvoid * parameter for data you can still use byte * even with this type and satisfy all requirements.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
Post Reply