Optimizing

leileilol · Post by **leileilol** » Tue Mar 27, 2012 6:19 pm

What are some to keep in mind?

I've already learned the hard way about having texture atlas/pages

qbism · Post by **qbism** » Tue Mar 27, 2012 11:27 pm

Isn't that the trend? Entire scenes/maps with a single texture. UDK, etc.

leileilol · Post by **leileilol** » Wed Mar 28, 2012 12:05 am

yeah but with bsp, tiled texturing and some 3d cards with 256x256 limits, ultrapaging everything isn't exactly feasible, but paging things like having muzzleflash, projectile AND impact on one texture, all the marks on another texture, etc. would help.

It's funny because Quake3 is fast and it didn't make an effort to atlas anything but the text....except on a OpenGL ES port where Q3 is a slug.

mh · Post by mh » Wed Mar 28, 2012 8:19 am

Simplest optimization is to get your lightmap update parameters matching the actual native GPU formats. You can also quite easily remove a huge amount of pipeline stalling by scheduling them better.

Where to go from there depends on your target hardware. Do you have hardware T&L? Do you have shaders? What kind of bandwidth and fillrate are we talking about? These all influence the best places to attack next.

Spike · Post by **Spike** » Wed Mar 28, 2012 3:01 pm

as you mention opengles, I assume you're making an arm/mobile port.

q3 does batching, but it uses memcpys to get those batches together.
arm devices don't exactly have a lot of cache, so memory access like that can be kinda slow.

combining multiple textures into one is problematic if you need to consider border regions between textures. this makes world/model textures rather painful, so when the content is not specifically designed for it, it only really works for 2d stuff.

Do you have an arm jit? try to compile gamecode natively?

quake3 uses about as many floats as quake1. make sure your arm port has hardware fpu enabled where possible.

various people seem to recommend using 16bit textures for mobile devices. I'm not sure if that's just due to memory usage or slowness with 32bit formats.

a device that supports only gles1 is generally considered too slow for a real fps. devices that supports gles2 are generally what you want to be using, even if you only use gles1.

the biggest optimisation possible is to get it to do nothing...

mh · Post by mh » Wed Mar 28, 2012 10:05 pm

This doc released by id for Linux driver writers around the time of Q3's release is worth reading: http://www.gamers.org/dEngine/quake3/johnc_glopt.html

Much contained in it is no longer relevant, but it does offer useful guidance on the kind of optimizations driver writers would have been encouraged to use in order to benchmark well in tests at the time. Moving a Q1 codebase to something similar is a good enough first step.

The big thing to watch out for with it is the use of padded 4-float position attributes and multiple (i.e. non-interleaved) vertex streams. That was something you would do in the software T&L days where vertex transforms would gain benefits of SIMD ops from it. With hardware T&L vertex transforms are no longer done on the CPU and SIMD is not relevant, so all it does is offer unnecessary bandwidth overhead and ruin cache locality for each full vertex.

With Q1 the biggest performance gains I've gotten have come from the following (roughly in order, and ignoring oddities like D3D9 or below's heavy draw call overhead):

Changing lightmaps to native GPU formats and scheduling updates to reduce pipeline stalls.
Get rid of R_DrawSequentialPoly and use texture chains for the multitextured path (you need to do this on brush models too, just pass the model_t * as a param to your DrawTextureChains function).
Concatening all MDL vertexes into a single GL_TRIANGLES (D3DPT_TRIANGLELIST actually, same thing) indexed list.
VBOs and making as much per-frame data as possible totally static (shaders are an absolute requirement for MDL frame interpolation and sky/water animation).
Concatenating brush surfaces to a single GL_TRIANGLES (as above, blah) list that only gets drawn when state needs to change.
Using a texture array for lightmaps.
Moving sky and water animations from the CPU to the GPU (goes hand-in-hand with VBOs but can be done separately).
Changing R_CullBox to use bounding spheres.
Switching particles to quads (yes, it's an extra vertex per particle, but look at all the wasted fillrate you save).

Most of these involve tradeoffs. You need to accept that you're going to get things like increased load times, extra memory usage or higher hardware requirements in exchange for the performance. Some of them accept a slow down in one area (e.g. more CPU work) in exchange for a speed up in another (e.g. more efficient GPU usage) with coming out on the right side of that tradeoff being the objective. And not all of them are appropriate for all platforms: some mobile platforms, I believe, still prefer triangle strips, whereas desktop platforms have long since moved on to preferring indexed triangles.

Ultimately the stock GL Q1 codebase is grotty and evil, and does things in ways that were appropriate for the hardware of the time (including piling a lot of work onto the CPU) but which stand as classic examples of the worst possible things you could do today. The only thing in it's favour is that Q2 is even worse in places.

leileilol · Post by **leileilol** » Thu Mar 29, 2012 12:47 pm

Spike wrote:as you mention opengles, I assume you're making an arm/mobile port.

i'm not making a mobile port, but i'm rebooting an asset project so that it doesn't suck on one

You may think that video shows it runs fine, but here's the kicker - this is vertexlight mode, and the inferior PowerVR PCX2 is faster than this

atlasing would help, notice the slowdown when the shotty fires - the muzzleflash alone cycles through 8 textures per puff.

Knightmare · Post by **Knightmare** » Thu Apr 05, 2012 1:45 am

mh wrote:[*]Changing lightmaps to native GPU formats and scheduling updates to reduce pipeline stalls.

That's GL_RGBA for all lightmaps, right? By scheduling updates you mean creating a set of dynamic lightmaps with dlights/lightstyles blended in BEFORE rendering world surfaces?

Spike · Post by **Spike** » Thu Apr 05, 2012 4:50 am

all PC hardware prefers GL_BGRA, for some weird reason, and GL_EXT_bgra is present in all windows-based opengl libraries, even microsoft's.

The general recommendation is to use internalformat=GL_RGBA8, format=GL_BGRA, type=GL_UNSIGNED_INT_8_8_8_8_REV (ie: the only byte ordering allowed by direct3d).

note that GL_BGRA is core in gl1.2 (but otherwise exists as GL_EXT_BGRA in all windows implementations).
note that GL_UNSIGNED_INT_8_8_8_8_REV also only exists from gl1.2, but is meant to be functionally identical to GL_UNSIGNED_BYTE on little endian systems.

This really only affects the speed of glTexImage calls and not the format stored on the gpu, so won't affect framerates in q3 at all, only load times. Quake1/2 engines will likely want to use it for lightmaps, but there are bigger issues there (like uploading the same lightmap 20 times per frame, along with multiple stalls) which should be addressed first.

leileilol: that video still has a higher framerate than fte running in the android emulator does.

mobile devices generally have shared video memory. the gles1 devices don't even support vbos - they'd be no faster. All I can really suggest is to reduce world geometry and make sure you're not overdrawing too much. Also avoid blending as that requires reading from the framebuffer too.
the fte android port is really just a gles1 port of fte, its basically identical to the pc version except for general gles issues (glClampd->glClampf, 16bit vertex indicies, glTexEnvf instead of glTexEnvi to work around android bugs... that's pretty much it, renderwise). It does also use 16bit textures, but not lightmaps because that would be more messy.
most people seem to get a playable framerate (>20) and enough seem to achieve vsync (at 55 or 60fps) that I'm not too concerned about raw performance with it. I do have to point out that q1 maps are generally quite a lot less detailed than your openarena maps, so I'm kinda wondering what the framerate differences is like with ftedroid and typical q1 vs q3 bsps.

Knightmare · Post by **Knightmare** » Sun Apr 22, 2012 3:34 am

Quake2 uses a temp lightmap image slot for the updates it does per-surface right before rendering each face. Batching these at the start of each frame will obviously require creating a second set of lightmap textures that are updated copies of the originals.

Minimizing uploads means updating each lightmap image with lightstyle and dlight info for every surface that uses it, before calling qglTexSubImage2D. Doing this efficently requires a way to quickly iterate through all surfaces that use a given lightmap image, and not looping through all surfaces in the BSP looking for a texturenum match for each lightmap image. Would using a special texturechain for each lightmap image generated on map load be the best way to do this?

What about for bmodels? Origin/angles for each bmodel are not available until it is rendered in the entity loop, so lightstyle/dlight info will need to be updated separately for each bmodel. Bmodels can be used by more than one entity, so batch updating their lightmaps at the beginning of the frame would be problematic, even if the entity info was available then.

mh · Post by mh » Sun Apr 22, 2012 12:02 pm

No need for any of that. You just keep a system memory copy of the lightmap data, same way as Quake does. R_BuildLightmap writes into that copy.

Both the worldmodel and brush models are modified to use texture chains. There are two steps here: (1) build the texture chain, and (2) draw the texture chain.

In step (1), as well as building the chain you check appropriate surfaces for lightmap modification. Again, the modification just updates the system memory copy, and if a lightmap is modified you set a qboolean indicating so. Also track the modified rectangle, same way as Quake.

In step (2), before drawing, you iterate through all your lightmaps calling glTexSubImage on the ones that are modified. Clear the qboolean and reset the rectangle.

There's a clear tradeoff here: you accept some extra system memory usage in exchange for hugely reducing the performance hit of lightmap updating (especially on AMD and Intel hardware).

The only remaining complication is sorting surfaces by both lightmap and texture to minimize state changes and allow bigger draw batches. One solution is to use a texture array which only needs to be bound once (at the start of your texture chain drawing function) so all you need to do is sort by texture, which using texture chains will already do for you.

Some sample code.

Initial lightmap build:

Code: Select all

	GL_BindTexture (GL_TEXTURE2, GL_TEXTURE_2D_ARRAY, 0, gl_state.lightmap_textures);

	glTexImage3D (GL_TEXTURE_2D_ARRAY,
		0,
		GL_RGBA8,
		LIGHTMAP_SIZE, LIGHTMAP_SIZE,
		gl_lms.current_lightmap_texture,
		0,
		GL_BGRA,
		GL_UNSIGNED_INT_8_8_8_8_REV,
		gl_lms.lmhunkbase);

Surface verts:

Code: Select all

			// lightmaps must always be built before surfaces so these are valid to set
			// texcoord[2] contains the slice index for a texture array
			verts->lightmap[0] = s;
			verts->lightmap[1] = t;
			verts->lightmap[2] = surf->lightmaptexturenum;

Check for modification:

Code: Select all

		int	smax, tmax;
		unsigned *base = gl_lms.lightmap_data[surf->lightmaptexturenum];
		RECT *rect = &gl_lms.lightrect[surf->lightmaptexturenum];

		smax = (surf->extents[0] >> 4) + 1;
		tmax = (surf->extents[1] >> 4) + 1;
		base += (surf->light_t * LIGHTMAP_SIZE) + surf->light_s;

		R_BuildLightMap (surf, base, LIGHTMAP_SIZE);
		R_SetCacheState (surf);

		gl_lms.modified[surf->lightmaptexturenum] = true;

		if (surf->lightrect.left < rect->left) rect->left = surf->lightrect.left;
		if (surf->lightrect.right > rect->right) rect->right = surf->lightrect.right;
		if (surf->lightrect.top < rect->top) rect->top = surf->lightrect.top;
		if (surf->lightrect.bottom > rect->bottom) rect->bottom = surf->lightrect.bottom;

Rebuild:

Code: Select all

	GL_BindTexture (GL_TEXTURE2, GL_TEXTURE_2D_ARRAY, r_lightmapsampler, gl_state.lightmap_textures);

	// upload any lightmaps that were modified
	for (i = 0; i < gl_lms.current_lightmap_texture; i++)
	{
		if (gl_lms.modified[i])
		{
			if (!stateset)
			{
				glPixelStorei (GL_UNPACK_ROW_LENGTH, LIGHTMAP_SIZE);
				stateset = true;
			}

			// texture is already bound so just modify it
			glTexSubImage3D (GL_TEXTURE_2D_ARRAY, 0,
				gl_lms.lightrect[i].left, gl_lms.lightrect[i].top, i,
				(gl_lms.lightrect[i].right - gl_lms.lightrect[i].left), (gl_lms.lightrect[i].bottom - gl_lms.lightrect[i].top), 1,
				GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV,
				gl_lms.lightmap_data[i] + (gl_lms.lightrect[i].top * LIGHTMAP_SIZE) + gl_lms.lightrect[i].left);

			gl_lms.lightrect[i].left = LIGHTMAP_SIZE;
			gl_lms.lightrect[i].right = 0;
			gl_lms.lightrect[i].top = LIGHTMAP_SIZE;
			gl_lms.lightrect[i].bottom = 0;

			gl_lms.modified[i] = false;
		}
	}

	if (stateset)
		glPixelStorei (GL_UNPACK_ROW_LENGTH, 0);

Knightmare · Post by **Knightmare** » Sun Apr 22, 2012 5:45 pm

Unfortunately, Q2 doesn't use texturechains for lightmapped surfaces, but only for warp and transparent surfaces. Also, aren't texture arrays incompatible with using fixed pipeline functions? It will be a while before I'm doing everything with fragment programs.

But an array for updated copies of lightmaps added to the lightmap state would definately simplify things.

Are the values in the lightrect struct just ints derived from the surface's light_s/light_t values and smax/tmax? If so, the lightmap state struct would look like this:

Code: Select all

typedef struct
{
	int			left;
	int			right;
	int			top;
	int			bottom;
} rect_t;

typedef struct
{
	int			internal_format;
	int			current_lightmap_texture;

	msurface_t	*lightmap_surfaces[MAX_LIGHTMAPS];
	byte		lightmap_update[MAX_LIGHTMAPS][4*LM_BLOCK_WIDTH*LM_BLOCK_HEIGHT];
	rect_t		lightrect[MAX_LIGHTMAPS];
	qboolean	modified[MAX_LIGHTMAPS];

	int			allocated[LM_BLOCK_WIDTH];

	// the lightmap texture data needs to be kept in
	// main memory so texsubimage can update properly
	byte		lightmap_buffer[4*LM_BLOCK_WIDTH*LM_BLOCK_HEIGHT];
} gllightmapstate_t;

extern gllightmapstate_t gl_lms;

Spike · Post by **Spike** » Sun Apr 22, 2012 7:34 pm

yeah, texture arrays mandate csqc.

if you're storing a chain of the surfaces in the lightmap, then I really hope that each lightmap has only a single texture, otherwise its pointless as each batch should have a single texture, and its probably faster to just glSubTexImage all surfaces in one go.

mh · Post by mh » Sun Apr 22, 2012 9:02 pm

Here's the lightmap state struct I'm using:

Code: Select all

typedef struct gllightmapstate_s
{
	int	current_lightmap_texture;

	int			allocated[LIGHTMAP_SIZE];
	qboolean	modified[MAX_LIGHTMAPS];
	RECT		lightrect[MAX_LIGHTMAPS];

	// the lightmap texture data needs to be kept in
	// main memory so texsubimage can update properly
	int			lmhunkmark;
	byte		*lmhunkbase;
	unsigned	*lightmap_data[MAX_LIGHTMAPS];
} gllightmapstate_t;

The lightmap data is kept in it's own hunk which is VirtualAlloc'ed as required and I'm using the Windows RECT struct rather than defining my own, but otherwise it's pretty standard. For convenience I also keep a lightrect in each msurface_t but that's not absolutely required. The only thing I'm doing that might look funny is using an unsigned for data rather than a byte * - that just makes array indexing a little easier and more robust (no need to have to remember to multiply by 4 every time).

Code: Select all

		gl_lms.modified[surf->lightmaptexturenum] = true;

		if (surf->lightrect.left < rect->left) rect->left = surf->lightrect.left;
		if (surf->lightrect.right > rect->right) rect->right = surf->lightrect.right;
		if (surf->lightrect.top < rect->top) rect->top = surf->lightrect.top;
		if (surf->lightrect.bottom > rect->bottom) rect->bottom = surf->lightrect.bottom;

Code: Select all

	surf->lightrect.left = surf->light_s;
	surf->lightrect.top = surf->light_t;
	surf->lightrect.right = surf->light_s + smax;
	surf->lightrect.bottom = surf->light_t + tmax;

Knightmare · Post by **Knightmare** » Mon Apr 23, 2012 1:38 am

Spike: That lightmap_surfaces pointer array is just a leftover from the old, removed two-pass lightmap blending code. I might repurpose it, though.

mh: Do the existing checks handle updates for the first frame when a surface is no longer dynamically lit, so that dlights are removed?

InsideQC Forums

Optimizing

Optimizing

Re: Optimizing

Re: Optimizing

Re: Optimizing

Re: Optimizing

Re: Optimizing

Re: Optimizing

Re: Optimizing

Re: Optimizing

Re: Optimizing

Re: Optimizing

Re: Optimizing

Re: Optimizing

Re: Optimizing

Re: Optimizing