VBO performance :(

Discuss programming topics that involve the OpenGL API.
Post Reply
r00k
Posts: 1111
Joined: Sat Nov 13, 2004 10:39 pm

VBO performance :(

Post by r00k »

I've been adding VBOs around the vertex array code, but after everything is up and running, im only getting 330fps with timedemo demo2 where i get >1000fps with immediate mode.

Any suggestions?

Code: Select all

	glBindBufferARB (GL_ARRAY_BUFFER_ARB, vboId);
	glBufferDataARB (GL_ARRAY_BUFFER_ARB,  sizeof(VArrayVerts), VArrayVerts, GL_STATIC_DRAW_ARB);
	glBufferSubDataARB (GL_ARRAY_BUFFER_ARB, 0, sizeof(VArrayVerts), VArrayVerts);

	glDrawArrays (VA_PRIMITIVE, 0, va_numverts);	
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Post by mh »

Try:

Code: Select all

   glBufferDataARB (GL_ARRAY_BUFFER_ARB,  sizeof(VArrayVerts), NULL, GL_STATIC_DRAW_ARB); 
instead of your current glBufferData call. Otherwise you need to stall the pipeline because it's currently using data in the buffer; doing NULL here will give you a fresh block of memory instead.

Also consider using GL_STREAM_DRAW

This kind of VBO usage pattern ain't great though. What you want to do is two passes through your data set; the first pass just fills up the buffer and records positions at which you need to change state and issue draw calls, the second pass is over your recorded positions and issues the actual draw calls. Otherwise you're doing lots and lots of tiny little updates per frame which is really going to hurt performance.

Even more ideally you'd put all of your surface data into one great big whopping static VBO and never touch it, then use shaders for animating water and sky. That will have zero overhead on data transfer from CPU -> GPU.

In practice though it doesn't really matter. For Quake there is more or less no difference between VBOs and client-side vertex arrays in terms of performance. You need to be hitting it with enough data to bring your driver to it's knees before you'll start seeing performance gains, and bandwidth to the GPU needs to be an identifiable bottleneck. If either of those are false it will be a lot of work for no measurable gain.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
r00k
Posts: 1111
Joined: Sat Nov 13, 2004 10:39 pm

Post by r00k »

Doh! ;)
Thanks for the explanation.

I guess ill put this on the back burner for now...
Spike
Posts: 2914
Joined: Fri Nov 05, 2004 3:12 am
Location: UK
Contact:

Post by Spike »

aye, if you're streaming data, then its easier/faster to let the drivers do the streaming entirely themselves.

save the vbos for large, true static data (vertex programs can do the animation for you).
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Post by mh »

Spike wrote:aye, if you're streaming data, then its easier/faster to let the drivers do the streaming entirely themselves.

save the vbos for large, true static data (vertex programs can do the animation for you).
Even then bandwidth might not be your bottleneck. VBOs aren't a magic bullet that can suddenly make everything go faster, they're a handy tool that's useful under certain circumstances, and if those circumstances don't apply then you may as well not be using them.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
Irritant
Posts: 250
Joined: Mon May 19, 2008 2:54 pm
Location: Maryland
Contact:

Post by Irritant »

mh wrote:In practice though it doesn't really matter. For Quake there is more or less no difference between VBOs and client-side vertex arrays in terms of performance. You need to be hitting it with enough data to bring your driver to it's knees before you'll start seeing performance gains, and bandwidth to the GPU needs to be an identifiable bottleneck. If either of those are false it will be a lot of work for no measurable gain.
Agreed. In CRX we have an option to load the entire map into a VBO, and the performance gain is really pretty neglibible at best.

Where you *might* see a gain is in a case where you have static alias meshes in a map that have multi-pass shaders, so that the arrays aren't being rebuilt each frame. However if you're using VBO, you're probably using GLSL, so everything is done in a single pass anyway.
http://red.planetarena.org - Alien Arena and the CRX engine
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Post by mh »

Animating meshes in your vertex shader is the way to go. Combine that with static VBOs containing all of the xyz data (and better yet, if you have generic vertex attrib arrays you can use GL_UNSIGNED_BYTE for Quake formats; don't normalize but make sure that you pad them out to 4 bytes and set a w of 1). Bind your VBO, set a pointer to the xyz for current frame, set a pointer to the xyz for previous frame, send your blendweight as a uniform and let the GPU do all the rest of the heavy lifting. Depending how you do light you could potentially completely eliminate all CPU -> GPU data traffic. Hell, you could even encode the r_avertex_normal_dots table into a texture if you're still using that.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
Irritant
Posts: 250
Joined: Mon May 19, 2008 2:54 pm
Location: Maryland
Contact:

Re:

Post by Irritant »

mh wrote:Animating meshes in your vertex shader is the way to go. Combine that with static VBOs containing all of the xyz data (and better yet, if you have generic vertex attrib arrays you can use GL_UNSIGNED_BYTE for Quake formats; don't normalize but make sure that you pad them out to 4 bytes and set a w of 1). Bind your VBO, set a pointer to the xyz for current frame, set a pointer to the xyz for previous frame, send your blendweight as a uniform and let the GPU do all the rest of the heavy lifting. Depending how you do light you could potentially completely eliminate all CPU -> GPU data traffic. Hell, you could even encode the r_avertex_normal_dots table into a texture if you're still using that.
While this seems a bit daunting and I'm really unsure how to go about it with our animation process(I'm am not convinced that we can move our calculations into GLSL and still have compatibility with all hardware - yeah ATI I am talking to you!), we did manage to get all of our static meshes rendered with VBO. The performance increase was signifigant.
http://red.planetarena.org - Alien Arena and the CRX engine
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Re: Re:

Post by mh »

Irritant wrote:
mh wrote:Animating meshes in your vertex shader is the way to go. Combine that with static VBOs containing all of the xyz data (and better yet, if you have generic vertex attrib arrays you can use GL_UNSIGNED_BYTE for Quake formats; don't normalize but make sure that you pad them out to 4 bytes and set a w of 1). Bind your VBO, set a pointer to the xyz for current frame, set a pointer to the xyz for previous frame, send your blendweight as a uniform and let the GPU do all the rest of the heavy lifting. Depending how you do light you could potentially completely eliminate all CPU -> GPU data traffic. Hell, you could even encode the r_avertex_normal_dots table into a texture if you're still using that.
While this seems a bit daunting and I'm really unsure how to go about it with our animation process(I'm am not convinced that we can move our calculations into GLSL and still have compatibility with all hardware - yeah ATI I am talking to you!), we did manage to get all of our static meshes rendered with VBO. The performance increase was signifigant.
I got IQM animations working on the GPU using ARB ASM - including on ATI and Intel! :D

It was quite easy. I did the matrix calcs on the CPU then added 4 extra "texcoords" to the vertex specification. I also added some caching so if the frames don't change since last time the model was drawn I just reuse the previous calcs. That worked out very well.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
revelator
Posts: 2621
Joined: Thu Jan 24, 2008 12:04 pm
Location: inside tha debugger

Re: VBO performance :(

Post by revelator »

Nice one :) id like a look at the code unless its (propriarity) ;)
Productivity is a state of mind.
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Re: VBO performance :(

Post by mh »

reckless wrote:Nice one :) id like a look at the code unless its (propriarity) ;)
Don't say you weren't warned...! (OpenGL can get really ugly sometimes)

Code: Select all

!!ARBvp1.0
TEMP transnorm, dot, dotlow, dothigh;
TEMP transvert;
DPH transvert.x, vertex.position, vertex.texcoord[1];
DPH transvert.y, vertex.position, vertex.texcoord[2];
DPH transvert.z, vertex.position, vertex.texcoord[3];
MOV transvert.w, vertex.position.w;
TEMP outtmp;
DP4 outtmp.x, state.matrix.mvp.row[0], transvert;
DP4 outtmp.y, state.matrix.mvp.row[1], transvert;
DP4 outtmp.z, state.matrix.mvp.row[2], transvert;
DP4 outtmp.w, state.matrix.mvp.row[3], transvert;
MOV result.position, outtmp;
MOV result.fogcoord, outtmp.z;
DP3 transnorm.x, vertex.normal, vertex.texcoord[1];
DP3 transnorm.y, vertex.normal, vertex.texcoord[2];
DP3 transnorm.z, vertex.normal, vertex.texcoord[3];
MUL result.texcoord[0], vertex.texcoord[0], program.local[0];
MUL result.texcoord[1], vertex.texcoord[0], program.local[1];
DP3 dot, transnorm, program.env[0];
ADD dothigh, dot, 1.0;
MAD dotlow, dot, 0.2954545, 1.0;
MAX result.texcoord[2], dotlow, dothigh;
END

Code: Select all

glClientActiveTexture (GL_TEXTURE1);
glEnableClientState (GL_TEXTURE_COORD_ARRAY);
glTexCoordPointer (4, GL_FLOAT, sizeof (matrix3x4_t), ((matrix3x4_t *) scratchbuf)->a);

glClientActiveTexture (GL_TEXTURE2);
glEnableClientState (GL_TEXTURE_COORD_ARRAY);
glTexCoordPointer (4, GL_FLOAT, sizeof (matrix3x4_t), ((matrix3x4_t *) scratchbuf)->b);

glClientActiveTexture (GL_TEXTURE3);
glEnableClientState (GL_TEXTURE_COORD_ARRAY);
glTexCoordPointer (4, GL_FLOAT, sizeof (matrix3x4_t), ((matrix3x4_t *) scratchbuf)->c);

Code: Select all

// valiant attempt at clawing back some CPU from IQM animation #1
void Matrix3x4_ScaleAdd (matrix3x4_t *out, matrix3x4_t *base, float scale, matrix3x4_t *add)
{
	out->a[0] = base->a[0] * scale + add->a[0];
	out->a[1] = base->a[1] * scale + add->a[1];
	out->a[2] = base->a[2] * scale + add->a[2];
	out->a[3] = base->a[3] * scale + add->a[3];

	out->b[0] = base->b[0] * scale + add->b[0];
	out->b[1] = base->b[1] * scale + add->b[1];
	out->b[2] = base->b[2] * scale + add->b[2];
	out->b[3] = base->b[3] * scale + add->b[3];

	out->c[0] = base->c[0] * scale + add->c[0];
	out->c[1] = base->c[1] * scale + add->c[1];
	out->c[2] = base->c[2] * scale + add->c[2];
	out->c[3] = base->c[3] * scale + add->c[3];
}


// valiant attempt at clawing back some CPU from IQM animation #2
void Matrix3x4_ScaleScaleAdd (matrix3x4_t *out, matrix3x4_t *base1, float scale1, matrix3x4_t *base2, float scale2)
{
	out->a[0] = ((base2->a[0] - base1->a[0]) * scale2) + base1->a[0];
	out->a[1] = ((base2->a[1] - base1->a[1]) * scale2) + base1->a[1];
	out->a[2] = ((base2->a[2] - base1->a[2]) * scale2) + base1->a[2];
	out->a[3] = ((base2->a[3] - base1->a[3]) * scale2) + base1->a[3];

	out->b[0] = ((base2->b[0] - base1->b[0]) * scale2) + base1->b[0];
	out->b[1] = ((base2->b[1] - base1->b[1]) * scale2) + base1->b[1];
	out->b[2] = ((base2->b[2] - base1->b[2]) * scale2) + base1->b[2];
	out->b[3] = ((base2->b[3] - base1->b[3]) * scale2) + base1->b[3];

	out->c[0] = ((base2->c[0] - base1->c[0]) * scale2) + base1->c[0];
	out->c[1] = ((base2->c[1] - base1->c[1]) * scale2) + base1->c[1];
	out->c[2] = ((base2->c[2] - base1->c[2]) * scale2) + base1->c[2];
	out->c[3] = ((base2->c[3] - base1->c[3]) * scale2) + base1->c[3];
}


// valiant attempt at clawing back some CPU from IQM animation #3
void Matrix3x4_MultiplyByRef (matrix3x4_t *out, matrix3x4_t *mat1, matrix3x4_t *mat2)
{
	// this was slower than it should be; it's already slow, let's not make it worse
	out->a[0] = (mat2->a[0] * mat1->a[0]) + (mat2->b[0] * mat1->a[1]) + (mat2->c[0] * mat1->a[2]);
	out->a[1] = (mat2->a[1] * mat1->a[0]) + (mat2->b[1] * mat1->a[1]) + (mat2->c[1] * mat1->a[2]);
	out->a[2] = (mat2->a[2] * mat1->a[0]) + (mat2->b[2] * mat1->a[1]) + (mat2->c[2] * mat1->a[2]);
	out->a[3] = (mat2->a[3] * mat1->a[0]) + (mat2->b[3] * mat1->a[1]) + (mat2->c[3] * mat1->a[2]) + mat1->a[3];

	out->b[0] = (mat2->a[0] * mat1->b[0]) + (mat2->b[0] * mat1->b[1]) + (mat2->c[0] * mat1->b[2]);
	out->b[1] = (mat2->a[1] * mat1->b[0]) + (mat2->b[1] * mat1->b[1]) + (mat2->c[1] * mat1->b[2]);
	out->b[2] = (mat2->a[2] * mat1->b[0]) + (mat2->b[2] * mat1->b[1]) + (mat2->c[2] * mat1->b[2]);
	out->b[3] = (mat2->a[3] * mat1->b[0]) + (mat2->b[3] * mat1->b[1]) + (mat2->c[3] * mat1->b[2]) + mat1->b[3];

	out->c[0] = (mat2->a[0] * mat1->c[0]) + (mat2->b[0] * mat1->c[1]) + (mat2->c[0] * mat1->c[2]);
	out->c[1] = (mat2->a[1] * mat1->c[0]) + (mat2->b[1] * mat1->c[1]) + (mat2->c[1] * mat1->c[2]);
	out->c[2] = (mat2->a[2] * mat1->c[0]) + (mat2->b[2] * mat1->c[1]) + (mat2->c[2] * mat1->c[2]);
	out->c[3] = (mat2->a[3] * mat1->c[0]) + (mat2->b[3] * mat1->c[1]) + (mat2->c[3] * mat1->c[2]) + mat1->c[3];
}


void GL_AnimateIQMFrame (iqmdata_t *iqm, int currframe, int lastframe, float lerp)
{
	int i;

    int frame1 = lastframe;
    int frame2 = currframe;

	matrix3x4_t *destmatrix = (matrix3x4_t *) scratchbuf;

	// frame sanity - prevent from going out of bounds
	frame1 %= iqm->num_poses;
	frame2 %= iqm->num_poses;

	// test for flipping the frames
	if (frame1 == iqm->cachedcurrframe && frame2 == iqm->cachedlastframe)
	{
		int temp = frame1;

		frame1 = frame2;
		frame2 = temp;
		lerp = 1.0f - lerp;
	}

	// coarsen the lerp so that we don't need to recache too often
	// this animates at 50 FPS which should be enough for anyone
	lerp = (float) ((int) (lerp * 5)) / 5;

	// if the cached frames and lerp don't change there is no need to re-animate
	// these are all that it's safe to cache as the transformed bones are stored in the scratchbuf
	// we can only cache per model too as the data set is just too big to cache per entity
	if (frame1 != iqm->cachedlastframe || frame2 != iqm->cachedcurrframe || lerp != iqm->cachedlerp)
	{
		matrix3x4_t *mat1 = &iqm->frames[frame1 * iqm->num_joints];
		matrix3x4_t *mat2 = &iqm->frames[frame2 * iqm->num_joints];
		matrix3x4_t mat;

		if (iqm->version == IQM_VERSION1)
		{
			for (i = 0; i < iqm->num_joints; i++)
			{
				Matrix3x4_ScaleScaleAdd (&mat, &mat1[i], (1 - lerp), &mat2[i], lerp);

				if (iqm->jointsv1[i].parent >= 0)
					Matrix3x4_MultiplyByRef (&iqm->outframe[i], &iqm->outframe[iqm->jointsv1[i].parent], &mat);
				else Matrix3x4_Copy (&iqm->outframe[i], &mat);
			}
		}
		else
		{
			for (i = 0; i < iqm->num_joints; i++)
			{
				Matrix3x4_ScaleScaleAdd (&mat, &mat1[i], (1 - lerp), &mat2[i], lerp);

				if (iqm->jointsv2[i].parent >= 0)
					Matrix3x4_MultiplyByRef (&iqm->outframe[i], &iqm->outframe[iqm->jointsv2[i].parent], &mat);
				else Matrix3x4_Copy (&iqm->outframe[i], &mat);
			}
		}
	}

	// cache back the frames
	iqm->cachedlastframe = frame1;
	iqm->cachedcurrframe = frame2;
	iqm->cachedlerp = lerp;

	// The actual vertex generation based on the matrixes follows...
	{
		// blendweights were converted to float on load; this consumes maybe an extra 20k memory
		// for a big model but runs a coupla percent faster - that's the tradeoff
		const unsigned char *index = iqm->blendindexes;
		const float *weight = iqm->blendweights;

		for (i = 0; i < iqm->numvertexes; i++, index += 4, weight += 4)
		{
			Matrix3x4_Scale (&destmatrix[i], &iqm->outframe[index[0]], weight[0]);

			// yet another valiant attempt at clawing back some CPU from IQM animation
			if (weight[1])
			{
				Matrix3x4_ScaleAdd (&destmatrix[i], &iqm->outframe[index[1]], weight[1], &destmatrix[i]);

				if (weight[2])
				{
					Matrix3x4_ScaleAdd (&destmatrix[i], &iqm->outframe[index[2]], weight[2], &destmatrix[i]);

					if (weight[3])
					{
						Matrix3x4_ScaleAdd (&destmatrix[i], &iqm->outframe[index[3]], weight[3], &destmatrix[i]);
					}
				}
			}
		}
	}
}
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
revelator
Posts: 2621
Joined: Thu Jan 24, 2008 12:04 pm
Location: inside tha debugger

Re: VBO performance :(

Post by revelator »

Wouldnt exactly call it ugly :) but a damn load of matrix operations in that code.
3 texcoord pointers ??? so its doing z axis to or is there something im not catching ?.
Productivity is a state of mind.
Post Reply