Page 1 of 1

VBO performance :(

Posted: Tue Aug 23, 2011 5:19 pm
by r00k
I've been adding VBOs around the vertex array code, but after everything is up and running, im only getting 330fps with timedemo demo2 where i get >1000fps with immediate mode.

Any suggestions?

Code: Select all

	glBindBufferARB (GL_ARRAY_BUFFER_ARB, vboId);
	glBufferDataARB (GL_ARRAY_BUFFER_ARB,  sizeof(VArrayVerts), VArrayVerts, GL_STATIC_DRAW_ARB);
	glBufferSubDataARB (GL_ARRAY_BUFFER_ARB, 0, sizeof(VArrayVerts), VArrayVerts);

	glDrawArrays (VA_PRIMITIVE, 0, va_numverts);	

Posted: Tue Aug 23, 2011 9:18 pm
by mh
Try:

Code: Select all

   glBufferDataARB (GL_ARRAY_BUFFER_ARB,  sizeof(VArrayVerts), NULL, GL_STATIC_DRAW_ARB); 
instead of your current glBufferData call. Otherwise you need to stall the pipeline because it's currently using data in the buffer; doing NULL here will give you a fresh block of memory instead.

Also consider using GL_STREAM_DRAW

This kind of VBO usage pattern ain't great though. What you want to do is two passes through your data set; the first pass just fills up the buffer and records positions at which you need to change state and issue draw calls, the second pass is over your recorded positions and issues the actual draw calls. Otherwise you're doing lots and lots of tiny little updates per frame which is really going to hurt performance.

Even more ideally you'd put all of your surface data into one great big whopping static VBO and never touch it, then use shaders for animating water and sky. That will have zero overhead on data transfer from CPU -> GPU.

In practice though it doesn't really matter. For Quake there is more or less no difference between VBOs and client-side vertex arrays in terms of performance. You need to be hitting it with enough data to bring your driver to it's knees before you'll start seeing performance gains, and bandwidth to the GPU needs to be an identifiable bottleneck. If either of those are false it will be a lot of work for no measurable gain.

Posted: Tue Aug 23, 2011 11:15 pm
by r00k
Doh! ;)
Thanks for the explanation.

I guess ill put this on the back burner for now...

Posted: Tue Aug 23, 2011 11:25 pm
by Spike
aye, if you're streaming data, then its easier/faster to let the drivers do the streaming entirely themselves.

save the vbos for large, true static data (vertex programs can do the animation for you).

Posted: Wed Aug 24, 2011 1:42 am
by mh
Spike wrote:aye, if you're streaming data, then its easier/faster to let the drivers do the streaming entirely themselves.

save the vbos for large, true static data (vertex programs can do the animation for you).
Even then bandwidth might not be your bottleneck. VBOs aren't a magic bullet that can suddenly make everything go faster, they're a handy tool that's useful under certain circumstances, and if those circumstances don't apply then you may as well not be using them.

Posted: Wed Aug 24, 2011 4:19 pm
by Irritant
mh wrote:In practice though it doesn't really matter. For Quake there is more or less no difference between VBOs and client-side vertex arrays in terms of performance. You need to be hitting it with enough data to bring your driver to it's knees before you'll start seeing performance gains, and bandwidth to the GPU needs to be an identifiable bottleneck. If either of those are false it will be a lot of work for no measurable gain.
Agreed. In CRX we have an option to load the entire map into a VBO, and the performance gain is really pretty neglibible at best.

Where you *might* see a gain is in a case where you have static alias meshes in a map that have multi-pass shaders, so that the arrays aren't being rebuilt each frame. However if you're using VBO, you're probably using GLSL, so everything is done in a single pass anyway.

Posted: Fri Aug 26, 2011 7:11 pm
by mh
Animating meshes in your vertex shader is the way to go. Combine that with static VBOs containing all of the xyz data (and better yet, if you have generic vertex attrib arrays you can use GL_UNSIGNED_BYTE for Quake formats; don't normalize but make sure that you pad them out to 4 bytes and set a w of 1). Bind your VBO, set a pointer to the xyz for current frame, set a pointer to the xyz for previous frame, send your blendweight as a uniform and let the GPU do all the rest of the heavy lifting. Depending how you do light you could potentially completely eliminate all CPU -> GPU data traffic. Hell, you could even encode the r_avertex_normal_dots table into a texture if you're still using that.

Re:

Posted: Fri Nov 11, 2011 3:08 pm
by Irritant
mh wrote:Animating meshes in your vertex shader is the way to go. Combine that with static VBOs containing all of the xyz data (and better yet, if you have generic vertex attrib arrays you can use GL_UNSIGNED_BYTE for Quake formats; don't normalize but make sure that you pad them out to 4 bytes and set a w of 1). Bind your VBO, set a pointer to the xyz for current frame, set a pointer to the xyz for previous frame, send your blendweight as a uniform and let the GPU do all the rest of the heavy lifting. Depending how you do light you could potentially completely eliminate all CPU -> GPU data traffic. Hell, you could even encode the r_avertex_normal_dots table into a texture if you're still using that.
While this seems a bit daunting and I'm really unsure how to go about it with our animation process(I'm am not convinced that we can move our calculations into GLSL and still have compatibility with all hardware - yeah ATI I am talking to you!), we did manage to get all of our static meshes rendered with VBO. The performance increase was signifigant.

Re: Re:

Posted: Sun Nov 13, 2011 12:24 am
by mh
Irritant wrote:
mh wrote:Animating meshes in your vertex shader is the way to go. Combine that with static VBOs containing all of the xyz data (and better yet, if you have generic vertex attrib arrays you can use GL_UNSIGNED_BYTE for Quake formats; don't normalize but make sure that you pad them out to 4 bytes and set a w of 1). Bind your VBO, set a pointer to the xyz for current frame, set a pointer to the xyz for previous frame, send your blendweight as a uniform and let the GPU do all the rest of the heavy lifting. Depending how you do light you could potentially completely eliminate all CPU -> GPU data traffic. Hell, you could even encode the r_avertex_normal_dots table into a texture if you're still using that.
While this seems a bit daunting and I'm really unsure how to go about it with our animation process(I'm am not convinced that we can move our calculations into GLSL and still have compatibility with all hardware - yeah ATI I am talking to you!), we did manage to get all of our static meshes rendered with VBO. The performance increase was signifigant.
I got IQM animations working on the GPU using ARB ASM - including on ATI and Intel! :D

It was quite easy. I did the matrix calcs on the CPU then added 4 extra "texcoords" to the vertex specification. I also added some caching so if the frames don't change since last time the model was drawn I just reuse the previous calcs. That worked out very well.

Re: VBO performance :(

Posted: Sun Nov 13, 2011 3:38 pm
by revelator
Nice one :) id like a look at the code unless its (propriarity) ;)

Re: VBO performance :(

Posted: Mon Nov 14, 2011 8:49 pm
by mh
reckless wrote:Nice one :) id like a look at the code unless its (propriarity) ;)
Don't say you weren't warned...! (OpenGL can get really ugly sometimes)

Code: Select all

!!ARBvp1.0
TEMP transnorm, dot, dotlow, dothigh;
TEMP transvert;
DPH transvert.x, vertex.position, vertex.texcoord[1];
DPH transvert.y, vertex.position, vertex.texcoord[2];
DPH transvert.z, vertex.position, vertex.texcoord[3];
MOV transvert.w, vertex.position.w;
TEMP outtmp;
DP4 outtmp.x, state.matrix.mvp.row[0], transvert;
DP4 outtmp.y, state.matrix.mvp.row[1], transvert;
DP4 outtmp.z, state.matrix.mvp.row[2], transvert;
DP4 outtmp.w, state.matrix.mvp.row[3], transvert;
MOV result.position, outtmp;
MOV result.fogcoord, outtmp.z;
DP3 transnorm.x, vertex.normal, vertex.texcoord[1];
DP3 transnorm.y, vertex.normal, vertex.texcoord[2];
DP3 transnorm.z, vertex.normal, vertex.texcoord[3];
MUL result.texcoord[0], vertex.texcoord[0], program.local[0];
MUL result.texcoord[1], vertex.texcoord[0], program.local[1];
DP3 dot, transnorm, program.env[0];
ADD dothigh, dot, 1.0;
MAD dotlow, dot, 0.2954545, 1.0;
MAX result.texcoord[2], dotlow, dothigh;
END

Code: Select all

glClientActiveTexture (GL_TEXTURE1);
glEnableClientState (GL_TEXTURE_COORD_ARRAY);
glTexCoordPointer (4, GL_FLOAT, sizeof (matrix3x4_t), ((matrix3x4_t *) scratchbuf)->a);

glClientActiveTexture (GL_TEXTURE2);
glEnableClientState (GL_TEXTURE_COORD_ARRAY);
glTexCoordPointer (4, GL_FLOAT, sizeof (matrix3x4_t), ((matrix3x4_t *) scratchbuf)->b);

glClientActiveTexture (GL_TEXTURE3);
glEnableClientState (GL_TEXTURE_COORD_ARRAY);
glTexCoordPointer (4, GL_FLOAT, sizeof (matrix3x4_t), ((matrix3x4_t *) scratchbuf)->c);

Code: Select all

// valiant attempt at clawing back some CPU from IQM animation #1
void Matrix3x4_ScaleAdd (matrix3x4_t *out, matrix3x4_t *base, float scale, matrix3x4_t *add)
{
	out->a[0] = base->a[0] * scale + add->a[0];
	out->a[1] = base->a[1] * scale + add->a[1];
	out->a[2] = base->a[2] * scale + add->a[2];
	out->a[3] = base->a[3] * scale + add->a[3];

	out->b[0] = base->b[0] * scale + add->b[0];
	out->b[1] = base->b[1] * scale + add->b[1];
	out->b[2] = base->b[2] * scale + add->b[2];
	out->b[3] = base->b[3] * scale + add->b[3];

	out->c[0] = base->c[0] * scale + add->c[0];
	out->c[1] = base->c[1] * scale + add->c[1];
	out->c[2] = base->c[2] * scale + add->c[2];
	out->c[3] = base->c[3] * scale + add->c[3];
}


// valiant attempt at clawing back some CPU from IQM animation #2
void Matrix3x4_ScaleScaleAdd (matrix3x4_t *out, matrix3x4_t *base1, float scale1, matrix3x4_t *base2, float scale2)
{
	out->a[0] = ((base2->a[0] - base1->a[0]) * scale2) + base1->a[0];
	out->a[1] = ((base2->a[1] - base1->a[1]) * scale2) + base1->a[1];
	out->a[2] = ((base2->a[2] - base1->a[2]) * scale2) + base1->a[2];
	out->a[3] = ((base2->a[3] - base1->a[3]) * scale2) + base1->a[3];

	out->b[0] = ((base2->b[0] - base1->b[0]) * scale2) + base1->b[0];
	out->b[1] = ((base2->b[1] - base1->b[1]) * scale2) + base1->b[1];
	out->b[2] = ((base2->b[2] - base1->b[2]) * scale2) + base1->b[2];
	out->b[3] = ((base2->b[3] - base1->b[3]) * scale2) + base1->b[3];

	out->c[0] = ((base2->c[0] - base1->c[0]) * scale2) + base1->c[0];
	out->c[1] = ((base2->c[1] - base1->c[1]) * scale2) + base1->c[1];
	out->c[2] = ((base2->c[2] - base1->c[2]) * scale2) + base1->c[2];
	out->c[3] = ((base2->c[3] - base1->c[3]) * scale2) + base1->c[3];
}


// valiant attempt at clawing back some CPU from IQM animation #3
void Matrix3x4_MultiplyByRef (matrix3x4_t *out, matrix3x4_t *mat1, matrix3x4_t *mat2)
{
	// this was slower than it should be; it's already slow, let's not make it worse
	out->a[0] = (mat2->a[0] * mat1->a[0]) + (mat2->b[0] * mat1->a[1]) + (mat2->c[0] * mat1->a[2]);
	out->a[1] = (mat2->a[1] * mat1->a[0]) + (mat2->b[1] * mat1->a[1]) + (mat2->c[1] * mat1->a[2]);
	out->a[2] = (mat2->a[2] * mat1->a[0]) + (mat2->b[2] * mat1->a[1]) + (mat2->c[2] * mat1->a[2]);
	out->a[3] = (mat2->a[3] * mat1->a[0]) + (mat2->b[3] * mat1->a[1]) + (mat2->c[3] * mat1->a[2]) + mat1->a[3];

	out->b[0] = (mat2->a[0] * mat1->b[0]) + (mat2->b[0] * mat1->b[1]) + (mat2->c[0] * mat1->b[2]);
	out->b[1] = (mat2->a[1] * mat1->b[0]) + (mat2->b[1] * mat1->b[1]) + (mat2->c[1] * mat1->b[2]);
	out->b[2] = (mat2->a[2] * mat1->b[0]) + (mat2->b[2] * mat1->b[1]) + (mat2->c[2] * mat1->b[2]);
	out->b[3] = (mat2->a[3] * mat1->b[0]) + (mat2->b[3] * mat1->b[1]) + (mat2->c[3] * mat1->b[2]) + mat1->b[3];

	out->c[0] = (mat2->a[0] * mat1->c[0]) + (mat2->b[0] * mat1->c[1]) + (mat2->c[0] * mat1->c[2]);
	out->c[1] = (mat2->a[1] * mat1->c[0]) + (mat2->b[1] * mat1->c[1]) + (mat2->c[1] * mat1->c[2]);
	out->c[2] = (mat2->a[2] * mat1->c[0]) + (mat2->b[2] * mat1->c[1]) + (mat2->c[2] * mat1->c[2]);
	out->c[3] = (mat2->a[3] * mat1->c[0]) + (mat2->b[3] * mat1->c[1]) + (mat2->c[3] * mat1->c[2]) + mat1->c[3];
}


void GL_AnimateIQMFrame (iqmdata_t *iqm, int currframe, int lastframe, float lerp)
{
	int i;

    int frame1 = lastframe;
    int frame2 = currframe;

	matrix3x4_t *destmatrix = (matrix3x4_t *) scratchbuf;

	// frame sanity - prevent from going out of bounds
	frame1 %= iqm->num_poses;
	frame2 %= iqm->num_poses;

	// test for flipping the frames
	if (frame1 == iqm->cachedcurrframe && frame2 == iqm->cachedlastframe)
	{
		int temp = frame1;

		frame1 = frame2;
		frame2 = temp;
		lerp = 1.0f - lerp;
	}

	// coarsen the lerp so that we don't need to recache too often
	// this animates at 50 FPS which should be enough for anyone
	lerp = (float) ((int) (lerp * 5)) / 5;

	// if the cached frames and lerp don't change there is no need to re-animate
	// these are all that it's safe to cache as the transformed bones are stored in the scratchbuf
	// we can only cache per model too as the data set is just too big to cache per entity
	if (frame1 != iqm->cachedlastframe || frame2 != iqm->cachedcurrframe || lerp != iqm->cachedlerp)
	{
		matrix3x4_t *mat1 = &iqm->frames[frame1 * iqm->num_joints];
		matrix3x4_t *mat2 = &iqm->frames[frame2 * iqm->num_joints];
		matrix3x4_t mat;

		if (iqm->version == IQM_VERSION1)
		{
			for (i = 0; i < iqm->num_joints; i++)
			{
				Matrix3x4_ScaleScaleAdd (&mat, &mat1[i], (1 - lerp), &mat2[i], lerp);

				if (iqm->jointsv1[i].parent >= 0)
					Matrix3x4_MultiplyByRef (&iqm->outframe[i], &iqm->outframe[iqm->jointsv1[i].parent], &mat);
				else Matrix3x4_Copy (&iqm->outframe[i], &mat);
			}
		}
		else
		{
			for (i = 0; i < iqm->num_joints; i++)
			{
				Matrix3x4_ScaleScaleAdd (&mat, &mat1[i], (1 - lerp), &mat2[i], lerp);

				if (iqm->jointsv2[i].parent >= 0)
					Matrix3x4_MultiplyByRef (&iqm->outframe[i], &iqm->outframe[iqm->jointsv2[i].parent], &mat);
				else Matrix3x4_Copy (&iqm->outframe[i], &mat);
			}
		}
	}

	// cache back the frames
	iqm->cachedlastframe = frame1;
	iqm->cachedcurrframe = frame2;
	iqm->cachedlerp = lerp;

	// The actual vertex generation based on the matrixes follows...
	{
		// blendweights were converted to float on load; this consumes maybe an extra 20k memory
		// for a big model but runs a coupla percent faster - that's the tradeoff
		const unsigned char *index = iqm->blendindexes;
		const float *weight = iqm->blendweights;

		for (i = 0; i < iqm->numvertexes; i++, index += 4, weight += 4)
		{
			Matrix3x4_Scale (&destmatrix[i], &iqm->outframe[index[0]], weight[0]);

			// yet another valiant attempt at clawing back some CPU from IQM animation
			if (weight[1])
			{
				Matrix3x4_ScaleAdd (&destmatrix[i], &iqm->outframe[index[1]], weight[1], &destmatrix[i]);

				if (weight[2])
				{
					Matrix3x4_ScaleAdd (&destmatrix[i], &iqm->outframe[index[2]], weight[2], &destmatrix[i]);

					if (weight[3])
					{
						Matrix3x4_ScaleAdd (&destmatrix[i], &iqm->outframe[index[3]], weight[3], &destmatrix[i]);
					}
				}
			}
		}
	}
}

Re: VBO performance :(

Posted: Tue Nov 15, 2011 8:06 pm
by revelator
Wouldnt exactly call it ugly :) but a damn load of matrix operations in that code.
3 texcoord pointers ??? so its doing z axis to or is there something im not catching ?.