VBO performance :(
Moderator: InsideQC Admins
12 posts
• Page 1 of 1
VBO performance :(
I've been adding VBOs around the vertex array code, but after everything is up and running, im only getting 330fps with timedemo demo2 where i get >1000fps with immediate mode.
Any suggestions?
Any suggestions?
- Code: Select all
glBindBufferARB (GL_ARRAY_BUFFER_ARB, vboId);
glBufferDataARB (GL_ARRAY_BUFFER_ARB, sizeof(VArrayVerts), VArrayVerts, GL_STATIC_DRAW_ARB);
glBufferSubDataARB (GL_ARRAY_BUFFER_ARB, 0, sizeof(VArrayVerts), VArrayVerts);
glDrawArrays (VA_PRIMITIVE, 0, va_numverts);
- r00k
- Posts: 1110
- Joined: Sat Nov 13, 2004 10:39 pm
Try:
Also consider using GL_STREAM_DRAW
This kind of VBO usage pattern ain't great though. What you want to do is two passes through your data set; the first pass just fills up the buffer and records positions at which you need to change state and issue draw calls, the second pass is over your recorded positions and issues the actual draw calls. Otherwise you're doing lots and lots of tiny little updates per frame which is really going to hurt performance.
Even more ideally you'd put all of your surface data into one great big whopping static VBO and never touch it, then use shaders for animating water and sky. That will have zero overhead on data transfer from CPU -> GPU.
In practice though it doesn't really matter. For Quake there is more or less no difference between VBOs and client-side vertex arrays in terms of performance. You need to be hitting it with enough data to bring your driver to it's knees before you'll start seeing performance gains, and bandwidth to the GPU needs to be an identifiable bottleneck. If either of those are false it will be a lot of work for no measurable gain.
- Code: Select all
glBufferDataARB (GL_ARRAY_BUFFER_ARB, sizeof(VArrayVerts), NULL, GL_STATIC_DRAW_ARB);
Also consider using GL_STREAM_DRAW
This kind of VBO usage pattern ain't great though. What you want to do is two passes through your data set; the first pass just fills up the buffer and records positions at which you need to change state and issue draw calls, the second pass is over your recorded positions and issues the actual draw calls. Otherwise you're doing lots and lots of tiny little updates per frame which is really going to hurt performance.
Even more ideally you'd put all of your surface data into one great big whopping static VBO and never touch it, then use shaders for animating water and sky. That will have zero overhead on data transfer from CPU -> GPU.
In practice though it doesn't really matter. For Quake there is more or less no difference between VBOs and client-side vertex arrays in terms of performance. You need to be hitting it with enough data to bring your driver to it's knees before you'll start seeing performance gains, and bandwidth to the GPU needs to be an identifiable bottleneck. If either of those are false it will be a lot of work for no measurable gain.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
We knew the words, we knew the score, we knew what we were fighting for
-

mh - Posts: 2292
- Joined: Sat Jan 12, 2008 1:38 am
aye, if you're streaming data, then its easier/faster to let the drivers do the streaming entirely themselves.
save the vbos for large, true static data (vertex programs can do the animation for you).
save the vbos for large, true static data (vertex programs can do the animation for you).
- Spike
- Posts: 2892
- Joined: Fri Nov 05, 2004 3:12 am
- Location: UK
Spike wrote:aye, if you're streaming data, then its easier/faster to let the drivers do the streaming entirely themselves.
save the vbos for large, true static data (vertex programs can do the animation for you).
Even then bandwidth might not be your bottleneck. VBOs aren't a magic bullet that can suddenly make everything go faster, they're a handy tool that's useful under certain circumstances, and if those circumstances don't apply then you may as well not be using them.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
We knew the words, we knew the score, we knew what we were fighting for
-

mh - Posts: 2292
- Joined: Sat Jan 12, 2008 1:38 am
mh wrote:In practice though it doesn't really matter. For Quake there is more or less no difference between VBOs and client-side vertex arrays in terms of performance. You need to be hitting it with enough data to bring your driver to it's knees before you'll start seeing performance gains, and bandwidth to the GPU needs to be an identifiable bottleneck. If either of those are false it will be a lot of work for no measurable gain.
Agreed. In CRX we have an option to load the entire map into a VBO, and the performance gain is really pretty neglibible at best.
Where you *might* see a gain is in a case where you have static alias meshes in a map that have multi-pass shaders, so that the arrays aren't being rebuilt each frame. However if you're using VBO, you're probably using GLSL, so everything is done in a single pass anyway.
http://red.planetarena.org - Alien Arena and the CRX engine
- Irritant
- Posts: 250
- Joined: Mon May 19, 2008 2:54 pm
- Location: Maryland
Animating meshes in your vertex shader is the way to go. Combine that with static VBOs containing all of the xyz data (and better yet, if you have generic vertex attrib arrays you can use GL_UNSIGNED_BYTE for Quake formats; don't normalize but make sure that you pad them out to 4 bytes and set a w of 1). Bind your VBO, set a pointer to the xyz for current frame, set a pointer to the xyz for previous frame, send your blendweight as a uniform and let the GPU do all the rest of the heavy lifting. Depending how you do light you could potentially completely eliminate all CPU -> GPU data traffic. Hell, you could even encode the r_avertex_normal_dots table into a texture if you're still using that.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
We knew the words, we knew the score, we knew what we were fighting for
-

mh - Posts: 2292
- Joined: Sat Jan 12, 2008 1:38 am
Re:
mh wrote:Animating meshes in your vertex shader is the way to go. Combine that with static VBOs containing all of the xyz data (and better yet, if you have generic vertex attrib arrays you can use GL_UNSIGNED_BYTE for Quake formats; don't normalize but make sure that you pad them out to 4 bytes and set a w of 1). Bind your VBO, set a pointer to the xyz for current frame, set a pointer to the xyz for previous frame, send your blendweight as a uniform and let the GPU do all the rest of the heavy lifting. Depending how you do light you could potentially completely eliminate all CPU -> GPU data traffic. Hell, you could even encode the r_avertex_normal_dots table into a texture if you're still using that.
While this seems a bit daunting and I'm really unsure how to go about it with our animation process(I'm am not convinced that we can move our calculations into GLSL and still have compatibility with all hardware - yeah ATI I am talking to you!), we did manage to get all of our static meshes rendered with VBO. The performance increase was signifigant.
http://red.planetarena.org - Alien Arena and the CRX engine
- Irritant
- Posts: 250
- Joined: Mon May 19, 2008 2:54 pm
- Location: Maryland
Re: Re:
Irritant wrote:mh wrote:Animating meshes in your vertex shader is the way to go. Combine that with static VBOs containing all of the xyz data (and better yet, if you have generic vertex attrib arrays you can use GL_UNSIGNED_BYTE for Quake formats; don't normalize but make sure that you pad them out to 4 bytes and set a w of 1). Bind your VBO, set a pointer to the xyz for current frame, set a pointer to the xyz for previous frame, send your blendweight as a uniform and let the GPU do all the rest of the heavy lifting. Depending how you do light you could potentially completely eliminate all CPU -> GPU data traffic. Hell, you could even encode the r_avertex_normal_dots table into a texture if you're still using that.
While this seems a bit daunting and I'm really unsure how to go about it with our animation process(I'm am not convinced that we can move our calculations into GLSL and still have compatibility with all hardware - yeah ATI I am talking to you!), we did manage to get all of our static meshes rendered with VBO. The performance increase was signifigant.
I got IQM animations working on the GPU using ARB ASM - including on ATI and Intel!
It was quite easy. I did the matrix calcs on the CPU then added 4 extra "texcoords" to the vertex specification. I also added some caching so if the frames don't change since last time the model was drawn I just reuse the previous calcs. That worked out very well.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
We knew the words, we knew the score, we knew what we were fighting for
-

mh - Posts: 2292
- Joined: Sat Jan 12, 2008 1:38 am
Re: VBO performance :(
Nice one
id like a look at the code unless its (propriarity) 
Productivity is a state of mind.
-

revelator - Posts: 2567
- Joined: Thu Jan 24, 2008 12:04 pm
- Location: inside tha debugger
Re: VBO performance :(
reckless wrote:Nice oneid like a look at the code unless its (propriarity)
Don't say you weren't warned...! (OpenGL can get really ugly sometimes)
- Code: Select all
!!ARBvp1.0
TEMP transnorm, dot, dotlow, dothigh;
TEMP transvert;
DPH transvert.x, vertex.position, vertex.texcoord[1];
DPH transvert.y, vertex.position, vertex.texcoord[2];
DPH transvert.z, vertex.position, vertex.texcoord[3];
MOV transvert.w, vertex.position.w;
TEMP outtmp;
DP4 outtmp.x, state.matrix.mvp.row[0], transvert;
DP4 outtmp.y, state.matrix.mvp.row[1], transvert;
DP4 outtmp.z, state.matrix.mvp.row[2], transvert;
DP4 outtmp.w, state.matrix.mvp.row[3], transvert;
MOV result.position, outtmp;
MOV result.fogcoord, outtmp.z;
DP3 transnorm.x, vertex.normal, vertex.texcoord[1];
DP3 transnorm.y, vertex.normal, vertex.texcoord[2];
DP3 transnorm.z, vertex.normal, vertex.texcoord[3];
MUL result.texcoord[0], vertex.texcoord[0], program.local[0];
MUL result.texcoord[1], vertex.texcoord[0], program.local[1];
DP3 dot, transnorm, program.env[0];
ADD dothigh, dot, 1.0;
MAD dotlow, dot, 0.2954545, 1.0;
MAX result.texcoord[2], dotlow, dothigh;
END
- Code: Select all
glClientActiveTexture (GL_TEXTURE1);
glEnableClientState (GL_TEXTURE_COORD_ARRAY);
glTexCoordPointer (4, GL_FLOAT, sizeof (matrix3x4_t), ((matrix3x4_t *) scratchbuf)->a);
glClientActiveTexture (GL_TEXTURE2);
glEnableClientState (GL_TEXTURE_COORD_ARRAY);
glTexCoordPointer (4, GL_FLOAT, sizeof (matrix3x4_t), ((matrix3x4_t *) scratchbuf)->b);
glClientActiveTexture (GL_TEXTURE3);
glEnableClientState (GL_TEXTURE_COORD_ARRAY);
glTexCoordPointer (4, GL_FLOAT, sizeof (matrix3x4_t), ((matrix3x4_t *) scratchbuf)->c);
- Code: Select all
// valiant attempt at clawing back some CPU from IQM animation #1
void Matrix3x4_ScaleAdd (matrix3x4_t *out, matrix3x4_t *base, float scale, matrix3x4_t *add)
{
out->a[0] = base->a[0] * scale + add->a[0];
out->a[1] = base->a[1] * scale + add->a[1];
out->a[2] = base->a[2] * scale + add->a[2];
out->a[3] = base->a[3] * scale + add->a[3];
out->b[0] = base->b[0] * scale + add->b[0];
out->b[1] = base->b[1] * scale + add->b[1];
out->b[2] = base->b[2] * scale + add->b[2];
out->b[3] = base->b[3] * scale + add->b[3];
out->c[0] = base->c[0] * scale + add->c[0];
out->c[1] = base->c[1] * scale + add->c[1];
out->c[2] = base->c[2] * scale + add->c[2];
out->c[3] = base->c[3] * scale + add->c[3];
}
// valiant attempt at clawing back some CPU from IQM animation #2
void Matrix3x4_ScaleScaleAdd (matrix3x4_t *out, matrix3x4_t *base1, float scale1, matrix3x4_t *base2, float scale2)
{
out->a[0] = ((base2->a[0] - base1->a[0]) * scale2) + base1->a[0];
out->a[1] = ((base2->a[1] - base1->a[1]) * scale2) + base1->a[1];
out->a[2] = ((base2->a[2] - base1->a[2]) * scale2) + base1->a[2];
out->a[3] = ((base2->a[3] - base1->a[3]) * scale2) + base1->a[3];
out->b[0] = ((base2->b[0] - base1->b[0]) * scale2) + base1->b[0];
out->b[1] = ((base2->b[1] - base1->b[1]) * scale2) + base1->b[1];
out->b[2] = ((base2->b[2] - base1->b[2]) * scale2) + base1->b[2];
out->b[3] = ((base2->b[3] - base1->b[3]) * scale2) + base1->b[3];
out->c[0] = ((base2->c[0] - base1->c[0]) * scale2) + base1->c[0];
out->c[1] = ((base2->c[1] - base1->c[1]) * scale2) + base1->c[1];
out->c[2] = ((base2->c[2] - base1->c[2]) * scale2) + base1->c[2];
out->c[3] = ((base2->c[3] - base1->c[3]) * scale2) + base1->c[3];
}
// valiant attempt at clawing back some CPU from IQM animation #3
void Matrix3x4_MultiplyByRef (matrix3x4_t *out, matrix3x4_t *mat1, matrix3x4_t *mat2)
{
// this was slower than it should be; it's already slow, let's not make it worse
out->a[0] = (mat2->a[0] * mat1->a[0]) + (mat2->b[0] * mat1->a[1]) + (mat2->c[0] * mat1->a[2]);
out->a[1] = (mat2->a[1] * mat1->a[0]) + (mat2->b[1] * mat1->a[1]) + (mat2->c[1] * mat1->a[2]);
out->a[2] = (mat2->a[2] * mat1->a[0]) + (mat2->b[2] * mat1->a[1]) + (mat2->c[2] * mat1->a[2]);
out->a[3] = (mat2->a[3] * mat1->a[0]) + (mat2->b[3] * mat1->a[1]) + (mat2->c[3] * mat1->a[2]) + mat1->a[3];
out->b[0] = (mat2->a[0] * mat1->b[0]) + (mat2->b[0] * mat1->b[1]) + (mat2->c[0] * mat1->b[2]);
out->b[1] = (mat2->a[1] * mat1->b[0]) + (mat2->b[1] * mat1->b[1]) + (mat2->c[1] * mat1->b[2]);
out->b[2] = (mat2->a[2] * mat1->b[0]) + (mat2->b[2] * mat1->b[1]) + (mat2->c[2] * mat1->b[2]);
out->b[3] = (mat2->a[3] * mat1->b[0]) + (mat2->b[3] * mat1->b[1]) + (mat2->c[3] * mat1->b[2]) + mat1->b[3];
out->c[0] = (mat2->a[0] * mat1->c[0]) + (mat2->b[0] * mat1->c[1]) + (mat2->c[0] * mat1->c[2]);
out->c[1] = (mat2->a[1] * mat1->c[0]) + (mat2->b[1] * mat1->c[1]) + (mat2->c[1] * mat1->c[2]);
out->c[2] = (mat2->a[2] * mat1->c[0]) + (mat2->b[2] * mat1->c[1]) + (mat2->c[2] * mat1->c[2]);
out->c[3] = (mat2->a[3] * mat1->c[0]) + (mat2->b[3] * mat1->c[1]) + (mat2->c[3] * mat1->c[2]) + mat1->c[3];
}
void GL_AnimateIQMFrame (iqmdata_t *iqm, int currframe, int lastframe, float lerp)
{
int i;
int frame1 = lastframe;
int frame2 = currframe;
matrix3x4_t *destmatrix = (matrix3x4_t *) scratchbuf;
// frame sanity - prevent from going out of bounds
frame1 %= iqm->num_poses;
frame2 %= iqm->num_poses;
// test for flipping the frames
if (frame1 == iqm->cachedcurrframe && frame2 == iqm->cachedlastframe)
{
int temp = frame1;
frame1 = frame2;
frame2 = temp;
lerp = 1.0f - lerp;
}
// coarsen the lerp so that we don't need to recache too often
// this animates at 50 FPS which should be enough for anyone
lerp = (float) ((int) (lerp * 5)) / 5;
// if the cached frames and lerp don't change there is no need to re-animate
// these are all that it's safe to cache as the transformed bones are stored in the scratchbuf
// we can only cache per model too as the data set is just too big to cache per entity
if (frame1 != iqm->cachedlastframe || frame2 != iqm->cachedcurrframe || lerp != iqm->cachedlerp)
{
matrix3x4_t *mat1 = &iqm->frames[frame1 * iqm->num_joints];
matrix3x4_t *mat2 = &iqm->frames[frame2 * iqm->num_joints];
matrix3x4_t mat;
if (iqm->version == IQM_VERSION1)
{
for (i = 0; i < iqm->num_joints; i++)
{
Matrix3x4_ScaleScaleAdd (&mat, &mat1[i], (1 - lerp), &mat2[i], lerp);
if (iqm->jointsv1[i].parent >= 0)
Matrix3x4_MultiplyByRef (&iqm->outframe[i], &iqm->outframe[iqm->jointsv1[i].parent], &mat);
else Matrix3x4_Copy (&iqm->outframe[i], &mat);
}
}
else
{
for (i = 0; i < iqm->num_joints; i++)
{
Matrix3x4_ScaleScaleAdd (&mat, &mat1[i], (1 - lerp), &mat2[i], lerp);
if (iqm->jointsv2[i].parent >= 0)
Matrix3x4_MultiplyByRef (&iqm->outframe[i], &iqm->outframe[iqm->jointsv2[i].parent], &mat);
else Matrix3x4_Copy (&iqm->outframe[i], &mat);
}
}
}
// cache back the frames
iqm->cachedlastframe = frame1;
iqm->cachedcurrframe = frame2;
iqm->cachedlerp = lerp;
// The actual vertex generation based on the matrixes follows...
{
// blendweights were converted to float on load; this consumes maybe an extra 20k memory
// for a big model but runs a coupla percent faster - that's the tradeoff
const unsigned char *index = iqm->blendindexes;
const float *weight = iqm->blendweights;
for (i = 0; i < iqm->numvertexes; i++, index += 4, weight += 4)
{
Matrix3x4_Scale (&destmatrix[i], &iqm->outframe[index[0]], weight[0]);
// yet another valiant attempt at clawing back some CPU from IQM animation
if (weight[1])
{
Matrix3x4_ScaleAdd (&destmatrix[i], &iqm->outframe[index[1]], weight[1], &destmatrix[i]);
if (weight[2])
{
Matrix3x4_ScaleAdd (&destmatrix[i], &iqm->outframe[index[2]], weight[2], &destmatrix[i]);
if (weight[3])
{
Matrix3x4_ScaleAdd (&destmatrix[i], &iqm->outframe[index[3]], weight[3], &destmatrix[i]);
}
}
}
}
}
}
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
We knew the words, we knew the score, we knew what we were fighting for
-

mh - Posts: 2292
- Joined: Sat Jan 12, 2008 1:38 am
Re: VBO performance :(
Wouldnt exactly call it ugly
but a damn load of matrix operations in that code.
3 texcoord pointers ??? so its doing z axis to or is there something im not catching ?.
3 texcoord pointers ??? so its doing z axis to or is there something im not catching ?.
Productivity is a state of mind.
-

revelator - Posts: 2567
- Joined: Thu Jan 24, 2008 12:04 pm
- Location: inside tha debugger
12 posts
• Page 1 of 1
Who is online
Users browsing this forum: No registered users and 1 guest