Forum

VBO performance :(

Discuss programming topics that involve the OpenGL API.

Moderator: InsideQC Admins

VBO performance :(

Postby r00k » Tue Aug 23, 2011 5:19 pm

I've been adding VBOs around the vertex array code, but after everything is up and running, im only getting 330fps with timedemo demo2 where i get >1000fps with immediate mode.

Any suggestions?

Code: Select all
   glBindBufferARB (GL_ARRAY_BUFFER_ARB, vboId);
   glBufferDataARB (GL_ARRAY_BUFFER_ARB,  sizeof(VArrayVerts), VArrayVerts, GL_STATIC_DRAW_ARB);
   glBufferSubDataARB (GL_ARRAY_BUFFER_ARB, 0, sizeof(VArrayVerts), VArrayVerts);

   glDrawArrays (VA_PRIMITIVE, 0, va_numverts);   
r00k
 
Posts: 1108
Joined: Sat Nov 13, 2004 10:39 pm

Postby mh » Tue Aug 23, 2011 9:18 pm

Try:
Code: Select all
   glBufferDataARB (GL_ARRAY_BUFFER_ARB,  sizeof(VArrayVerts), NULL, GL_STATIC_DRAW_ARB);
instead of your current glBufferData call. Otherwise you need to stall the pipeline because it's currently using data in the buffer; doing NULL here will give you a fresh block of memory instead.

Also consider using GL_STREAM_DRAW

This kind of VBO usage pattern ain't great though. What you want to do is two passes through your data set; the first pass just fills up the buffer and records positions at which you need to change state and issue draw calls, the second pass is over your recorded positions and issues the actual draw calls. Otherwise you're doing lots and lots of tiny little updates per frame which is really going to hurt performance.

Even more ideally you'd put all of your surface data into one great big whopping static VBO and never touch it, then use shaders for animating water and sky. That will have zero overhead on data transfer from CPU -> GPU.

In practice though it doesn't really matter. For Quake there is more or less no difference between VBOs and client-side vertex arrays in terms of performance. You need to be hitting it with enough data to bring your driver to it's knees before you'll start seeing performance gains, and bandwidth to the GPU needs to be an identifiable bottleneck. If either of those are false it will be a lot of work for no measurable gain.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
User avatar
mh
 
Posts: 2287
Joined: Sat Jan 12, 2008 1:38 am

Postby r00k » Tue Aug 23, 2011 11:15 pm

Doh! ;)
Thanks for the explanation.

I guess ill put this on the back burner for now...
r00k
 
Posts: 1108
Joined: Sat Nov 13, 2004 10:39 pm

Postby Spike » Tue Aug 23, 2011 11:25 pm

aye, if you're streaming data, then its easier/faster to let the drivers do the streaming entirely themselves.

save the vbos for large, true static data (vertex programs can do the animation for you).
Spike
 
Posts: 2883
Joined: Fri Nov 05, 2004 3:12 am
Location: UK

Postby mh » Wed Aug 24, 2011 1:42 am

Spike wrote:aye, if you're streaming data, then its easier/faster to let the drivers do the streaming entirely themselves.

save the vbos for large, true static data (vertex programs can do the animation for you).


Even then bandwidth might not be your bottleneck. VBOs aren't a magic bullet that can suddenly make everything go faster, they're a handy tool that's useful under certain circumstances, and if those circumstances don't apply then you may as well not be using them.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
User avatar
mh
 
Posts: 2287
Joined: Sat Jan 12, 2008 1:38 am

Postby Irritant » Wed Aug 24, 2011 4:19 pm

mh wrote:In practice though it doesn't really matter. For Quake there is more or less no difference between VBOs and client-side vertex arrays in terms of performance. You need to be hitting it with enough data to bring your driver to it's knees before you'll start seeing performance gains, and bandwidth to the GPU needs to be an identifiable bottleneck. If either of those are false it will be a lot of work for no measurable gain.


Agreed. In CRX we have an option to load the entire map into a VBO, and the performance gain is really pretty neglibible at best.

Where you *might* see a gain is in a case where you have static alias meshes in a map that have multi-pass shaders, so that the arrays aren't being rebuilt each frame. However if you're using VBO, you're probably using GLSL, so everything is done in a single pass anyway.
http://red.planetarena.org - Alien Arena and the CRX engine
Irritant
 
Posts: 250
Joined: Mon May 19, 2008 2:54 pm
Location: Maryland

Postby mh » Fri Aug 26, 2011 7:11 pm

Animating meshes in your vertex shader is the way to go. Combine that with static VBOs containing all of the xyz data (and better yet, if you have generic vertex attrib arrays you can use GL_UNSIGNED_BYTE for Quake formats; don't normalize but make sure that you pad them out to 4 bytes and set a w of 1). Bind your VBO, set a pointer to the xyz for current frame, set a pointer to the xyz for previous frame, send your blendweight as a uniform and let the GPU do all the rest of the heavy lifting. Depending how you do light you could potentially completely eliminate all CPU -> GPU data traffic. Hell, you could even encode the r_avertex_normal_dots table into a texture if you're still using that.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
User avatar
mh
 
Posts: 2287
Joined: Sat Jan 12, 2008 1:38 am

Re:

Postby Irritant » Fri Nov 11, 2011 3:08 pm

mh wrote:Animating meshes in your vertex shader is the way to go. Combine that with static VBOs containing all of the xyz data (and better yet, if you have generic vertex attrib arrays you can use GL_UNSIGNED_BYTE for Quake formats; don't normalize but make sure that you pad them out to 4 bytes and set a w of 1). Bind your VBO, set a pointer to the xyz for current frame, set a pointer to the xyz for previous frame, send your blendweight as a uniform and let the GPU do all the rest of the heavy lifting. Depending how you do light you could potentially completely eliminate all CPU -> GPU data traffic. Hell, you could even encode the r_avertex_normal_dots table into a texture if you're still using that.


While this seems a bit daunting and I'm really unsure how to go about it with our animation process(I'm am not convinced that we can move our calculations into GLSL and still have compatibility with all hardware - yeah ATI I am talking to you!), we did manage to get all of our static meshes rendered with VBO. The performance increase was signifigant.
http://red.planetarena.org - Alien Arena and the CRX engine
Irritant
 
Posts: 250
Joined: Mon May 19, 2008 2:54 pm
Location: Maryland

Re: Re:

Postby mh » Sun Nov 13, 2011 12:24 am

Irritant wrote:
mh wrote:Animating meshes in your vertex shader is the way to go. Combine that with static VBOs containing all of the xyz data (and better yet, if you have generic vertex attrib arrays you can use GL_UNSIGNED_BYTE for Quake formats; don't normalize but make sure that you pad them out to 4 bytes and set a w of 1). Bind your VBO, set a pointer to the xyz for current frame, set a pointer to the xyz for previous frame, send your blendweight as a uniform and let the GPU do all the rest of the heavy lifting. Depending how you do light you could potentially completely eliminate all CPU -> GPU data traffic. Hell, you could even encode the r_avertex_normal_dots table into a texture if you're still using that.


While this seems a bit daunting and I'm really unsure how to go about it with our animation process(I'm am not convinced that we can move our calculations into GLSL and still have compatibility with all hardware - yeah ATI I am talking to you!), we did manage to get all of our static meshes rendered with VBO. The performance increase was signifigant.


I got IQM animations working on the GPU using ARB ASM - including on ATI and Intel! :D

It was quite easy. I did the matrix calcs on the CPU then added 4 extra "texcoords" to the vertex specification. I also added some caching so if the frames don't change since last time the model was drawn I just reuse the previous calcs. That worked out very well.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
User avatar
mh
 
Posts: 2287
Joined: Sat Jan 12, 2008 1:38 am

Re: VBO performance :(

Postby revelator » Sun Nov 13, 2011 3:38 pm

Nice one :) id like a look at the code unless its (propriarity) ;)
Productivity is a state of mind.
User avatar
revelator
 
Posts: 2533
Joined: Thu Jan 24, 2008 12:04 pm
Location: inside tha debugger

Re: VBO performance :(

Postby mh » Mon Nov 14, 2011 8:49 pm

reckless wrote:Nice one :) id like a look at the code unless its (propriarity) ;)


Don't say you weren't warned...! (OpenGL can get really ugly sometimes)

Code: Select all
!!ARBvp1.0
TEMP transnorm, dot, dotlow, dothigh;
TEMP transvert;
DPH transvert.x, vertex.position, vertex.texcoord[1];
DPH transvert.y, vertex.position, vertex.texcoord[2];
DPH transvert.z, vertex.position, vertex.texcoord[3];
MOV transvert.w, vertex.position.w;
TEMP outtmp;
DP4 outtmp.x, state.matrix.mvp.row[0], transvert;
DP4 outtmp.y, state.matrix.mvp.row[1], transvert;
DP4 outtmp.z, state.matrix.mvp.row[2], transvert;
DP4 outtmp.w, state.matrix.mvp.row[3], transvert;
MOV result.position, outtmp;
MOV result.fogcoord, outtmp.z;
DP3 transnorm.x, vertex.normal, vertex.texcoord[1];
DP3 transnorm.y, vertex.normal, vertex.texcoord[2];
DP3 transnorm.z, vertex.normal, vertex.texcoord[3];
MUL result.texcoord[0], vertex.texcoord[0], program.local[0];
MUL result.texcoord[1], vertex.texcoord[0], program.local[1];
DP3 dot, transnorm, program.env[0];
ADD dothigh, dot, 1.0;
MAD dotlow, dot, 0.2954545, 1.0;
MAX result.texcoord[2], dotlow, dothigh;
END


Code: Select all
glClientActiveTexture (GL_TEXTURE1);
glEnableClientState (GL_TEXTURE_COORD_ARRAY);
glTexCoordPointer (4, GL_FLOAT, sizeof (matrix3x4_t), ((matrix3x4_t *) scratchbuf)->a);

glClientActiveTexture (GL_TEXTURE2);
glEnableClientState (GL_TEXTURE_COORD_ARRAY);
glTexCoordPointer (4, GL_FLOAT, sizeof (matrix3x4_t), ((matrix3x4_t *) scratchbuf)->b);

glClientActiveTexture (GL_TEXTURE3);
glEnableClientState (GL_TEXTURE_COORD_ARRAY);
glTexCoordPointer (4, GL_FLOAT, sizeof (matrix3x4_t), ((matrix3x4_t *) scratchbuf)->c);


Code: Select all
// valiant attempt at clawing back some CPU from IQM animation #1
void Matrix3x4_ScaleAdd (matrix3x4_t *out, matrix3x4_t *base, float scale, matrix3x4_t *add)
{
   out->a[0] = base->a[0] * scale + add->a[0];
   out->a[1] = base->a[1] * scale + add->a[1];
   out->a[2] = base->a[2] * scale + add->a[2];
   out->a[3] = base->a[3] * scale + add->a[3];

   out->b[0] = base->b[0] * scale + add->b[0];
   out->b[1] = base->b[1] * scale + add->b[1];
   out->b[2] = base->b[2] * scale + add->b[2];
   out->b[3] = base->b[3] * scale + add->b[3];

   out->c[0] = base->c[0] * scale + add->c[0];
   out->c[1] = base->c[1] * scale + add->c[1];
   out->c[2] = base->c[2] * scale + add->c[2];
   out->c[3] = base->c[3] * scale + add->c[3];
}


// valiant attempt at clawing back some CPU from IQM animation #2
void Matrix3x4_ScaleScaleAdd (matrix3x4_t *out, matrix3x4_t *base1, float scale1, matrix3x4_t *base2, float scale2)
{
   out->a[0] = ((base2->a[0] - base1->a[0]) * scale2) + base1->a[0];
   out->a[1] = ((base2->a[1] - base1->a[1]) * scale2) + base1->a[1];
   out->a[2] = ((base2->a[2] - base1->a[2]) * scale2) + base1->a[2];
   out->a[3] = ((base2->a[3] - base1->a[3]) * scale2) + base1->a[3];

   out->b[0] = ((base2->b[0] - base1->b[0]) * scale2) + base1->b[0];
   out->b[1] = ((base2->b[1] - base1->b[1]) * scale2) + base1->b[1];
   out->b[2] = ((base2->b[2] - base1->b[2]) * scale2) + base1->b[2];
   out->b[3] = ((base2->b[3] - base1->b[3]) * scale2) + base1->b[3];

   out->c[0] = ((base2->c[0] - base1->c[0]) * scale2) + base1->c[0];
   out->c[1] = ((base2->c[1] - base1->c[1]) * scale2) + base1->c[1];
   out->c[2] = ((base2->c[2] - base1->c[2]) * scale2) + base1->c[2];
   out->c[3] = ((base2->c[3] - base1->c[3]) * scale2) + base1->c[3];
}


// valiant attempt at clawing back some CPU from IQM animation #3
void Matrix3x4_MultiplyByRef (matrix3x4_t *out, matrix3x4_t *mat1, matrix3x4_t *mat2)
{
   // this was slower than it should be; it's already slow, let's not make it worse
   out->a[0] = (mat2->a[0] * mat1->a[0]) + (mat2->b[0] * mat1->a[1]) + (mat2->c[0] * mat1->a[2]);
   out->a[1] = (mat2->a[1] * mat1->a[0]) + (mat2->b[1] * mat1->a[1]) + (mat2->c[1] * mat1->a[2]);
   out->a[2] = (mat2->a[2] * mat1->a[0]) + (mat2->b[2] * mat1->a[1]) + (mat2->c[2] * mat1->a[2]);
   out->a[3] = (mat2->a[3] * mat1->a[0]) + (mat2->b[3] * mat1->a[1]) + (mat2->c[3] * mat1->a[2]) + mat1->a[3];

   out->b[0] = (mat2->a[0] * mat1->b[0]) + (mat2->b[0] * mat1->b[1]) + (mat2->c[0] * mat1->b[2]);
   out->b[1] = (mat2->a[1] * mat1->b[0]) + (mat2->b[1] * mat1->b[1]) + (mat2->c[1] * mat1->b[2]);
   out->b[2] = (mat2->a[2] * mat1->b[0]) + (mat2->b[2] * mat1->b[1]) + (mat2->c[2] * mat1->b[2]);
   out->b[3] = (mat2->a[3] * mat1->b[0]) + (mat2->b[3] * mat1->b[1]) + (mat2->c[3] * mat1->b[2]) + mat1->b[3];

   out->c[0] = (mat2->a[0] * mat1->c[0]) + (mat2->b[0] * mat1->c[1]) + (mat2->c[0] * mat1->c[2]);
   out->c[1] = (mat2->a[1] * mat1->c[0]) + (mat2->b[1] * mat1->c[1]) + (mat2->c[1] * mat1->c[2]);
   out->c[2] = (mat2->a[2] * mat1->c[0]) + (mat2->b[2] * mat1->c[1]) + (mat2->c[2] * mat1->c[2]);
   out->c[3] = (mat2->a[3] * mat1->c[0]) + (mat2->b[3] * mat1->c[1]) + (mat2->c[3] * mat1->c[2]) + mat1->c[3];
}


void GL_AnimateIQMFrame (iqmdata_t *iqm, int currframe, int lastframe, float lerp)
{
   int i;

    int frame1 = lastframe;
    int frame2 = currframe;

   matrix3x4_t *destmatrix = (matrix3x4_t *) scratchbuf;

   // frame sanity - prevent from going out of bounds
   frame1 %= iqm->num_poses;
   frame2 %= iqm->num_poses;

   // test for flipping the frames
   if (frame1 == iqm->cachedcurrframe && frame2 == iqm->cachedlastframe)
   {
      int temp = frame1;

      frame1 = frame2;
      frame2 = temp;
      lerp = 1.0f - lerp;
   }

   // coarsen the lerp so that we don't need to recache too often
   // this animates at 50 FPS which should be enough for anyone
   lerp = (float) ((int) (lerp * 5)) / 5;

   // if the cached frames and lerp don't change there is no need to re-animate
   // these are all that it's safe to cache as the transformed bones are stored in the scratchbuf
   // we can only cache per model too as the data set is just too big to cache per entity
   if (frame1 != iqm->cachedlastframe || frame2 != iqm->cachedcurrframe || lerp != iqm->cachedlerp)
   {
      matrix3x4_t *mat1 = &iqm->frames[frame1 * iqm->num_joints];
      matrix3x4_t *mat2 = &iqm->frames[frame2 * iqm->num_joints];
      matrix3x4_t mat;

      if (iqm->version == IQM_VERSION1)
      {
         for (i = 0; i < iqm->num_joints; i++)
         {
            Matrix3x4_ScaleScaleAdd (&mat, &mat1[i], (1 - lerp), &mat2[i], lerp);

            if (iqm->jointsv1[i].parent >= 0)
               Matrix3x4_MultiplyByRef (&iqm->outframe[i], &iqm->outframe[iqm->jointsv1[i].parent], &mat);
            else Matrix3x4_Copy (&iqm->outframe[i], &mat);
         }
      }
      else
      {
         for (i = 0; i < iqm->num_joints; i++)
         {
            Matrix3x4_ScaleScaleAdd (&mat, &mat1[i], (1 - lerp), &mat2[i], lerp);

            if (iqm->jointsv2[i].parent >= 0)
               Matrix3x4_MultiplyByRef (&iqm->outframe[i], &iqm->outframe[iqm->jointsv2[i].parent], &mat);
            else Matrix3x4_Copy (&iqm->outframe[i], &mat);
         }
      }
   }

   // cache back the frames
   iqm->cachedlastframe = frame1;
   iqm->cachedcurrframe = frame2;
   iqm->cachedlerp = lerp;

   // The actual vertex generation based on the matrixes follows...
   {
      // blendweights were converted to float on load; this consumes maybe an extra 20k memory
      // for a big model but runs a coupla percent faster - that's the tradeoff
      const unsigned char *index = iqm->blendindexes;
      const float *weight = iqm->blendweights;

      for (i = 0; i < iqm->numvertexes; i++, index += 4, weight += 4)
      {
         Matrix3x4_Scale (&destmatrix[i], &iqm->outframe[index[0]], weight[0]);

         // yet another valiant attempt at clawing back some CPU from IQM animation
         if (weight[1])
         {
            Matrix3x4_ScaleAdd (&destmatrix[i], &iqm->outframe[index[1]], weight[1], &destmatrix[i]);

            if (weight[2])
            {
               Matrix3x4_ScaleAdd (&destmatrix[i], &iqm->outframe[index[2]], weight[2], &destmatrix[i]);

               if (weight[3])
               {
                  Matrix3x4_ScaleAdd (&destmatrix[i], &iqm->outframe[index[3]], weight[3], &destmatrix[i]);
               }
            }
         }
      }
   }
}
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
User avatar
mh
 
Posts: 2287
Joined: Sat Jan 12, 2008 1:38 am

Re: VBO performance :(

Postby revelator » Tue Nov 15, 2011 8:06 pm

Wouldnt exactly call it ugly :) but a damn load of matrix operations in that code.
3 texcoord pointers ??? so its doing z axis to or is there something im not catching ?.
Productivity is a state of mind.
User avatar
revelator
 
Posts: 2533
Joined: Thu Jan 24, 2008 12:04 pm
Location: inside tha debugger


Return to OpenGL Programming

Who is online

Users browsing this forum: No registered users and 1 guest