drawspans optimizations in c?
Moderator: InsideQC Admins
20 posts
• Page 1 of 2 • 1, 2
drawspans optimizations in c?
Just wondered if anyone has tried improving performance of drawspans in c code (D_DrawSpans8). I'm going to try D_DrawSpans16, looks easy to do.
Profiling WinQuake w/ gfprof after compling with mingw, d_drawspans comes in at about 40% of CPU time. This is using the c version, not asm. Looking at several source bases (bjp's WinQuake, ToChris,.... ) the function is generally untouched.
Profiling WinQuake w/ gfprof after compling with mingw, d_drawspans comes in at about 40% of CPU time. This is using the c version, not asm. Looking at several source bases (bjp's WinQuake, ToChris,.... ) the function is generally untouched.
-
qbism - Posts: 1236
- Joined: Thu Nov 04, 2004 5:51 am
This gets maybe 5% extra from D_DrawSpans8:
Replace the innermost do...while with it.
Generally with software Quake I find that array indexing is faster than "*variable++" so if you can unroll your loops this way it can go much faster than ID's code - plenty of scope for that all over the place. You could probably precalculate a lot of the pbase indexing here too.
(Edited: indexes should start at -8, not -7).
- Code: Select all
pdest += spancount;
switch (spancount)
{
case 8: pdest[-8] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 7: pdest[-7] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 6: pdest[-6] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 5: pdest[-5] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 4: pdest[-4] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 3: pdest[-3] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 2: pdest[-2] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 1: pdest[-1] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
}
Replace the innermost do...while with it.
Generally with software Quake I find that array indexing is faster than "*variable++" so if you can unroll your loops this way it can go much faster than ID's code - plenty of scope for that all over the place. You could probably precalculate a lot of the pbase indexing here too.
(Edited: indexes should start at -8, not -7).
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
We knew the words, we knew the score, we knew what we were fighting for
-

mh - Posts: 2292
- Joined: Sat Jan 12, 2008 1:38 am
DrawSpan16 was only in asm and only for i386. Here's c version, seems to work
- Code: Select all
/*
=============
D_DrawSpans16
=============
*/
void D_DrawSpans16 (espan_t *pspan) //qbism- up it from 8 to 16
{
int count, spancount;
unsigned char *pbase, *pdest;
fixed16_t s, t, snext, tnext, sstep, tstep;
float sdivz, tdivz, zi, z, du, dv, spancountminus1;
float sdivzstepu, tdivzstepu, zistepu;
sstep = 0; // keep compiler happy
tstep = 0; // ditto
pbase = (unsigned char *)cacheblock;
sdivzstepu = d_sdivzstepu * 16;
tdivzstepu = d_tdivzstepu * 16;
zistepu = d_zistepu * 16;
do
{
pdest = (unsigned char *)((byte *)d_viewbuffer +
(screenwidth * pspan->v) + pspan->u);
count = pspan->count;
// calculate the initial s/z, t/z, 1/z, s, and t and clamp
du = (float)pspan->u;
dv = (float)pspan->v;
sdivz = d_sdivzorigin + dv*d_sdivzstepv + du*d_sdivzstepu;
tdivz = d_tdivzorigin + dv*d_tdivzstepv + du*d_tdivzstepu;
zi = d_ziorigin + dv*d_zistepv + du*d_zistepu;
z = (float)0x10000 / zi; // prescale to 16.16 fixed-point
s = (int)(sdivz * z) + sadjust;
if (s > bbextents)
s = bbextents;
else if (s < 0)
s = 0;
t = (int)(tdivz * z) + tadjust;
if (t > bbextentt)
t = bbextentt;
else if (t < 0)
t = 0;
do
{
// calculate s and t at the far end of the span
if (count >= 16)
spancount = 16;
else
spancount = count;
count -= spancount;
if (count)
{
// calculate s/z, t/z, zi->fixed s and t at far end of span,
// calculate s and t steps across span by shifting
sdivz += sdivzstepu;
tdivz += tdivzstepu;
zi += zistepu;
z = (float)0x10000 / zi; // prescale to 16.16 fixed-point
snext = (int)(sdivz * z) + sadjust;
if (snext > bbextents)
snext = bbextents;
else if (snext <= 16)
snext = 16; // prevent round-off error on <0 steps from
// from causing overstepping & running off the
// edge of the texture
tnext = (int)(tdivz * z) + tadjust;
if (tnext > bbextentt)
tnext = bbextentt;
else if (tnext < 16)
tnext = 16; // guard against round-off error on <0 steps
sstep = (snext - s) >> 4;
tstep = (tnext - t) >> 4;
}
else
{
// calculate s/z, t/z, zi->fixed s and t at last pixel in span (so
// can't step off polygon), clamp, calculate s and t steps across
// span by division, biasing steps low so we don't run off the
// texture
spancountminus1 = (float)(spancount - 1);
sdivz += d_sdivzstepu * spancountminus1;
tdivz += d_tdivzstepu * spancountminus1;
zi += d_zistepu * spancountminus1;
z = (float)0x10000 / zi; // prescale to 16.16 fixed-point
snext = (int)(sdivz * z) + sadjust;
if (snext > bbextents)
snext = bbextents;
else if (snext < 16)
snext = 16; // prevent round-off error on <0 steps from
// from causing overstepping & running off the
// edge of the texture
tnext = (int)(tdivz * z) + tadjust;
if (tnext > bbextentt)
tnext = bbextentt;
else if (tnext < 16)
tnext = 16; // guard against round-off error on <0 steps
if (spancount > 1)
{
sstep = (snext - s) / (spancount - 1);
tstep = (tnext - t) / (spancount - 1);
}
}
do
{
*pdest++ = *(pbase + (s >> 16) + (t >> 16) * cachewidth);
s += sstep;
t += tstep;
} while (--spancount > 0);
s = snext;
t = tnext;
} while (count > 0);
} while ((pspan = pspan->pnext) != NULL);
}
-
qbism - Posts: 1236
- Joined: Thu Nov 04, 2004 5:51 am
Spike wrote:yay duff's device!
The old ones never die.
I'm especially proud of the negative array indexes.
Edit: goes maybe 10% faster with 16.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
We knew the words, we knew the score, we knew what we were fighting for
-

mh - Posts: 2292
- Joined: Sat Jan 12, 2008 1:38 am
Negative indices? Duff? Google...
I didn't grasp that it was "climbing down" until after commenting out //s += sstep; t += tstep;
Performance: In a roughly ideal test*, FPS increased from 106.5 to 112 going from span8 to span16. FPS increased again to 118 after unrolling span16. So that unroll is netting 5%+ in the test case. Combined effect is about 10%
*Facing the blank wall at DM6 dead-end. In contrast, facing complex geometry dilutes the FPS gain from this optimization. Server was throttled at 72 fps, and there wasn't much for the server to do.
I didn't grasp that it was "climbing down" until after commenting out //s += sstep; t += tstep;
Performance: In a roughly ideal test*, FPS increased from 106.5 to 112 going from span8 to span16. FPS increased again to 118 after unrolling span16. So that unroll is netting 5%+ in the test case. Combined effect is about 10%
*Facing the blank wall at DM6 dead-end. In contrast, facing complex geometry dilutes the FPS gain from this optimization. Server was throttled at 72 fps, and there wasn't much for the server to do.
-
qbism - Posts: 1236
- Joined: Thu Nov 04, 2004 5:51 am
486 benchmarks!
DISCLAIMER: i'm not using the cleanest quake source in the world.
timedemo demo1, 320x200, no sound, pure DOS
id386 asm implementation:
span8: 16.0FPS
span16: 17.0FPS
C implementation:
span8: 11.5fps
span16 qbism: 12.4FPS
span16 qbism with mh's unrolls: 12.8FPS
Totally off-topic desperation:
span16 qbism with mh's unrolls and d_mipcap 4: 14.6FPS
span16 qbism with mh's unrolls and d_mipcap 4 with that awesome r_lowdetail feature in my awesome engine: 21.4FPS
I made mh's span stuff a seperate function turned on by the 2 value for comparison.
also off-topic - i need to try hexen2's span asm code to see if rjmonroe's optimisations are bull (EDIT: all that trouble to inline some NASM leads to no speed gain!)
more offtopic - I tried making a c drawspans32 and 64 out of it and it turned out crappy looking (to do with the cacheblock i think) and not even much faster
DISCLAIMER: i'm not using the cleanest quake source in the world.
timedemo demo1, 320x200, no sound, pure DOS
id386 asm implementation:
span8: 16.0FPS
span16: 17.0FPS
C implementation:
span8: 11.5fps
span16 qbism: 12.4FPS
span16 qbism with mh's unrolls: 12.8FPS
Totally off-topic desperation:
span16 qbism with mh's unrolls and d_mipcap 4: 14.6FPS
span16 qbism with mh's unrolls and d_mipcap 4 with that awesome r_lowdetail feature in my awesome engine: 21.4FPS
I made mh's span stuff a seperate function turned on by the 2 value for comparison.
also off-topic - i need to try hexen2's span asm code to see if rjmonroe's optimisations are bull (EDIT: all that trouble to inline some NASM leads to no speed gain!)
more offtopic - I tried making a c drawspans32 and 64 out of it and it turned out crappy looking (to do with the cacheblock i think) and not even much faster
i should not be here
- leileilol
- Posts: 2783
- Joined: Fri Oct 15, 2004 3:23 am
I think it's just the 486; I've run Quake on one of those before and it's a miracle to be getting FPS above single digits.
My benchmarks came from a timedemo demo1 by the way.
It's not really Duff's either as it doesn't have a do...while embedded in the switch, but then again since we already know there's an absolute maximum on spancount it doesn't need to.
My benchmarks came from a timedemo demo1 by the way.
It's not really Duff's either as it doesn't have a do...while embedded in the switch, but then again since we already know there's an absolute maximum on spancount it doesn't need to.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
We knew the words, we knew the score, we knew what we were fighting for
-

mh - Posts: 2292
- Joined: Sat Jan 12, 2008 1:38 am
leileilol wrote:also off-topic - i need to try hexen2's span asm code to see if rjmonroe's optimisations are bull (EDIT: all that trouble to inline some NASM leads to no speed gain!)
Inline asm into tight C loops are unlikely to bring any real gain in performance due the intensive stack handling overhead.
I know FrikaC made a cgi-bin version of the quakec interpreter once and wrote part of his website in QuakeC
(LordHavoc)
-

frag.machine - Posts: 2090
- Joined: Sat Nov 25, 2006 1:49 pm
Just an idea occured me. I wonder if we could squeeze any performance by skipping the right shifts with the use of some sort of union:
mh code:
my idea:
Can't check it right now, but seems feasible to me.
EDIT: of course, I am assuming big endians here, but the right thing to do would be to check for it out of the loop and use the correct index value for the high short in the union.
mh code:
- Code: Select all
// typedef int fixed16_t;
fixed16_t s, t, snext, tnext, sstep, tstep;
(....)
switch (spancount)
{
case 16: pdest[-16] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
my idea:
- Code: Select all
// union fixed32 {
// int i;
// short s[2];
// }fixed32_t;
fixed32_t s, t, snext, tnext, sstep, tstep;
(....)
switch (spancount)
{
case 16: pdest[-16] = pbase[(s.s[1]) + (t.t[1]) * cachewidth]; s.i += sstep.i; t.i += tstep.i;
Can't check it right now, but seems feasible to me.
EDIT: of course, I am assuming big endians here, but the right thing to do would be to check for it out of the loop and use the correct index value for the high short in the union.
I know FrikaC made a cgi-bin version of the quakec interpreter once and wrote part of his website in QuakeC
(LordHavoc)
-

frag.machine - Posts: 2090
- Joined: Sat Nov 25, 2006 1:49 pm
There's some floating point division further up in the loop that I'd like to have a pop at doing something about too. I suspect that might be causing more performance issues.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
We knew the words, we knew the score, we knew what we were fighting for
-

mh - Posts: 2292
- Joined: Sat Jan 12, 2008 1:38 am
20 posts
• Page 1 of 2 • 1, 2
Who is online
Users browsing this forum: No registered users and 1 guest