R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
Moderator: InsideQC Admins
15 posts
• Page 1 of 1
R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
- Code: Select all
static byte *vidcolmap;
...
vidcolmap = (byte *)vid.colormap;
...
void R_DrawSurfaceBlock8_mip0 (void)
{
int v, i, lightstep, light;
byte *psource, *prowdest;
psource = pbasesource;
prowdest = prowdestbase;
for (v=0 ; v<r_numvblocks ; v++)
{
lightleft = r_lightptr[0];
lightright = r_lightptr[1];
r_lightptr += r_lightwidth;
lightleftstep = (r_lightptr[0] - lightleft) >> 4;
lightrightstep = (r_lightptr[1] - lightright) >> 4;
for (i=0 ; i<16 ; i++)
{
lightstep = (lightleft - lightright) >> 4;
light = lightright;
prowdest[15] = vidcolmap[((light += lightstep) & 0xFF00) + psource[15]];
prowdest[14] = vidcolmap[((light += lightstep) & 0xFF00) + psource[14]];
prowdest[13] = vidcolmap[((light += lightstep) & 0xFF00) + psource[13]];
prowdest[12] = vidcolmap[((light += lightstep) & 0xFF00) + psource[12]];
prowdest[11] = vidcolmap[((light += lightstep) & 0xFF00) + psource[11]];
prowdest[10] = vidcolmap[((light += lightstep) & 0xFF00) + psource[10]];
prowdest[9] = vidcolmap[((light += lightstep) & 0xFF00) + psource[9]];
prowdest[8] = vidcolmap[((light += lightstep) & 0xFF00) + psource[8]];
prowdest[7] = vidcolmap[((light += lightstep) & 0xFF00) + psource[7]];
prowdest[6] = vidcolmap[((light += lightstep) & 0xFF00) + psource[6]];
prowdest[5] = vidcolmap[((light += lightstep) & 0xFF00) + psource[5]];
prowdest[4] = vidcolmap[((light += lightstep) & 0xFF00) + psource[4]];
prowdest[3] = vidcolmap[((light += lightstep) & 0xFF00) + psource[3]];
prowdest[2] = vidcolmap[((light += lightstep) & 0xFF00) + psource[2]];
prowdest[1] = vidcolmap[((light += lightstep) & 0xFF00) + psource[1]];
prowdest[0] = vidcolmap[(light & 0xFF00) + psource[0]];
psource += sourcetstep;
lightright += lightrightstep;
lightleft += lightleftstep;
prowdest += surfrowbytes;
}
if (psource >= r_sourcemax)
psource -= r_stepback;
}
}
[edit2: declare vidcolmap 'static' ]
Last edited by qbism on Fri Feb 15, 2013 5:32 pm, edited 4 times in total.
-
qbism - Posts: 1236
- Joined: Thu Nov 04, 2004 5:51 am
Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
Just a comment on unrolling: about twelve years ago, I got into an argument about loop unrolling. I insisted a certain loop should be hand unrolled, he insisted it would be faster to let gcc do it (using dead simple code like "*dst++ = *src++;"). Well, there was only one thing to do: try both and see what happened. I "lost". The version with stupid code and gcc was allowed to do the unrolling was noticeably faster. When I investigated, it turned out gcc was unrolling both versions, but the hand-unrolled version wound up with N increments (N = number of times gcc unrolled the loop) vs the one increment. Also, there was N times as much code (nasty for cache), and I seem to remember there being some nasty register pressure.
In short, you are generally far better off using a decent compiler (remember, that was gcc as of 12 years ago), writing clean, simple code, and letting the compiler choose the optimal code, especially if you can tune the compiler to your intended target CPU. If the compiler you're using can't do a decent job of loop unrolling, burn it and get a new one.
That said, other than loop unrolling, your code does look clean
.
In short, you are generally far better off using a decent compiler (remember, that was gcc as of 12 years ago), writing clean, simple code, and letting the compiler choose the optimal code, especially if you can tune the compiler to your intended target CPU. If the compiler you're using can't do a decent job of loop unrolling, burn it and get a new one.
That said, other than loop unrolling, your code does look clean
Leave others their otherness.
http://quakeforge.net/
http://quakeforge.net/
- taniwha
- Posts: 399
- Joined: Thu Jan 14, 2010 7:11 am
Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
A good point, and I unrolled this without a clue regarding performance. But it helped me to visualize. Next I may post the rolled-back-up version c/ assembly. Unrolled gcc assembly:
- Code: Select all
_R_DrawSurfaceBlock8_mip0:
pushl %ebp
xorl %ebp, %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $8, %esp
movl _r_numvblocks, %eax
movl _pbasesource, %edx
movl _prowdestbase, %ebx
movl %ebp, 4(%esp)
testl %eax, %eax
jle L38
.p2align 4,,15
L44:
movl _r_lightptr, %ecx
movl _r_lightwidth, %edi
movl $16, (%esp)
movl (%ecx), %eax
movl 4(%ecx), %esi
leal (%ecx,%edi,4), %edi
movl %edi, _r_lightptr
movl %eax, _lightleft
movl %esi, _lightright
movl (%edi), %ecx
subl %eax, %ecx
shrl $4, %ecx
movl %ecx, _lightleftstep
movl 4(%edi), %ecx
subl %esi, %ecx
shrl $4, %ecx
movl %ecx, _lightrightstep
.p2align 4,,15
L40:
subl %esi, %eax
movl _vidcolmap, %ecx
sarl $4, %eax
movzbl 15(%edx), %edi
addl %eax, %esi
movl %esi, %ebp
addl %eax, %esi
andl $65280, %ebp
addl %ecx, %ebp
movzbl 0(%ebp,%edi), %ecx
movl %esi, %ebp
addl %eax, %esi
andl $65280, %ebp
movb %cl, 15(%ebx)
movl _vidcolmap, %ecx
movzbl 14(%edx), %edi
addl %ecx, %ebp
movzbl 0(%ebp,%edi), %ecx
movl %esi, %ebp
addl %eax, %esi
andl $65280, %ebp
movb %cl, 14(%ebx)
movl _vidcolmap, %ecx
movzbl 13(%edx), %edi
addl %ecx, %ebp
movzbl 0(%ebp,%edi), %ecx
movl %esi, %ebp
addl %eax, %esi
andl $65280, %ebp
movb %cl, 13(%ebx)
movl _vidcolmap, %ecx
movzbl 12(%edx), %edi
addl %ecx, %ebp
movzbl 0(%ebp,%edi), %ecx
movl %esi, %ebp
addl %eax, %esi
andl $65280, %ebp
movb %cl, 12(%ebx)
movl _vidcolmap, %ecx
movzbl 11(%edx), %edi
addl %ecx, %ebp
movzbl 0(%ebp,%edi), %ecx
movl %esi, %ebp
addl %eax, %esi
andl $65280, %ebp
movb %cl, 11(%ebx)
movl _vidcolmap, %ecx
movzbl 10(%edx), %edi
addl %ecx, %ebp
movzbl 0(%ebp,%edi), %ecx
movl %esi, %ebp
addl %eax, %esi
andl $65280, %ebp
movb %cl, 10(%ebx)
movl _vidcolmap, %ecx
movzbl 9(%edx), %edi
addl %ecx, %ebp
movzbl 0(%ebp,%edi), %ecx
movl %esi, %ebp
addl %eax, %esi
andl $65280, %ebp
movb %cl, 9(%ebx)
movl _vidcolmap, %ecx
movzbl 8(%edx), %edi
addl %ecx, %ebp
movzbl 0(%ebp,%edi), %ecx
movl %esi, %ebp
addl %eax, %esi
andl $65280, %ebp
movb %cl, 8(%ebx)
movl _vidcolmap, %ecx
movzbl 7(%edx), %edi
addl %ecx, %ebp
movzbl 0(%ebp,%edi), %ecx
movl %esi, %ebp
addl %eax, %esi
andl $65280, %ebp
movb %cl, 7(%ebx)
movl _vidcolmap, %ecx
movzbl 6(%edx), %edi
addl %ecx, %ebp
movzbl 0(%ebp,%edi), %ecx
movl %esi, %ebp
andl $65280, %ebp
addl %eax, %esi
movb %cl, 6(%ebx)
movl _vidcolmap, %ecx
movzbl 5(%edx), %edi
addl %ecx, %ebp
movzbl 0(%ebp,%edi), %ecx
movl %esi, %ebp
addl %eax, %esi
andl $65280, %ebp
movb %cl, 5(%ebx)
movl _vidcolmap, %ecx
movzbl 4(%edx), %edi
addl %ecx, %ebp
movzbl 0(%ebp,%edi), %ecx
movl %esi, %ebp
addl %eax, %esi
andl $65280, %ebp
movb %cl, 4(%ebx)
movl _vidcolmap, %ecx
movzbl 3(%edx), %edi
addl %ebp, %ecx
movzbl (%ecx,%edi), %ecx
movb %cl, 3(%ebx)
movl %esi, %ecx
addl %eax, %esi
movl _vidcolmap, %ebp
andl $65280, %ecx
addl %esi, %eax
movzbl 2(%edx), %edi
andl $65280, %eax
addl %ebp, %ecx
movzbl (%ecx,%edi), %ecx
movb %cl, 2(%ebx)
movl %esi, %ecx
movl _vidcolmap, %ebp
andl $65280, %ecx
movzbl 1(%edx), %edi
addl %ebp, %ecx
movzbl (%ecx,%edi), %ecx
movb %cl, 1(%ebx)
movl _vidcolmap, %esi
movzbl (%edx), %ecx
addl %esi, %eax
movzbl (%eax,%ecx), %eax
movb %al, (%ebx)
movl _lightright, %esi
movl _lightrightstep, %ebp
movl _lightleft, %eax
movl _lightleftstep, %ecx
movl _sourcetstep, %edi
addl %ebp, %esi
movl %esi, _lightright
addl %ecx, %eax
addl %edi, %edx
movl %eax, _lightleft
movl _surfrowbytes, %edi
addl %edi, %ebx
decl (%esp)
jne L40
movl _r_stepback, %edi
movl %edx, %eax
subl %edi, %eax
cmpl %edx, _r_sourcemax
cmovbe %eax, %edx
incl 4(%esp)
movl 4(%esp), %eax
cmpl %eax, _r_numvblocks
jg L44
L38:
addl $8, %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
-
qbism - Posts: 1236
- Joined: Thu Nov 04, 2004 5:51 am
Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
This is clean and rolled
- Code: Select all
void R_DrawSurfaceBlock8_mip0 (void)
{
int v, i, b, lightstep, light;
byte *psource, *prowdest;
psource = pbasesource;
prowdest = prowdestbase;
for (v=0 ; v<r_numvblocks ; v++)
{
lightleft = r_lightptr[0];
lightright = r_lightptr[1];
r_lightptr += r_lightwidth;
lightleftstep = (r_lightptr[0] - lightleft) >> 4;
lightrightstep = (r_lightptr[1] - lightright) >> 4;
for (i=0 ; i<16 ; i++)
{
lightstep = (lightleft - lightright) >> 4;
light = lightright;
for (b=15; b>=0; b--)
{
prowdest[b] = vidcolmap[(light & 0xFF00) + psource[b]];
light += lightstep;
}
psource += sourcetstep;
lightright += lightrightstep;
lightleft += lightleftstep;
prowdest += surfrowbytes;
}
if (psource >= r_sourcemax)
psource -= r_stepback;
}
}
-
qbism - Posts: 1236
- Joined: Thu Nov 04, 2004 5:51 am
Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
This is clean rolled-up assembly output, optimized. The compiler decided how to unroll the loop.
Loop comparison-
unrolled by hand:
unrolled by compiler:
So the compiler took 9 instructions to unroll vs. 8 by hand. I did not look at the code outside the loop. Boiling down the differences assuming all registers are equal speed:
By hand:
movl _vidcolmap, %ecx
Machine:
movl %ebx, (%esp)
movl (%esp), %ebx
- Code: Select all
_R_DrawSurfaceBlock8_mip0:
pushl %ebp
xorl %ebp, %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $12, %esp
movl _r_numvblocks, %edx
movl _pbasesource, %eax
movl _prowdestbase, %esi
movl %ebp, 8(%esp)
testl %edx, %edx
jle L38
.p2align 4,,15
L44:
movl _r_lightptr, %ecx
movl _r_lightwidth, %edi
movl (%ecx), %edx
movl 4(%ecx), %ebx
leal (%ecx,%edi,4), %edi
movl %edi, _r_lightptr
movl %edx, _lightleft
movl %ebx, _lightright
movl (%edi), %ecx
movl %ebx, (%esp)
subl %edx, %ecx
shrl $4, %ecx
movl %ecx, _lightleftstep
movl 4(%edi), %ecx
movl $16, %edi
movl %edi, 4(%esp)
subl %ebx, %ecx
shrl $4, %ecx
movl %ecx, _lightrightstep
.p2align 4,,15
L40:
movl (%esp), %ebp
movl _vidcolmap, %ecx
movzbl 15(%eax), %edi
subl %ebp, %edx
movl (%esp), %ebp
sarl $4, %edx
andl $65280, %ebp
addl %ecx, %ebp
movzbl 0(%ebp,%edi), %ebx
movb %bl, 15(%esi)
movl (%esp), %ebx
movzbl 14(%eax), %edi
addl %edx, %ebx
movl %ebx, %ebp
andl $65280, %ebp
addl %ecx, %ebp
movl %ebx, (%esp)
movzbl 0(%ebp,%edi), %ebx
movb %bl, 14(%esi)
movl (%esp), %ebx
movzbl 13(%eax), %edi
addl %edx, %ebx
movl %ebx, %ebp
andl $65280, %ebp
addl %ecx, %ebp
movl %ebx, (%esp)
movzbl 0(%ebp,%edi), %ebx
movb %bl, 13(%esi)
movl (%esp), %ebx
movzbl 12(%eax), %edi
addl %edx, %ebx
movl %ebx, %ebp
andl $65280, %ebp
addl %ecx, %ebp
movl %ebx, (%esp)
movzbl 0(%ebp,%edi), %ebx
movb %bl, 12(%esi)
movl (%esp), %ebx
movzbl 11(%eax), %edi
addl %edx, %ebx
movl %ebx, %ebp
andl $65280, %ebp
addl %ecx, %ebp
movl %ebx, (%esp)
movzbl 0(%ebp,%edi), %ebx
movb %bl, 11(%esi)
movl (%esp), %ebx
movzbl 10(%eax), %edi
addl %edx, %ebx
movl %ebx, %ebp
andl $65280, %ebp
addl %ecx, %ebp
movl %ebx, (%esp)
movzbl 0(%ebp,%edi), %ebx
movb %bl, 10(%esi)
movl (%esp), %ebx
movzbl 9(%eax), %edi
addl %edx, %ebx
movl %ebx, %ebp
andl $65280, %ebp
addl %ecx, %ebp
movl %ebx, (%esp)
movzbl 0(%ebp,%edi), %ebx
movb %bl, 9(%esi)
movl (%esp), %ebx
movzbl 8(%eax), %edi
addl %edx, %ebx
movl %ebx, %ebp
andl $65280, %ebp
addl %ecx, %ebp
movl %ebx, (%esp)
movzbl 0(%ebp,%edi), %ebx
movb %bl, 8(%esi)
movl (%esp), %ebx
movzbl 7(%eax), %edi
addl %edx, %ebx
movl %ebx, %ebp
andl $65280, %ebp
addl %ecx, %ebp
movl %ebx, (%esp)
movzbl 0(%ebp,%edi), %ebx
movb %bl, 7(%esi)
movl (%esp), %ebx
movzbl 6(%eax), %edi
addl %edx, %ebx
movl %ebx, %ebp
andl $65280, %ebp
addl %ecx, %ebp
movl %ebx, (%esp)
movzbl 0(%ebp,%edi), %ebx
movb %bl, 6(%esi)
movl (%esp), %ebx
movzbl 5(%eax), %edi
addl %edx, %ebx
movl %ebx, %ebp
andl $65280, %ebp
addl %ecx, %ebp
movl %ebx, (%esp)
movzbl 0(%ebp,%edi), %ebx
movb %bl, 5(%esi)
movl (%esp), %ebx
movzbl 4(%eax), %edi
addl %edx, %ebx
movl %ebx, %ebp
andl $65280, %ebp
addl %ecx, %ebp
movl %ebx, (%esp)
movzbl 0(%ebp,%edi), %ebx
movb %bl, 4(%esi)
movl (%esp), %ebx
movzbl 3(%eax), %edi
addl %edx, %ebx
movl %ebx, %ebp
addl %edx, %ebx
andl $65280, %ebp
addl %ebp, %ecx
movzbl (%ecx,%edi), %ecx
movb %cl, 3(%esi)
movl %ebx, %ecx
addl %edx, %ebx
movl _vidcolmap, %ebp
andl $65280, %ecx
addl %ebx, %edx
movzbl 2(%eax), %edi
andl $65280, %edx
addl %ebp, %ecx
movzbl (%ecx,%edi), %ecx
movb %cl, 2(%esi)
movl %ebx, %ecx
movl _vidcolmap, %ebp
andl $65280, %ecx
movzbl 1(%eax), %edi
addl %ebp, %ecx
movzbl (%ecx,%edi), %ecx
movb %cl, 1(%esi)
movl _vidcolmap, %ebx
movzbl (%eax), %ecx
addl %ebx, %edx
movzbl (%edx,%ecx), %edx
movb %dl, (%esi)
movl _lightright, %edx
movl _lightrightstep, %ebx
movl _sourcetstep, %edi
movl _lightleftstep, %ebp
movl _surfrowbytes, %ecx
addl %ebx, %edx
movl %edx, (%esp)
addl %edi, %eax
movl %edx, _lightright
movl _lightleft, %edx
addl %ecx, %esi
addl %ebp, %edx
decl 4(%esp)
movl %edx, _lightleft
jne L40
movl _r_stepback, %edi
movl %eax, %edx
subl %edi, %edx
cmpl _r_sourcemax, %eax
cmovae %edx, %eax
incl 8(%esp)
movl 8(%esp), %edx
cmpl %edx, _r_numvblocks
jg L44
L38:
addl $12, %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
Loop comparison-
unrolled by hand:
- Code: Select all
movl _vidcolmap, %ecx
movzbl 14(%edx), %edi
addl %ecx, %ebp
movzbl 0(%ebp,%edi), %ecx
movl %esi, %ebp
addl %eax, %esi
andl $65280, %ebp
movb %cl, 14(%ebx)
unrolled by compiler:
- Code: Select all
movzbl 14(%eax), %edi
addl %edx, %ebx
movl %ebx, %ebp
andl $65280, %ebp
addl %ecx, %ebp
movl %ebx, (%esp)
movzbl 0(%ebp,%edi), %ebx
movb %bl, 14(%esi)
movl (%esp), %ebx
So the compiler took 9 instructions to unroll vs. 8 by hand. I did not look at the code outside the loop. Boiling down the differences assuming all registers are equal speed:
By hand:
movl _vidcolmap, %ecx
Machine:
movl %ebx, (%esp)
movl (%esp), %ebx
-
qbism - Posts: 1236
- Joined: Thu Nov 04, 2004 5:51 am
Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
if vidcolmap is declared static, "movl _vidcolmap, %ecx" becomes "movl (%esp), %ebx"
- Code: Select all
movzbl 14(%eax), %edi
addl %ebx, %ebp
movzbl 0(%ebp,%edi), %ebx
movl %esi, %ebp
addl %ecx, %esi
andl $65280, %ebp
movb %bl, 14(%edx)
movl (%esp), %ebx
-
qbism - Posts: 1236
- Joined: Thu Nov 04, 2004 5:51 am
Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
I'd say there's a "push" somewhere (or maybe "mov whatever,(%esp)"). However, if that is outside the loop, then it will be a gain as it removes (code) cache pressure from the loop.
Leave others their otherness.
http://quakeforge.net/
http://quakeforge.net/
- taniwha
- Posts: 399
- Joined: Thu Jan 14, 2010 7:11 am
Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
Yes, "movl _vidcolmap, %ebx" is outside the loop. Here's the whole thing:
- Code: Select all
.p2align 4,,15
.globl _R_DrawSurfaceBlock8_mip0
.def _R_DrawSurfaceBlock8_mip0; .scl 2; .type 32; .endef
_R_DrawSurfaceBlock8_mip0:
movl _r_numvblocks, %ecx
movl _pbasesource, %eax
movl _prowdestbase, %edx
testl %ecx, %ecx
jle L43
pushl %ebp
xorl %ecx, %ecx
pushl %edi
pushl %esi
pushl %ebx
subl $12, %esp
movl _vidcolmap, %ebx
movl %ecx, 8(%esp)
.p2align 4,,15
L38:
movl _r_lightptr, %edi
movl _r_lightwidth, %ebp
movl %ebx, (%esp)
movl (%edi), %ecx
movl 4(%edi), %esi
leal (%edi,%ebp,4), %ebp
movl %ebp, _r_lightptr
movl %ecx, _lightleft
movl %esi, _lightright
movl 0(%ebp), %edi
subl %ecx, %edi
shrl $4, %edi
movl %edi, _lightleftstep
movl 4(%ebp), %edi
movl $16, %ebp
movl %ebp, 4(%esp)
subl %esi, %edi
shrl $4, %edi
movl %edi, _lightrightstep
.p2align 4,,15
L36:
subl %esi, %ecx
movl (%esp), %ebx
sarl $4, %ecx
movzbl 15(%eax), %edi
addl %ecx, %esi
movl %esi, %ebp
addl %ecx, %esi
andl $65280, %ebp
addl %ebx, %ebp
movzbl 0(%ebp,%edi), %ebx
movl %esi, %ebp
addl %ecx, %esi
andl $65280, %ebp
movb %bl, 15(%edx)
movl (%esp), %ebx
movzbl 14(%eax), %edi
addl %ebx, %ebp
movzbl 0(%ebp,%edi), %ebx
movl %esi, %ebp
addl %ecx, %esi
andl $65280, %ebp
movb %bl, 14(%edx)
movl (%esp), %ebx
movzbl 13(%eax), %edi
addl %ebx, %ebp
movzbl 0(%ebp,%edi), %ebx
movl %esi, %ebp
addl %ecx, %esi
andl $65280, %ebp
movb %bl, 13(%edx)
movl (%esp), %ebx
movzbl 12(%eax), %edi
addl %ebx, %ebp
movzbl 0(%ebp,%edi), %ebx
movl %esi, %ebp
addl %ecx, %esi
andl $65280, %ebp
movb %bl, 12(%edx)
movl (%esp), %ebx
movzbl 11(%eax), %edi
addl %ebx, %ebp
movzbl 0(%ebp,%edi), %ebx
movl %esi, %ebp
addl %ecx, %esi
andl $65280, %ebp
movb %bl, 11(%edx)
movl (%esp), %ebx
movzbl 10(%eax), %edi
addl %ebx, %ebp
movzbl 0(%ebp,%edi), %ebx
movl %esi, %ebp
addl %ecx, %esi
andl $65280, %ebp
movb %bl, 10(%edx)
movl (%esp), %ebx
movzbl 9(%eax), %edi
addl %ebx, %ebp
movzbl 0(%ebp,%edi), %ebx
movl %esi, %ebp
addl %ecx, %esi
andl $65280, %ebp
movb %bl, 9(%edx)
movl (%esp), %ebx
movzbl 8(%eax), %edi
addl %ebx, %ebp
movzbl 0(%ebp,%edi), %ebx
movl %esi, %ebp
addl %ecx, %esi
andl $65280, %ebp
movb %bl, 8(%edx)
movl (%esp), %ebx
movzbl 7(%eax), %edi
addl %ebx, %ebp
movzbl 0(%ebp,%edi), %ebx
movl %esi, %ebp
addl %ecx, %esi
andl $65280, %ebp
movb %bl, 7(%edx)
movl (%esp), %ebx
movzbl 6(%eax), %edi
addl %ebx, %ebp
movzbl 0(%ebp,%edi), %ebx
movl %esi, %ebp
andl $65280, %ebp
addl %ecx, %esi
movb %bl, 6(%edx)
movl (%esp), %ebx
movzbl 5(%eax), %edi
addl %ebx, %ebp
movzbl 0(%ebp,%edi), %ebx
movl %esi, %ebp
addl %ecx, %esi
andl $65280, %ebp
movb %bl, 5(%edx)
movl (%esp), %ebx
movzbl 4(%eax), %edi
addl %ebx, %ebp
movzbl 0(%ebp,%edi), %ebx
movl %esi, %ebp
addl %ecx, %esi
andl $65280, %ebp
addl %esi, %ecx
andl $65280, %ecx
movb %bl, 4(%edx)
movl (%esp), %ebx
movzbl 3(%eax), %edi
addl %ebx, %ebp
movzbl 0(%ebp,%edi), %ebx
movl %esi, %ebp
andl $65280, %ebp
movb %bl, 3(%edx)
movl (%esp), %ebx
movzbl 2(%eax), %edi
addl %ebx, %ebp
movzbl 0(%ebp,%edi), %ebx
movl (%esp), %edi
movb %bl, 2(%edx)
movzbl 1(%eax), %esi
addl %edi, %ecx
movzbl (%ecx,%esi), %ebx
movb %bl, 1(%edx)
movzbl (%eax), %esi
movzbl (%ecx,%esi), %ecx
movb %cl, (%edx)
movl _lightright, %esi
movl _lightrightstep, %ecx
movl _lightleftstep, %ebx
movl _sourcetstep, %ebp
addl %ecx, %esi
movl _lightleft, %ecx
movl %esi, _lightright
addl %ebp, %eax
addl %ebx, %ecx
movl %ecx, _lightleft
movl _surfrowbytes, %edi
addl %edi, %edx
decl 4(%esp)
jne L36
movl _r_stepback, %edi
movl %eax, %ecx
movl (%esp), %ebx
subl %edi, %ecx
cmpl %eax, _r_sourcemax
cmovbe %ecx, %eax
incl 8(%esp)
movl 8(%esp), %ecx
cmpl %ecx, _r_numvblocks
jg L38
addl $12, %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
L43:
ret
-
qbism - Posts: 1236
- Joined: Thu Nov 04, 2004 5:51 am
Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
What would duff devicing the inner most loop do?
i should not be here
- leileilol
- Posts: 2783
- Joined: Fri Oct 15, 2004 3:23 am
Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
Yeah, with vidcolormap static, the compiler knows nothing should be able to change it unexpectedly, and thus can do such optimizations.
My general rule (for other reasons, actually) is if it doesn't need to be accessible to other files, make it static (whether variable or function). Also, as a related rule: the "extern" keyword should not appear in a .c file (limit its use to .h files). The reason for this is there was a bug (possibly in only QF, but I don't know) where a variable was declared as int in one .c file and short in another .c file. The compiler didn't catch the problem because one of the .c files was using extern, and for some reason the linker didn't complain.
While places like R_DrawSurfaceBlock8_mip0 are obvious places to optimize (or at least likely places), always follow Abrash's rule: profile, profile, reconsider the algorithm, profile, profile, then optimize. There's little point in optimizing code that has little impact on performance, and there's no point optimizing an O(N**2) algorithm when there's an O(N) (or even the holy grail, O(1)*) algorithm available.
* Yes, for some problem spaces, O(1) exists, but they're rare. Sure, they're sometimes more expensive for small N, but since they scale so nicely, who cares?
eg, using a hash table will be slower for small N (hash string, compare a string or two vs just comparing a few strings in a simple linked list), but get to a few thousand (eg, compiling frikbot with qcc) and the hash table wins hands down.
Anyway, if the code is fast enough (thus the need for profiling), clean, easily read (and modified!) code is much more important than "fast" code. Also, compiler writers put their efforts into getting the compiler to produce optimial code for the clean, easily read code. Hand optimizing the code can wind up fighting the compiler's optimization algorithms and wind up producing sub-optimal code (probably still better than the clean code with compiler optimizations turned off).
Now, the best thing to do for "hand optimizing" code is actually to give the compiler hints (and as a fringe benefit, improve the readability of the code and let the compiler help catch errors). Things like "static" and "const" wherever possible can make a big difference. Just making vec3_t params "const" where possible made a noticeable difference!
My general rule (for other reasons, actually) is if it doesn't need to be accessible to other files, make it static (whether variable or function). Also, as a related rule: the "extern" keyword should not appear in a .c file (limit its use to .h files). The reason for this is there was a bug (possibly in only QF, but I don't know) where a variable was declared as int in one .c file and short in another .c file. The compiler didn't catch the problem because one of the .c files was using extern, and for some reason the linker didn't complain.
While places like R_DrawSurfaceBlock8_mip0 are obvious places to optimize (or at least likely places), always follow Abrash's rule: profile, profile, reconsider the algorithm, profile, profile, then optimize. There's little point in optimizing code that has little impact on performance, and there's no point optimizing an O(N**2) algorithm when there's an O(N) (or even the holy grail, O(1)*) algorithm available.
* Yes, for some problem spaces, O(1) exists, but they're rare. Sure, they're sometimes more expensive for small N, but since they scale so nicely, who cares?
Anyway, if the code is fast enough (thus the need for profiling), clean, easily read (and modified!) code is much more important than "fast" code. Also, compiler writers put their efforts into getting the compiler to produce optimial code for the clean, easily read code. Hand optimizing the code can wind up fighting the compiler's optimization algorithms and wind up producing sub-optimal code (probably still better than the clean code with compiler optimizations turned off).
Now, the best thing to do for "hand optimizing" code is actually to give the compiler hints (and as a fringe benefit, improve the readability of the code and let the compiler help catch errors). Things like "static" and "const" wherever possible can make a big difference. Just making vec3_t params "const" where possible made a noticeable difference!
Leave others their otherness.
http://quakeforge.net/
http://quakeforge.net/
- taniwha
- Posts: 399
- Joined: Thu Jan 14, 2010 7:11 am
Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
leileilol: Duff's device is useless when the number of iterations is fixed (which, unless I'm mistaken, it appears to be*). Odds are high that Duff's device will produce slower code for fixed iterations (depends on the compiler (aggressiveness of constant folding, common subexpression elimination and dead code removal)).
* In fact, it seems only the outermost loop is variable.
Duff's device is cool, but it's no hammer and the problem doesn't even look like a nail
.
* In fact, it seems only the outermost loop is variable.
Duff's device is cool, but it's no hammer and the problem doesn't even look like a nail
Leave others their otherness.
http://quakeforge.net/
http://quakeforge.net/
- taniwha
- Posts: 399
- Joined: Thu Jan 14, 2010 7:11 am
Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
okay.
For some odd reason, in vc6, using "Prefer small code" in the optimizations for r_surf.c makes it faster for me
For some odd reason, in vc6, using "Prefer small code" in the optimizations for r_surf.c makes it faster for me
i should not be here
- leileilol
- Posts: 2783
- Joined: Fri Oct 15, 2004 3:23 am
Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
My guess is cache usage for the code itself. ie, keep the code small and a lot of r_surf.c might fit in L1 cache.
Leave others their otherness.
http://quakeforge.net/
http://quakeforge.net/
- taniwha
- Posts: 399
- Joined: Thu Jan 14, 2010 7:11 am
Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
leileilol wrote:okay.
For some odd reason, in vc6, using "Prefer small code" in the optimizations for r_surf.c makes it faster for me
smaller code == (usually) more L1/L2 cache hits ?
EDIT: Gahh, ninja'd by taniwha
I know FrikaC made a cgi-bin version of the quakec interpreter once and wrote part of his website in QuakeC
(LordHavoc)
-

frag.machine - Posts: 2090
- Joined: Sat Nov 25, 2006 1:49 pm
Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled
Probably, though i've only tried that small code thing with my duff deviced rgb surfaceblocks. I've yet to test it on the plain unrolled ones i've had before.
I should compile some build that allows to choose the duff blocks, unrolled blocks, and combined duff and unrolled blocks (to one function), then test that build on a Pentium II, trying small code and fast code optimizations on two different builds
EDIT: Unrolled > Duff in a surfaceblock. Working out a duff was a waste of my time
I should compile some build that allows to choose the duff blocks, unrolled blocks, and combined duff and unrolled blocks (to one function), then test that build on a Pentium II, trying small code and fast code optimizations on two different builds
EDIT: Unrolled > Duff in a surfaceblock. Working out a duff was a waste of my time
i should not be here
- leileilol
- Posts: 2783
- Joined: Fri Oct 15, 2004 3:23 am
15 posts
• Page 1 of 1
Who is online
Users browsing this forum: No registered users and 1 guest