qbismSuper8 builds

qbism · Post by **qbism** » Tue May 08, 2012 4:48 am

Not really a release, but a test build: http://super8.qbism.com/wp-content/uplo ... 3-test.zip
The exe is optimized for MMX, SSE, SSE2, and SSE3 (Pentium 4/ Athlon 64). Hopefully will run on a wider range older machines, and I didn't notice any slowdown compared to icore optimized build.
Framerate is improved in fog due to fewer depth samples. It tests every other pixel, but still blends them all.
Also added SSE vector math posted by Reckless here. I don't know how much faster it is, but it's probably not slower.

mh · Post by mh » Tue May 08, 2012 6:00 pm

I doubt if you're going to get much from SSE-izing those operations - where SSE works best is on long chains of float4 data. Using SSE instructions on single float3s seems highly unlikely to give anything extra.

revelator · Post by **revelator** » Tue May 08, 2012 6:42 pm

This might be handy then.

Code: Select all

void VectorConstruct(float vec1, float vec2, float vec3, float vec4, vec4_t out)
{
#if __SSE__
    // raynorpat: sse optimization
    __m128 xmm_in;

    xmm_in = _mm_load_ss(&vec1);
    _mm_store_ss(&out[0], xmm_in);

    xmm_in = _mm_load_ss(&vec2);
    _mm_store_ss(&out[1], xmm_in);

    xmm_in = _mm_load_ss(&vec3);
    _mm_store_ss(&out[2], xmm_in);

    xmm_in = _mm_load_ss(&vec4);
    _mm_store_ss(&out[3], xmm_in);

#else
    out[0] = vec1;
    out[1] = vec2;
    out[2] = vec3;
    out[3] = vec4;
#endif
}

I use it for rgba operations but it can be used for other stuff to.

Spike · Post by **Spike** » Tue May 08, 2012 7:10 pm

#define makevec4(a,b,c,d,out) ((long long *)out)[0] = &a,((long long *)out)[1] = &c
mwahaha. evil though.
the function call overhead will kill you with your function. the sse inside is less likely to be directly optimised (that is, it'll never no-op elements that can be trivially overwritten due to known values).

sse2 at least struggles with dotproducts.
enabling auto-sse2 optimisations actually reduced framerates for me with fte - quake has a lot of dotproducts for things like culling etc.

as mh says, sse is good for bulk stuff like 'multiply this vec4 by 4, now add 2'. Its also quite good for memcpys, if you can avoid alignment issues and non-multiple-of-16 data blocks.

However, while 'multiply each element in this vec3' is fast enough, adding those 3 elements together is more painful than doing the whole thing with just x87 instructions, and that's your basic dotproduct that is used all over the place in quake!

maybe sse3 fixes something? I've not checked, just that optimising for sse2 was a loss for me, but then I don't have a software renderer. Benchmark it!

mh · Post by mh » Tue May 08, 2012 8:32 pm

Auto-SSE made DirectQ slower too. The only place I remain using SSE is in my matrix operations, which D3D does automatically for me anyway (yayyy! no work!) and which are called so relatively few times by comparison to so much other more important stuff, so it doesn't really count for much in the general case (useful for IQM though, even if I do have to pad them to 4x4 - no big deal).

I believe that the DarkPlaces matrix library is set up to be friendly for compiler auto-vectorization, but I've already got the D3D library and it works natively with the rest of the code, so why bother? (I even use this library with OpenGL where possible - it works, I know it, I'm comfortable with it, why not?)

Everywhere else I get much more performance by putting this kind of calculation on the GPU where it counts. Not an option for a software engine of course.......

(Random mad thought - use CUDA/OpenCL/DirectCompute to accelerate these ops but still render in software).

Spike · Post by **Spike** » Tue May 08, 2012 9:04 pm

gcc actually has a built in vector type. You can do your c = a*b; etc stuff with it and it'll automatically be converted to sse or altivec (depending on cpu, obviously) for you without any of the unreadable gibberish intrinsic names.
explicit vector types allow the compiler to allign things properly, etc, which will help save a few cycles even with auto-vectorisation, and because its not instruction-set-specific, you can always change the cpu target and you get a working build with the same code for an entirely different cpu.

revelator · Post by **revelator** » Wed May 09, 2012 3:12 am

theres an asm version of memcpy in quake2xp unfortunatly it does not compute with gcc's assembler (masm syntax) .
one switch you can try with gcc is -ftree-vectorize to enable the vectorizing compiler. i had a few problem with some sources though when using this.

qbism · Post by **qbism** » Wed May 09, 2012 5:31 am

CB Advanced code profiling shows that time spent drawing fog (it's within R_RenderView) is still large, about the same time consumed as D_DrawSpans8. When fog is off, R_RenderView drops way down the list.

FlipScreen in vid_win.c is high on the list. (BTW, discard _mcount_private, that's the profiler itself.) FlipScreen is where the 8-bit fantasy ends, ridden as far as it goes, and now have to convert it to 32-bit for ddraw.

sv_physics, on the other hand, takes almost no time at 0.07%. So it could afford some improvements! It's already got a few basic updates like bounce_missile, movetype_follow, brush rotation fix, and gravity interpolation fix.

A partial report is below. It is cut at the most expensive mathlib.c function: TransformVector. It's only 0.18%. So further vector math optimization won't get much bang. I compiled this without the manual SSE code but with -ftree-vectorize and with MMX/SSE/SSE2/SSE3 compiler options. In a build without -ftree-vectorize, TransformVector was at 0.25%, but that might only be statistical variation in it's favor between runs.

This build is using asm for some of the span functions, indicated with missing calls and s/call data. D_PolysetDrawSpans8 and certain others are not asm because they've been modified for transparency and so forth.

Code: Select all

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 17.81     43.72    43.72                             D_DrawSpans8
 17.21     85.96    42.24     6505     0.01     0.01  R_RenderView
 15.72    124.54    38.58                             _mcount_private
 12.78    155.92    31.38     6515     0.00     0.00  FlipScreen
  4.43    166.79    10.87                             D_DrawZSpans
  4.27    177.27    10.48  2511031     0.00     0.00  D_PolysetDrawSpans8
  3.94    186.95     9.68  9875785     0.00     0.00  R_RecursiveClipBPoly
  2.77    193.75     6.80 343848953     0.00     0.00  R_LeadingEdge
  2.25    199.28     5.53 2088482963     0.00     0.00  Q_strcmp
  2.01    204.22     4.95  7279200     0.00     0.00  R_GenerateSpans
  1.59    208.13     3.91   407306     0.00     0.00  R_DrawSurfaceBlock8_mip0
  1.36    211.48     3.35  7377603     0.00     0.00  GetEdictFieldValue
  1.23    214.49     3.01 29338307     0.00     0.00  R_EmitEdge
  1.17    217.35     2.86  1579143     0.00     0.00  SV_FindTouchedLeafs
  0.80    219.32     1.97  6745591     0.00     0.00  R_StepActiveU
  0.73    221.12     1.80    26460     0.00     0.00  Draw_Fill
  0.70    222.85     1.73     8088     0.00     0.00  R_RecursiveWorldNode
  0.65    224.46     1.61 10036193     0.00     0.00  R_RenderFace
  0.53    225.76     1.31 59825048     0.00     0.00  R_ClipEdge
  0.53    227.07     1.31  1951413     0.00     0.00  D_PolysetScanLeftEdge_C
  0.49    228.28     1.20        1     1.20     1.20  R_BuildLightmaps
  0.47    229.44     1.16     8088     0.00     0.00  R_ScanEdges
  0.42    230.47     1.04    66130     0.00     0.00  R_DrawSolidClippedSubmodelPolygons
  0.36    231.37     0.89 41118167     0.00     0.00  BoxOnPlaneSide
  0.36    232.26     0.89   188979     0.00     0.00  PR_ExecuteProgram
  0.32    233.05     0.80                             Entry8_8
  0.24    233.65     0.60      251     0.00     0.00  Draw_ConsoleBackground
  0.24    234.23     0.58                             _fentry__
  0.22    234.76     0.53     6505     0.00     0.00  SV_WriteEntitiesToClient
  0.20    235.26     0.50    24258     0.00     0.00  D_DrawSurfaces
  0.19    235.72     0.46    93641     0.00     0.00  SV_PushMove
  0.18    236.17     0.45 65651197     0.00     0.00  TransformVector

revelator · Post by **revelator** » Wed May 09, 2012 12:15 pm

seems the absolute winner as for calls is strcmp

Baker · Post by **Baker** » Wed May 09, 2012 12:24 pm

Hmmmm ....

qbism · Post by **qbism** » Wed May 09, 2012 5:08 pm

q_strcmp: 2 billion calls/ 6500 frames = 300,000+ calls/frame

Maybe flipscreen is the place to post-process fog in 32bit color.

revelator · Post by **revelator** » Wed May 09, 2012 5:39 pm

comparing strings is probably not the biggest hit on performance normally but 2 billion calls... ouch.

mankrip · Post by **mankrip** » Wed May 09, 2012 8:26 pm

I've looked at D_DrawZSpans a few weeks ago, and couldn't really find a way to optimize it.

R_DrawSurfaceBlock8_mip* obviously needs to be optimized, and should be easy to. Limiting the number of lightmap updates per second could also help.

Q_strcmp usage should be easy to optimize in some way, maybe by finding ways to reduce the number of calls for it.

GetEdictFieldValue seems to be the villain in QuakeC's performance. Good to know, it should be the starting point if I try to optimize the QC VM.

Draw_Fill has a surprisingly high hit on performance.

Spike · Post by **Spike** » Thu May 10, 2012 1:38 am

the whole point of D_DrawZSpans is because a 386 doesn't have enough registers to interpolate/write depth at the same time as colours.
with the sse or amd64 instruction sets, you might get enough spare registers (possibly mmx too if you don't use floats).
Modern CPUs have more agressive instruction pipelines. If you don't use the value from mem reads instantly, you may well be able to get away with using a little memory (stack?) instead of registers. Memory writes should at least be cacheable also.
Back when FTE still had a software renderer (much of which used C instead of asm) I found that combining D_DrawSpans8 and D_DrawZSpans gave a couple of percent speedup. But like I say, no asm so its more a case of the compiler not managing to use all registers efficiently.

GetEdictFieldValue contains a loop and strcmps. Its not pretty. The offsets should be directly cacheable anyway.

qbism · Post by **qbism** » Thu May 10, 2012 5:07 pm

Spike wrote:Back when FTE still had a software renderer (much of which used C instead of asm) I found that combining D_DrawSpans8 and D_DrawZSpans gave a couple of percent speedup. But like I say, no asm so its more a case of the compiler not managing to use all registers efficiently.

I wonder if this idea combined with fog would win-out in speed? This might eventually support volumetric fog unlike the post-process global fog effect.

InsideQC Forums

qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds