Laaca
Czech republic, 02.03.2009, 21:15 |
Incredibly slow MMX? (Developers) |
Now I tried to write a transparent PutSprite routine. Suprisingly it wasn't so hard but the result is much slower than normal 386 code. If I draw into VRAM it is even cca 80 times slower!!!
In my test routines I measured how many cycles of PutSprite will pass in 50x55ms. I switched on and off the CPU_info_mmx variable and used RAM or VRAM location:
noMMX into RAM: 8645
MMX into RAM: 5928
noMMX into VRAM: 10210
MMX into VRAM: 152 (!!!)
Hell, do I something fundamentaly wrong?
My computer: AMD K6 Thunderbird 1,33 GHz with GeForce4 MX card
Measured in FreeDOS and DOS session in Windows98 and results are the same.
PROCEDURE PutHCSprite(var Dest,Sprite:VirtualWindow;x,y:LongInt;HideColor:Word);assembler;
Asm
PUSH ES
mov edi,Dest
mov ax,ds:[edi+ 0] {selector of destination...}
Mov es,ax {...into ES}
mov esi,Sprite
mov ecx,ds:[esi+26] {Sprite - bytes per line (I use 16bpp modes)}
mov eax,y {Compute the offset of Y-th line}
mov ebx,ds:[edi+26] {Destination - bytes per line into EBX}
mul ebx
Mov edx,ds:[esi+30] {Height of sprite into EDX }
Add eax,x {add X position to offset}
add eax,x {and even more because I use 16bpp mode}
Add eax,ds:[edi+2] {Add the Destination basic offset to the computed
position offset}
mov edi,eax {and store into EDI}
Mov esi,ds:[esi+2] {Sprite selector is allways DS and we will not
compute the offset because this is non-cliping
routine}
cmp cpu_info_mmx,0 {do I have a MMX processor?}
jnz @with_mmx {If yes, jump}
@wo_mmx: {--------------------------------------------------}
@wo_mmx_lines:
push ecx
push edi
@wo_mmx_dots:
Mov ax,[esi]
Cmp ax,HideColor
Je @wo_mmx_skip
Mov es:[edi],ax
@wo_mmx_skip:
Add esi,2
Add edi,2
sub ecx,2
jnz @wo_mmx_dots
pop edi
pop ecx
Add edi,ebx
dec edx
jnz @wo_mmx_lines
JMP @Finished
@with_mmx:{--------------------------------------------------}
cmp ecx,8 {too short lines?}
jle @wo_mmx {If yes, jump}
{Now we know we are on MMX processor and sprite is at least 8 bytes (4 pixels)
width}
{Put Hidecolor word in all MMX registers}
mov ax,HideColor
shl eax,16
mov ax,HideColor
movd mm5,eax
movd mm6,eax
psllq mm5,32
paddusw mm5,mm6
{...ready in mm5------------------------}
@mmx_lines:
push ecx
push edi
@mmx_dots:
movq mm1,ds:[esi] {4 pixels from sprite}
movq mm2,mm1 {make a backup}
pcmpeqw mm1,mm5 {make mask for AND operation}
movq mm3,es:[edi] {4 pixels from destination}
pand mm1,mm3 {AND operation between mask and dest}
por mm1,mm2 {and now I can place the backuped sprite}
movq es:[edi],mm1 {finished 4 pixels into destination}
add esi,8
add edi,8
sub ecx,8
jz @mmx_endline {ECX zero? So this line is finished}
cmp ecx,8
jge @mmx_dots {Do I have to process the rest of line?}
{Now do the rest what didn't fit into MMX registers}
@mmx_rest:
Mov ax,[esi]
Cmp ax,HideColor
Je @mmx_skip
Mov es:[edi],ax
@mmx_skip:
add edi,2
add esi,2
sub ecx,2
jnz @mmx_rest
@mmx_endline:
pop edi
pop ecx
Add edi,ebx
dec edx
jnz @mmx_lines
EMMS
@Finished:
POP ES
End; --- DOS-u-akbar! |
Rugxulo
Usono, 04.03.2009, 03:01
@ Laaca
|
Incredibly slow MMX? |
> Now I tried to write a transparent PutSprite routine. Suprisingly it wasn't
> so hard but the result is much slower than normal 386 code. If I draw into
> VRAM it is even cca 80 times slower!!!
>
> In my test routines I measured how many cycles of PutSprite will pass in
> 50x55ms. I switched on and off the CPU_info_mmx variable and used RAM or
> VRAM location:
>
> noMMX into RAM: 8645
> MMX into RAM: 5928
> noMMX into VRAM: 10210
> MMX into VRAM: 152 (!!!)
>
> Hell, do I something fundamentaly wrong?
>
> My computer: AMD K6 Thunderbird 1,33 GHz with GeForce4 MX card
> Measured in FreeDOS and DOS session in Windows98 and results are the
> same.
N.B. I'm no expert, take this with a bucket of salt!
MTRR enabled? Aligned properly? Double buffered?
BTW, Wikipedia calls Thunderbird the improved K7 Athlon, so I can't think of any inherent reason its MMX would be weaker. (Even includes the extended MMX subset of SSE, right?) I assume it uses register renaming, but maybe that's not aggressive enough (since you have a few dependency issues, e.g. using same register as output a few times in a row). Actually it looks like your non-MMX code uses lots of MOV, ADD, and a MUL. That stuff is very basic ALU, so it's probably very very fast (i.e. very common, so heavily optimized in the cpu itself). |
Laaca
Czech republic, 04.03.2009, 13:26
@ Rugxulo
|
Incredibly slow MMX? |
> MTRR enabled? Aligned properly? Double buffered?
Yes, MTRR is enabled. What do you meen by double buffering? In my test routine I drew into RAM (without any later copying into screen) or into visible part of VRAM.
It is aligned because VRAM is aligned always and Freepascal alignes all arrays or buffers on 16 bytes. --- DOS-u-akbar! |
Japheth
Germany (South), 04.03.2009, 17:55
@ Laaca
|
Incredibly slow MMX? |
> Hell, do I something fundamentaly wrong?
Probably yes. AFAICS your mmx code does read from vram, while your 386 code avoids this. Reading vram is a bad idea, because it's VERY slow, even with modern cards. --- MS-DOS forever! |
Rugxulo
Usono, 04.03.2009, 23:50
@ Laaca
|
Incredibly slow MMX? |
> > MTRR enabled? Aligned properly? Double buffered?
>
> Yes, MTRR is enabled. What do you meen by double buffering? In my test
> routine I drew into RAM (without any later copying into screen) or into
> visible part of VRAM.
double buffering (Wikipedia)
> It is aligned because VRAM is aligned always and Freepascal alignes all
> arrays or buffers on 16 bytes.
I meant the code should probably be aligned, but then again, I kinda doubt your Athlon is as sensitive as some older cpus, so that may not make much of a difference. |
Laaca
Czech republic, 08.03.2009, 22:04 (edited by Laaca, 08.03.2009, 22:46)
@ Japheth
|
Incredibly slow MMX? |
I write another routine for testing MMX. It is darkening procedure for sprite (again for 16 bpp mode). Without MMX is it quite complicated and MMX is here a better speed gain.
I figured that MMX access to VRAM is much slower than 386 access to VRAM.
So, never use MMX for direct VRAM output. It is always better to draw into RAM buffer and then result copy into screen.
Procedure DecreaseMMXSpriteLightness(var sprite:virtualwindow;r,g,b:longint);assembler;
asm
push es
mov esi,sprite
mov ecx,[esi+6] {size}
mov ax,[esi+0]
mov esi,[esi+2]
mov es,ax
cmp ecx,8
jl @zbytek
cmp cpu_info_mmx,0
jz @zbytek
movd mm5,r
punpcklwd mm5,mm5
punpcklwd mm5,mm5 {R je rozepsane do celeho mm5}
movd mm6,g
punpcklwd mm6,mm6
punpcklwd mm6,mm6 {G je rozepsane do celeho mm6}
movd mm7,b
punpcklwd mm7,mm7
punpcklwd mm7,mm7 {B je rozepsane do celeho mm7}
@smycka:
movq mm1,es:[esi] { pro R slozku }
movq mm2,mm1 { pro G slozku }
movq mm3,mm1 { pro B slozku }
psrlw mm1,11 { R slozka osamostatnena }
psllw mm2,5
psrlw mm2,5+5 { G slozka osamostatnena }
psllw mm3,11
psrlw mm3,11 { B slozka osamostatnena }
psubusw mm1,mm5 { provede odecet R slozky }
psubusw mm2,mm6 { ...G slozky }
psubusw mm3,mm7 { ...B slozky }
psllw mm1,11 { R na sve misto }
psllw mm2,5 { G na sve misto }
{ B na svem miste uz je }
por mm1,mm2
por mm1,mm3
movq es:[esi],mm1
add esi,8
sub ecx,8
cmp ecx,8
jge @smycka
emms
jecxz @konec
{-}@zbytek: {---------------------------}
shr ecx,1
{}@cykl_zbytku:
movzx eax,word ptr es:[esi] {v EAX R}
mov ebx,eax {v EBX G}
mov edx,eax {v EDX B}
shr eax,11
shr ebx,5
and ebx,63
and edx,31
sub eax,r
sub ebx,g
sub edx,b
cmp eax,0
jge @r_ok
xor eax,eax
{}@r_ok:
cmp ebx,0
jge @g_ok
xor ebx,ebx
{}@g_ok:
cmp edx,0
jge @b_ok
xor edx,edx
{}@b_ok:
shl eax,11
shl ebx,5
or eax,ebx
or eax,edx
mov es:[esi],ax
add esi,2
dec ecx
jnz @cykl_zbytku
{---------------------------------------}
@konec:
pop es
end; --- DOS-u-akbar! |
Rugxulo
Usono, 09.03.2009, 00:33
@ Laaca
|
Incredibly slow MMX? |
> I write another routine for testing MMX. It is darkening procedure for
> sprite (again for 16 bpp mode). Without MMX is it quite complicated and
> MMX is here a better speed gain.
>
> I figured that MMX access to VRAM is much slower than 386 access to VRAM.
>
> So, never use MMX for direct VRAM output. It is always better to draw into
> RAM buffer and then result copy into screen.
Is it possible that you can only issue a handful of MMX operations per cycle? Or that they take a long time to complete? I mean, you're basically using the FPU decoder unit (or whatever). I assume AMD's FPU finally caught up to Intel's pipelined 587 with the Athlon. Still, there is always a bottleneck, so maybe you hit it.
And maybe your issue is the EMMS (although FEMMS is disallowed on Intel and possibly a no-op on Athlon or newer). Or it could be the overhead from FNSAVE / FNRSTOR, perhaps? Are you or the compiler doing other FPU-related stuff? Have you tried in pure DOS without the context switches of multitasking? |
mht
Wroclaw, Poland, 21.03.2009, 12:57
@ Laaca
|
Incredibly slow MMX? |
While looking for something else, I found the following in http://www.microsoft.com/whdc/archive/GDInext.mspx:
Writes by the CPU to video memory surfaces are also acceptably fast--thanks to write combining--and throughput is typically 200 MB/s on the latest AGP (accelerated graphics port) systems. Read speeds, however, are terrible, typically maxing out at 12 MB/s on the latest AGP systems. This read performance is anathema to most MMX routines, which are typically read-modify-write by nature of their vector processing. It is also a problem for any routines that must explicitly do read or read-modify-write operations, such as is the case with almost all image processing filters or Microsoft DirectX Transform plug-ins. |
DOS386
22.03.2009, 06:11
@ Laaca
|
Incredibly slow MMX? |
> If I draw into VRAM it is even cca 80 times slower!!!
Old GUI programmers theorem: Do never read from VRAM. If ever you need to read back some screen data, set up a backbuffer. --- This is a LOGITECH mouse driver, but some software expect here
the following string:*** This is Copyright 1983 Microsoft *** |