DOS ain't dead - Incredibly slow MMX?

Laaca

Czech republic,
02.03.2009, 21:15

Incredibly slow MMX? (Developers)

Now I tried to write a transparent PutSprite routine. Suprisingly it wasn't so hard but the result is much slower than normal 386 code. If I draw into VRAM it is even cca 80 times slower!!!

In my test routines I measured how many cycles of PutSprite will pass in 50x55ms. I switched on and off the CPU_info_mmx variable and used RAM or VRAM location:

noMMX into RAM: 8645 MMX into RAM: 5928 noMMX into VRAM: 10210 MMX into VRAM: 152 (!!!)

Hell, do I something fundamentaly wrong?

My computer: AMD K6 Thunderbird 1,33 GHz with GeForce4 MX card
Measured in FreeDOS and DOS session in Windows98 and results are the same.

PROCEDURE PutHCSprite(var Dest,Sprite:VirtualWindow;x,y:LongInt;HideColor:Word);assembler; Asm PUSH ES mov edi,Dest mov ax,ds:[edi+ 0] {selector of destination...} Mov es,ax {...into ES} mov esi,Sprite mov ecx,ds:[esi+26] {Sprite - bytes per line (I use 16bpp modes)} mov eax,y {Compute the offset of Y-th line} mov ebx,ds:[edi+26] {Destination - bytes per line into EBX} mul ebx Mov edx,ds:[esi+30] {Height of sprite into EDX } Add eax,x {add X position to offset} add eax,x {and even more because I use 16bpp mode} Add eax,ds:[edi+2] {Add the Destination basic offset to the computed position offset} mov edi,eax {and store into EDI} Mov esi,ds:[esi+2] {Sprite selector is allways DS and we will not compute the offset because this is non-cliping routine} cmp cpu_info_mmx,0 {do I have a MMX processor?} jnz @with_mmx {If yes, jump} @wo_mmx: {--------------------------------------------------} @wo_mmx_lines: push ecx push edi @wo_mmx_dots: Mov ax,[esi] Cmp ax,HideColor Je @wo_mmx_skip Mov es:[edi],ax @wo_mmx_skip: Add esi,2 Add edi,2 sub ecx,2 jnz @wo_mmx_dots pop edi pop ecx Add edi,ebx dec edx jnz @wo_mmx_lines JMP @Finished @with_mmx:{--------------------------------------------------} cmp ecx,8 {too short lines?} jle @wo_mmx {If yes, jump} {Now we know we are on MMX processor and sprite is at least 8 bytes (4 pixels) width} {Put Hidecolor word in all MMX registers} mov ax,HideColor shl eax,16 mov ax,HideColor movd mm5,eax movd mm6,eax psllq mm5,32 paddusw mm5,mm6 {...ready in mm5------------------------} @mmx_lines: push ecx push edi @mmx_dots: movq mm1,ds:[esi] {4 pixels from sprite} movq mm2,mm1 {make a backup} pcmpeqw mm1,mm5 {make mask for AND operation} movq mm3,es:[edi] {4 pixels from destination} pand mm1,mm3 {AND operation between mask and dest} por mm1,mm2 {and now I can place the backuped sprite} movq es:[edi],mm1 {finished 4 pixels into destination} add esi,8 add edi,8 sub ecx,8 jz @mmx_endline {ECX zero? So this line is finished} cmp ecx,8 jge @mmx_dots {Do I have to process the rest of line?} {Now do the rest what didn't fit into MMX registers} @mmx_rest: Mov ax,[esi] Cmp ax,HideColor Je @mmx_skip Mov es:[edi],ax @mmx_skip: add edi,2 add esi,2 sub ecx,2 jnz @mmx_rest @mmx_endline: pop edi pop ecx Add edi,ebx dec edx jnz @mmx_lines EMMS @Finished: POP ES End;

---
DOS-u-akbar!

Rugxulo

Usono,
04.03.2009, 03:01

@ Laaca

Incredibly slow MMX?

Post reply

> Now I tried to write a transparent PutSprite routine. Suprisingly it wasn't
> so hard but the result is much slower than normal 386 code. If I draw into
> VRAM it is even cca 80 times slower!!!
>
> In my test routines I measured how many cycles of PutSprite will pass in
> 50x55ms. I switched on and off the CPU_info_mmx variable and used RAM or
> VRAM location:
>
> noMMX into RAM: 8645> MMX into RAM: 5928 > noMMX into VRAM: 10210 > MMX into VRAM: 152 (!!!)
>
> Hell, do I something fundamentaly wrong?
>
> My computer: AMD K6 Thunderbird 1,33 GHz with GeForce4 MX card
> Measured in FreeDOS and DOS session in Windows98 and results are the
> same.

N.B. I'm no expert, take this with a bucket of salt!

MTRR enabled? Aligned properly? Double buffered?

BTW, Wikipedia calls Thunderbird the improved K7 Athlon, so I can't think of any inherent reason its MMX would be weaker. (Even includes the extended MMX subset of SSE, right?) I assume it uses register renaming, but maybe that's not aggressive enough (since you have a few dependency issues, e.g. using same register as output a few times in a row). Actually it looks like your non-MMX code uses lots of MOV, ADD, and a MUL. That stuff is very basic ALU, so it's probably very very fast (i.e. very common, so heavily optimized in the cpu itself).

Laaca

Czech republic,
04.03.2009, 13:26

@ Rugxulo

Incredibly slow MMX?

Post reply

> MTRR enabled? Aligned properly? Double buffered?

Yes, MTRR is enabled. What do you meen by double buffering? In my test routine I drew into RAM (without any later copying into screen) or into visible part of VRAM.

It is aligned because VRAM is aligned always and Freepascal alignes all arrays or buffers on 16 bytes.

---
DOS-u-akbar!

Rugxulo

Usono,
04.03.2009, 23:50

@ Laaca

Incredibly slow MMX?

Post reply

> > MTRR enabled? Aligned properly? Double buffered?
>
> Yes, MTRR is enabled. What do you meen by double buffering? In my test
> routine I drew into RAM (without any later copying into screen) or into
> visible part of VRAM.

double buffering (Wikipedia)

> It is aligned because VRAM is aligned always and Freepascal alignes all
> arrays or buffers on 16 bytes.

I meant the code should probably be aligned, but then again, I kinda doubt your Athlon is as sensitive as some older cpus, so that may not make much of a difference.

Japheth

Germany (South),
04.03.2009, 17:55

@ Laaca

Incredibly slow MMX?

Post reply

> Hell, do I something fundamentaly wrong?

Probably yes. AFAICS your mmx code does read from vram, while your 386 code avoids this. Reading vram is a bad idea, because it's VERY slow, even with modern cards.

---
MS-DOS forever!

Laaca

Czech republic,
08.03.2009, 22:04
(edited by Laaca, 08.03.2009, 22:46)

@ Japheth

Incredibly slow MMX?

Post reply

I write another routine for testing MMX. It is darkening procedure for sprite (again for 16 bpp mode). Without MMX is it quite complicated and MMX is here a better speed gain.

I figured that MMX access to VRAM is much slower than 386 access to VRAM.
So, never use MMX for direct VRAM output. It is always better to draw into RAM buffer and then result copy into screen.

Procedure DecreaseMMXSpriteLightness(var sprite:virtualwindow;r,g,b:longint);assembler; asm push es mov esi,sprite mov ecx,[esi+6] {size} mov ax,[esi+0] mov esi,[esi+2] mov es,ax cmp ecx,8 jl @zbytek cmp cpu_info_mmx,0 jz @zbytek movd mm5,r punpcklwd mm5,mm5 punpcklwd mm5,mm5 {R je rozepsane do celeho mm5} movd mm6,g punpcklwd mm6,mm6 punpcklwd mm6,mm6 {G je rozepsane do celeho mm6} movd mm7,b punpcklwd mm7,mm7 punpcklwd mm7,mm7 {B je rozepsane do celeho mm7} @smycka: movq mm1,es:[esi] { pro R slozku } movq mm2,mm1 { pro G slozku } movq mm3,mm1 { pro B slozku } psrlw mm1,11 { R slozka osamostatnena } psllw mm2,5 psrlw mm2,5+5 { G slozka osamostatnena } psllw mm3,11 psrlw mm3,11 { B slozka osamostatnena } psubusw mm1,mm5 { provede odecet R slozky } psubusw mm2,mm6 { ...G slozky } psubusw mm3,mm7 { ...B slozky } psllw mm1,11 { R na sve misto } psllw mm2,5 { G na sve misto } { B na svem miste uz je } por mm1,mm2 por mm1,mm3 movq es:[esi],mm1 add esi,8 sub ecx,8 cmp ecx,8 jge @smycka emms jecxz @konec {-}@zbytek: {---------------------------} shr ecx,1 {}@cykl_zbytku: movzx eax,word ptr es:[esi] {v EAX R} mov ebx,eax {v EBX G} mov edx,eax {v EDX B} shr eax,11 shr ebx,5 and ebx,63 and edx,31 sub eax,r sub ebx,g sub edx,b cmp eax,0 jge @r_ok xor eax,eax {}@r_ok: cmp ebx,0 jge @g_ok xor ebx,ebx {}@g_ok: cmp edx,0 jge @b_ok xor edx,edx {}@b_ok: shl eax,11 shl ebx,5 or eax,ebx or eax,edx mov es:[esi],ax add esi,2 dec ecx jnz @cykl_zbytku {---------------------------------------} @konec: pop es end;

---
DOS-u-akbar!

Rugxulo

Usono,
09.03.2009, 00:33

@ Laaca

Incredibly slow MMX?

Post reply

> I write another routine for testing MMX. It is darkening procedure for
> sprite (again for 16 bpp mode). Without MMX is it quite complicated and
> MMX is here a better speed gain.
>
> I figured that MMX access to VRAM is much slower than 386 access to VRAM.
>
> So, never use MMX for direct VRAM output. It is always better to draw into
> RAM buffer and then result copy into screen.

Is it possible that you can only issue a handful of MMX operations per cycle? Or that they take a long time to complete? I mean, you're basically using the FPU decoder unit (or whatever). I assume AMD's FPU finally caught up to Intel's pipelined 587 with the Athlon. Still, there is always a bottleneck, so maybe you hit it.

And maybe your issue is the EMMS (although FEMMS is disallowed on Intel and possibly a no-op on Athlon or newer). Or it could be the overhead from FNSAVE / FNRSTOR, perhaps? Are you or the compiler doing other FPU-related stuff? Have you tried in pure DOS without the context switches of multitasking?

mht

Wroclaw, Poland,
21.03.2009, 12:57

@ Laaca

Incredibly slow MMX?

Post reply

While looking for something else, I found the following in http://www.microsoft.com/whdc/archive/GDInext.mspx:

Writes by the CPU to video memory surfaces are also acceptably fast--thanks to write combining--and throughput is typically 200 MB/s on the latest AGP (accelerated graphics port) systems. Read speeds, however, are terrible, typically maxing out at 12 MB/s on the latest AGP systems. This read performance is anathema to most MMX routines, which are typically read-modify-write by nature of their vector processing. It is also a problem for any routines that must explicitly do read or read-modify-write operations, such as is the case with almost all image processing filters or Microsoft DirectX Transform plug-ins.

DOS386

22.03.2009, 06:11

@ Laaca

Incredibly slow MMX?

Post reply

> If I draw into VRAM it is even cca 80 times slower!!!

Old GUI programmers theorem: Do never read from VRAM. If ever you need to read back some screen data, set up a backbuffer.

---
This is a LOGITECH mouse driver, but some software expect here
the following string:*** This is Copyright 1983 Microsoft ***