Back to home page

DOS ain't dead

Forum index page

Log in | Register

Back to the board
Thread view  Mix view  Order
Laaca

Homepage

Czech republic,
02.03.2009, 21:15
 

Incredibly slow MMX? (Developers)

Now I tried to write a transparent PutSprite routine. Suprisingly it wasn't so hard but the result is much slower than normal 386 code. If I draw into VRAM it is even cca 80 times slower!!!

In my test routines I measured how many cycles of PutSprite will pass in 50x55ms. I switched on and off the CPU_info_mmx variable and used RAM or VRAM location:

noMMX into RAM:   8645
MMX into RAM:     5928
noMMX into VRAM: 10210
MMX into VRAM:     152 (!!!)


Hell, do I something fundamentaly wrong?

My computer: AMD K6 Thunderbird 1,33 GHz with GeForce4 MX card
Measured in FreeDOS and DOS session in Windows98 and results are the same.

PROCEDURE PutHCSprite(var Dest,Sprite:VirtualWindow;x,y:LongInt;HideColor:Word);assembler;
 Asm
PUSH ES
  mov  edi,Dest
  mov  ax,ds:[edi+ 0]       {selector of destination...}
  Mov  es,ax                {...into ES}

  mov  esi,Sprite

  mov  ecx,ds:[esi+26]      {Sprite - bytes per line (I use 16bpp modes)}
  mov  eax,y                {Compute the offset of Y-th line}
  mov  ebx,ds:[edi+26]      {Destination - bytes per line into EBX}
  mul  ebx
  Mov  edx,ds:[esi+30]      {Height of sprite into EDX }

  Add  eax,x                {add X position to offset}
  add  eax,x                {and even more because I use 16bpp mode}

  Add  eax,ds:[edi+2]       {Add the Destination basic offset to the computed
                             position offset}

  mov  edi,eax              {and store into EDI}

  Mov  esi,ds:[esi+2]       {Sprite selector is allways DS and we will not
                             compute the offset because this is non-cliping
                             routine}


cmp cpu_info_mmx,0          {do I have a MMX processor?}
jnz @with_mmx                  {If yes, jump}


@wo_mmx:  {--------------------------------------------------}
          @wo_mmx_lines:
            push ecx
            push edi
          @wo_mmx_dots:
               Mov ax,[esi]
               Cmp ax,HideColor
                Je @wo_mmx_skip
               Mov es:[edi],ax
          @wo_mmx_skip:
               Add esi,2
               Add edi,2
               sub ecx,2
               jnz @wo_mmx_dots

            pop edi
            pop ecx
            Add edi,ebx
            dec edx
            jnz @wo_mmx_lines
            JMP @Finished

@with_mmx:{--------------------------------------------------}
cmp ecx,8                   {too short lines?}
jle @wo_mmx                   {If yes, jump}

{Now we know we are on MMX processor and sprite is at least 8 bytes (4 pixels)
 width}

       {Put Hidecolor word in all MMX registers}
            mov ax,HideColor
            shl eax,16
            mov ax,HideColor
            movd mm5,eax
            movd mm6,eax
            psllq mm5,32
            paddusw mm5,mm6
       {...ready in mm5------------------------}

          @mmx_lines:

            push ecx
            push edi
          @mmx_dots:
               movq    mm1,ds:[esi] {4 pixels from sprite}
               movq    mm2,mm1      {make a backup}
               pcmpeqw mm1,mm5          {make mask for AND operation}


               movq    mm3,es:[edi] {4 pixels from destination}
               pand    mm1,mm3      {AND operation between mask and dest}
               por     mm1,mm2      {and now I can place the backuped sprite}

               movq    es:[edi],mm1 {finished 4 pixels into destination}

               add     esi,8
               add     edi,8
               sub     ecx,8
               jz  @mmx_endline     {ECX zero? So this line is finished}
               cmp     ecx,8
               jge @mmx_dots        {Do I have to process the rest of line?}

       {Now do the rest what didn't fit into MMX registers}
          @mmx_rest:
               Mov ax,[esi]
               Cmp ax,HideColor
                Je @mmx_skip
               Mov es:[edi],ax
          @mmx_skip:
               add edi,2
               add esi,2
               sub ecx,2
               jnz @mmx_rest

          @mmx_endline:
               pop edi
               pop ecx
               Add edi,ebx
               dec edx
               jnz @mmx_lines
          EMMS

@Finished:
POP ES
End;

---
DOS-u-akbar!

Rugxulo

Homepage

Usono,
04.03.2009, 03:01

@ Laaca

Incredibly slow MMX?

> Now I tried to write a transparent PutSprite routine. Suprisingly it wasn't
> so hard but the result is much slower than normal 386 code. If I draw into
> VRAM it is even cca 80 times slower!!!
>
> In my test routines I measured how many cycles of PutSprite will pass in
> 50x55ms. I switched on and off the CPU_info_mmx variable and used RAM or
> VRAM location:
>
> noMMX into RAM:   8645
> MMX into RAM:     5928
> noMMX into VRAM: 10210
> MMX into VRAM:     152 (!!!)

>
> Hell, do I something fundamentaly wrong?
>
> My computer: AMD K6 Thunderbird 1,33 GHz with GeForce4 MX card
> Measured in FreeDOS and DOS session in Windows98 and results are the
> same.

N.B. I'm no expert, take this with a bucket of salt!

MTRR enabled? Aligned properly? Double buffered?

BTW, Wikipedia calls Thunderbird the improved K7 Athlon, so I can't think of any inherent reason its MMX would be weaker. (Even includes the extended MMX subset of SSE, right?) I assume it uses register renaming, but maybe that's not aggressive enough (since you have a few dependency issues, e.g. using same register as output a few times in a row). Actually it looks like your non-MMX code uses lots of MOV, ADD, and a MUL. That stuff is very basic ALU, so it's probably very very fast (i.e. very common, so heavily optimized in the cpu itself).

Laaca

Homepage

Czech republic,
04.03.2009, 13:26

@ Rugxulo

Incredibly slow MMX?

> MTRR enabled? Aligned properly? Double buffered?

Yes, MTRR is enabled. What do you meen by double buffering? In my test routine I drew into RAM (without any later copying into screen) or into visible part of VRAM.

It is aligned because VRAM is aligned always and Freepascal alignes all arrays or buffers on 16 bytes.

---
DOS-u-akbar!

Japheth

Homepage

Germany (South),
04.03.2009, 17:55

@ Laaca

Incredibly slow MMX?

> Hell, do I something fundamentaly wrong?

Probably yes. AFAICS your mmx code does read from vram, while your 386 code avoids this. Reading vram is a bad idea, because it's VERY slow, even with modern cards.

---
MS-DOS forever!

Rugxulo

Homepage

Usono,
04.03.2009, 23:50

@ Laaca

Incredibly slow MMX?

> > MTRR enabled? Aligned properly? Double buffered?
>
> Yes, MTRR is enabled. What do you meen by double buffering? In my test
> routine I drew into RAM (without any later copying into screen) or into
> visible part of VRAM.

double buffering (Wikipedia)

> It is aligned because VRAM is aligned always and Freepascal alignes all
> arrays or buffers on 16 bytes.

I meant the code should probably be aligned, but then again, I kinda doubt your Athlon is as sensitive as some older cpus, so that may not make much of a difference.

Laaca

Homepage

Czech republic,
08.03.2009, 22:04
(edited by Laaca, 08.03.2009, 22:46)

@ Japheth

Incredibly slow MMX?

I write another routine for testing MMX. It is darkening procedure for sprite (again for 16 bpp mode). Without MMX is it quite complicated and MMX is here a better speed gain.

I figured that MMX access to VRAM is much slower than 386 access to VRAM.
So, never use MMX for direct VRAM output. It is always better to draw into RAM buffer and then result copy into screen.

Procedure DecreaseMMXSpriteLightness(var sprite:virtualwindow;r,g,b:longint);assembler;
asm
push es
mov esi,sprite
mov ecx,[esi+6]                 {size}
mov ax,[esi+0]
mov esi,[esi+2]
mov es,ax
cmp ecx,8
jl @zbytek
cmp cpu_info_mmx,0

jz @zbytek

movd mm5,r
punpcklwd mm5,mm5
punpcklwd mm5,mm5               {R je rozepsane do celeho mm5}

movd mm6,g
punpcklwd mm6,mm6
punpcklwd mm6,mm6               {G je rozepsane do celeho mm6}


movd mm7,b
punpcklwd mm7,mm7
punpcklwd mm7,mm7               {B je rozepsane do celeho mm7}


@smycka:
movq mm1,es:[esi]             { pro R slozku }
movq mm2,mm1                  { pro G slozku }
movq mm3,mm1                  { pro B slozku }

psrlw mm1,11                  { R slozka osamostatnena }
psllw mm2,5
psrlw mm2,5+5                 { G slozka osamostatnena }
psllw mm3,11
psrlw mm3,11                  { B slozka osamostatnena }


psubusw mm1,mm5               { provede odecet R slozky }
psubusw mm2,mm6               { ...G slozky }
psubusw mm3,mm7               { ...B slozky }

psllw mm1,11                  { R na sve misto }
psllw mm2,5                   { G na sve misto }
                              { B na svem miste uz je }

por mm1,mm2
por mm1,mm3

movq es:[esi],mm1
add esi,8

sub ecx,8
cmp ecx,8
jge @smycka

emms
jecxz @konec

{-}@zbytek: {---------------------------}
shr ecx,1
{}@cykl_zbytku:
movzx eax,word ptr es:[esi]  {v EAX R}
mov ebx,eax                  {v EBX G}
mov edx,eax                  {v EDX B}
shr eax,11
shr ebx,5
and ebx,63
and edx,31
sub eax,r
sub ebx,g
sub edx,b
cmp eax,0
jge @r_ok
xor eax,eax
{}@r_ok:
cmp ebx,0
jge @g_ok
xor ebx,ebx
{}@g_ok:
cmp edx,0
jge @b_ok
xor edx,edx
{}@b_ok:
shl eax,11
shl ebx,5
or eax,ebx
or eax,edx
mov es:[esi],ax
add esi,2
dec ecx
jnz @cykl_zbytku


{---------------------------------------}
@konec:
pop es
end;

---
DOS-u-akbar!

Rugxulo

Homepage

Usono,
09.03.2009, 00:33

@ Laaca

Incredibly slow MMX?

> I write another routine for testing MMX. It is darkening procedure for
> sprite (again for 16 bpp mode). Without MMX is it quite complicated and
> MMX is here a better speed gain.
>
> I figured that MMX access to VRAM is much slower than 386 access to VRAM.
>
> So, never use MMX for direct VRAM output. It is always better to draw into
> RAM buffer and then result copy into screen.

Is it possible that you can only issue a handful of MMX operations per cycle? Or that they take a long time to complete? I mean, you're basically using the FPU decoder unit (or whatever). I assume AMD's FPU finally caught up to Intel's pipelined 587 with the Athlon. Still, there is always a bottleneck, so maybe you hit it.

And maybe your issue is the EMMS (although FEMMS is disallowed on Intel and possibly a no-op on Athlon or newer). Or it could be the overhead from FNSAVE / FNRSTOR, perhaps? Are you or the compiler doing other FPU-related stuff? Have you tried in pure DOS without the context switches of multitasking?

mht

Homepage

Wroclaw, Poland,
21.03.2009, 12:57

@ Laaca

Incredibly slow MMX?

While looking for something else, I found the following in http://www.microsoft.com/whdc/archive/GDInext.mspx:

Writes by the CPU to video memory surfaces are also acceptably fast--thanks to write combining--and throughput is typically 200 MB/s on the latest AGP (accelerated graphics port) systems. Read speeds, however, are terrible, typically maxing out at 12 MB/s on the latest AGP systems. This read performance is anathema to most MMX routines, which are typically read-modify-write by nature of their vector processing. It is also a problem for any routines that must explicitly do read or read-modify-write operations, such as is the case with almost all image processing filters or Microsoft DirectX Transform plug-ins.

DOS386

22.03.2009, 06:11

@ Laaca

Incredibly slow MMX?

> If I draw into VRAM it is even cca 80 times slower!!!

Old GUI programmers theorem: Do never read from VRAM. If ever you need to read back some screen data, set up a backbuffer.

---
This is a LOGITECH mouse driver, but some software expect here
the following string:*** This is Copyright 1983 Microsoft ***

Back to the board
Thread view  Mix view  Order
22049 Postings in 2034 Threads, 396 registered users, 216 users online (0 registered, 216 guests)
DOS ain't dead | Admin contact
RSS Feed
powered by my little forum