DOS ain't dead - Assembler optimalisation

Laaca

Czech republic,
13.05.2012, 20:38

Assembler optimalisation - how to avoid a jump? (Developers)

I write some kind of RLE sprite routine with clipping.
I would like to write nice code. I know I could somehow remove the ugly jump "jnc @RLE_CLIP_MOVE_CONT" but I don't know how.

In EDX is length of displayable scanline (computed by clipping routine)
In ECX is number of pixels which could be copyied. It is always>0
After clipping I sometimes can't copy all ECX bytes but only EDX bytes.
That it what this code does.

@RLE_CLIP_MOVE:
sub edx,ecx
jnc @RLE_CLIP_MOVE_CONT

add ecx,edx

@RLE_CLIP_MOVE_CONT:
shr ecx,1;rep movsd;adc ecx,ecx;rep movsw {fast 32bit write}

cmp edx,0
jle @RLE_CLIP_END_LOOP
jmp @RLE_CLIP_SCANLINE_LOOP

---
DOS-u-akbar!

Rugxulo

Usono,
13.05.2012, 21:10

@ Laaca

Assembler optimalisation - how to avoid a jump?

Post reply

> I write some kind of RLE sprite routine with clipping.
> I would like to write nice code. I know I could somehow remove the ugly
> jump "jnc @RLE_CLIP_MOVE_CONT" but I don't know how.

Is the jump hurting performance or just "feels" bad? I wouldn't sweat it, honestly.

> @RLE_CLIP_MOVE:
> sub edx,ecx
> jnc @RLE_CLIP_MOVE_CONT
>
> add ecx,edx

Seems redundant. "cmp" is better here (esp. since it's basically "sub" without saving result but changes flags). If your target is 686+, try "cmovle" or similar.

> @RLE_CLIP_MOVE_CONT:
> shr ecx,1;rep movsd;adc ecx,ecx;rep movsw {fast 32bit write}

This still assumes an even number. Anyways, adc ecx,ecx is pointless as it's always one or zero after rep. Hmmm, maybe that's the point? Otherwise I'd say drop the second "rep".

Dunno, I'm not good thinking outside of a debugger, but ...

shr ecx,1 ; div by 2? should this be "shr ecx,2" (div by 4)?
rep movsd ; should this be "rep movsw"?
adc ecx,ecx ; would "adc cl,cl" be smaller?
rep movsw ; should this be "rep movsb"?

> cmp edx,0
> jle @RLE_CLIP_END_LOOP
> jmp @RLE_CLIP_SCANLINE_LOOP

Use "test edx,edx" here. Very very minor difference, I know, but still ....

Laaca

Czech republic,
13.05.2012, 22:15

@ Rugxulo

Assembler optimalisation - how to avoid a jump?

Post reply

>shr ecx,1;rep movsd;adc ecx,ecx;rep movsw {fast 32bit write}

It is nice piece of code I think.
ECX is number ox pixels to plot. I forgot to mention that I draw in 16bpp mode
so one pixel are two bytes.

If I want to transfer 2 pixels than I move 4 bytes. So:
2 shr 1 = 1 (and CF is set to zero)

REP MOVSD with ECX=1 does one pass of dword transfer so two pixels are moved
After this ECX is guaranted to be zero. And because CF is zero too, after ADC ECX,ECX is ECX still zero. Then...
REP MOVSW with ECX=0 SKIPS THE TRANSFER

That is how this piece works.

---
DOS-u-akbar!

marcov

13.05.2012, 23:22

@ Laaca

Assembler optimalisation - how to avoid a jump?

Post reply

> >shr ecx,1;rep movsd;adc ecx,ecx;rep movsw {fast 32bit write}
> If I want to transfer 2 pixels than I move 4 bytes. So:
> 2 shr 1 = 1 (and CF is set to zero)
>
> REP MOVSD with ECX=1 does one pass of dword transfer so two pixels are
> moved
> After this ECX is guaranted to be zero. And because CF is zero too, after
> ADC ECX,ECX is ECX still zero. Then...
> REP MOVSW with ECX=0 SKIPS THE TRANSFER
>
> That is how this piece works.

The fast move routine that I use is the SSE (move_JOH_SSE_10) routine in the archive on this side:

http://fastcode.sourceforge.net/

(the results of some runtime optimization contests for Delphi, most of them flowed into D2006, but for more specialistic uses I still browse through them, even if slightly outdated)

I also use it for images (but my rowlengths are typically long, 2048 1-byte pixels etc). We have a athlon64/core2 minimum though.

Note that afaik it is common to first align with a few movsb , and then do the bulk with the largest granularity move aligned, and then maybe another few odd ones.

ecm

Düsseldorf, Germany,
13.05.2012, 23:32

@ Laaca

Assembler optimisation - how to avoid a jump?

Post reply

> I would like to write nice code. I know I could somehow remove the ugly
> jump "jnc @RLE_CLIP_MOVE_CONT" but I don't know how.

Hmm, I'm not very good at this. (Usually, I try to rather optimise for code size, and in pure 8086-compatible code.) But let me try.

> sub edx,ecx
> jnc @RLE_CLIP_MOVE_CONT
>
> add ecx,edx
>
> @RLE_CLIP_MOVE_CONT:

The following assumes that eax is available:

sub edx, ecx rcr eax, 1 sar eax, 31 and eax, edx add ecx, eax

This does avoid the jump, but I don't know whether it is actually much better speed-wise.

---
l

Laaca

Czech republic,
14.05.2012, 18:15

@ ecm

Assembler optimisation - how to avoid a jump?

Post reply

I measured my original code, CM's code and another variant found on internet:

VARIANT 1:
sub edx,ecx jnc @RLE_CLIP_MOVE_CONT add ecx,edx @RLE_CLIP_MOVE_CONT:

VARIANT 2:
sub edx, ecx rcr eax, 1 sar eax, 31 and eax, edx add ecx, eax

VARIANT 3:
sub edx,ecx sbb eax,eax and eax,edx add ecx,eax

Differences are small. Variants 2 and 3 are slightly faster then V1. Difference between V2 and V3 is only borderly significant, maybe V3 is sligtly faster but measurement would be must done in pure DOS, not in Win98 I am running just now.
I made test only on Pentium 4 machine, due lack of time I haven't tested on my Pentium III.

---
DOS-u-akbar!

ecm

Düsseldorf, Germany,
14.05.2012, 18:22

@ Laaca

Assembler optimisation - variant 3, sbb

Post reply

> VARIANT 3:
> sub edx,ecx
> sbb eax,eax
> and eax,edx
> add ecx,eax

Oh yeah, I forgot about sbb for that.

> Difference between V2 and V3 is only borderly significant, maybe V3 is
> sligtly faster but measurement would be must done in pure DOS, not in Win98
> I am running just now.

Hmm, even if variant 3 is only a little faster or about equally fast, it's less code and makes the source easier to understand.

---
l

Rugxulo

Usono,
16.05.2012, 10:34

@ Laaca

Assembler optimisation - how to avoid a jump?

Post reply

> I measured my original code, CM's code and another variant found on
> internet:

Could depend on many factors.

> VARIANT 1:
> sub edx,ecx> jnc @RLE_CLIP_MOVE_CONT > add ecx,edx > @RLE_CLIP_MOVE_CONT:

6 bytes

> VARIANT 2:
> sub edx, ecx> rcr eax, 1 > sar eax, 31 > and eax, edx > add ecx, eax

11 bytes

> VARIANT 3:
> sub edx,ecx> sbb eax,eax > and eax,edx > add ecx,eax

8 bytes

> Differences are small. Variants 2 and 3 are slightly faster then V1.

Jumps usually aren't that expensive, and branch prediction makes them reasonable. Of course, Darek Mihocka says avoid them where possible (e.g. his BOCHS optimizations), but it's such a common thing for x86 that I've never even bothered worrying about it.

I think even correctly taken jumps cost 2 cycles on a 486. Pentium assumed all backwards jumps were taken and forwards weren't. On a P4, you could also use jump hints, but I'd doubt it would help (much, if at all, might even hurt, who knows).

It gets more complicated because of EFLAGS and register dependencies, which may or may not cause AGIs (esp. on 486). And some of the more CISC-y instructions (RCR, I assume) will probably not be pariable on a Pentium. PPro/686 has the whole 4-1-1 microcode bullcrap, and don't forget that pipelines fill up faster on older machines, hence sometimes smaller code is better.

> Difference between V2 and V3 is only borderly significant, maybe V3 is
> sligtly faster but measurement would be must done in pure DOS, not in Win98
> I am running just now.
> I made test only on Pentium 4 machine, due lack of time I haven't tested on
> my Pentium III.

Pentium 4 has no barrel shifter, so VARIANT 2 will always be slower there (I think?).

There's honestly nothing horrible about any of these versions, they all work more or less the same. The difference is very very minor. You also have to worry about on-chip cache size, instruction timings, latency / thoroughput, code and data alignment, and avoid nearby self-modifying code. You're probably more limited by OS calls or HD or RAM access speeds.

It's fun to "pretend" to even barely (0.0001%) understand this stuff, but it's so incredibly arcane and (almost) useless, impossible, etc. (in my unprofessional opinion). There is no easy answer, and I'd doubt anybody really does it well across various x86 subarchitectures. (Some newer ones broke old optimizations, so that really sucks.) I wouldn't worry about it (or choose the safer path of cm and myself, optimize for size!).

bretjohn

Rio Rancho, NM,
16.05.2012, 16:42

@ Rugxulo

Assembler optimisation - how to avoid a jump?

Post reply

I'll just to add another two cents worth of opinion.

There are at least four general categories of things to keep in mind when writing code: speed, size, compatibility, and maintainability. To me, those are listed in reverse order of priority, though there is a constant balancing act.

The most important is maintainability, which at a minimum means lots of comments and "logical" organization of the code.

Compatibility with other programs and standards (even if de facto), even if it means a larger program size, is generally more important than the size.

Size matters A LOT, especially when you are dealing with DOS programs, and even more when you are dealing with TSR's. To me, this is not necessarily the size of the executable program file itself, but the amount of memory it ultimately requires.

Optimizing for speed is an almost endless chasing of your tail, and is not usually very productive. Something that is optimized for one particular manufacturer or model or stepping (or even cache size) of CPU doesn't mean it is optimized for every scenario, even if you avoid using any special CPU-specific instructions. You can do certain things in your code that can help the speed if the CPU uses pipelines (and that don't hurt performance if the CPU doesn't have pipelines), but even those can make the code more confusing (harder to understand) and therefore less maintainable. In most (but not all) situations, speed is a relatively minor concern. Even then, though, the speed issues don't usually involve specific CPU instructions, but more a "philosophy" of how to write the program (e.g., minimizing the number of switches between PM and V86/RM).

ecm

Düsseldorf, Germany,
16.05.2012, 16:59

@ bretjohn

Assembler optimisation - speed, size, etc

Post reply

> Size matters A LOT, especially when you are dealing with DOS programs, and
> even more when you are dealing with TSR's. To me, this is not necessarily
> the size of the executable program file itself, but the amount of memory it
> ultimately requires.

So actually we could draw a distinction between "executable size on disk" and "process size in memory". The latter can even, if applicable, further be split into "transient memory usage" and "resident memory usage".

Speaking of which, I'm just now writing an experimental program where I expect the core loop's speed might noticeably affect "user" experience, so to say. Hence, while I could have optimised it for size as usual, I instead put most of that loop into a macro and then use that to write several different variants of the loop code to be used in various circumstances. (For example, the default pair of the loop's variants only handles continuous single file regions up to 8 KiB. In 99% of usage cases, that will suffice, but I added an alternative pair that is able to handle regions up to 512 MiB instead - this is only executed when needed.) This program only ever executes its code in a transient/"foreground" process, so a bit of "wasted" code size isn't very relevant.

> In most
> (but not all) situations, speed is a relatively minor concern.

I'd say the speed of most code generally is rather unimportant. To improve timing, one should focus on the parts of their code that the program actually spends a lot of time in. Putting a lot of effort into optimising all of a program's code for speed is unnecessary.

> Even then,
> though, the speed issues don't usually involve specific CPU instructions,
> but more a "philosophy" of how to write the program (e.g., minimizing the
> number of switches between PM and V86/RM).

True. Just recently, I noticed that a program I was using/developing employed a rather suboptimal algorithm for table lookup. It wasn't a critical problem (at least not on modern CPUs), it just wasn't designed very well. Purely for amusement, I set up a sort of benchmark for the affected algorithm, and determined that my (very simplistic) improvement consistently sped that section up by more than ten times.

---
l