Rugxulo
Usono, 14.03.2020, 08:13 |
rebuilding NASM 0.98.39 (2005) for 16-bit 8086 host (Developers) |
> > (Remember I rebuilt old NASM 0.98.39 with Turbo C++ 1.01?
> > I assume NASM's preprocessor would make a great overlay
> > since you only need it at the beginning.
>
> Actually, the NASM preprocessor is run on each pass of the assembler.
> That's why my sources won't work with nasm -E (preprocess-only mode).
As you've noticed, the official build of 0.98.39 for 16-bit DOS was 186 only (using "ENTER" instruction). Not only is it pointless (not much size or memory savings), and 4x slower on modern machines, but some emulators hate it (Fake86, 8086tinyplus). Your 8086tiny fork fixes that bug, among other things.
Newer NASM 2.xx versions since 2007 have x64 support and require a C99 compiler, so they don't build for anything less than (386+?) pmode. I do have some .BATs to build some NASM 2.x versions in native DOS with DJGPP (GCC 3.2.3 or newer), but I haven't finalized the 2.13.03 one due to other priorities. I dislike the idea of only cross-compiling it.
Here are my two .EXEs and makefiles for 8086 (0.98.39), once with freeware Turbo C++ 1.01, once with (OSI) OpenWatcom 1.9. The former needed to switch from Large to Huge model just to build correctly (no /Ff switch?), which I assume makes it waste more precious RAM than Large does. (Or is their malloc inefficient?). Since these are using conventional memory only, they probably further need more fixes (overlays, optional EMS, or more trimming). I did only enable bin and obj output support, but that wasn't quite enough. Probably I need to go in (insns.pl??) and disable anything past 586. (Why bother with SSE2 in such a limited environment?) It does reassemble PSR Invaders, but that's very small and simple, so it's not a good test.
Rebuilding the FreeDOS kernel ran out of memory (Turbo C++ build only), but apparently it works when not run under WmakeR. So the bloody Make program is eating too much (scarce) conventional RAM! On the one hand, I definitely didn't want to forcibly rely on an obsolete 16-bit NASM to build the kernel due to other problems. So I was using WmakeR just to avoid "extender" mixups (DOS4GW vs. CWSDPMI used by NASM or UPX). But you can use CWSTUB renamed as DOS4GW and then use WMake (386 pmode version), and then apparently it doesn't waste as much conventional memory, so it can successfully reassemble those files needed by the kernel (tested only patched 2041, set NASMENV=-O9 first).
So I don't necessarily need to delve further, especially since you can just use the 386 DJGPP (or older 386 OpenWatcom) build of NASM. Hey, OpenWatcom itself needs 386, so who cares? Still, I feel like a 16-bit build shouldn't be so hampered. Also, it's annoying that it doesn't even tell you how much memory was free, how much was used, nor how much was unsuccessfully (mis)allocated. I haven't checked closely to see what debug hooks or instrumentation they included in the (old, outdated) 0.98.39 sources.
> As noted there, this made the assembly take more than 2 minutes.
> Therefore I disabled this option
Even FASM has noticeable slowdowns with complex macros. But fasmg is even slower since it's all interpreted (although his recent CALM additions have sped it up a lot). |
Rugxulo
Usono, 16.03.2020, 01:24
@ Rugxulo
|
rebuilding NASM 0.98.39 (2005) for 16-bit 8086 host |
> Probably I need to go in (insns.pl??) and disable anything past 586.
> (Why bother with SSE2 in such a limited environment?)
Assuming that was intentional, by design, and not bugged, I used AWK to modify insns.dat to avoid anything past 586. Then I used (DJGPP Perl) insns.pl to rebuild some sources (see makefile.in). Then NASM still compiled okay (with TC++) but crashed when run (invalid opcode) without warning, even when used with no args. So either I'm misunderstanding or it needs even heavier fixing.
Should I rebuild 0.97 instead? Not quite as good (no size optimizations via -O3), but at least it's smaller (586/MMX only). But there's probably all kind of old bugs. (IIRC, the prebuilt DOS binary included all output formats, ugh, such a waste of RAM.) N.B. I already used AWK to read the .LST and auto-fix the source of PSR Invaders so that NASM 0.97 will 100% match in size optimizations, but that's quite a kludge.
> Rebuilding the FreeDOS kernel ran out of memory (Turbo C++ build only), but
> apparently it works when not run under WmakeR. So the bloody Make program
> is eating too much (scarce) conventional RAM!
A quick look at the (OW 1.9) Tools manual didn't show anything about Make swapping. Probably it isn't supported or wasn't a huge priority. I haven't looked into 2.0-pre, but it's probably no improvement.
Turbo C++ 1.01 (Make 3.0?) has .SWAP (or -S option) to swap to disk (saving 50k of low RAM). DMake (4.00pl2?) lets you put % before commands to do the same (saving about 100k of low RAM).
> Also, it's annoying that it doesn't even tell you how much
> memory was free, how much was used, nor how much was unsuccessfully
> (mis)allocated. I haven't checked closely to see what debug hooks or
> instrumentation they included
-DLOGALLOC writes to malloc.log (big!), but I'm not sure that's quite what I wanted to know either. |
rr
Berlin, Germany, 16.03.2020, 15:22
@ Rugxulo
|
rebuilding NASM 0.98.39 (2005) for 16-bit 8086 host |
> > Probably I need to go in (insns.pl??) and disable anything past 586.
> > (Why bother with SSE2 in such a limited environment?)
>
> Assuming that was intentional, by design, and not bugged, I used AWK to
> modify insns.dat to avoid anything past 586. Then I used (DJGPP Perl)
> insns.pl to rebuild some sources (see makefile.in). Then NASM still
> compiled okay (with TC++) but crashed when run (invalid opcode) without
> warning, even when used with no args. So either I'm misunderstanding or it
> needs even heavier fixing.
"Invalid opcode" comes from the CPU treating data as code, as you probably know.
Does it happen under "all" memory configurations, i.e., w/o JEMM386?
Does it occur if you only remove anything after line 1166 ("Katmai"...)? --- Forum admin |
Rugxulo
Usono, 17.03.2020, 05:20
@ rr
|
rebuilding NASM 0.98.39 (without MMX/3DNOW/686/SSE) |
> "Invalid opcode" comes from the CPU treating data as code, as you probably
> know.
> Does it happen under "all" memory configurations, i.e., w/o JEMM386?
I always run XMS only these days, so it's not due to EMM386.
> Does it occur if you only remove anything after line 1166 ("Katmai"...)?
Actually, the newer, bloated SIMD stuff starts earlier than that in the file. It's not grouped properly, apparently unsorted.
But, no, I found the problem, a simple error on my part. I forgot to also include "ignore" (pseudo-instructions, e.g. EQU or DW) in my trimmed version.
So it works (after running DJGPP Perl four times to regenerate some sources). I doubt it saved too much actual runtime memory, but the .EXE itself saved 30k (or 6 kb UPX'd), which is seemingly not much. And it runs okay (reassembles INVADERS.COM correctly).
It still didn't fit into Large model (like OpenWatcom), though, but if I exclude "PENT" entirely (which also includes MMX and 3DNOW for later 586s), then it fits into Large model okay. So then it's roughly 50 kb smaller than the "full" .EXE. But it only saves 9 kb UPX'd.
In other words, I'm sure it helps some, but it's maybe not as much as I expected. Saving 50 kb while WmakeR uses 150 kb is probably not enough (for that nonsensical problem, already worked around elsewhere).
So, long story short, I need to rebuild twice and upload those for us: 1). 586/MMX/3DNOW only (no 686 or SSE), 2). 486 (or early 586, e.g. CPUID, but no MMX/3DNOW) only (Large model, is it noticeably faster??). Hmmm, maybe 286 only would be interesting, for comparison (like A86).
But I want to avoid relying on Perl at all, so I need to find a shortcut to trim the pre-existing generated (included) insns*.[ch] files (awk? sed? homegrown tool again?).
P.S. Does anyone know if -O9 will forcibly run nine passes or only those passes needed (e.g. three)? I could be wrong, but I think it does all nine! Inefficient, but maybe it doesn't (or can't) know. Though some assemblers can figure it out (if nothing changed in previous pass then exit?). Maybe that's why -O9v tells you (kinda) how many it took, so you can only run as many as needed next time? Not sure. |
ecm
Düsseldorf, Germany, 17.03.2020, 06:58
@ Rugxulo
|
rebuilding NASM 0.98.39 (without MMX/3DNOW/686/SSE) |
> P.S. Does anyone know if -O9 will forcibly run nine passes or only those
> passes needed (e.g. three)? I could be wrong, but I think it does all nine!
> Inefficient, but maybe it doesn't (or can't) know. Though some assemblers
> can figure it out (if nothing changed in previous pass then exit?). Maybe
> that's why -O9v tells you (kinda) how many it took, so you can only run as
> many as needed next time? Not sure.
At least in recent versions, -O9 or -O2 are entirely the same as -Ox, ie enable multi-pass optimisation. Here's the manual text about it: https://www.nasm.us/xdoc/2.14.02/html/nasmdoc2.html#section-2.1.23
> -Ox (where x is the actual letter x): Multipass optimization. Minimize branch offsets and signed immediate bytes, overriding size specification unless the strict keyword has been used (see section 3.7). For compatibility with earlier releases, the letter x may also be any number greater than one. This number has no effect on the actual number of passes. --- l |
Rugxulo
Usono, 22.03.2020, 00:42
@ Rugxulo
|
rebuilding NASM 0.98.39 (without MMX/3DNOW/686/SSE) |
> It still didn't fit into Large model (like OpenWatcom), though, but if I
> exclude "PENT" entirely (which also includes MMX and 3DNOW for later 586s),
> then it fits into Large model okay. So then it's roughly 50 kb smaller than
> the "full" .EXE. But it only saves 9 kb UPX'd.
The 486 and 586 didn't add much (not counting MMX, 3DNow!), so I felt I should keep those. But everything else (Cyrix, MMX, 3DNow!, P6, IA64 [JMPE], SSE[123]) I deleted. I also had to use "-d" (merge duplicate strings). I call this NASM lite. (FASM also has a "lite" version for DOS nowadays, back to using his unreal hack by omitting all of the bloated AVX instructions. He claims AVX only works in 16-bit pmode anyways.)
BTW, I heard that newer AMD cpus don't even have 3DNow! anymore. And even Intel was rumored to be tweaking compilers (GCC?) to target SSE2 with MMX intrinsics (so they could remove them from chips later?? dunno). Something weird like that.
> But I want to avoid relying on Perl at all, so I need to find a shortcut to
> trim the pre-existing generated (included) insns*.[ch] files (awk? sed?
> homegrown tool again?).
I just manually cobbled together some Sed scripts from Diff'd output. They're all inlined into the makefile, so no bloated DJGPP Perl (5.8.8 from 2007) needed to rebuild this. Latest Perl upstream seems to be 5.30.2.
NASM 0.97 did have INSNS.BAS (QBASIC), but maybe it wasn't kept working, so it was dropped. I didn't try that here. I think AWK would be more reasonable. |
Rugxulo
Usono, 22.03.2020, 00:48
@ ecm
|
rebuilding NASM 0.98.39 (without MMX/3DNOW/686/SSE) |
ecm, do you remember, ten years ago (2010), saying you reduced NASM 2.09 down to get it to build for 16-bit DOS?
> (NASM version 2.09 available)
>
> I today got it to compile for an 8086 DOS target after stripping
> the supported instruction set down to 686 - both executables
> (NASM and NDISASM) appear to work, but NASM.EXE is a 460 KiB file
> which means that it barely runs at all and usually crashes if I try
> to assemble anything at all (at least I hope the crash is only due
> to running out of memory). NDISASM.EXE works fine though, and is
> just about 110 KiB.
>
> So NASM still can be compiled for 8086 systems. If I have time for that,
> I might look into stripping it down better to get the executable's size
> down. Maybe write some code to let it use disk or XMS swapping? Hmm. |
ecm
Düsseldorf, Germany, 22.03.2020, 10:13
@ Rugxulo
|
rebuilding NASM 0.98.39 (without MMX/3DNOW/686/SSE) |
> ecm, do you remember, ten years ago (2010), saying you reduced NASM
> 2.09 down to get it to build for 16-bit DOS?
No, I don't actually. Been a long time. May look around on my (now Debian testing amd64) development machine to see if I find any traces of that. Will have to set up OpenWatcom on that machine to attempt this again, should I decide to.
For my current projects I usually need at least 2.10, I did some testing recently but don't remember which exact versions worked. (And Google's Hangouts is difficult to search.) I think some specific 2.09.xx versions worked too.
The memory problem I'd noted back then is truer than ever. The current (symbolic branch) lDebug source is simply way too large. Without *extensive* work to do some kind of swapping this won't ever be able to be assembled in 86 Mode. Here's a current count (not including the symsnip and lmacros files):
ldebug/source$ for file in *.asm *.mac; do echo "$file"; done >> filelist.txt
ldebug/source$ cntlines @filelist.txt
List file: filelist.txt
Files: 20
Bytes: 922787
Total lines: 41120
Blanks: 4607
Comment only: 7886
Actual code: 28627
--- l |
ecm
Düsseldorf, Germany, 22.03.2020, 11:45
@ ecm
|
rebuilding NASM 2.09 and NASM compatibility |
> No, I don't actually. Been a long time. May look around on my (now Debian
> testing amd64) development machine to see if I find any traces of that.
> Will have to set up OpenWatcom on that machine to attempt this again,
> should I decide to.
I didn't find any traces of that build. The oldest NASM files I found are in the directory Projects/obsolete/NASM/nasm-2.09.01-20100908/ and were written 2010-09-08, just a bit after the 2010-08-25 date of that message.
> For my current projects I usually need at least 2.10, I did some testing
> recently but don't remember which exact versions worked. (And Google's
> Hangouts is difficult to search.) I think some specific 2.09.xx versions
> worked too.
Here's a log from 2019-11-16:
ecm: NASM version 2.10.09 ist nötig um lDebug zu assemblieren
ecm: bei 2.09.10 funktioniert "%assign %$foo%[bar] quux" nicht richtig
ecm: insrch.asm:802: warning: (__ldup_count:1) context-local macro expansion fall-through (automatic searching of outer contexts) will be deprecated starting in NASM 2.10, please see the NASM Manual for more information
binsrch.asm:802: warning: (__ldup_count:1) `count_newparlist': context-local macro expansion fall-through
binsrch.asm:802: error: (__ldup_count:1) `%assign' expects a macro identifier
ecm: bei 2.07 passiert dies hier (duplikate rausgeschnitten)
proj/lDebug/build.206/source$ ./mak.sh -D_BOOTLDR
Creating debug.com
debugtbl.inc:10: error: (opl:10) unknown preprocessor directive `%deftok'
debugtbl.inc:10: error: label or instruction expected at start of line
debugtbl.inc:10: error: (opl:14) `%assign' expects a macro identifier
ecm: der 2.10.09 build ist identisch mit dem 2.15rc0
ecm: d.h. binary ist gleich
Translation:
ecm: NASM version 2.10.09 is needed to assemble lDebug
ecm: with 2.09.10 "%assign %$foo%[bar] quux" doesn't function right
ecm: insrch.asm:802: warning: (__ldup_count:1) context-local macro expansion fall-through (automatic searching of outer contexts) will be deprecated starting in NASM 2.10, please see the NASM Manual for more information
binsrch.asm:802: warning: (__ldup_count:1) `count_newparlist': context-local macro expansion fall-through
binsrch.asm:802: error: (__ldup_count:1) `%assign' expects a macro identifier
ecm: with 2.07 this happens (duplicates cut)
proj/lDebug/build.206/source$ ./mak.sh -D_BOOTLDR
Creating debug.com
debugtbl.inc:10: error: (opl:10) unknown preprocessor directive `%deftok'
debugtbl.inc:10: error: label or instruction expected at start of line
debugtbl.inc:10: error: (opl:14) `%assign' expects a macro identifier
ecm: the 2.10.09 build is identical to the 2.15rc0
ecm: ie, binary is the same
Here are a few more messages I found in GoogleMail/Hangouts (but not in Pidgin, which is easier to copy and search through):
https://stackoverflow.com/questions/46549671/doesnt-perl-include-current-directory-in-inc-by-default ran into this for macros.pl
https://www.perlmonks.org/bare/?node_id=375341 due to this I added BEGIN { unshift @INC, "."; }
which worked
https://repo.or.cz/nasm.git/commitdiff/bc8522e3a08ae3124bdf60d27dd0a24baee535f0 I had to cherry pick this commit
to allow building the older nasm versions
in one of these (must be 2.06) I manually had to resolve a conflict by doing git rm -f directiv.pl
testopt etc (newly moved to lmacros3) need correct %deftok
the bugged one doesn't work for that
ie, it must be 2.09.02
huh. 2.09.10 has the error that I showed earlier
2.09.02 doesn't have that error and seems to function
same binary output
2.09.02 works
2.09.03 to 2.09.10 all don't work
https://repo.or.cz/nasm.git?a=commit&h=6cdc900d8d56106e9ac62247968e618355b08938
starting with this one it doesn't work anymore --- l |
marcov
22.03.2020, 13:08
@ Rugxulo
|
rebuilding NASM 0.98.39 (without MMX/3DNOW/686/SSE) |
> The 486 and 586 didn't add much (not counting MMX, 3DNow!), so I felt I
> should keep those. But everything else (Cyrix, MMX, 3DNow!, P6, IA64
> [JMPE], SSE[123]) I deleted. I also had to use "-d" (merge duplicate
(P6 has a highly useful cmov that reduces branching, and is a substitute for a branch instruction and a mov. IIRC it is 3 bytes and replaces a one byte branch and two bytes mov. (in both cases + size of imm), equal size, better performance)
> BTW, I heard that newer AMD cpus don't even have 3DNow! anymore. And even
> Intel was rumored to be tweaking compilers (GCC?) to target SSE2 with MMX
> intrinsics (so they could remove them from chips later?? dunno). Something
> weird like that.
More likely the other way around. SSE2 is a core (non optional) aspect of the x86_64 architecture, the older SIMD implementations not, and aren't really used much anymore.
Note that SSE is the first 128-bit one, while earlier ones are 64-bit. |
ecm
Düsseldorf, Germany, 22.03.2020, 14:26
@ marcov
|
CMOV |
> (P6 has a highly useful cmov that reduces branching, and is a substitute
> for a branch instruction and a mov. IIRC it is 3 bytes and replaces a one
> byte branch and two bytes mov. (in both cases + size of imm), equal size,
> better performance)
I recently discussed whether CMOVcc is available in all AMD64 processors: https://stackoverflow.com/questions/60760138/do-al...mplementations-support-the-cmovcc-instructions/
The other answer of mine linked from there includes this statement about CMOVcc:
> (I came across this thread involving Linus Torvalds which indicates that the conditional jump solution may actually be better or no worse than cmov. Make of that what you will.)
Note that CMOVcc has its own CPUID support bit. A processor being a 686+ class does not necessarily mean it has CMOVcc. --- l |
marcov
22.03.2020, 18:32
@ ecm
|
CMOV |
>
> > (I came across this
> thread involving Linus Torvalds which indicates that the conditional
> jump solution may actually be better or no worse than cmov. Make of that
> what you will.)
Still, one of the other links says that VS2010 in 64-bit mode does generate cmov more. And that is from after Linus Torvalds analysis.
My own experiences (as a non-compiler dev) are a bit less hard cut. When cmov generation was added, FPC self compiled faster on 32-bit windows, afaik in Core2 times. (before that there was few processor specific optimization)
On the one hand that is a real, and not a micro benchmark, on the other hand it is possible that the result is one code-path vs the other and that other factors (like Pascal boolean handling with 1=true, 0=false) factor in and the whole codepath with cmov turned out to be faster, not just cmov itself.
I've also seen references that the preference for CMOV is avoid pollution the TLB with unneeded branch addresses, and let the register rename stage handle it. (and with uop caches of Sandy Bridge, the same branch miss latency now hurts more)
Maybe it is safest to conclude that it is undecided since all hunches and feelings are old (Core2 period), and we are now 10 generations "Core" architecture and an AMD resurgence further along.
> Note that CMOVcc has its own CPUID support bit. A processor being a 686+
> class does not necessarily mean it has CMOVcc.
I know, to my own detriment . I had a Via based firewall that didn't, and when the Smooth Wall went to a newer kernel version, it suddenly failed because the kernel was compiled with P6 compat. But that was like 8 years ago.
The oldest PC I still use (home+work) is a low power E-350 work server, but that is up for replacement this year (probably by a Ryzen based Athlon 200).
So then a mix of Ivy Bridges and one Sandy Bridge i5-2500 will be the eldest. Though that is partially bad luck since we had quite a big batch of 2nd generation Core2's, but the used mainboard of that generation had caps that weren't very durable. So they retired themselves basically
Many linux distros are already phasing out 32-bit versions.... |
Rugxulo
Usono, 23.03.2020, 04:27
@ marcov
|
rebuilding NASM 0.98.39 (without MMX/3DNOW/686/SSE) |
> > The 486 and 586 didn't add much (not counting MMX, 3DNow!), so I felt I
> > should keep those. But everything else (Cyrix, MMX, 3DNow!, P6, IA64
> > [JMPE], SSE[123]) I deleted. I also had to use "-d" (merge duplicate
>
> (P6 has a highly useful cmov that reduces branching, and is a substitute
> for a branch instruction and a mov. IIRC it is 3 bytes and replaces a one
> byte branch and two bytes mov. (in both cases + size of imm), equal size,
> better performance)
I don't remember the encoding details. I thought CMOV was smaller?? But it's never faster (AFAIK), and it's harder to patch out. I did exchange a few emails with CWS (years ago) about maybe adding [F]CMOV* emulation to DJGPP's EMU387, but he only barely gave me a hint (SIGILL?) and didn't worry about it (among many other things). I don't blame him, but it was barely frustrating for me (at the time, less so nowadays). DOSBox doesn't support anything beyond Pentium (default is fast 486 DX2), so no CMOV there. (Not usually a problem there specifically, though.) It is annoying when DJGPP/GCC builds are 686+ only for no obvious advantage.
The whole point of recompiling was to have an 8086-hosted build of NASM since their old build was 186 (and didn't work under 8086tinyplus). Even the TinyAsm guy lamented that. (His is 8086-only target! ... in NASM syntax.) Old NASM 0.97 worked but ran out of memory easily (too many output formats compiled in). OpenWatcom compiled 0.98.39 okay as Large model, but TC++ wouldn't fit (Huge only), DGROUP overflow, until I trimmed a bunch of instructions. Then it would fit in Large, but I'm not aware of any obvious speedups or improvements there otherwise. (Those two previous .EXEs had all instructions, up through SSE3. So only this "lite" one omits some stuff.)
I did double-check for you. Not counting Cyrix (irrelevant to me, for this particular .EXE), P6 only is 35 instructions. So it's only an extra 2 kb of code space used, so it probably wouldn't have hurt to include here, especially with "cpu 8086" directives to warn unwary users. Heck, this wouldn't fit at all in Large without "-d" (merge duplicate strings), even without P6. But I don't have TDUMP in this TC++ 1.01 freeware. OW's WDUMP didn't work, and WDIS didn't show anything obvious from quick glance. Presumably the OW build is better overall. |
marcov
23.03.2020, 12:52
@ marcov
|
CMOV |
(yeah, probably all irrelevant to the thread since the subject was mostly 16-bit only CPUs. But I originally started to answer from a size perspective, so just conclude this subthread)
I researched a bit more, and there seem several separate issues:
- cmov has a latency of 1 (AMD) or 2 (Intel) cycles, so if it forms a dependency chain with instructions coming after it. In that case the branched form might be more worthwhile if correct predicted and the opcodes are sufficiently fused.
- I found some references that using branches might confuse the branch-predictor, without many details, except a general advise to minimize branches.
- the older the cpu (superscalar ones, so p6+), the less inputs a single uop can have. Since cmov also depends on flags (it has two arguments + flags), in older CPUs (before Sandy Bridge) many combinations couldn't be a single uop. Even now there are more problems with e.g. indexed version (which take another input register). Probably Sandy Bridge raised that to 3 inputs because of the three-address AVX instructions.
- the exact dependencies depend also on the form (the used flags). Carry and other flags combined are separate dependencies, so if you need carry and another flag, uop fusion probably won't happen.
The current opinion seems to be to use cmov instructions unless there is a very clear dependency chain. Cmov seems to be put in the same group as the adc instruction. |
Rugxulo
Usono, 24.03.2020, 17:54
@ marcov
|
deprecated MMX and obsolete 3DNow! |
> > BTW, I heard that newer AMD cpus don't even have 3DNow! anymore.
"Not supported in Bulldozer, Bobcat and Zen" (so nothing after 2010, omitted in family 14h and newer). XBoxOne and PS4 both use Jaguar (family 16h from 2013). EDIT: First implemented in K6-2 in 1998 (AMD's competitor to Pentium 2).
> And even Intel was rumored to be tweaking compilers (GCC?)
> to target SSE2 with MMX intrinsics (so they could remove them
> from chips later?? dunno).
>
> More likely the other way around. SSE2 is a core (non optional) aspect of
> the x86_64 architecture, the older SIMD implementations not, and aren't
> really used much anymore.
>
> Note that SSE is the first 128-bit one, while earlier ones are 64-bit.
I know all of that (barely), I meant what you said. Here's the 2019 Phoronix article that I was thinking of.
"Implement MMX intrinsics with SSE" is what H.J. Lu named his patches. However, GCC 10 hasn't been released yet, and the Changes page shows no mention of it yet.
>> Intel open-source compiler toolchain expert H.J. Lu sent out
>> a set of 46 patches for GCC that implement MMX intrinsics
>> with SSE instructions instead. Of course, in modern code-bases
>> hopefully you are utilizing modern versions of AVX.
"Modern versions"?? I double-checked, AVX2 is from Intel Haswell (2013) or AMD Excavator (2015).
AVX-512 is from 2016 (first supported in GCC 4.9). But AMD doesn't support any AVX-512 (yet??). "AVX-512 consists of multiple extensions not all meant to be supported by all processors implementing them. Only the core extension AVX-512F (AVX-512 Foundation) is required by all AVX-512 implementations."
(Fun fun fun!) |
Rugxulo
Usono, 24.03.2020, 20:36
@ Rugxulo
|
NASM 0.98.39 (MSC 7, "286", not full instruction support) |
> The whole point of recompiling was to have an 8086-hosted build of NASM
> since their old build was 186 (and didn't work under 8086tinyplus).
> OpenWatcom compiled 0.98.39 okay as Large model, but TC++ wouldn't fit
> (Huge only), DGROUP overflow, until I trimmed a bunch of instructions.
> Then it would fit in Large. (Those two previous .EXEs had all
> instructions, up through SSE3. So only this "lite" one omits some stuff.)
Actually, their pre-existing prebuilt .EXE for 16-bit DOS must've been compiled with Mkfiles/Makefile.ms7, i.e. MS C 7 from 1992 (according to .EXE's internal runtime copyright).
># Compile for a 286, ain't nobody using an 8086 anymore
>CC = cl /c /Oz /AL /Gt256 /G2 /I.. # MSC 7.00
Not sure why 286 [effectively 186] via /G2, it's probably very minimal improvement. "/AL" means Large model (according to this for MSC 6).
Also, note this:
> # GNU software compiled by DJGPP is also required:
># grep 2.4
># perl 5.6.1
Although you only need Perl to regenerate insns*.[ch] files, but they're already included.
>insns16.dat: insns.dat
> grep -v WILLAMETTE insns.dat | grep -v KATMAI | grep -v SSE | \
> grep -v MMX | grep -v 3DNOW | grep -v UNDOC >insns16.dat
Inefficient, ERE has "|" alternation operator or whatever.
So they didn't even include all instructions! I'm actually surprised (but, as mentioned, it's wiser this way). Then again, they did include outputs: bin, obj, as86, win32 (latter two are irrelevant in such limited memory conditions).
"Better" (??) would've presumably been using MSVC 1.52c (info) from 1995. Or something with 286 pmode target support. Or Digital Mars. (Or overlays??) |
Rugxulo
Usono, 24.03.2020, 21:16
@ Rugxulo
|
NASM 0.98.39 (not full instruction support) |
> >insns16.dat: insns.dat
> > grep -v WILLAMETTE insns.dat | grep -v KATMAI | grep -v SSE | \
> > grep -v MMX | grep -v 3DNOW | grep -v UNDOC >insns16.dat
Admittedly, it's hard to properly support so many instructions, and they aren't grouped 100% accurately, IMHO. (CPUID as PENT should be 486. Also, PENT didn't always support MMX, they are partially distinct.) Still, this piecemeal approach is error-prone, for many reasons.
Omitting UNDOC is "mostly" fine for oddballs like IBTS or or XBTS or UMOV (386 only) which (almost literally) nobody ever used. But omitting CMPXCHG486, LOADALL, LOADALL286, SALC is bad because those were much more common! And it won't warn, only assemble (but ignore) your use of such "unsupported" instruction, treating it as a label. So you won't even know until it bites you (hopefully, you have reproducible checksums or test suites)! And yet the .EXE still included a few unnecessary instructions (IA64 [JMPE] and PRESCOTT non-SSE [MONITOR,MWAIT,FISTTP]).
This may also be because the makefile was old and not updated for recent releases. So maybe it was accurate for previous releases ... but no longer. And this was the only prebuilt 16-bit binary we had! |
ecm
Düsseldorf, Germany, 24.03.2020, 22:55
@ Rugxulo
|
NASM 0.98.39 (not full instruction support) |
> And it won't warn, only assemble (but ignore) your use of such
> "unsupported" instruction, treating it as a label. So you won't even know
> until it bites you (hopefully, you have reproducible checksums or test
> suites)!
Doesn't that version have the orphan labels warning? In the versions that have that, it will warn by default if an unknown instruction is treated as a label without a colon. --- l |
Rugxulo
Usono, 24.03.2020, 23:17
@ ecm
|
NASM 0.98.39 (not full instruction support) |
> Doesn't that version have the orphan labels warning? In the versions that
> have that, it will warn by default if an unknown instruction is treated as
> a label without a colon.
Yes, it has that warning, but "default off".
Trying "movd eax,mm0" does give "error: parser: instruction expected". So it's only instructions with no operands that have this problem (e.g. salc). I always disliked orphan labels, but I guess it's meant to avoid directives from (semi-)compatible code?? NASM seemingly always disliked red-tape "bureaucracy" (from MASM/TASM). |
Rugxulo
Usono, 31.03.2020, 20:07 (edited by Rugxulo, 31.03.2020, 20:43)
@ Rugxulo
|
NASM 0.98.39 (not full instruction support) ... LOADALL |
> But omitting CMPXCHG486, LOADALL, LOADALL286, SALC is bad because those were much more common!
I'm not a systems programmer, so I'm not familiar with all the gory details.
IIRC, it was claimed at one time that all 486 BIOSes emulated LOADALL, presumably for IBM's OS/2.
Wikipedia implies other software using it (MS Himem, Emm386, Smartdrv, Windows), or at least certain older versions (pre-XMS?). But no mention of it in Himemx nor JEMM386 sources.
Apparently it was removed in later processors, replaced by Pentium's RSM instruction (different encoding) for SMM (Systems Management Mode, aka ring -2). The actual encodings were reused instead for SYSCALL and SYSRET (both of which old NASM calls "P6" although Wikipedia says AMD64).
OS/2 Museum had a few things to say, too:
> On Intel processors, SYSCALL must be explicitly enabled and even then is
> only recognized in 64-bit mode, so it cannot possibly conflict with software
> written for the 286. On AMD processors, SYSCALL is not limited to 64-bit
> mode but must still be enabled via EFER.SCE, which again won’t be the case
> with software designed to run on 286s. Therefore SYSCALL does not conflict
> with 286 LOADALL emulation.
Strange stuff. |
Rugxulo
Usono, 13.04.2020, 07:16
@ Rugxulo
|
rebuilding NASM 0.98.39 (without MMX/3DNOW/686/SSE) |
> > > The 486 and 586 didn't add much (not counting MMX, 3DNow!), so I felt
> > > I should keep those. But everything else (Cyrix, MMX, 3DNow!, P6, IA64
> > > [JMPE], SSE[123]) I deleted.
> >
> > (P6 has a highly useful cmov that reduces branching, and is a substitute
> > for a branch instruction and a mov. IIRC it is 3 bytes and replaces a
> > one byte branch and two bytes mov. (in both cases + size of imm),
> > equal size, better performance)
>
> I did double-check for you. Not counting Cyrix, P6 only is 35 instructions.
> So it's only an extra 2 kb of space used, so it probably wouldn't hurt
> to include here
So I refreshed "nasmlite". But sometimes it's hard to know what really helps or not (especially for old cpus I don't have on hand to test). I did enable support for P6 instructions, though (but easily disabled via makefile, if needed).
"-a" enables word alignment and didn't hurt size at all, so that's a no-brainer, but it's still of questionable benefit (without further testing). "-G' speed optimization (vs. size) would've taken 6 kb extra, and I didn't even bother because I can't prove it actually helps at all. (I need to compare "-S" assembly output, but even that's probably targeting truly ancient cpus. "-S" also ran out of memory on preproc.c, so that's fun.)
I still need to read the compiler manuals in more detail. There's plenty of other options and miscellaneous trivia. |
ecm
Düsseldorf, Germany, 06.09.2020, 23:09
@ Rugxulo
|
rebuilding NASM 0.98.39 (without MMX/3DNOW/686/SSE) |
Made a new thread about the 8086 build for NASM 2.09 in 2010 August. --- l |