> I would like to make my programs (mTCP) more friendly for non-US users, but
> being from the US I have tunnel vision. ;-0 So I am asking for
> enlightenment ...
I think Unicode is a bit overrated, esp. since most users are just Europeans anyways. But of course I don't (can't!) frequent Asian forums, so that's my own bias. "Latin-[1234] should be good enough for most DOS users!" :P
> My understanding of DOS internationalization support is that I can tell DOS
> what country I am in, which affects the sorting order, date and time
> display, etc. I can also use CPI files to select alternate code pages for
> display and printing. Most systems have code page 437 built in, and in the
> case of CGA and MDA that is all that is available. EGA and better cards
> have loadable font support allowing for switchable code pages. Switching a
> code page changes the upper 128 of the 256 characters that are available;
> the lower 128 never change.
If you need to support CGA and MDA, you may have to tweak and recompile GRAFTABL. If you don't mind only supporting this on EGA and up (and you "only" need code page "display"), only use DISPLAY and KEYB (or similar). If you want more than that, you'll have to use other methods (e.g. graphics mode like Blocek and FoxType).
> As the programs are now, if the user selects a different code page the
> program will use it. The only time this becomes a problem is if I used one
> of the graphics characters in CP437 and those characters are replaced.
> This is often a problem when programs assume CP437 and something different,
> like CP850 is used instead. But except for that glitch, this method works
> - the programs generally don't care how the 8 bits are drawn, it is just an
> 8 bit value.
Don't use box chars. Or make it optional and fall back to (quirkier) 7-bit if needed.
> For a program like IRC where you are exchanging data with other systems in
> real-time the character encoding matters; two systems with different
> character encodings may not agree on the same representation of a value.
> The standard solution for supporting additional characters in this case is
> Unicode, which gets embedded in the data stream as UTF-8.
Okay, but usually the chatting parties are friendly enough to be willing to compromise to an "inferior" encoding such as Latin-[1234] if Unicode isn't available. I mean, why chat with someone who refuses to switch encodings out of principle ("just use Linux", yeah, we know they'll say that, ugh)?
> When Unicode is used the problem then becomes a matter of mapping the
> Unicode character to a character that can be displayed on the current
> machine. This involves sensing the currently active display code page and
> then having a mapping table from Unicode to the characters available in the
> code page.
You could just leave as-is and have a footnote on the screen somewhere where it will explain exactly what character it is. I think such a descriptive text file is a "only" few MB, e.g. 'C-x =' or whatever in GNU Emacs. (I forget exactly, but even VIM 'g8' or similar supports this too. So does Mined.)
> Not everything can be mapped, so characters that can not be
> displayed have to be shown in some encoded format. (In a limited
> environment like on a real CGA or MDA card only CP437 is available.)
>
> Is my understanding correct? If so, are there libraries that make the
> mapping of Unicode to code pages easier?
iconv (licv*b.zip on DJGPP mirrors) is probably your best bet. It's 386 only, but surely the codepage data itself can be reused with impunity. So you can manually convert on-the-fly between a small subset of Unicode (UTF-8, right) if desired. Though I still say Latin-[1234] would be the easiest improvement, for now. (FreeDOS code pages aren't quite ISO-8859-x compatible, but somewhat close enough. You can use Kosta Kostis' ISOLATIN.CPI if otherwise desired. Either set can be made to work on most DOS compatibles, so you're not stuck to FreeDOS exclusively.) |