DOS ain't dead - DOS internationalization support

mbbrutman

Washington, USA,
25.11.2012, 01:06

DOS internationalization support (Developers)

I would like to make my programs (mTCP) more friendly for non-US users, but being from the US I have tunnel vision. ;-0 So I am asking for enlightenment ...

My understanding of DOS internationalization support is that I can tell DOS what country I am in, which affects the sorting order, date and time display, etc. I can also use CPI files to select alternate code pages for display and printing. Most systems have code page 437 built in, and in the case of CGA and MDA that is all that is available. EGA and better cards have loadable font support allowing for switchable code pages. Switching a code page changes the upper 128 of the 256 characters that are available; the lower 128 never change.

As the programs are now, if the user selects a different code page the program will use it. The only time this becomes a problem is if I used one of the graphics characters in CP437 and those characters are replaced. This is often a problem when programs assume CP437 and something different, like CP850 is used instead. But except for that glitch, this method works - the programs generally don't care how the 8 bits are drawn, it is just an 8 bit value.

For a program like IRC where you are exchanging data with other systems in real-time the character encoding matters; two systems with different character encodings may not agree on the same representation of a value. The standard solution for supporting additional characters in this case is Unicode, which gets embedded in the data stream as UTF-8.

When Unicode is used the problem then becomes a matter of mapping the Unicode character to a character that can be displayed on the current machine. This involves sensing the currently active display code page and then having a mapping table from Unicode to the characters available in the code page. Not everything can be mapped, so characters that can not be displayed have to be shown in some encoded format. (In a limited environment like on a real CGA or MDA card only CP437 is available.)

Is my understanding correct? If so, are there libraries that make the mapping of Unicode to code pages easier?

Regards,
Mike

---
mTCP - TCP/IP apps for vintage DOS machines!
http://www.brutman.com/mTCP

Rugxulo

Usono,
25.11.2012, 04:19

@ mbbrutman

DOS internationalization support

Post reply

> I would like to make my programs (mTCP) more friendly for non-US users, but
> being from the US I have tunnel vision. ;-0 So I am asking for
> enlightenment ...

I think Unicode is a bit overrated, esp. since most users are just Europeans anyways. But of course I don't (can't!) frequent Asian forums, so that's my own bias. "Latin-[1234] should be good enough for most DOS users!" :P

> My understanding of DOS internationalization support is that I can tell DOS
> what country I am in, which affects the sorting order, date and time
> display, etc. I can also use CPI files to select alternate code pages for
> display and printing. Most systems have code page 437 built in, and in the
> case of CGA and MDA that is all that is available. EGA and better cards
> have loadable font support allowing for switchable code pages. Switching a
> code page changes the upper 128 of the 256 characters that are available;
> the lower 128 never change.

If you need to support CGA and MDA, you may have to tweak and recompile GRAFTABL. If you don't mind only supporting this on EGA and up (and you "only" need code page "display"), only use DISPLAY and KEYB (or similar). If you want more than that, you'll have to use other methods (e.g. graphics mode like Blocek and FoxType).

> As the programs are now, if the user selects a different code page the
> program will use it. The only time this becomes a problem is if I used one
> of the graphics characters in CP437 and those characters are replaced.
> This is often a problem when programs assume CP437 and something different,
> like CP850 is used instead. But except for that glitch, this method works
> - the programs generally don't care how the 8 bits are drawn, it is just an
> 8 bit value.

Don't use box chars. Or make it optional and fall back to (quirkier) 7-bit if needed.

> For a program like IRC where you are exchanging data with other systems in
> real-time the character encoding matters; two systems with different
> character encodings may not agree on the same representation of a value.
> The standard solution for supporting additional characters in this case is
> Unicode, which gets embedded in the data stream as UTF-8.

Okay, but usually the chatting parties are friendly enough to be willing to compromise to an "inferior" encoding such as Latin-[1234] if Unicode isn't available. I mean, why chat with someone who refuses to switch encodings out of principle ("just use Linux", yeah, we know they'll say that, ugh)?

> When Unicode is used the problem then becomes a matter of mapping the
> Unicode character to a character that can be displayed on the current
> machine. This involves sensing the currently active display code page and
> then having a mapping table from Unicode to the characters available in the
> code page.

You could just leave as-is and have a footnote on the screen somewhere where it will explain exactly what character it is. I think such a descriptive text file is a "only" few MB, e.g. 'C-x =' or whatever in GNU Emacs. (I forget exactly, but even VIM 'g8' or similar supports this too. So does Mined.)

> Not everything can be mapped, so characters that can not be
> displayed have to be shown in some encoded format. (In a limited
> environment like on a real CGA or MDA card only CP437 is available.)
>
> Is my understanding correct? If so, are there libraries that make the
> mapping of Unicode to code pages easier?

iconv (licv*b.zip on DJGPP mirrors) is probably your best bet. It's 386 only, but surely the codepage data itself can be reused with impunity. So you can manually convert on-the-fly between a small subset of Unicode (UTF-8, right) if desired. Though I still say Latin-[1234] would be the easiest improvement, for now. (FreeDOS code pages aren't quite ISO-8859-x compatible, but somewhat close enough. You can use Kosta Kostis' ISOLATIN.CPI if otherwise desired. Either set can be made to work on most DOS compatibles, so you're not stuck to FreeDOS exclusively.)

mbbrutman

Washington, USA,
25.11.2012, 19:52

@ Rugxulo

DOS internationalization support

Post reply

You might think Unicode is overrated but the standard for data transfer is UTF-8, so there is not really a choice. If you want your IRC client to correctly interchange characters with other clients then you need to use UTF-8.

This is a text mode application only - GRAFTABL is not relevant here. Using a graphics mode would cure other problems (like not being able to underline properly), but that is for another day.

The BOX and line drawing characters are part of the screen display. I could use something crude like "-" or "=" to form lines if I have to, but that is pretty ugly. If the users selected code page doesn't have those characters then there is not really much else I can do.

I think I'm supposed to sense the current code page, and then map the incoming UTF-8 to characters on that code page. (Or map the keyboard input to the correct UTF-8 character.) Does iconv have those mapping tables?

---
mTCP - TCP/IP apps for vintage DOS machines!
http://www.brutman.com/mTCP

marcov

25.11.2012, 13:15

@ mbbrutman

DOS internationalization support

Post reply

> As the programs are now, if the user selects a different code page the
> program will use it.

Correct and a codepage is 256 ascii points max (some of them unprintable)

> The standard solution for supporting additional characters in this case is
> Unicode, which gets embedded in the data stream as UTF-8.

> When Unicode is used the problem then becomes a matter of mapping the
> Unicode character to a character that can be displayed on the current
> machine.

Yes, or any character set larger than a VGA can display in an hardware manner.

> This involves sensing the currently active display code page and
> then having a mapping table from Unicode to the characters available in the
> code page. Not everything can be mapped, so characters that can not be
> displayed have to be shown in some encoded format. (In a limited
> environment like on a real CGA or MDA card only CP437 is available.)

Correct, unless you go to a framebuffer textmode solution and forgo the hardware textmode, as many *nixes do.

> Is my understanding correct? If so, are there libraries that make the
> mapping of Unicode to code pages easier?

libiconv is the standard answer, but if you can only display 256 chars, you don't need that much library support. (since even if you render certain accented characters to their non accented equivalent, there are still only 500 something mappings for a given codepage)

We use such solutions to provide some minimal unicode awareness for little programs and programs involved in bootstrapping the system. The tables are postloadable, so new codepages can be added without recompiling the binaries.

For fuller unicode support we usually try to tap in whatever system unicode library there is, to avoid

mbbrutman

Washington, USA,
25.11.2012, 20:05

@ marcov

DOS internationalization support

Post reply

> Correct, unless you go to a framebuffer textmode solution and forgo the
> hardware textmode, as many *nixes do.

Graphics would solve other problems (underlining, bold, etc.) that I can not handle using the text modes. But the text modes have the advantage of being the same across all of the adapters - MDA to VGA. And text modes are fast enough for the most primitive machines. (The 8088s are still very important to me.)

> libiconv is the standard answer, but if you can only display 256 chars, you
> don't need that much library support. (since even if you render certain
> accented characters to their non accented equivalent, there are still only
> 500 something mappings for a given codepage)
>
> We use such solutions to provide some minimal unicode awareness for little
> programs and programs involved in bootstrapping the system. The tables are
> postloadable, so new codepages can be added without recompiling the
> binaries.
>
> For fuller unicode support we usually try to tap in whatever system unicode
> library there is, to avoid

The support I need is basically just the mapping tables - given a Unicode character what is the correct character in the current codepage to use. (Or a close approximation.) (And the reverse is needed for keyboard input.)

Your approach for minimal Unicode awareness sounds similar to what I am thinking of. I want to make it more Unicode aware, not perfect. (I am never going to have the Asian character sets covered ...)

---
mTCP - TCP/IP apps for vintage DOS machines!
http://www.brutman.com/mTCP

Rugxulo

Usono,
27.11.2012, 07:55

@ mbbrutman

DOS internationalization support

Post reply

Don't mess with keyboard at all. Rely on native KEYB (or suitable free/libre replacement, e.g. FD KEYB, if needed).

The glyphs from code pages (more or less) corresponding to ISO 8859 can be re-used, covering a large batch of European languages.

For mapping tables, any number of resources supply them, e.g. GNU Emacs (em2303[bs].zip) or Mined sources or iconv (licv112[bs].zip).

"The UCS characters U+0000 to U+007F are identical to those in US-ASCII (ISO 646 IRV) and the range U+0000 to U+00FF is identical to ISO 8859-1 (Latin-1)." (source; i.e., same "code points", different encoding)

cp850 - MS-DOS Codepage 850 (Multilingual Latin 1)
cp852 - MS-DOS Codepage 852 (Multilingual Latin 2)
cp853 - MS-DOS Codepage 853 (Multilingual Latin 3; not in MS-DOS, probably should say IBM OS/2 ??)

iso8859.1 - ISOIEC 8859-1:1998 Latin Alphabet No. 1
iso8859.2 - ISOIEC 8859-2:1999 Latin Alphabet No. 2
iso8859.3 - ISOIEC 8859-3:1999 Latin Alphabet No. 3
iso8859.4 - ISOIEC 8859-4:1998 Latin Alphabet No. 4

Kosta Kostis also has various freeware (fonts, converters, TSRs), esp. isocp101.zip, which include ISOLATIN.CPI, which has cp819 ("true" Latin-1), cp912, cp913 (undocumented in DR-DOS 7.03; Windows calls this cp28593), cp914, etc. As mentioned, FreeDOS does not natively (yet) support these, only the weird variants (same glyphs, different encodings), e.g. cp853 has box chars and same language support as cp913 but text needs manual translation between them.

See Dos2Unix or Mined (or old patch here) for querying the code page correctly between MS-DOS, DR-DOS, and FreeDOS. (Basically, try int 21h 440Ch 6Ah first [MS-DOS, DR-DOS], and if it fails [FreeDOS!], try undocumented int 2Fh, 0AD02h [MS-DOS, FreeDOS].)

marcov 27.11.2012, 09:35 @ marcov	DOS internationalization support Post reply
	> For fuller unicode support we usually try to tap in whatever system unicode > library there is, to avoid .. distribution of many tables (either in or out of binary) with serious applications. In general I avoid the own tables as much as possible, except for small potato stuff and where it is involved in bootstrap/buildsystem.

Laaca

Czech republic,
25.11.2012, 14:41
(edited by Laaca, 25.11.2012, 15:11)

@ mbbrutman

DOS internationalization support

Post reply

Most DOS programs use for simpe conversions between unicode and DOS code page a simple .TBL files called cp850uni.tbl, cp852uni.tbl and so on.
Look at DOSLFN device driver, it is a good example how is it used and there is also a description of this -extremely primitive- format.
You only have to decide which table you will use.
It can be selected via CFG file or you can it derive from result of DOS function INT21h/AX=6501h

---
DOS-u-akbar!