Back to home page

DOS ain't dead

Forum index page

Log in | Register

Back to index page
Thread view  Board view
mbbrutman

Homepage

Washington, USA,
21.01.2023, 21:44
 

International keyboard support (Developers)

I'm working on enabling some of my programs for Unicode. Specially, being able to decode UTF-8 and map Unicode codepoints to the machine's 256 available characters. My machine uses CP437 which is limited, but UTF-8 things already look much better.

I'd like to check my understanding of keyboard input handling.

At the lowest level, the keyboard just returns scan codes to the BIOS. The BIOS is responsible for mapping the keyboard scan codes to the built-in character set, or to special keys (F1, Alt-X, etc.)

If a machine strayed from the standard US keyboard layout or built-in character set, then the BIOS had to do the mapping for the differences. Which also becomes problematic for software that is expecting one character set and gets a different one. In that case you just get the wrong thing displayed.

Around DOS 3.x timeframe the hardware (EGA) allowed for loadable character sets (code pages) and DOS provided a mechanism to specify which should be resident. So that takes care of the limitation of the fixed code page that earlier machines (MDA/CGA) had.

On the keyboard side there is the KEYB.COM program. From what I can tell when you use KEYB it just fixes or replaces the standard BIOS handling, converting scan codes to something different from the default. It also allows for some shift/meta type operations to generate additional characters. But ultimately, the output is still going to be a value from 0 to 255 representing a character, and hopefully the matching code page for the video display is resident. Otherwise, you get the correct character but an incorrect display.

Does this sound about right?

---
mTCP - TCP/IP apps for vintage DOS machines!
http://www.brutman.com/mTCP

marcov

22.01.2023, 14:19

@ mbbrutman
 

International keyboard support

> On the keyboard side there is the KEYB.COM program. From what I can tell
> when you use KEYB it just fixes or replaces the standard BIOS handling,
> converting scan codes to something different from the default. It also
> allows for some shift/meta type operations to generate additional
> characters. But ultimately, the output is still going to be a value from 0
> to 255 representing a character, and hopefully the matching code page for
> the video display is resident. Otherwise, you get the correct character
> but an incorrect display.
>
> Does this sound about right?

That sounds all about right. The kbd must generate multi character sequences, and the display system must support writing those.

Additionally, quite often the display system requires such sequences to be written in one sys/int call. (so if the intermediate programming breaks them up it can also cause artefacts).

And of course anything that does character positioning must be fixed because new position is no longer old position + bytes written. etc etc.

tom

Homepage

Germany (West),
22.01.2023, 18:06
(edited by tom, 22.01.2023, 19:51)

@ mbbrutman
 

International keyboard support

> I'm working on enabling some of my programs for Unicode.
>...
>
> On the keyboard side there is the KEYB.COM program. From what I can tell
> when you use KEYB it just fixes or replaces the standard BIOS handling,
> converting scan codes to something different from the default. It also
> allows for some shift/meta type operations to generate additional
> characters. But ultimately, the output is still going to be a value from 0
> to 255 representing a character, and hopefully the matching code page for
> the video display is resident. Otherwise, you get the correct character
> but an incorrect display.
>
> Does this sound about right?

that sounds mostly right.

however, there used to be a thing called DBCS (double-byte character support?) that was somehow used to support languages with more than 256 characters like Chinese. however, as a (german) European I'm only used to languages with much less than 256 letters. It would be cool to have a Chinese/Japanese (or similar) developer explain how this worked - in the context of DOS and how MSDOS helped support this.

of course, a DOS application like the editor Blocek can find a way to input and output DBCS all on its own...

Laaca

Homepage

Czech republic,
22.01.2023, 21:55

@ mbbrutman
 

International keyboard support

The loadable keyboard driver creates the ASCII codes from the scancodes.
The scan codes are usualy not changed but not always. For example KEYB changes even the scan code of letters Z/Y (using function INT15h/AH=4Fh).¨

What makes problems are the "dead keys" (the prefix keys). If you press such key the driver usualy does not generate the "Key pressed" interrupt on INT16h or INT 21h.

---
DOS-u-akbar!

mbbrutman

Homepage

Washington, USA,
22.01.2023, 22:46

@ Laaca
 

International keyboard support

So based on my experiments and what I'm reading here, I think I can summarize it as follows:

* Regardless of the physical keyboard layout, an application program reading the keyboard using the BIOS routines will see 8 bit values. (This can't change or the world would be broken.)

* KEYB.COM is used to provide alternate mappings of keys to values returned by the BIOS, but ultimately you still get the same type of output - an 8 bit value.

* Special key sequences such as F1, Alt-X, etc. still generate a 0x0 and an 8 bit value that remains unchanged regardless of the KEYB.COM changes.

* To convert an 8 bit character value to Unicode you need to know what codepage was specified with KEYB.COM. Then you can try to map the 8 bit value received from the BIOS call to a Unicode codepoint.


The test version of IRCjr I am writing is already mapping incoming Unicode to arbitrary values when it receives a message. I have a mapping table for CP437, but any mapping table can be provided. (I'll need to make up additional tables.) Once I figure it out for IRCjr I'll reuse the code in Telnet and finally have proper UTF-8 support, and hopefully my European friends will forgive me for being so late with this.

Laaca

Homepage

Czech republic,
23.01.2023, 00:20

@ mbbrutman
 

International keyboard support

Yes, the 8-bit (extended ASCII) to Unicode (and vice versa) conversion is a problem as DOS does not provide any table for it.
There exist a 8-bit Uppercase/Lowercase table, collation table but not the 8-bit/Unicode table.
My programs (and also DOSLFN a few other programs) use the .TBL files (for examples and format description look to DOSLFN archive).
The principe is: ask DOS what for code page we actualy use.
Lets say that DOS answers 852.
Then we load the proper TBL file named "cp852uni.tbl" and use the values.

In systems with LFN functions you also maybe can (in theory of course because nobody is so insane) try this:
Using LFNCreateFile create a file in root directory (like C:\) named _insane_#128#129#130... and so on up to #255.
.....It will be internaly translated to Unicode.
But in reading it would be translate back from Unicode to ASCII.
To keep the filename in Unicode you have to read it directly - not using LFNFindFirst("_insane*.*") but using the interrupt INT25h (INT21h/AX=7305h).

---
DOS-u-akbar!

bretjohn

Homepage E-mail

Rio Rancho, NM,
23.01.2023, 18:12

@ mbbrutman
 

International keyboard support

Unfortunately, it's not as simple as all that (even though that's far from simple in itself).

There are also some keyboard that have multiple "modes" they can be in. For example, at least some versions of Cyrillic keyboards have "Cyrillic Mode" (for typing in Russian or a similar language) and a Latin Mode (for typing in English). Even if you can identify which keyboard driver is loaded (some keyboard drivers have a way to do that and others don't) you also may need identify which Mode it's in. Again, some keyboard drivers provide a way to do that and others don't.

And there's also the problem of "custom" keyboard layouts that don't follow any particular widely recognized "standard". There are also keyboard layouts that are available in, e.g., a Windows command prompt that have never had a DOS equivalent made.

It gets really complicated and really ugly very quickly unless you're going to limit yourself to certain "common" scenarios. I had to go through this as I was writing my SCANCODE program (which can "type" scancodes for you automatically).

mbbrutman

Homepage

Washington, USA,
23.01.2023, 23:51

@ bretjohn
 

International keyboard support

Understood - I can't support everything, nor am I going to try. I think a good starting point is assuming that there is only one keyboard mode and one codepage in effect at a time, and then using a mapping table to convert from whatever value is provided by the BIOS to Unicode.

As an escape hatch, I'll also define a key sequence for entering Unicode codepoints directly by number.

I suspect this covers most people's needs. If anybody contacts me with something more complex, then I can dig into the specifics of their use case. But right now, I'd just be even happy with trying to map incoming UTF-8 to a codepage .. it's better than seeing the encoded gibberish.

Do you think my plan works for the 80% case? (Single keyboard layout, single display codepage?)

bretjohn

Homepage E-mail

Rio Rancho, NM,
24.01.2023, 21:10

@ mbbrutman
 

International keyboard support

> Do you think my plan works for the 80% case? (Single keyboard layout,
> single display codepage?)

I think it would work. You would just need to warn the user about what's going on and the limitations.

tom

Homepage

Germany (West),
12.02.2023, 18:39

@ bretjohn
 

International keyboard support

> There are also some keyboard that have multiple "modes" they can be in.
it's not 'multiple'. it's 2: 'translated to local' and 'completely transparent or no translation at all'.

and this mode is questionable via INT whatever.

> Again, some keyboard drivers
> provide a way to do that and others don't.
then 'others' are buggy. what's the problem?



> And there's also the problem of "custom" keyboard layouts that don't follow
> any particular widely recognized "standard". There are also keyboard
> layouts that are available in, e.g., a Windows command prompt that have
> never had a DOS equivalent made.

it's not such a big deal to create your own keyboard layout.


> It gets really complicated

?

bretjohn

Homepage E-mail

Rio Rancho, NM,
13.02.2023, 15:35

@ tom
 

International keyboard support

> > There are also some keyboard that have multiple "modes" they can be in.
> it's not 'multiple'. it's 2: 'translated to local' and 'completely
> transparent or no translation at all'.
>
> and this mode is questionable via INT whatever.

It is INT 2F.AD85, and is supported by later versions the PC-DOS KEYB program. AFAIK it was never supported by MS-DOS KEYB. The FD-KEYB program by Aitor does support it but it works a little differently than the PC-DOS version. The FD-KEYB program has way more than 2 "modes" (sub-mappings). The KEYB programs from other DOS clones may support it also (like DR-DOS or PTS-DOS), but I'm not sure.

> > Again, some keyboard drivers
> > provide a way to do that and others don't.
> then 'others' are buggy. what's the problem?

The problem is that sometimes other programs (not KEYB) need to know what the current keyboard mapping and/or sub-mapping is in order to work properly (automatically) without requiring manual intervention from the user. When a KEYB program doesn't identify itself to other programs the user must do it themselves and they shouldn't need to do that.

> > And there's also the problem of "custom" keyboard layouts that don't
> follow
> > any particular widely recognized "standard". There are also keyboard
> > layouts that are available in, e.g., a Windows command prompt that have
> > never had a DOS equivalent made.
>
> it's not such a big deal to create your own keyboard layout.

Creating a custom keyboard layout is not a big deal, but having it automatically be able to notify other programs that may need to know what it's doing can be a very big deal.

tom

Homepage

Germany (West),
13.02.2023, 18:36

@ bretjohn
 

International keyboard support

> Creating a custom keyboard layout is not a big deal, but having it
> automatically be able to notify other programs that may need to know what
> it's doing can be a very big deal.

I think we discussed that before, and my feeling is still the same.

a) there is no API for that. forget it as no one else would be watching a newly invented API.

b) there is no need for this. no one else should bother by which scancode 'Y' or 'Z' is produced. just process 'Z' or 'Y'. and this API IS available.

bretjohn

Homepage E-mail

Rio Rancho, NM,
14.02.2023, 04:51

@ tom
 

International keyboard support

>> Creating a custom keyboard layout is not a big deal, but having it
>> automatically be able to notify other programs that may need to know
>> what it's doing can be a very big deal.

> I think we discussed that before, and my feeling is still the same.

Indeed we did, and my opinion hasn't changed, either.

> a) there is no API for that. forget it as no one else would be watching a
> newly invented API.

This is not a newly invented API. It's been around for decades and has been documented in RBIL (both the MS-DOS KEYB identification API and the PC-DOS KEYB sub-mapping API).

> b) there is no need for this. no one else should bother by which scancode
> 'Y' or 'Z' is produced. just process 'Z' or 'Y'. and this API IS
> available.

You're correct that programs _usually_ don't care how the ASCII character got entered, at least for the letters. But some programs care very much about how non-letter characters got entered. For example, some programs do care whether the number "2" got entered via the top row of the keyboard or whether it got entered via the number pad and some particular combination of the NumLock and/or Shift keys. Similarly, some programs care very much whether Enter (ASCII 13) was entered via the regular Enter key, the Enter key on the number pad, Alt-013, or Ctrl-M. With your statement that, "no one else should bother" you are being pretty bold with your opinion.

tkchia

Homepage

13.02.2023, 20:13

@ bretjohn
 

International keyboard support

Hello bretjohn,

> Creating a custom keyboard layout is not a big deal, but having it
> automatically be able to notify other programs that may need to know what
> it's doing can be a very big deal.

I think this is not quite relevant for mbbrutman's particular problem(s). Namely, we want to know how to map, say, some 8-bit ASCII character value such as 0x8a to, say, "á". We do not need to know which particular key positions the 0x8a came from, but we do need to somehow figure out that 0x8a maps to Unicode U+00e1 "á". Then the program can display a "á" and save it as U+00e1 (maybe in UTF-8 form) in a text file.

Thank you!

---
https://gitlab.com/tkchia · https://codeberg.org/tkchia · 😴 "MOV AX,0D500H+CMOS_REG_D+NMI"

bretjohn

Homepage E-mail

Rio Rancho, NM,
14.02.2023, 05:05

@ tkchia
 

International keyboard support

> Hello bretjohn,
>
> > Creating a custom keyboard layout is not a big deal, but having it
> > automatically be able to notify other programs that may need to know
> what
> > it's doing can be a very big deal.
>
> I think this is not quite relevant for mbbrutman's particular problem(s).
> Namely, we want to know how to map, say, some 8-bit ASCII character value
> such as 0x8a to, say, "á". We do not need to know which
> particular key positions the 0x8a came from, but we do need to
> somehow figure out that 0x8a maps to Unicode
> U+00e1 "á". Then the program can display a "á" and
> save it as U+00e1 (maybe in UTF-8 form) in a text file.

The general problem with this is that there is not a one-to-one map between a DOS Code Page and UniCode.

For example, if your input is UniCode and you're trying to display what you receive in DOS, there are multiple issues. One is that there may be some characters that simply can't be displayed because the UniCode characters don't exist in the current Code Page. In that case, what do you put on the screen since you can't display a "legitimate" character? What I did in my UNI2ASCI program is try to display _something_ on the screen that somewhat resembles the UniCode character (received from a USB device) even though it may not be the "correct" character to display. In UNI2ASCI, if I can't display it I just write the UniCode number (something like "{U+092C}"). Based on what Michael wrote, I think he's trying to do basically the same thing (but I could be wrong).

The other problem when going from UniCode to DOS is that there are several UniCode characters that are effectively duplicates of each other (even though that's not supposed to happen in UniCode). For example, there are more than 20 UniCode characters that are classified as "spaces", with official names such as "No-Break Space", "Zero-Width Space", and "Three-Per-Em Space". I think those can all be _displayed_ as a "regular" space in DOS, though technically they probably shouldn't be because they have additional characteristics beside the fact that the "look like a space" (they have some "metachaaracteristics").

You have similar problems when going the other direction: converting s character from a DOS Code Page to UniCode. Again, we can talk about spaces. There are three DOS characters that "look like" spaces (ASCII 0 or NUL, ASCII 32 or a "normal" space, and ASCII 255 which is normally translated to UniCode as a No-Break Space or NBSP). Some DOS Code Pages also have additional characters that are displayed on the screen as a "space" (e.g., Code Page 869 which is used for Greek has several "space" characters). On a DOS Code Page, they all look exactly the same, but if you were to save them to UniCode which of the UniCode "spaces" should you use?

Now, imagine trying to go back-and-forth between a DOS Code Page and UniCode multiple times and just try to foresee how screwed up the characters can get if you're not 100% consistent in how you do the mapping in both directions, or if you do it differently than the next programmer. Or, trying to import/export something you saved as "UniCode" into another program that natively uses UniCode and actually understands the metacharacteriscs of the different spaces.

How you enter something even as simple as a space can make a difference in what UniCode character to save it as. For example, you probably should differentiate between a No-Break Space and a "regular" Space in a word-processor since it affects the output formatting.

I realize it seems like it should be pretty simple to do, but it's not, at least if you want to do it correctly and interact with other programs.

tom

Homepage

Germany (West),
14.02.2023, 12:17

@ mbbrutman
 

International keyboard support

> So based on my experiments and what I'm reading here, I think I can
> summarize it as follows:
>
> * Regardless of the physical keyboard layout, an application program
> reading the keyboard using the BIOS routines will see 8 bit values. (This
> can't change or the world would be broken.)

Right.

> * KEYB.COM is used to provide alternate mappings of keys to values returned
> by the BIOS, but ultimately you still get the same type of output - an 8
> bit value.

Right.

> * Special key sequences such as F1, Alt-X, etc. still generate a 0x0 and an
> 8 bit value that remains unchanged regardless of the KEYB.COM changes.

Right.

> * To convert an 8 bit character value to Unicode you need to know what
> codepage was specified

Right.

> with KEYB.COM.

wrong. codepage is a DISPLAY/VGA Bios thing, not related to the KEYB option.

example: as a german, I always used CP 437 (actually 850 now because €).
however I use the standard BIOS when using a US ASCII keyboard, but KEB GR when using a german keyboard.

Similarly when hitting CtrlAltF1/CtrlAltF2 to enable/disable KEYB, the codepage remains the same.

> Then you can try to map the 8 bit
> value received from the BIOS call to a Unicode codepoint.

Right. only in the context of the current codepage can you know if
0x9A is Ü (CP437) or ³ (CP869) and translate it to the correct UTF value.

> The test version of IRCjr I am writing is already mapping incoming Unicode
> to arbitrary values when it receives a message.

btw UNI2ASCI by Bret has done a really heroic effort to map dozens of UNICODE values that 'look similar to T' to the available 'T' (and the rest of a-zA-Z).

it's probably better to display 'A' when 'Ä' is not available in the current codepage then to display '?'.

tom

Homepage

Germany (West),
15.02.2023, 13:19

@ tom
 

International keyboard support

> it's probably better to display 'A' when 'Ä' is not available in the
> current codepage then to display '?'.

why are you using text mode anyway?

the availability codepage's imply VGA and a machine fast enough to run IRCjr in graphic mode with infinitely many displayable characters.

according to WGL4 all European/American character sets sum to just about 700 different symbols; certainly doable.

marcov

15.02.2023, 18:14

@ tom
 

International keyboard support

> according to WGL4 all European/American character sets sum to just about
> 700 different symbols; certainly doable.

IIRC VGA has some 512 char capability (sacrificing a bit of the color/attribute byte), so if you can reduce that set, textmode should be ok:-)

tom

Homepage

Germany (West),
15.02.2023, 21:31

@ marcov
 

International keyboard support

> > according to WGL4 all European/American character sets sum to just about
> > 700 different symbols; certainly doable.
>
> IIRC VGA has some 512 char capability (sacrificing a bit of the
> color/attribute byte), so if you can reduce that set, textmode should be
> ok:-)

that's right. however, displaying 512 chars is about as complex as displaying 16x8 graphical chars. the latter scales better.

mbbrutman

Homepage

Washington, USA,
18.02.2023, 03:17

@ tom
 

International keyboard support

Understood about the difference between the display code page and the keyboard layout. I know this, but not everything comes across in a web forum posting.

I'm using text mode because I am targeting CGA text mode on slow machines, and the graphics modes for those machines would never be usable. CGA and MDA users will be limited to CP437. The same code has to run everywhere because I don't have the time to do multiple versions. EGA and VGA users might have a different code page loaded, so I don't want to make assumptions about what they have loaded, hence the need for a flexible mapping from whatever codepage is loaded to Unicode.

---
mTCP - TCP/IP apps for vintage DOS machines!
http://www.brutman.com/mTCP

Aitor

06.03.2023, 01:09

@ mbbrutman
 

International keyboard support

> Understood about the difference between the display code page and the
> keyboard layout. I know this, but not everything comes across in a web
> forum posting.
>
> I'm using text mode because I am targeting CGA text mode on slow machines,
> and the graphics modes for those machines would never be usable. CGA and
> MDA users will be limited to CP437. The same code has to run everywhere
> because I don't have the time to do multiple versions. EGA and VGA users
> might have a different code page loaded, so I don't want to make
> assumptions about what they have loaded, hence the need for a flexible
> mapping from whatever codepage is loaded to Unicode.

Hi,

just to add a couple of things: I think it is clearly stated that KEYB will just drop characters to the keyboard buffer, simply a list of characters to be "read" by the console or whatever the program that waits for user keyboard imput.

The fact that KEYB understands of "codepages" means that if you are pressing the key to produce À, it may be in different position for different codepages, hence it drops different numbers.

If it were a question of the number of bytes posted by KEYB, there would be a shaky and painful way of doing it: in FD-KEYB you can define "strings", and therefore make the keyboard drop more than one byte per a key-press.

However, you still need that someone (usually the screen BIOS) will translate characters into glyphs. Should you have a program that is able to read UTF or any other multi-byte characters from the keyboard buffer, it could translate it into the appropriate way, but you would still need a graphics mode if you wanted to output more than 256 (or 512) bytes.

If you planned to enlarge the output to 512 characters (instead of 256 character codepage tables), IIRC you would be loosing the ability of the screen BIOS to use light colours as background colour. Some programs may misbehave, as they may expect to use light background colours.

Anyway in CGA you can just use 8x8 character glyphs.

There aren't many chances to go Unicode with a CGA card, I'm afraid.

Aitor

Back to index page
Thread view  Board view
22049 Postings in 2034 Threads, 396 registered users, 156 users online (0 registered, 156 guests)
DOS ain't dead | Admin contact
RSS Feed
powered by my little forum