DOS ain't dead

bencollver

27.06.2024, 00:59

webdump 2024-05-23 (Announce)

webdump is (yet another) HTML to plain-text converter tool.

It reads HTML in UTF-8 from stdin and writes plain-text to stdout.

This DJGPP build uses the original source code unchanged.

Note: webdump requires more memory than will fit within real-mode constraints. No 16-bit build here.

Download:
gopher://tilde.pink/1/~bencollver/files/dos386/util/webdump/

Source:
https://codemadness.org/webdump.html

Example:

Format the GNU Privacy Handbook to plain text:

C:\>curl -o manual.htm https://www.gnupg.org/gph/en/manual.html
C:\>webdump -dilr -w 72 <manual.htm >manual.txt

mbbrutman Washington, USA, 27.06.2024, 16:56 @ bencollver	webdump 2024-05-23 Post reply
	Just curious - why does this need so much memory? It seems like a UTF-8 to ASCII converter. The HTTP tags are probably in straight ASCII so there is no need to make the code aware of the document structure. Maybe you could count open '<' and closing '>' characters if you had to. A translation table from Unicode to ASCII doesn't require a lot of memory. So what am I missing here?

Rugxulo Usono, 27.06.2024, 20:01 @ mbbrutman	UnHTML Post reply
	> UnHTML (24k) removes HTML and some SGML from text files, leaving the file > good condition for formatting/viewing with a word processor or JUSTIFY. > I wrote it to get public domain text files from Web pages and other > sources where the markup is copyrighted. Comes with C source. Includes > Linux executable! (Tom Almy)

bencollver 28.06.2024, 04:34 @ Rugxulo	UnHTML Post reply
	> > UnHTML (24k) removes HTML > and some SGML from text files, leaving the file Using the GNU Privacy Handbook example mentioned in my original post, unhtml gave the following error. Null pointer assignment. Unhtml apparently discarded most of the content and it output 7k of text. For comparison, webdump output 105k of text.

jadoxa

Queensland, Australia,
28.06.2024, 08:43
(edited by jadoxa, 28.06.2024, 09:49)

@ bencollver

UnHTML

Post reply

> unhtml gave the following error.
>
> Null pointer assignment.

Logic error when it tests entities (while (a[i].in) should be while (*a[i].in)).

> For comparison, webdump output 105k of text.

Many lines of which end with >, but maybe that's just me? Although it happened with my Windows build, your djgpp build and a WSL Ubuntu build, on the file downloaded from the curl command you provided. Seems it doesn't like that file's <tag\n> format.

Edit: more specifically, it doesn't like the split end tag. Sent an email to the original dev.

bencollver 30.06.2024, 16:07 @ jadoxa	UnHTML Post reply
	> Logic error when it tests entities (`while (a[i].in)` should be > `while (a[i].in)`). I made this change, and now unhtml produces about 23k of text and then gives the following error. ** NULL assignment detected

jadoxa Queensland, Australia, 01.07.2024, 02:05 @ bencollver	UnHTML Post reply
	> > Logic error when it tests entities (`while (a[i].in)` should > be > > `while (a[i].in)`). > > I made this change, and now unhtml produces about 23k of text and then > gives the following error. > > ** NULL assignment detected Strange, works fine in Windows (VC6), producing 100k of text (although I wouldn't say it worked well, could do with more newlines at the end at least).

bencollver

14.07.2024, 01:30

@ jadoxa

UnHTML

Post reply

> > > Logic error when it tests entities (while (a[i].in)
> should
> > be
> > > while (*a[i].in)).
> >
> > I made this change, and now unhtml produces about 23k of text and then
> > gives the following error.
> >
> > *** NULL assignment detected
>
> Strange, works fine in Windows (VC6), producing 100k of text (although I
> wouldn't say it worked well, could do with more newlines at the end at
> least).

I built unhtml with DJGPP and it ran without any error messages. However the resulting text file is not complete. Chapter 4 and onward are missing.

jadoxa

Queensland, Australia,
14.07.2024, 03:22

@ bencollver

UnHTML

Post reply

> I built unhtml with DJGPP and it ran without any error messages. However
> the resulting text file is not complete. Chapter 4 and onward are missing.

I built it with djgpp 2.03 (June 2002 refresh) and gcc 4.6.2 (under Win7); the resulting text file was identical with the VC6 version (100262 bytes).
C:\unhtml>unhtml < manual.htm > manual.txt HTML removing filter Version 1.0 Copyright 1996 by Tom Almy C:\unhtml>tail manual.htm ><P >In this section, GnuPG refers to the GnuPG implementation of OpenPGP as well as other implementations such as NAI's PGP product.</P ></TD ></TR ></TABLE ></BODY ></HTML > C:\unhtml>tail manual.txt This can be confusing. Sometimes trust in an owner is referred to as owner-trust to distinguish it from trust in a key. Throughout this manual, however, ``trust'' is used to mean trust in a key's owner, and ``validity'' is used to mean trust that a key belongs to the human associated with the key ID.[5]In this section, GnuPG refers to the GnuPG implementation of OpenPGP as well as other implementations such as NAI's PGP product.

bencollver

14.07.2024, 05:16
(edited by bencollver, 14.07.2024, 05:32)

@ jadoxa

UnHTML

Post reply

> > I built unhtml with DJGPP and it ran without any error messages.
> However
> > the resulting text file is not complete. Chapter 4 and onward are
> missing.
>
> I built it with djgpp 2.03 (June 2002 refresh) and gcc 4.6.2 (under Win7);
> the resulting text file was identical with the VC6 version (100262 bytes).

Interesting that we got different results. I had built it with djgpp 2.05 and gcc 7.2.0 under FreeDOS 1.3.

I rebuilt it using djgpp 2.03A and gcc 4.7.1 (the ones bundled with the FreeDOS 1.3 bonus CD) and this time around the resulting text file is complete.

Rugxulo Usono, 14.07.2024, 08:11 @ jadoxa	DJGPP 2.03p2 (June 2002) Post reply
	> I built it with djgpp 2.03 (June 2002 refresh) I believe this is called "2.03p2", aka patchlevel 2, with the Win2k/XP fixes. This was "/current/" until 2015.

bencollver

15.07.2024, 02:14
(edited by bencollver, 15.07.2024, 02:26)

@ jadoxa

UnHTML

Post reply

I found that the unhtml problem was the result of a buffer overflow. With more recent versions of GCC, the stack corruption changes the intitle variable to a garbage value, omitting all output after the overflow.

char cmdbuf[20]; ... while (ch != ' ' && ch != '>') { cmdbuf[i++] = ch; mygetchar(); }

I used the following workaround:

while (ch != ' ' && ch != '>') { if (i < 19) cmdbuf[i++] = ch; mygetchar(); }

Here's a link to download binaries.

gopher://tilde.pink/1/~bencollver/files/dos/util/unhtml/

jadoxa

Queensland, Australia,
16.07.2024, 04:54

@ bencollver

UnHTML

Post reply

> I found that the unhtml problem was the result of a buffer overflow.

Nice find. That also led me to the newline problem - it never sees </pre because the newline is literal, thus the tag is "pre\n". Here's my complete diff.

--- ../unhtml/unhtml.c 1996-02-18 07:06:06 +1000 +++ unhtml.c 2024-07-16 12:43:58 +1000 @@ -20,8 +20,8 @@ typedef struct { char in[7]; - char out1d; /* DOS character (USA codepage) */ - char out1w; /* Windows character */ + unsigned char out1d; /* DOS character (USA codepage) */ + unsigned char out1w; /* Windows character */ char out2[4]; /* ASCII substitute */ char use2; /* 1- use out2 instead of out1d for dos2flag 2- diacritical marked character @@ -58,7 +58,8 @@ {"#167", 21, 167, "%"}, {"uml", '"', 168, "\""}, {"#168", '"', 168, "\""}, - {"cright", 'C', 169, "(C)",1}, + {"COPY", 'C', 169, "(C)",1}, + {"copy", 'C', 169, "(C)",1}, {"#169", 'C', 169, "(C)",1}, {"ordf", 166, 170, "a"}, {"#170", 166, 170, "a"}, @@ -173,6 +174,8 @@ {{0},0,0,{0}} }; +/* the longest name above */ +#define MAX_SUB 6 void newline(void) { @@ -208,13 +211,24 @@ } void mygetchar(void) { + int space = 0; for (;;) { ch = getchar(); - if (ch == '\n' && !quoting) ch = ' '; /* convert to whitespace */ if (ch == EOF) { cnewline(); exit(0); } + if (!quoting) { + if (ch == '\n' || ch == '\t') ch = ' '; /* convert to whitespace */ + if (ch == ' ') { + space = 1; /* consolidate multiple spaces */ + continue; + } + if (space) { + ungetc(ch, stdin); + ch = ' '; + } + } return; } } @@ -253,7 +267,8 @@ void main(int argc, char **argv) { int notflag=0, intitle=0; - char cmdbuf[20]; + #define CMDBUF_SIZE 32 + char cmdbuf[CMDBUF_SIZE]; int listlevel = -1; /* not in a list */ int listcount[10]; /* current counter value at each list level */ int i; @@ -296,30 +311,37 @@ /* special character processing */ mygetchar(); i=0; - while (ch != ';' && i < 12) { + while (ch != ';' && !isspace(ch) && i < CMDBUF_SIZE - 1) { cmdbuf[i++] = ch; mygetchar(); } + if (intitle) continue; cmdbuf[i] = 0; - if (i > 10) { - /* bad &; field, should not occur, but I've seen them! */ - if (!intitle) { - printf("&%s%c", cmdbuf, ch); + if (*cmdbuf == '#') { + if (cmdbuf[1] == 'x') { + i = (int)strtol(cmdbuf + 2, 0, 16); + } else { + i = (int)strtol(cmdbuf + 1, 0, 10); + } + if (i < 128) { + putchar(i); startline = 0; + continue; } - continue; } - i = 0; - while (a[i].in) { - if (strcmp(a[i].in,cmdbuf)==0) { - if (!intitle) { + if (i <= MAX_SUB) { + i = 0; + while (*a[i].in) { + if (strcmp(a[i].in,cmdbuf)==0) { putTableChar(i); - startline = 0; + i = 0; + break; } - break; + i++; } - i++; } + if (i) printf("&%s%c", cmdbuf, ch); + startline = 0; continue; } /* process <> command */ @@ -330,7 +352,7 @@ mygetchar(); } i=0; - while (ch != ' ' && ch != '>') { + while (!isspace(ch) && ch != '>' && i < CMDBUF_SIZE - 1) { cmdbuf[i++] = ch; mygetchar(); } @@ -391,7 +413,9 @@ } if (strcmp("pre", cmdbuf)==0) { /* preformatted */ - if (!notflag) cnewline(); + cnewline(); + newline(); + if (notflag) skipws = 1; quoting = !notflag; continue; } @@ -534,4 +558,4 @@ continue; } } -} \ No newline at end of file +}

I've used unsigned char simply because VC6 complains about int truncation.
Consolidate multiple spaces (outside pre) to one space.
Convert &#N; and &#xN; to a character when N is under 128.
Replace entity cright with COPY & copy.
Allow entities to be stopped by space, should the semicolon be absent.
Preserve all entities that aren't matched.
Add additional newlines around pre.

bencollver 16.07.2024, 16:12 @ jadoxa	UnHTML Post reply
	> > I found that the unhtml problem was the result of a buffer overflow. > > Nice find. That also led me to the newline problem - it never sees </pre > because the newline is literal, thus the tag is "pre\n". Here's my > complete diff. Thanks for the diff. I built and posted it as unhtm10c.zip

bencollver 28.06.2024, 04:26 @ mbbrutman	webdump 2024-05-23 Post reply
	Webdump is aware of the document structure. It parses and crawls the the document tree. During this process it allocates a bunch of memory and uses a lot of stack space. It might be more comparable to Mozilla readability than to unhtml.

mbbrutman Washington, USA, 28.06.2024, 05:39 (edited by mbbrutman, 28.06.2024, 06:07) @ bencollver	webdump 2024-05-23 Post reply
	> Webdump is aware of the document structure. It parses and crawls the the > document tree. During this process it allocates a bunch of memory and uses > a lot of stack space. It might be more comparable to Mozilla readability > than to unhtml. Sorry, it just seems shocking that this can't be compiled for 16 bit machines. It doesn't seem that complicated.

bencollver 28.06.2024, 17:09 @ mbbrutman	webdump 2024-05-23 Post reply
	> Sorry, it just seems shocking that this can't be compiled for 16 bit > machines. It doesn't seem that complicated. I was able to compile it with OpenWatcom, i just couldn't get it to run in a useful manner. The code is small enough, but not the memory usage at runtime.

bocke

30.06.2024, 00:13

@ bencollver

webdump 2024-05-23

Post reply

Just for the reference, you can also use Links web browser to dump a formated text version of the site.

links -dump file.htm > file.txt

Caveats: No UTF8 support. DOS also outputs stderr to the redirected file.

For example this is the dump of Micheal Brutman's DOS TCP/IP networking guide.

I compared it with the dump from Linux version of Links and, at first glance, everything looks the same.

Links also has some additional rendering options that might be useful sometimes. For example you can force it to make a numbered and referenced list of all links on the page. Or you can disable HTML tables or frame rendering.

P.S. I'll also test webdump to see how well it works.

P.P.S. Links dumps UTF8 on Linux and Windows.

jadoxa Queensland, Australia, 30.06.2024, 03:00 @ bencollver	webdump 2024-05-23 Post reply
	It's been updated to fix the split close tag issue.

bencollver 30.06.2024, 16:08 @ jadoxa	webdump 2024-05-23 Post reply
	> It's been updated to fix the split close tag issue. Thanks! I posted an updated DJGPP build.

webdump 2024-05-23 (Announce)

webdump 2024-05-23

UnHTML

UnHTML

UnHTML

UnHTML

UnHTML

UnHTML

UnHTML

UnHTML

DJGPP 2.03p2 (June 2002)

UnHTML

UnHTML

UnHTML

webdump 2024-05-23

webdump 2024-05-23

webdump 2024-05-23

webdump 2024-05-23

webdump 2024-05-23

webdump 2024-05-23