Back to home page

DOS ain't dead

Forum index page

Log in | Register

Back to the board
Thread view  Mix view  Order
bencollver

Homepage

27.06.2024, 00:59
 

webdump 2024-05-23 (Announce)

webdump is (yet another) HTML to plain-text converter tool.

It reads HTML in UTF-8 from stdin and writes plain-text to stdout.

This DJGPP build uses the original source code unchanged.

Note: webdump requires more memory than will fit within real-mode constraints. No 16-bit build here.

Download:
gopher://tilde.pink/1/~bencollver/files/dos386/util/webdump/

Source:
https://codemadness.org/webdump.html

Example:

Format the GNU Privacy Handbook to plain text:

C:\>curl -o manual.htm https://www.gnupg.org/gph/en/manual.html
C:\>webdump -dilr -w 72 <manual.htm >manual.txt

mbbrutman

Homepage

Washington, USA,
27.06.2024, 16:56

@ bencollver

webdump 2024-05-23

Just curious - why does this need so much memory?

It seems like a UTF-8 to ASCII converter. The HTTP tags are probably in straight ASCII so there is no need to make the code aware of the document structure. Maybe you could count open '<' and closing '>' characters if you had to.

A translation table from Unicode to ASCII doesn't require a lot of memory. So what am I missing here?

Rugxulo

Homepage

Usono,
27.06.2024, 20:01

@ mbbrutman

UnHTML

> UnHTML (24k) removes HTML and some SGML from text files, leaving the file
> good condition for formatting/viewing with a word processor or JUSTIFY.
> I wrote it to get public domain text files from Web pages and other
> sources where the markup is copyrighted. Comes with C source. Includes
> Linux executable! (Tom Almy)

bencollver

Homepage

28.06.2024, 04:26

@ mbbrutman

webdump 2024-05-23

Webdump is aware of the document structure. It parses and crawls the the document tree. During this process it allocates a bunch of memory and uses a lot of stack space. It might be more comparable to Mozilla readability than to unhtml.

bencollver

Homepage

28.06.2024, 04:34

@ Rugxulo

UnHTML

> > UnHTML (24k) removes HTML
> and some SGML from text files, leaving the file

Using the GNU Privacy Handbook example mentioned in my original post, unhtml gave the following error.

Null pointer assignment.

Unhtml apparently discarded most of the content and it output 7k of text. For comparison, webdump output 105k of text.

mbbrutman

Homepage

Washington, USA,
28.06.2024, 05:39
(edited by mbbrutman, 28.06.2024, 06:07)

@ bencollver

webdump 2024-05-23

> Webdump is aware of the document structure. It parses and crawls the the
> document tree. During this process it allocates a bunch of memory and uses
> a lot of stack space. It might be more comparable to Mozilla readability
> than to unhtml.

Sorry, it just seems shocking that this can't be compiled for 16 bit machines. It doesn't seem that complicated.

jadoxa

Homepage E-mail

Queensland, Australia,
28.06.2024, 08:43
(edited by jadoxa, 28.06.2024, 09:49)

@ bencollver

UnHTML

> unhtml gave the following error.
>
> Null pointer assignment.

Logic error when it tests entities (while (a[i].in) should be while (*a[i].in)).

> For comparison, webdump output 105k of text.

Many lines of which end with >, but maybe that's just me? Although it happened with my Windows build, your djgpp build and a WSL Ubuntu build, on the file downloaded from the curl command you provided. Seems it doesn't like that file's <tag\n> format.

Edit: more specifically, it doesn't like the split end tag. Sent an email to the original dev.

bencollver

Homepage

28.06.2024, 17:09

@ mbbrutman

webdump 2024-05-23

> Sorry, it just seems shocking that this can't be compiled for 16 bit
> machines. It doesn't seem that complicated.

I was able to compile it with OpenWatcom, i just couldn't get it to run in a useful manner. The code is small enough, but not the memory usage at runtime.

bocke

30.06.2024, 00:13

@ bencollver

webdump 2024-05-23

Just for the reference, you can also use Links web browser to dump a formated text version of the site.

links -dump file.htm > file.txt

Caveats: No UTF8 support. DOS also outputs stderr to the redirected file.

For example this is the dump of Micheal Brutman's DOS TCP/IP networking guide.

I compared it with the dump from Linux version of Links and, at first glance, everything looks the same.

Links also has some additional rendering options that might be useful sometimes. For example you can force it to make a numbered and referenced list of all links on the page. Or you can disable HTML tables or frame rendering.

P.S. I'll also test webdump to see how well it works.

P.P.S. Links dumps UTF8 on Linux and Windows.

jadoxa

Homepage E-mail

Queensland, Australia,
30.06.2024, 03:00

@ bencollver

webdump 2024-05-23

It's been updated to fix the split close tag issue.

bencollver

Homepage

30.06.2024, 16:07

@ jadoxa

UnHTML

> Logic error when it tests entities (while (a[i].in) should be
> while (*a[i].in)).

I made this change, and now unhtml produces about 23k of text and then gives the following error.

*** NULL assignment detected

bencollver

Homepage

30.06.2024, 16:08

@ jadoxa

webdump 2024-05-23

> It's been updated to fix the split close tag issue.

Thanks! I posted an updated DJGPP build.

jadoxa

Homepage E-mail

Queensland, Australia,
01.07.2024, 02:05

@ bencollver

UnHTML

> > Logic error when it tests entities (while (a[i].in) should
> be
> > while (*a[i].in)).
>
> I made this change, and now unhtml produces about 23k of text and then
> gives the following error.
>
> *** NULL assignment detected

Strange, works fine in Windows (VC6), producing 100k of text (although I wouldn't say it worked well, could do with more newlines at the end at least).

bencollver

Homepage

14.07.2024, 01:30

@ jadoxa

UnHTML

> > > Logic error when it tests entities (while (a[i].in)
> should
> > be
> > > while (*a[i].in)).
> >
> > I made this change, and now unhtml produces about 23k of text and then
> > gives the following error.
> >
> > *** NULL assignment detected
>
> Strange, works fine in Windows (VC6), producing 100k of text (although I
> wouldn't say it worked well, could do with more newlines at the end at
> least).

I built unhtml with DJGPP and it ran without any error messages. However the resulting text file is not complete. Chapter 4 and onward are missing.

jadoxa

Homepage E-mail

Queensland, Australia,
14.07.2024, 03:22

@ bencollver

UnHTML

> I built unhtml with DJGPP and it ran without any error messages. However
> the resulting text file is not complete. Chapter 4 and onward are missing.

I built it with djgpp 2.03 (June 2002 refresh) and gcc 4.6.2 (under Win7); the resulting text file was identical with the VC6 version (100262 bytes).

C:\unhtml>unhtml < manual.htm > manual.txt
HTML removing filter Version 1.0
Copyright 1996 by Tom Almy

C:\unhtml>tail manual.htm
><P
>In this section, GnuPG refers to the
GnuPG implementation of OpenPGP as well as other implementations
such as NAI's PGP product.</P
></TD
></TR
></TABLE
></BODY
></HTML
>
C:\unhtml>tail manual.txt
This can be confusing.
Sometimes trust in an owner is referred to as
owner-trust to
distinguish it from trust in a key.
Throughout this manual, however, ``trust'' is used to
mean trust in a key's
owner, and ``validity'' is used to mean trust that a key
belongs to the human associated with the key ID.[5]In this section, GnuPG refers to the
GnuPG implementation of OpenPGP as well as other implementations
such as NAI's PGP product.

bencollver

Homepage

14.07.2024, 05:16
(edited by bencollver, 14.07.2024, 05:32)

@ jadoxa

UnHTML

> > I built unhtml with DJGPP and it ran without any error messages.
> However
> > the resulting text file is not complete. Chapter 4 and onward are
> missing.
>
> I built it with djgpp 2.03 (June 2002 refresh) and gcc 4.6.2 (under Win7);
> the resulting text file was identical with the VC6 version (100262 bytes).

Interesting that we got different results. I had built it with djgpp 2.05 and gcc 7.2.0 under FreeDOS 1.3.

I rebuilt it using djgpp 2.03A and gcc 4.7.1 (the ones bundled with the FreeDOS 1.3 bonus CD) and this time around the resulting text file is complete.

Rugxulo

Homepage

Usono,
14.07.2024, 08:11

@ jadoxa

DJGPP 2.03p2 (June 2002)

> I built it with djgpp 2.03 (June 2002 refresh)

I believe this is called "2.03p2", aka patchlevel 2, with the Win2k/XP fixes. This was "/current/" until 2015.

bencollver

Homepage

15.07.2024, 02:14
(edited by bencollver, 15.07.2024, 02:26)

@ jadoxa

UnHTML

I found that the unhtml problem was the result of a buffer overflow. With more recent versions of GCC, the stack corruption changes the intitle variable to a garbage value, omitting all output after the overflow.

        char cmdbuf[20];
...
                while (ch != ' ' && ch != '>') {
                        cmdbuf[i++] = ch;
                        mygetchar();
                }


I used the following workaround:

                while (ch != ' ' && ch != '>') {
                        if (i < 19) cmdbuf[i++] = ch;
                        mygetchar();
                }


Here's a link to download binaries.

gopher://tilde.pink/1/~bencollver/files/dos/util/unhtml/

jadoxa

Homepage E-mail

Queensland, Australia,
16.07.2024, 04:54

@ bencollver

UnHTML

> I found that the unhtml problem was the result of a buffer overflow.

Nice find. That also led me to the newline problem - it never sees </pre because the newline is literal, thus the tag is "pre\n". Here's my complete diff.


--- ../unhtml/unhtml.c  1996-02-18 07:06:06 +1000
+++ unhtml.c    2024-07-16 12:43:58 +1000
@@ -20,8 +20,8 @@
 
 typedef struct {
        char in[7];
-       char out1d;             /* DOS character (USA codepage) */
-       char out1w;             /* Windows character */
+       unsigned char out1d; /* DOS character (USA codepage) */
+       unsigned char out1w; /* Windows character */
        char out2[4];   /* ASCII substitute */
        char use2;              /* 1- use out2 instead of out1d for dos2flag
                        2- diacritical marked character
@@ -58,7 +58,8 @@
        {"#167", 21, 167, "%"},
        {"uml", '"', 168, "\""},
        {"#168", '"', 168, "\""},
-       {"cright", 'C', 169, "(C)",1},
+       {"COPY", 'C', 169, "(C)",1},
+       {"copy", 'C', 169, "(C)",1},
        {"#169", 'C', 169, "(C)",1},
        {"ordf", 166, 170, "a"},
        {"#170", 166, 170, "a"},
@@ -173,6 +174,8 @@
        {{0},0,0,{0}}
 };
 
+/* the longest name above */
+#define MAX_SUB 6
 
 
 void newline(void) {
@@ -208,13 +211,24 @@
 }
 
 void mygetchar(void) {
+       int space = 0;
        for (;;) {
                ch = getchar();
-               if (ch == '\n' && !quoting) ch = ' ';       /* convert to whitespace */
                if (ch == EOF) {
                        cnewline();
                        exit(0);
                }
+               if (!quoting) {
+                       if (ch == '\n' || ch == '\t') ch = ' ';    /* convert to whitespace */
+                       if (ch == ' ') {
+                               space = 1;              /* consolidate multiple spaces */
+                               continue;
+                       }
+                       if (space) {
+                               ungetc(ch, stdin);
+                               ch = ' ';
+                       }
+               }
                return;
        }
 }
@@ -253,7 +267,8 @@
 
 void main(int argc, char **argv) {
        int notflag=0, intitle=0;
-       char cmdbuf[20];
+       #define CMDBUF_SIZE 32
+       char cmdbuf[CMDBUF_SIZE];
        int listlevel = -1; /* not in a list */
        int listcount[10];      /* current counter value at each list level */
        int i;
@@ -296,30 +311,37 @@
                        /* special character processing */
                        mygetchar();
                        i=0;
-                       while (ch != ';' && i < 12) {
+                       while (ch != ';' && !isspace(ch) && i < CMDBUF_SIZE - 1) {
                                cmdbuf[i++] = ch;
                                mygetchar();
                        }
+                       if (intitle) continue;
                        cmdbuf[i] = 0;
-                       if (i > 10) {
-                               /* bad &; field, should not occur, but I've seen them! */
-                               if (!intitle) {
-                                       printf("&%s%c", cmdbuf, ch);
+                       if (*cmdbuf == '#') {
+                               if (cmdbuf[1] == 'x') {
+                                       i = (int)strtol(cmdbuf + 2, 0, 16);
+                               } else {
+                                       i = (int)strtol(cmdbuf + 1, 0, 10);
+                               }
+                               if (i < 128) {
+                                       putchar(i);
                                        startline = 0;
+                                       continue;
                                }
-                               continue;
                        }
-                       i = 0;
-                       while (a[i].in) {
-                               if (strcmp(a[i].in,cmdbuf)==0) {
-                                       if (!intitle) {
+                       if (i <= MAX_SUB) {
+                               i = 0;
+                               while (*a[i].in) {
+                                       if (strcmp(a[i].in,cmdbuf)==0) {
                                                putTableChar(i);
-                                               startline = 0;
+                                               i = 0;
+                                               break;
                                        }
-                                       break;
+                                       i++;
                                }
-                               i++;
                        }
+                       if (i) printf("&%s%c", cmdbuf, ch);
+                       startline = 0;
                        continue;
                }
                /* process <> command */
@@ -330,7 +352,7 @@
                        mygetchar();
                }
                i=0;
-               while (ch != ' ' && ch != '>') {
+               while (!isspace(ch) && ch != '>' && i < CMDBUF_SIZE - 1) {
                        cmdbuf[i++] = ch;
                        mygetchar();
                }
@@ -391,7 +413,9 @@
                }
                if (strcmp("pre", cmdbuf)==0) {
                        /* preformatted */
-                       if (!notflag) cnewline();
+                       cnewline();
+                       newline();
+                       if (notflag) skipws = 1;
                        quoting = !notflag;
                        continue;
                }
@@ -534,4 +558,4 @@
                        continue;
                }
        }
-}
\ No newline at end of file
+}


I've used unsigned char simply because VC6 complains about int truncation.
Consolidate multiple spaces (outside pre) to one space.
Convert &#N; and &#xN; to a character when N is under 128.
Replace entity cright with COPY & copy.
Allow entities to be stopped by space, should the semicolon be absent.
Preserve all entities that aren't matched.
Add additional newlines around pre.

bencollver

Homepage

16.07.2024, 16:12

@ jadoxa

UnHTML

> > I found that the unhtml problem was the result of a buffer overflow.
>
> Nice find. That also led me to the newline problem - it never sees </pre
> because the newline is literal, thus the tag is "pre\n". Here's my
> complete diff.

Thanks for the diff. I built and posted it as unhtm10c.zip

Back to the board
Thread view  Mix view  Order
22033 Postings in 2032 Threads, 396 registered users, 64 users online (0 registered, 64 guests)
DOS ain't dead | Admin contact
RSS Feed
powered by my little forum