[Skip to the validation tools!]
On February 7, 1997, I posted a message to Web4Lib reporting some statistics the degree to which library home pages validated against an HTML DTD. I had selected the HTML 3.2 DTD, which had been formally recommended by the Word Wide Web Consortium the month before (a document criticized in some circles for being too permissive). Out of 624 North American libraries' home pages, the disappointing statistics were:
It is easy to present such statistics unfairly: many knowledgeable web authors can justify instances in which they violate HTML rules, and obviously some violations are more serious than others. For that matter, the consequences of violating HTML rules are usually not too serious; the differences between a badly formatted HTML document and a badly formatted tax return are pretty clear.
Nonetheless, "bad" HTML, by definition, can have unpredictable consequences for users of some browsers. HTML is one implementation of SGML, the Standard Generalized Markup Language. In SGML, the rules and requirements for a given variety of document are listed in the Document Type Definition, or DTD. There have been several HTML DTDs, representing various versions of HTML: the most notable are probably HTML 2.0, still the only version to be standardized as an Internet RFC; the original draft for HTML 3.0, which never evolved into a complete recommendation and—advertising claims notwithstanding—never received broad support from browser makers; HTML 3.2, originally code named "Wilbur"; and the three flavors of HTML 4.0: Strict, Transitional, and Frames. In addition to these more-or-less official DTDs, there have been DTDs created to describe Netscape Navigator's behavior post facto, an Internet Explorer DTD from Microsoft, and the DTDs used by several HTML editors.
Since browsers render HTML pages based on a DTD, or at least tagging rules that can be described in a DTD, it follows that documents which break the rules of that DTD may be rendered with unpredictable results. For a description of a worst-case scenario, read about the many documents that disappeared when Netscape 2.0 came out.
Validation is the process in which a document is compared to a set of document rules, in this context a DTD, and then a report of rule violations is created. Typically, a document author will submit a draft document to a validation service, use the report to identify errors, correct those errors, and resubmit the document for validation.
Note that validation compares a document only to the syntactical rules of a DTD. There are more general "checker" programs that provide services like spell-checking, link verification, and stylistic tips. None of these programs, however, can check for abuses of the intent of HTML syntax; using <BLOCKQUOTE> in an attempt to force indentation, for example, is something validators cannot check; whether this should actually be considered invalid HTML is subject to much debate.
It is only fair to provide one caution about HTML validation services. Since validating documents against a DTD is an SGML operation, and since SGML is a very complex standard with a jargon of its own, some validation services provide error messages which are difficult to understand. Also, one small error can create a cascade of dire error messages. For example, this code...
<html> <head><title>Test Page</title> </head> <body> <h1>Test</h1> <dl> <li>1 <li>2 <li>3 <li>4 </ul>
...creates this list of error messages
test.html:9:4:E: element `LI' not allowed here test.html:10:4:E: element `LI' not allowed here test.html:11:4:E: element `LI' not allowed here test.html:12:4:E: element `LI' not allowed here test.html:13:5:E: end tag for element `UL' which is not open test.html:13:7:E: `DL' not finished but document ended test.html:13:7:E: end tag for `DL' omitted, but its declaration does not permit this test.html:8:1: start tag was here
The only problem with the original source was that it opened the list with a DL tag instead of a UL.
The best way to determine why (or whether) to validate your HTML is to ask what could possibly happen if you do not validate. As described above, the way invalid HTML is rendered across a wide variety of browsers is unpredictable. In some environments, such as corporate or campus intranets, there may be no wide variety of browsers.
Bear in mind that the browser market may be about to fragment considerably. Browsers like pwWebSpeak and WebTV have already taken the browser metaphor out of the traditional desktop computer to the realm of voice synthesizers and televisions. A new generation of hand-held computers is emerging, including Internet connectivity and web browsers. And beyond that, browsers should be appearing this year for telephones and pagers. The only way to write documents now with a reasonable chance they will render as expected on these disparate platforms is to write to some standard on the assumption the browsers themselves will conform to that standard or a superset of that standard.
Not a few web authors avoid validation in part because validation is always presented to them as an authoritarian process to be robotically carried out across every single document on the server.
In fact, if validation serves to provide maximum predictability over time and over a range of browsers, authors can safely choose not to validate if such predictability is not terribly valuable. While it may never hurt to validate a document--you never know what surprises will turn up--some documents will serve their purpose even with numerous errors. As a case in point, the OhioLINK web site contains committee minutes, usage statistics and charts, and internal reports which have been generated by an HTML conversion utility known to include some pretty crazy tagging. I do not ask authors to validate these, mainly because I do not believe the benefit we would receive from validation would be worth the added time and frustration to inexperienced web authors.
Instead, I reserve validation efforts for higher-level pages on our web server that are likely to be viewed by a wider audience.
This is not offered as a complete list of HTML validators and checkers. It is a list of programs and services I have worked with; if you want a recommendation, I suggest WebTechs.
A number of services on the web allow you to enter a URL or some HTML source and have it validated. Some allow you to paste the text of a document into a form input, but many seem to work most conveniently with a URL for a document already available on the web. One way to use these services from your desktop is to run a light-duty web server on your own computer; if you are running Windows 95, one option is to install Microsoft's Personal Web Server, even if you only want to run it while validating your documents.
And Then There Were Two: As of December 1, 1998, WebTechs has discontinued its validation service. To my knowledge, it was the first such service on the net, but unfortunately it had fallen behind similar services.
August 21, 1998: The good people at the Web Design Group have just announced a new validation service. This service is designed to be very rigorous, but provides more informative feedback on validation errors than many other validators. Note that it counts as errors the lack of a document type declaration (although it then assumes HTML 4.0 Transitional) and unescaped ampersands in URLs. It is debatable whether either of these is actually wrong in any harmful way, but it's helpful to see messages about these.
With the release of HTML 4.0, the W3C established its own validation service. It builds upon the KGV, and was developed by the KGV's original creator.
The W3C validator also requires a document type declaration in the HTML document.
The WebTechs validation service (sometimes still referred to by its
old name, the HALSoft validation service) is widely considered as
authoritative as you can get. It is a strict SGML validator (it uses
nsgmls, below), and it admittedly does a better job of identifying problems
than explaining them. However, it is one of the few services that lets you
select the specific DTD against which to validate.
August 21, 1998: the WebTechs service is still using draft versions of the HTML 4.0 DTDs, roughly nine months after the official HTML 4.0 release. They have not replied to e-mail asking when they will begin using the official DTDs.
December 2, 1998: the WebTechs service has been discontinued. Requiescat in Pace.
The intent behind the KGV is to provide a service as
authoritative as WebTechs while providing friendlier, more informative
feedback. To a large extent, it succeeds. Unfortunately, in my opinion,
KGV expects all documents to contains a document type declaration, which is
neither common nor required on the web; if it finds no declaration, it
defaults to HTML 2.0, which is probably a more conservative choice than
most authors will expect.
August 21, 1998: the KGV has been decommissioned. Its URL now refers the user to the W3C Validation Service. There is no date indicating when KGV went away, but it was probably sometime in summer, 1998.
Doctor HTML points out in its FAQ that it is not an SGML validator, so users have to assume that it is not using any DTD, but is trying to create a set of rules drawn from a mixture of standard DTDs, Netscape and Microsoft documentation, and perhaps other sources. Its real strength lies in the services it provides in addition to checking HTML syntax: spelling checks, link verification, etc.
Theoretically, Yahoo should track additions and changes to these services more frequently than I expect to.
These services run locally on your system and can check files without requiring you to put them under a web server. Many are Unix command-line utilities, but there are at least two Windows 95 programs. If you can point me to a rigorous Mac validator (not just a checker, please), let me know and I will be happy to include it.
nsgmls is the most widely used SGML validator on the Internet. Be aware that it is exhaustively comprehensive, beyond the point of user friendliness. However, it is capable of using any DTD you can provide, and all the HTML DTDs can be found on the net.
At one point, there were a handful of Unix command-line utilities that checked local HTML files; most of these were not rigorous validators, but did detect most tagging problems and also offered some other syntax help. Most of these have expired; one that still has a little life is Weblint. As of August 21, 1998, it is still on version 1.020 from September 1997, and only supports HTML through version 3.2. It also supports non-standard extensions for Netscape 4.x and Microsoft 4.x
The Spyglass HTML Validator is a free Windows 95 utility. It allows users to select from four DTDs and has a minimal editing facility for fixing errors within the program. Unfortunately, it provides no way to add new DTDs and three of the four precompiled DTDs need updating.
The Spyglass validator is no longer being supported, and will not add support for anything beyond HTML 3.2.
This is a shareware program that currently costs US$24.95 to register. It does not seem to communicate which DTD, if any, it uses, and is reported not to be a rigorous validator, but a less formal syntax checker.