gnu.javax.swing.text.html.parser.support

Class Parser

Implemented Interfaces:
DTDConstants
Known Direct Subclasses:
DomHTMLParser

public class Parser
extends ReaderTokenizer
implements DTDConstants

A simple error-tolerant HTML parser that uses a DTD document to access data on the possible tokens, arguments and syntax.

The parser reads an HTML content from a Reader and calls various notifying methods (which should be overridden in a subclass) when tags or data are encountered.

Some HTML elements need no opening or closing tags. The task of this parser is to invoke the tag handling methods also when the tags are not explicitly specified and must be supposed using information, stored in the DTD. For example, parsing the document

<table><tr><td>a<td>b<td>c</tr>
will invoke exactly the handling methods exactly in the same order (and with the same parameters) as if parsing the document:
<html><head></head><body><table>< tbody><tr><td>a</td><td>b </td><td>c</td></tr>< /tbody></table></body></html> (supposed tags are given in italics). The parser also supports obsolete elements of HTML syntax.

Field Summary

protected DTD
dtd
The document template description that will be used to parse the documents.
Token
hTag
The current html tag.
protected int
preformatted
This fields has positive values in preformatted tags.
protected boolean
strict
The value of this field determines whether or not the Parser will be strict in enforcing SGML compatibility.

Fields inherited from class gnu.javax.swing.text.html.parser.support.low.ReaderTokenizer

advanced, backupMode

Fields inherited from class gnu.javax.swing.text.html.parser.support.low.Constants

AP, BEGIN, COMMENT_END, COMMENT_OPEN, COMMENT_TRIPLEDASH_END, DOUBLE_DASH, END, ENTITY, ENTITY_NAMED, ENTITY_NUMERIC, EOF, EQ, EXCLAMATION, NUMTOKEN, OTHER, QUOT, SCRIPT, SCRIPT_CLOSE, SCRIPT_OPEN, SGML, SLASH, STYLE, STYLE_CLOSE, STYLE_OPEN, TAG, TAG_CLOSE, WS, bDIGIT, bLETTER, bLINEBREAK, bNAME, bQUOTING, bSINGLE_CHAR_TOKEN, bSPECIAL, bWHITESPACE

Fields inherited from interface javax.swing.text.html.parser.DTDConstants

ANY, CDATA, CONREF, CURRENT, DEFAULT, EMPTY, ENDTAG, ENTITIES, ENTITY, FIXED, GENERAL, ID, IDREF, IDREFS, IMPLIED, MD, MODEL, MS, NAME, NAMES, NMTOKEN, NMTOKENS, NOTATION, NUMBER, NUMBERS, NUTOKEN, NUTOKENS, PARAMETER, PI, PUBLIC, RCDATA, REQUIRED, SDATA, STARTTAG, SYSTEM

Constructor Summary

Parser(DTD a_dtd)
Creates a new Parser that uses the given DTD.

Method Summary

protected void
CDATA(boolean clearBuffer)
Read parseable character data, add to buffer.
protected void
Comment()
Process Comment.
protected void
Script()
Read a script.
protected void
Sgml()
Process SGML insertion that is not a comment.
protected void
Style()
Read a style definition.
protected void
Tag()
Read a html tag.
protected void
_handleText()
A hook, for operations, preceeding call to handleText.
protected void
append(Token t)
Add the image of this token to the buffer.
protected void
consume(pattern p)
Consume pattern that must match.
protected void
endTag(boolean omitted)
The method is called when the HTML end (closing) tag is found or if the parser concludes that the one should be present in the current position.
void
error(String msg)
Invokes the error handler.
void
error(String msg, Token atToken)
Invokes the error handler.
void
error(String msg, String invalid)
Invokes the error handler.
void
error(String parm1, String parm2, String parm3)
Invokes the error handler.
void
error(String parm1, String parm2, String parm3, String parm4)
Invokes the error handler.
void
flushAttributes()
SimpleAttributeSet
getAttributes()
Get the attributes of the current tag.
protected int
getCurrentLine()
Get the first line of the last parsed token.
protected void
handleComment(char[] comment)
Handle HTML comment.
protected void
handleEOFInComment()
This is additionally called in when the HTML content terminates without closing the HTML comment.
protected void
handleEmptyTag(TagElement tag)
Handle the tag with no content, like <br>.
protected void
handleEndTag(TagElement tag)
The method is called when the HTML closing tag ((like </table>) is found or if the parser concludes that the one should be present in the current position.
protected void
handleError(int line, String message)
protected void
handleStartTag(TagElement tag)
The method is called when the HTML opening tag ((like <table>) is found or if the parser concludes that the one should be present in the current position.
protected void
handleText(char[] text)
Handle the text section.
protected void
handleTitle(char[] title)
Handle HTML <title> tag.
protected TagElement
makeTag(Element element)
Constructs the tag from the given element.
protected TagElement
makeTag(Element element, boolean isSupposed)
Constructs the tag from the given element.
protected void
markFirstTime(Element element)
This is called when the tag, representing the given element, occurs first time in the document.
protected Token
mustBe(int kind)
Consume the token that was checked before and hence MUST be present.
protected void
noValueAttribute(String element, String attribute)
Handle attribute without value.
protected Token
optional(int kind)
Consume the optional token, if present.
void
parse(Reader reader)
Parse the HTML text, calling various methods in response to the occurence of the corresponding HTML constructions.
String
parseDTDMarkup()
Parses DTD markup declaration.
protected void
parseDocument()
Parse the html document.
boolean
parseMarkupDeclarations(StringBuffer strBuff)
Parse SGML insertion ( <! .
protected void
readAttributes(String element)
Read the element attributes, adding them into attribute set.
protected String
resolveNamedEntity(String a_tag)
Return string, corresponding the given named entity.
protected char
resolveNumericEntity(String a_tag)
Return char, corresponding the given numeric entity.
protected void
restart()
Reset all fields into the intial default state, preparing the parset for parsing the next document.
protected void
startTag(TagElement tag)
The method is called when the HTML opening tag ((like <table>) is found or if the parser concludes that the one should be present in the current position.

Methods inherited from class gnu.javax.swing.text.html.parser.support.low.ReaderTokenizer

error, getEndOfLineSequence, getNextToken, getTokenAhead, getTokenAhead, mark, reset, reset

Methods inherited from class gnu.javax.swing.text.html.parser.support.low.Constants

endMatches

Methods inherited from class java.lang.Object

clone, equals, extends Object> getClass, finalize, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details

dtd

protected DTD dtd
The document template description that will be used to parse the documents.

hTag

public Token hTag
The current html tag.

preformatted

protected int preformatted
This fields has positive values in preformatted tags.

strict

protected boolean strict
The value of this field determines whether or not the Parser will be strict in enforcing SGML compatibility. The default value is false, stating that the parser should do everything to parse and get at least some information even from the incorrectly written HTML input.

Constructor Details

Parser

public Parser(DTD a_dtd)
Creates a new Parser that uses the given DTD. The only standard way to get an instance of DTD is to construct it manually, filling in all required fields.
Parameters:
a_dtd - The DTD to use. The parser behaviour after passing null as an argument is not documented and may vary between implementations.

Method Details

CDATA

protected void CDATA(boolean clearBuffer)
            throws ParseException
Read parseable character data, add to buffer.
Parameters:
clearBuffer - If true, buffer if filled by CDATA section, otherwise the section is appended to the existing content of the buffer.
Throws:
ParseException -

Comment

protected void Comment()
            throws ParseException
Process Comment. This method skips till --> without taking SGML constructs into consideration. The supported SGML constructs are handled separately.

Script

protected void Script()
            throws ParseException
Read a script. The text, returned without any changes, is terminated only by the closing tag SCRIPT.

Sgml

protected void Sgml()
            throws ParseException
Process SGML insertion that is not a comment.

Style

protected void Style()
            throws ParseException
Read a style definition. The text, returned without any changes, is terminated only by the closing tag STYLE.

Tag

protected void Tag()
            throws ParseException
Read a html tag.

_handleText

protected void _handleText()
A hook, for operations, preceeding call to handleText. Handle text in a string buffer. In non - preformatted mode, all line breaks immediately following the start tag and immediately before an end tag is discarded, \r, \n and \t are replaced by spaces, multiple space are replaced by the single one and the result is moved into array, passing it to handleText().

append

protected final void append(Token t)
Add the image of this token to the buffer.
Parameters:
t - A token to append.

consume

protected final void consume(pattern p)
Consume pattern that must match.
Parameters:
p - A pattern to consume.

endTag

protected void endTag(boolean omitted)
The method is called when the HTML end (closing) tag is found or if the parser concludes that the one should be present in the current position. The method is called immediatly before calling the handleEndTag().
Parameters:
omitted - True if the tag is no actually present in the document, but is supposed by the parser (like </html> at the end of the document).

error

public void error(String msg)
Invokes the error handler. The default method in this implementation delegates the call to handleError, also providing the current line.

error

public void error(String msg,
                  Token atToken)
Invokes the error handler.
Overrides:
error in interface ReaderTokenizer

error

public void error(String msg,
                  String invalid)
Invokes the error handler. The default method in this implementation delegates the call to error (parm1+": '"+parm2+"'").

error

public void error(String parm1,
                  String parm2,
                  String parm3)
Invokes the error handler. The default method in this implementation delegates the call to error (parm1+" "+ parm2+" "+ parm3).

error

public void error(String parm1,
                  String parm2,
                  String parm3,
                  String parm4)
Invokes the error handler. The default method in this implementation delegates the call to error (parm1+" "+ parm2+" "+ parm3+" "+ parm4).

flushAttributes

public void flushAttributes()

getAttributes

public SimpleAttributeSet getAttributes()
Get the attributes of the current tag.
Returns:
The attribute set, representing the attributes of the current tag.

getCurrentLine

protected int getCurrentLine()
Get the first line of the last parsed token.

handleComment

protected void handleComment(char[] comment)
Handle HTML comment. The default method returns without action.
Parameters:
comment -

handleEOFInComment

protected void handleEOFInComment()
This is additionally called in when the HTML content terminates without closing the HTML comment. This can only happen if the HTML document contains errors (for example, the closing --;gt is missing.

handleEmptyTag

protected void handleEmptyTag(TagElement tag)
            throws ChangedCharSetException
Handle the tag with no content, like <br>. The method is called for the elements that, in accordance with the current DTD, has an empty content.
Parameters:
tag - The tag being handled.

handleEndTag

protected void handleEndTag(TagElement tag)
The method is called when the HTML closing tag ((like </table>) is found or if the parser concludes that the one should be present in the current position.
Parameters:
tag - The tag

handleError

protected void handleError(int line,
                           String message)

handleStartTag

protected void handleStartTag(TagElement tag)
The method is called when the HTML opening tag ((like <table>) is found or if the parser concludes that the one should be present in the current position.
Parameters:
tag - The tag

handleText

protected void handleText(char[] text)
Handle the text section.

For non-preformatted section, the parser replaces \t, \r and \n by spaces and then multiple spaces by a single space. Additionaly, all whitespace around tags is discarded.

For pre-formatted text (inside TEXAREA and PRE), the parser preserves all tabs and spaces, but removes one bounding \r, \n or \r\n, if it is present. Additionally, it replaces each occurence of \r or \r\n by a single \n.

Parameters:
text - A section text.

handleTitle

protected void handleTitle(char[] title)
Handle HTML <title> tag. This method is invoked when both title starting and closing tags are already behind. The passed argument contains the concatenation of all title text sections.
Parameters:
title - The title text.

makeTag

protected TagElement makeTag(Element element)
Constructs the tag from the given element. In this implementation, this is defined, but never called.
Returns:
the tag

makeTag

protected TagElement makeTag(Element element,
                             boolean isSupposed)
Constructs the tag from the given element.
Parameters:
isSupposed - true if the tag is not actually present in the html input, but the parser supposes that it should to occur in the current location.
Returns:
the tag

markFirstTime

protected void markFirstTime(Element element)
This is called when the tag, representing the given element, occurs first time in the document.
Parameters:
element -

mustBe

protected Token mustBe(int kind)
Consume the token that was checked before and hence MUST be present.
Parameters:
kind - The kind of token to consume.

noValueAttribute

protected void noValueAttribute(String element,
                                String attribute)
Handle attribute without value. The default method uses the only allowed attribute value from DTD. If the attribute is unknown or allows several values, the HTML.NULL_ATTRIBUTE_VALUE is used. The attribute with this value is added to the attribute set.
Parameters:
element - The name of element.
attribute - The name of attribute without value.

optional

protected Token optional(int kind)
Consume the optional token, if present.
Parameters:
kind - The kind of token to consume.

parse

public void parse(Reader reader)
            throws IOException
Parse the HTML text, calling various methods in response to the occurence of the corresponding HTML constructions.
Parameters:
reader - The reader to read the source HTML from.
Throws:
IOException - If the reader throws one.

parseDTDMarkup

public String parseDTDMarkup()
            throws IOException
Parses DTD markup declaration. Currently returns null without action.
Returns:
null.
Throws:
IOException -

parseDocument

protected void parseDocument()
            throws ParseException
Parse the html document.

parseMarkupDeclarations

public boolean parseMarkupDeclarations(StringBuffer strBuff)
            throws IOException
Parse SGML insertion ( <! ... > ). When the the SGML insertion is found, this method is called, passing SGML in the string buffer as a parameter. The default method returns false without action and can be overridden to implement user - defined SGML support.

If you need more information about SGML insertions in HTML documents, the author suggests to read SGML tutorial on http://www.w3.org/TR/WD-html40-970708/intro/sgmltut.html. We also recommend Goldfarb C.F (1991) The SGML Handbook, Oxford University Press, 688 p, ISBN: 0198537379.

Parameters:
strBuff -
Returns:
true if this is a valid DTD markup declaration.
Throws:
IOException -

readAttributes

protected void readAttributes(String element)
Read the element attributes, adding them into attribute set.
Parameters:
element - The element name (needed to access attribute information in dtd).

resolveNamedEntity

protected String resolveNamedEntity(String a_tag)
Return string, corresponding the given named entity. The name is passed with the preceeding &, but without the ending semicolon.

resolveNumericEntity

protected char resolveNumericEntity(String a_tag)
Return char, corresponding the given numeric entity. The name is passed with the preceeding &#, but without the ending semicolon.

restart

protected void restart()
Reset all fields into the intial default state, preparing the parset for parsing the next document.

startTag

protected void startTag(TagElement tag)
            throws ChangedCharSetException
The method is called when the HTML opening tag ((like <table>) is found or if the parser concludes that the one should be present in the current position. The method is called immediately before calling the handleStartTag.
Parameters:
tag - The tag

Parser.java -- HTML parser. Copyright (C) 2005 Free Software Foundation, Inc. This file is part of GNU Classpath. GNU Classpath is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version. GNU Classpath is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with GNU Classpath; see the file COPYING. If not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. Linking this library statically or dynamically with other modules is making a combined work based on this library. Thus, the terms and conditions of the GNU General Public License cover the whole combination. As a special exception, the copyright holders of this library give you permission to link this library with independent modules to produce an executable, regardless of the license terms of these independent modules, and to copy and distribute the resulting executable under terms of your choice, provided that you also meet, for each linked independent module, the terms and conditions of the license of that module. An independent module is a module which is not derived from or based on this library. If you modify this library, you may extend this exception to your version of the library, but you are not obligated to do so. If you do not wish to do so, delete this exception statement from your version.