gnu.javax.swing.text.html.parser.support
Class Parser
- DTDConstants
A simple error-tolerant HTML parser that uses a DTD document
to access data on the possible tokens, arguments and syntax.
The parser reads an HTML content from a Reader and calls various
notifying methods (which should be overridden in a subclass)
when tags or data are encountered.
Some HTML elements need no opening or closing tags. The
task of this parser is to invoke the tag handling methods also when
the tags are not explicitly specified and must be supposed using
information, stored in the DTD.
For example, parsing the document
<table><tr><td>a<td>b<td>c</tr>
will invoke exactly the handling methods exactly in the same order
(and with the same parameters) as if parsing the document:
<html><head></head><body><table><
tbody><tr><td>a
</td><td>b
</td><td>c
</td></tr><
/tbody></table></body></html>
(supposed tags are given in italics). The parser also supports
obsolete elements of HTML syntax.
protected DTD | dtd - The document template description that will be used to parse the documents.
|
Token | hTag - The current html tag.
|
protected int | preformatted - This fields has positive values in preformatted tags.
|
protected boolean | strict - The value of this field determines whether or not the Parser will be
strict in enforcing SGML compatibility.
|
AP , BEGIN , COMMENT_END , COMMENT_OPEN , COMMENT_TRIPLEDASH_END , DOUBLE_DASH , END , ENTITY , ENTITY_NAMED , ENTITY_NUMERIC , EOF , EQ , EXCLAMATION , NUMTOKEN , OTHER , QUOT , SCRIPT , SCRIPT_CLOSE , SCRIPT_OPEN , SGML , SLASH , STYLE , STYLE_CLOSE , STYLE_OPEN , TAG , TAG_CLOSE , WS , bDIGIT , bLETTER , bLINEBREAK , bNAME , bQUOTING , bSINGLE_CHAR_TOKEN , bSPECIAL , bWHITESPACE |
ANY , CDATA , CONREF , CURRENT , DEFAULT , EMPTY , ENDTAG , ENTITIES , ENTITY , FIXED , GENERAL , ID , IDREF , IDREFS , IMPLIED , MD , MODEL , MS , NAME , NAMES , NMTOKEN , NMTOKENS , NOTATION , NUMBER , NUMBERS , NUTOKEN , NUTOKENS , PARAMETER , PI , PUBLIC , RCDATA , REQUIRED , SDATA , STARTTAG , SYSTEM |
Parser(DTD a_dtd) - Creates a new Parser that uses the given
DTD .
|
protected void | CDATA(boolean clearBuffer) - Read parseable character data, add to buffer.
|
protected void | Comment() - Process Comment.
|
protected void | Script() - Read a script.
|
protected void | Sgml() - Process SGML insertion that is not a comment.
|
protected void | Style() - Read a style definition.
|
protected void | Tag() - Read a html tag.
|
protected void | _handleText() - A hook, for operations, preceeding call to handleText.
|
protected void | append(Token t) - Add the image of this token to the buffer.
|
protected void | consume(pattern p) - Consume pattern that must match.
|
protected void | endTag(boolean omitted) - The method is called when the HTML end (closing) tag is found or if
the parser concludes that the one should be present in the
current position.
|
void | error(String msg) - Invokes the error handler.
|
void | error(String msg, Token atToken) - Invokes the error handler.
|
void | error(String msg, String invalid) - Invokes the error handler.
|
void | error(String parm1, String parm2, String parm3) - Invokes the error handler.
|
void | error(String parm1, String parm2, String parm3, String parm4) - Invokes the error handler.
|
void | flushAttributes()
|
SimpleAttributeSet | getAttributes() - Get the attributes of the current tag.
|
protected int | getCurrentLine() - Get the first line of the last parsed token.
|
protected void | handleComment(char[] comment) - Handle HTML comment.
|
protected void | handleEOFInComment() - This is additionally called in when the HTML content terminates
without closing the HTML comment.
|
protected void | handleEmptyTag(TagElement tag) - Handle the tag with no content, like <br>.
|
protected void | handleEndTag(TagElement tag) - The method is called when the HTML closing tag ((like </table>)
is found or if the parser concludes that the one should be present
in the current position.
|
protected void | handleError(int line, String message)
|
protected void | handleStartTag(TagElement tag) - The method is called when the HTML opening tag ((like <table>)
is found or if the parser concludes that the one should be present
in the current position.
|
protected void | handleText(char[] text) - Handle the text section.
|
protected void | handleTitle(char[] title) - Handle HTML <title> tag.
|
protected TagElement | makeTag(Element element) - Constructs the tag from the given element.
|
protected TagElement | makeTag(Element element, boolean isSupposed) - Constructs the tag from the given element.
|
protected void | markFirstTime(Element element) - This is called when the tag, representing the given element,
occurs first time in the document.
|
protected Token | mustBe(int kind) - Consume the token that was checked before and hence MUST be present.
|
protected void | noValueAttribute(String element, String attribute) - Handle attribute without value.
|
protected Token | optional(int kind) - Consume the optional token, if present.
|
void | parse(Reader reader) - Parse the HTML text, calling various methods in response to the
occurence of the corresponding HTML constructions.
|
String | parseDTDMarkup() - Parses DTD markup declaration.
|
protected void | parseDocument() - Parse the html document.
|
boolean | parseMarkupDeclarations(StringBuffer strBuff) - Parse SGML insertion ( <! .
|
protected void | readAttributes(String element) - Read the element attributes, adding them into attribute set.
|
protected String | resolveNamedEntity(String a_tag) - Return string, corresponding the given named entity.
|
protected char | resolveNumericEntity(String a_tag) - Return char, corresponding the given numeric entity.
|
protected void | restart() - Reset all fields into the intial default state, preparing the
parset for parsing the next document.
|
protected void | startTag(TagElement tag) - The method is called when the HTML opening tag ((like <table>)
is found or if the parser concludes that the one should be present
in the current position.
|
clone , equals , extends Object> getClass , finalize , hashCode , notify , notifyAll , toString , wait , wait , wait |
dtd
protected DTD dtd
The document template description that will be used to parse the documents.
preformatted
protected int preformatted
This fields has positive values in preformatted tags.
strict
protected boolean strict
The value of this field determines whether or not the Parser will be
strict in enforcing SGML compatibility. The default value is false,
stating that the parser should do everything to parse and get at least
some information even from the incorrectly written HTML input.
Parser
public Parser(DTD a_dtd)
Creates a new Parser that uses the given
DTD
. The only standard way
to get an instance of DTD is to construct it manually, filling in
all required fields.
a_dtd
- The DTD to use. The parser behaviour after passing null
as an argument is not documented and may vary between implementations.
CDATA
protected void CDATA(boolean clearBuffer)
throws ParseException
Read parseable character data, add to buffer.
clearBuffer
- If true, buffer if filled by CDATA section,
otherwise the section is appended to the existing content of the
buffer.
Comment
protected void Comment()
throws ParseException
Process Comment. This method skips till --> without
taking SGML constructs into consideration. The supported SGML
constructs are handled separately.
Script
protected void Script()
throws ParseException
Read a script. The text, returned without any changes,
is terminated only by the closing tag SCRIPT.
Sgml
protected void Sgml()
throws ParseException
Process SGML insertion that is not a comment.
Style
protected void Style()
throws ParseException
Read a style definition. The text, returned without any changes,
is terminated only by the closing tag STYLE.
_handleText
protected void _handleText()
A hook, for operations, preceeding call to handleText.
Handle text in a string buffer.
In non - preformatted mode, all line breaks immediately following the
start tag and immediately before an end tag is discarded,
\r, \n and \t are replaced by spaces, multiple space are replaced
by the single one and the result is moved into array,
passing it to handleText().
append
protected final void append(Token t)
Add the image of this token to the buffer.
consume
protected final void consume(pattern p)
Consume pattern that must match.
p
- A pattern to consume.
endTag
protected void endTag(boolean omitted)
The method is called when the HTML end (closing) tag is found or if
the parser concludes that the one should be present in the
current position. The method is called immediatly
before calling the handleEndTag().
omitted
- True if the tag is no actually present in the document,
but is supposed by the parser (like </html> at the end of the
document).
error
public void error(String msg)
Invokes the error handler. The default method in this implementation
delegates the call to handleError, also providing the current line.
error
public void error(String msg,
String invalid)
Invokes the error handler. The default method in this implementation
delegates the call to error (parm1+": '"+parm2+"'").
error
public void error(String parm1,
String parm2,
String parm3)
Invokes the error handler. The default method in this implementation
delegates the call to error (parm1+" "+ parm2+" "+ parm3).
error
public void error(String parm1,
String parm2,
String parm3,
String parm4)
Invokes the error handler. The default method in this implementation
delegates the call to error (parm1+" "+ parm2+" "+ parm3+" "+ parm4).
getAttributes
public SimpleAttributeSet getAttributes()
Get the attributes of the current tag.
- The attribute set, representing the attributes of the current tag.
getCurrentLine
protected int getCurrentLine()
Get the first line of the last parsed token.
handleComment
protected void handleComment(char[] comment)
Handle HTML comment. The default method returns without action.
handleEOFInComment
protected void handleEOFInComment()
This is additionally called in when the HTML content terminates
without closing the HTML comment. This can only happen if the
HTML document contains errors (for example, the closing --;gt is
missing.
handleEmptyTag
protected void handleEmptyTag(TagElement tag)
throws ChangedCharSetException
Handle the tag with no content, like <br>. The method is
called for the elements that, in accordance with the current DTD,
has an empty content.
tag
- The tag being handled.
handleEndTag
protected void handleEndTag(TagElement tag)
The method is called when the HTML closing tag ((like </table>)
is found or if the parser concludes that the one should be present
in the current position.
handleStartTag
protected void handleStartTag(TagElement tag)
The method is called when the HTML opening tag ((like <table>)
is found or if the parser concludes that the one should be present
in the current position.
handleText
protected void handleText(char[] text)
Handle the text section.
For non-preformatted section, the parser replaces
\t, \r and \n by spaces and then multiple spaces
by a single space. Additionaly, all whitespace around
tags is discarded.
For pre-formatted text (inside TEXAREA and PRE), the parser preserves
all tabs and spaces, but removes
one bounding \r, \n or \r\n,
if it is present. Additionally, it replaces each occurence of \r or \r\n
by a single \n.
handleTitle
protected void handleTitle(char[] title)
Handle HTML <title> tag. This method is invoked when
both title starting and closing tags are already behind.
The passed argument contains the concatenation of all
title text sections.
makeTag
protected TagElement makeTag(Element element)
Constructs the tag from the given element. In this implementation,
this is defined, but never called.
makeTag
protected TagElement makeTag(Element element,
boolean isSupposed)
Constructs the tag from the given element.
isSupposed
- true if the tag is not actually present in the
html input, but the parser supposes that it should to occur in
the current location.
markFirstTime
protected void markFirstTime(Element element)
This is called when the tag, representing the given element,
occurs first time in the document.
mustBe
protected Token mustBe(int kind)
Consume the token that was checked before and hence MUST be present.
kind
- The kind of token to consume.
noValueAttribute
protected void noValueAttribute(String element,
String attribute)
Handle attribute without value. The default method uses
the only allowed attribute value from DTD.
If the attribute is unknown or allows several values,
the HTML.NULL_ATTRIBUTE_VALUE is used. The attribute with
this value is added to the attribute set.
element
- The name of element.attribute
- The name of attribute without value.
optional
protected Token optional(int kind)
Consume the optional token, if present.
kind
- The kind of token to consume.
parse
public void parse(Reader reader)
throws IOException
Parse the HTML text, calling various methods in response to the
occurence of the corresponding HTML constructions.
reader
- The reader to read the source HTML from.
parseMarkupDeclarations
public boolean parseMarkupDeclarations(StringBuffer strBuff)
throws IOException
Parse SGML insertion ( <! ... > ). When the
the SGML insertion is found, this method is called, passing
SGML in the string buffer as a parameter. The default method
returns false without action and can be overridden to
implement user - defined SGML support.
If you need more information about SGML insertions in HTML documents,
the author suggests to read SGML tutorial on
http://www.w3.org/TR/WD-html40-970708/intro/sgmltut.html
.
We also recommend Goldfarb C.F (1991)
The SGML Handbook,
Oxford University Press, 688 p, ISBN: 0198537379.
- true if this is a valid DTD markup declaration.
readAttributes
protected void readAttributes(String element)
Read the element attributes, adding them into attribute set.
element
- The element name (needed to access attribute
information in dtd).
resolveNamedEntity
protected String resolveNamedEntity(String a_tag)
Return string, corresponding the given named entity. The name is passed
with the preceeding &, but without the ending semicolon.
resolveNumericEntity
protected char resolveNumericEntity(String a_tag)
Return char, corresponding the given numeric entity.
The name is passed with the preceeding , but without
the ending semicolon.
restart
protected void restart()
Reset all fields into the intial default state, preparing the
parset for parsing the next document.
startTag
protected void startTag(TagElement tag)
throws ChangedCharSetException
The method is called when the HTML opening tag ((like <table>)
is found or if the parser concludes that the one should be present
in the current position. The method is called immediately before
calling the handleStartTag.
Parser.java -- HTML parser.
Copyright (C) 2005 Free Software Foundation, Inc.
This file is part of GNU Classpath.
GNU Classpath is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2, or (at your option)
any later version.
GNU Classpath is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License
along with GNU Classpath; see the file COPYING. If not, write to the
Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301 USA.
Linking this library statically or dynamically with other modules is
making a combined work based on this library. Thus, the terms and
conditions of the GNU General Public License cover the whole
combination.
As a special exception, the copyright holders of this library give you
permission to link this library with independent modules to produce an
executable, regardless of the license terms of these independent
modules, and to copy and distribute the resulting executable under
terms of your choice, provided that you also meet, for each linked
independent module, the terms and conditions of the license of that
module. An independent module is a module which is not derived from
or based on this library. If you modify this library, you may extend
this exception to your version of the library, but you are not
obligated to do so. If you do not wish to do so, delete this
exception statement from your version.