gnu.xml.pipeline

Class LinkFilter

Implemented Interfaces:
ContentHandler, DeclHandler, DTDHandler, EventConsumer, LexicalHandler

public class LinkFilter
extends EventFilter

Pipeline filter to remember XHTML links found in a document, so they can later be crawled. Fragments are not counted, and duplicates are ignored. Callers are responsible for filtering out URLs they aren't interested in. Events are passed through unmodified.

Input MUST include a setDocumentLocator() call, as it's used to resolve relative links in the absence of a "base" element. Input MUST also include namespace identifiers, since it is the XHTML namespace identifier which is used to identify the relevant elements.

FIXME: handle xml:base attribute ... in association with a stack of base URIs. Similarly, recognize/support XLink data.

Field Summary

Fields inherited from class gnu.xml.pipeline.EventFilter

DECL_HANDLER, FEATURE_URI, LEXICAL_HANDLER, PROPERTY_URI

Constructor Summary

LinkFilter()
Constructs a new event filter, which collects links in private data structure for later enumeration.
LinkFilter(EventConsumer next)
Constructs a new event filter, which collects links in private data structure for later enumeration and passes all events, unmodified, to the next consumer.

Method Summary

void
endDocument()
Forgets about any base URI information that may be recorded.
Enumeration<E>
getLinks()
Returns an enumeration of the links found since the filter was constructed, or since removeAllLinks() was called.
void
removeAllLinks()
Removes records about all links reported to the event stream, as if the filter were newly created.
void
startDocument()
Reports an error if no Locator has been made available.
void
startElement(String uri, String localName, String qName, Attributes atts)
Collects URIs for (X)HTML content from elements which hold them.

Methods inherited from class gnu.xml.pipeline.EventFilter

attributeDecl, bind, chainTo, characters, comment, elementDecl, endCDATA, endDTD, endDocument, endElement, endEntity, endPrefixMapping, externalEntityDecl, getContentHandler, getDTDHandler, getDocumentLocator, getErrorHandler, getNext, getProperty, ignorableWhitespace, internalEntityDecl, notationDecl, processingInstruction, setContentHandler, setDTDHandler, setDocumentLocator, setErrorHandler, setProperty, skippedEntity, startCDATA, startDTD, startDocument, startElement, startEntity, startPrefixMapping, unparsedEntityDecl

Methods inherited from class java.lang.Object

clone, equals, extends Object> getClass, finalize, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details

LinkFilter

public LinkFilter()
Constructs a new event filter, which collects links in private data structure for later enumeration.

LinkFilter

public LinkFilter(EventConsumer next)
Constructs a new event filter, which collects links in private data structure for later enumeration and passes all events, unmodified, to the next consumer.

Method Details

endDocument

public void endDocument()
            throws SAXException
Forgets about any base URI information that may be recorded. Applications will often want to call removeAllLinks(), likely after examining the links which were reported.
Specified by:
endDocument in interface ContentHandler
Overrides:
endDocument in interface EventFilter

getLinks

public Enumeration<E> getLinks()
Returns an enumeration of the links found since the filter was constructed, or since removeAllLinks() was called.
Returns:
enumeration of strings.

removeAllLinks

public void removeAllLinks()
Removes records about all links reported to the event stream, as if the filter were newly created.

startDocument

public void startDocument()
            throws SAXException
Reports an error if no Locator has been made available.
Specified by:
startDocument in interface ContentHandler
Overrides:
startDocument in interface EventFilter

startElement

public void startElement(String uri,
                         String localName,
                         String qName,
                         Attributes atts)
            throws SAXException
Collects URIs for (X)HTML content from elements which hold them.
Specified by:
startElement in interface ContentHandler
Overrides:
startElement in interface EventFilter

LinkFilter.java -- Copyright (C) 1999,2000,2001 Free Software Foundation, Inc. This file is part of GNU Classpath. GNU Classpath is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version. GNU Classpath is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with GNU Classpath; see the file COPYING. If not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. Linking this library statically or dynamically with other modules is making a combined work based on this library. Thus, the terms and conditions of the GNU General Public License cover the whole combination. As a special exception, the copyright holders of this library give you permission to link this library with independent modules to produce an executable, regardless of the license terms of these independent modules, and to copy and distribute the resulting executable under terms of your choice, provided that you also meet, for each linked independent module, the terms and conditions of the license of that module. An independent module is a module which is not derived from or based on this library. If you modify this library, you may extend this exception to your version of the library, but you are not obligated to do so. If you do not wish to do so, delete this exception statement from your version.