Groovy XmlSlurper for HTML Parsing

Very common task: you need to parse XML. When using groovy there is groovy.util.XmlSlurper for that. We know that HTML is just a special XML – but when you have to parse from online ressources you have to assume that it’s never well-formed. So in order not to get errors while parsing and simultaneously being able to use XmlSlurper’s great node-traversing functionality – there has to be a solution to make it work…

Don’t try to build your own SAXParser respectivly SAXParserFactory with special features that doesn’t load DTDs, has a dummy entity-resolver, is not validating  or ignores missing end-tags. Far too much effort – not groovy…

//        SAXParserFactory saxParserFactory = javax.xml.parsers.SAXParserFactory.newInstance()
//        saxParserFactory.validating = false
//        saxParserFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
////        saxParserFactory.setFeature("http://xml.org/sax/features/validation", false)
////        saxParserFactory.setFeature("http://apache.org/xml/features/validation/schema", false)
//
//        SAXParser saxParser = saxParserFactory.newSAXParser()
//        saxParser.setEntityResolver(entityResolver);

//        XmlSlurper xmlSlurper = new XmlSlurper(saxParser)

// Create instance of XmlSlurper and set EntityResolver
//        def xmlSlurper = new XmlSlurper(false, false)
//        xmlSlurper.setEntityResolver(DummyDTD.entityResolver)

// http://groovy.329449.n5.nabble.com/XmlSlurper-without-Validation-and-DTD-access-td334171.html
//    class DummyDTD {
//          def static entityResolver = [
//                  resolveEntity: { publicId, systemId ->
//                  }
//          ] as org.xml.sax.EntityResolver
//    }

When you use the Groovy HttpBuilder for grabbing your markup you can just use neko-html because it’s a direct dependency of HttpBuilder.

<dependency>
	<!-- Only needed for HTML parsing -->
	<groupId>net.sourceforge.nekohtml</groupId>
	<artifactId>nekohtml</artifactId>
	<version>1.9.9</version>
</dependency>

Because neko-html is providing a ready-made SAXParser for HTML – creating a Groovy XmlSlurper for HTML parsing is as easy as calling the constructor using neko-htmls SAXParser:

  groovy.util.XmlSlurper.XmlSlurper xmlSlurper = new groovy.util.XmlSlurper.XmlSlurper(new org.cyberneko.html.parsers.SAXParser())