Imediava's Blog

Just another site

Web Scraping with Groovy 2 of 3 – XPath

In the previous article Web Scraping with Groovy 1/3 we talked about how we could use groovy features to make web scraping easy. In the following, we’ll exploit Java/Groovy interoperability using some additional Java libraries to simplify even further the process using XPath.

We are going to keep using the same practical example we used in the previous article that consisted of fetching ( ) and obtaining results titles that matched $(‘#results h3 a’) .

Web Scraping with XPath

URL fetching can be done exactly like in the previous article, however, parsing needs to be completely modified. The reason for that is that Java’s XPath support is prepared for DOM documents, nonetheless I still haven’t found any HTML DOM parser that can be used with Java XPath. On the other hand, there are many available HTML SAX parsers like the popular TagSoup which we already used in the first post.

After a considerable effort the only solution I have found is provided at Building a DOM with TagSoup. Adapted to our example the code looks like the following:

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.XMLReader;
import org.xml.sax.InputSource;
import javax.xml.transform.*;
import javax.xml.xpath.*

def urlString = ""
URL url = new URL(urlString);

@Grapes( @Grab('org.ccil.cowan.tagsoup:tagsoup:1.2') )
XMLReader reader = new Parser();
//Transform SAX to DOM
reader.setFeature(Parser.namespacesFeature, false);
reader.setFeature(Parser.namespacePrefixesFeature, false);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
DOMResult result = new DOMResult();
transformer.transform(new SAXSource(reader, new InputSource(url.openStream())), result);

With the parsed html we now can use XPath expressivity to filter elements in the web DOM. XPath allows better selection than GPath in a declarative way and benefiting from using a standard that can be ported to other programming languages easily. To select the same elements as in the first examples we will just need:

def xpath = XPathFactory.newInstance().newXPath()

//JQuery selector: $('#results h3 a')
def results = xpath.evaluate( '//*[@id=\'results\']//h3/a', result.getNode(), XPathConstants.NODESET )

Simulating the ‘#’ operator with XPath is quite complex compared with the simplicity of JQuery selectors. However XPath is powerful enough to express anything that can be expressed with them and it comes with its own advantages such as the possibility to select all elements that have a children of a specific type. For example:

'//p[a]' - // Selects all "p" elements that have an "a" element

That is something that is impossible to do with CSS selectors.


Pros Cons
Very powerful and capable of covering any filtering need Needs a hack to allow using html parsing with Java SDK XPath support
Less verbose than GPath It’s less prepared for html, what makes it more verbose than CSS selectors for operators like ‘#’ or ‘.’


In the next article, the last of this series, I will talk about JSoup a library that I have just recently discovered but which offers in my opinion the best alternative. We will see not only how this library simplifies element filtering but also how it comes with additional features to make web scraping even easier.

Edited 22/10/2011: Grab with multiple named parameters has been replaced by the more concise version with only one parameter as suggested by Guillaume Laforge.


2 responses to “Web Scraping with Groovy 2 of 3 – XPath

  1. Pingback: Web Scraping with Groovy (3 of 3) – JSoup « Imediava's Blog

  2. Pingback: Web Scraping with Groovy (1 of 3) « Imediava's Blog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: