Imediava's Blog

Just another WordPress.com site

Monthly Archives: August 2011

Web Scraping with Groovy 2 of 3 – XPath

In the previous article Web Scraping with Groovy 1/3 we talked about how we could use groovy features to make web scraping easy. In the following, we’ll exploit Java/Groovy interoperability using some additional Java libraries to simplify even further the process using XPath.

We are going to keep using the same practical example we used in the previous article that consisted of fetching ( http://www.bing.com/search?q=web+scraping ) and obtaining results titles that matched $(‘#results h3 a’) .

Web Scraping with XPath

URL fetching can be done exactly like in the previous article, however, parsing needs to be completely modified. The reason for that is that Java’s XPath support is prepared for DOM documents, nonetheless I still haven’t found any HTML DOM parser that can be used with Java XPath. On the other hand, there are many available HTML SAX parsers like the popular TagSoup which we already used in the first post.

After a considerable effort the only solution I have found is provided at Building a DOM with TagSoup. Adapted to our example the code looks like the following:


import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.XMLReader;
import org.xml.sax.InputSource;
import javax.xml.transform.*;
import javax.xml.xpath.*

def urlString = "http://www.bing.com/search?q=web+scraping"
URL url = new URL(urlString);

@Grapes( @Grab('org.ccil.cowan.tagsoup:tagsoup:1.2') )
XMLReader reader = new Parser();
//Transform SAX to DOM
reader.setFeature(Parser.namespacesFeature, false);
reader.setFeature(Parser.namespacePrefixesFeature, false);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
DOMResult result = new DOMResult();
transformer.transform(new SAXSource(reader, new InputSource(url.openStream())), result);

With the parsed html we now can use XPath expressivity to filter elements in the web DOM. XPath allows better selection than GPath in a declarative way and benefiting from using a standard that can be ported to other programming languages easily. To select the same elements as in the first examples we will just need:

def xpath = XPathFactory.newInstance().newXPath()

//JQuery selector: $('#results h3 a')
def results = xpath.evaluate( '//*[@id=\'results\']//h3/a', result.getNode(), XPathConstants.NODESET )

Simulating the '#' operator with XPath is quite complex compared with the simplicity of JQuery selectors. However XPath is powerful enough to express anything that can be expressed with them and it comes with its own advantages such as the possibility to select all elements that have a children of a specific type. For example:

'//p[a]' - // Selects all "p" elements that have an "a" element

That is something that is impossible to do with CSS selectors.

Summary

Pros Cons
Very powerful and capable of covering any filtering need Needs a hack to allow using html parsing with Java SDK XPath support
Less verbose than GPath It's less prepared for html, what makes it more verbose than CSS selectors for operators like '#' or '.'

Next

In the next article, the last of this series, I will talk about JSoup a library that I have just recently discovered but which offers in my opinion the best alternative. We will see not only how this library simplifies element filtering but also how it comes with additional features to make web scraping even easier.

Edited 22/10/2011: Grab with multiple named parameters has been replaced by the more concise version with only one parameter as suggested by Guillaume Laforge.

Web Scraping with Groovy (1 of 3)

Web Scraping

Web Scraping consists in extracting information from a webpage in an automatic way. It works from a combination of url fetching and html parsing. As an example for this article we are going to extract the main titles for the results of searching “web scraping” in Microsoft’s Bing.

As a reference for the article, searching “web search” with Bing is equivalent to accessing the following URL: http://www.bing.com/search?q=web+scraping

And the results’ titles are selected applying the following JQuery selector to the webpage’s DOM:

$('#results h3 a')

Scraping with Groovy

Groovy features make screen scraping easy. Url fetching in groovy makes use of Java
classes like java.net.URL yet it's facilitated by Groovy's additional methods, in this case withReader.

import org.ccil.cowan.tagsoup.Parser;
    
String ENCODING = "UTF-8"

@Grapes( @Grab('org.ccil.cowan.tagsoup:tagsoup:1.2') )       
def PARSER = new XmlSlurper(new Parser() )

def url = "http://www.bing.com/search?q=web+scraping"

new URL(url).withReader (ENCODING) { reader -> 

    def document = PARSER.parse(reader) 
    // Extracting information
}

Html parsing can be done with any of the many available html-parsing java tools like tagsoup or cyberneko. In this example we have used tagsoup and we can see how easy we declare our dependency on the library thanks to Grapes.

On top of that groovy's xmlslurper and gpath allow to access specific parts of the parsed html in a convenient way. For the example of the article we would just need a line of code
to extract the titles of the search results:

//JQuery selector: $('#results h3 a')
//Example 1
document.'**'.find{ it['@id'] == 'results'}.ul.li.div.div.h3.a.each { println it.text() }
//Example 2
document.'**'.find{ it['@id'] == 'results'}.'**'.findAll{ it.name() == 'h3'}.a.each { println it.text() }

In the snippet I have provided two different ways of achieving the same goal.

For both examples we first use groovy's '**' to search for all document's children in depth, this way we can find which one has as its id results.

Then for the first example we specify the full element path from the results element to the links that represent the titles. As we can see this is less handy than just saying "i want all h3 descendants" the way it is done with JQuery.

The second example does exactely that, using '**' operator it asks for all elements of type h3. However, if we keep comparing it with the way it is done with JQuery we find the solution quite complex.

Summary

Pros Cons
Easy URL fetching thanks to withReader Verbose for filtering descendants at lower levels
Parsing simplyfied thanks to XmlSlurper and Grapes for declaring dependencies Filtering based on id, class or attributes is complex comparing it with (#,.,or [attribute=]) in JQuery

To Sum up, we have seen that web scraping is made easier thanks to Groovy. However it comes with some inconveniencies, above all if we compare it with how easy it is to select elements with JQuery selectors.

In my next post i'm going to explore other libraries that simplify element filtering providing support for things like XPath or even CSS selectors.

PS: This example's code is really simple but it you still want to access it, it is available at this gist

PS2: This set of articles is now going to be three articles long. With the first dedicated to GPath, the seconde to XPath and the last to the most interesting of all of them in my opinion JSoup.

Edited 22/10/2011: Grab with multiple named parameters has been replaced by the more concise version with only one parameter as suggested by Guillaume Laforge.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: