Imediava's Blog

Just another WordPress.com site

Web Scraping with Groovy (1 of 3)

Web Scraping

Web Scraping consists in extracting information from a webpage in an automatic way. It works from a combination of url fetching and html parsing. As an example for this article we are going to extract the main titles for the results of searching “web scraping” in Microsoft’s Bing.

As a reference for the article, searching “web search” with Bing is equivalent to accessing the following URL: http://www.bing.com/search?q=web+scraping

And the results’ titles are selected applying the following JQuery selector to the webpage’s DOM:

$('#results h3 a')

Scraping with Groovy

Groovy features make screen scraping easy. Url fetching in groovy makes use of Java
classes like java.net.URL yet it’s facilitated by Groovy’s additional methods, in this case withReader.

import org.ccil.cowan.tagsoup.Parser;
    
String ENCODING = "UTF-8"

@Grapes( @Grab('org.ccil.cowan.tagsoup:tagsoup:1.2') )       
def PARSER = new XmlSlurper(new Parser() )

def url = "http://www.bing.com/search?q=web+scraping"

new URL(url).withReader (ENCODING) { reader -> 

    def document = PARSER.parse(reader) 
    // Extracting information
}

Html parsing can be done with any of the many available html-parsing java tools like tagsoup or cyberneko. In this example we have used tagsoup and we can see how easy we declare our dependency on the library thanks to Grapes.

On top of that groovy’s xmlslurper and gpath allow to access specific parts of the parsed html in a convenient way. For the example of the article we would just need a line of code
to extract the titles of the search results:

//JQuery selector: $('#results h3 a')
//Example 1
document.'**'.find{ it['@id'] == 'results'}.ul.li.div.div.h3.a.each { println it.text() }
//Example 2
document.'**'.find{ it['@id'] == 'results'}.'**'.findAll{ it.name() == 'h3'}.a.each { println it.text() }

In the snippet I have provided two different ways of achieving the same goal.

For both examples we first use groovy’s ‘**’ to search for all document’s children in depth, this way we can find which one has as its id results.

Then for the first example we specify the full element path from the results element to the links that represent the titles. As we can see this is less handy than just saying “i want all h3 descendants” the way it is done with JQuery.

The second example does exactely that, using ‘**’ operator it asks for all elements of type h3. However, if we keep comparing it with the way it is done with JQuery we find the solution quite complex.

Summary

Pros Cons
Easy URL fetching thanks to withReader Verbose for filtering descendants at lower levels
Parsing simplyfied thanks to XmlSlurper and Grapes for declaring dependencies Filtering based on id, class or attributes is complex comparing it with (#,.,or [attribute=]) in JQuery

To Sum up, we have seen that web scraping is made easier thanks to Groovy. However it comes with some inconveniencies, above all if we compare it with how easy it is to select elements with JQuery selectors.

In my next post i’m going to explore other libraries that simplify element filtering providing support for things like XPath or even CSS selectors.

PS: This example’s code is really simple but it you still want to access it, it is available at this gist

PS2: This set of articles is now going to be three articles long. With the first dedicated to GPath, the seconde to XPath and the last to the most interesting of all of them in my opinion JSoup.

Edited 22/10/2011: Grab with multiple named parameters has been replaced by the more concise version with only one parameter as suggested by Guillaume Laforge.

About these ads

5 responses to “Web Scraping with Groovy (1 of 3)

  1. Crazy4Groovy August 19, 2011 at 7:03 pm

    Excellent! I’ve done a similar post for scarping with Groovy at: http://crazy4groovy.blogspot.com/2011/03/scrape-html.html

  2. Pingback: Web Scraping with Groovy 2/3 – XPath « Imediava's Blog

  3. Pingback: Web Scraping with Groovy (3 of 3) – JSoup « Imediava's Blog

  4. Pingback: Groovy. Un lenguaje de Scripting que se ejecuta en la Java Virtual Machine | 4lberto's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: