Imediava's Blog

Just another WordPress.com site

Tag Archives: web scraping

Web Scraping with Groovy (3 of 3) – JSoup

In previous articles we’ve had a look at how to use Groovy [4] and Groovy + XPath [5] for scraping web pages. In the following one we are going to see how the JSoup library can make it even easier.

Jsoup

Jsoup is a very powerful Java library i have just recently discovered. As a Java library, it can be used with any JVM language, so we are going to use it with groovy thus benefiting from the features of both.

With Jsoup is really easy to fetch and parse an url, we just need to use one convenient method. The code to get the url for the example we’ve been using in the previous articles is as simple as this:

@Grapes( @Grab('org.jsoup:jsoup:1.6.1'))
Document doc = Jsoup.connect("http://www.bing.com/search?q=web+scraping").get();

We just define our dependency in the Jsoup library (thanks to grape) and then we call the method connect in the Jsoup class. This creates a Connection object whose parameters can be modified to allow things like setting cookies on it. After creating the Connection object calling it’s get method will actually retrieved the webpage, parse it as a DOM and return a Document object.

CSS selectors

JSoup’s most important feature is that it allows to use CSS selectors, a way to identify parts of a webpage that should be familiar to any JQuery or CSS user. CSS selectors are in my opinion the best existent way to filter elements in a web.

With the Document object we got before, the full code for filtering the links of interest for our example would be:

def results = doc.select("#results h3 a")

As you can see calling the select method we can use the same selector we would use with JQuery, what makes the query really easy.

To summarize i will show a summary of the advantages of Jsoup:

Summary

To sum up Jsoup is somewhat recent but comes with features that make it in my opinion the best Java library for web scraping. I recommend anyone with interest in scraping with Java to go to Jsoup’s page that is full of good examples of how to use the library.

Nonetheless, I encourage everyone to express your opinions about which one you think is the best Java library for web scraping.

Pros Cons
Simplifies URL fetching to the extreme (just one method.) XPath filtering is more standarized.
Facilitates the use of cookies.
Allows the of use “CSS” selectors known by any JQuery user.
In my opinion the best way to select an element or a list of elements in a webpage. (For other similar opinions see references [1] [2] [3])).

Links

Links to comparisons of XPath and CSS selectors:

[1] http://ejohn.org/blog/xpath-css-selectors/
[2] http://chrisfjay.blogspot.com/2007/08/css-and-xpath-selectors.html
[3] http://saucelabs.com/blog/index.php/2011/05/why-css-locators-are-the-way-to-go-vs-xpath/

Previous articles about web scraping with groovy:

[4] http://imediava.wordpress.com/2011/08/18/web-scraping-with-groovy-1-of-3/
[5] http://imediava.wordpress.com/2011/08/30/web-scraping-with-groovy-2-of-3/

Edited 22/10/2011: Grab with multiple named parameters has been replaced by the more concise version with only one parameter as suggested by Guillaume Laforge.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: