Imediava's Blog

Just another site

Web Scraping with Groovy (3 of 3) – JSoup

In previous articles we’ve had a look at how to use Groovy [4] and Groovy + XPath [5] for scraping web pages. In the following one we are going to see how the JSoup library can make it even easier.


Jsoup is a very powerful Java library i have just recently discovered. As a Java library, it can be used with any JVM language, so we are going to use it with groovy thus benefiting from the features of both.

With Jsoup is really easy to fetch and parse an url, we just need to use one convenient method. The code to get the url for the example we’ve been using in the previous articles is as simple as this:

@Grapes( @Grab('org.jsoup:jsoup:1.6.1'))
Document doc = Jsoup.connect("").get();

We just define our dependency in the Jsoup library (thanks to grape) and then we call the method connect in the Jsoup class. This creates a Connection object whose parameters can be modified to allow things like setting cookies on it. After creating the Connection object calling it’s get method will actually retrieved the webpage, parse it as a DOM and return a Document object.

CSS selectors

JSoup’s most important feature is that it allows to use CSS selectors, a way to identify parts of a webpage that should be familiar to any JQuery or CSS user. CSS selectors are in my opinion the best existent way to filter elements in a web.

With the Document object we got before, the full code for filtering the links of interest for our example would be:

def results ="#results h3 a")

As you can see calling the select method we can use the same selector we would use with JQuery, what makes the query really easy.

To summarize i will show a summary of the advantages of Jsoup:


To sum up Jsoup is somewhat recent but comes with features that make it in my opinion the best Java library for web scraping. I recommend anyone with interest in scraping with Java to go to Jsoup’s page that is full of good examples of how to use the library.

Nonetheless, I encourage everyone to express your opinions about which one you think is the best Java library for web scraping.

Pros Cons
Simplifies URL fetching to the extreme (just one method.) XPath filtering is more standarized.
Facilitates the use of cookies.
Allows the of use “CSS” selectors known by any JQuery user.
In my opinion the best way to select an element or a list of elements in a webpage. (For other similar opinions see references [1] [2] [3])).


Links to comparisons of XPath and CSS selectors:


Previous articles about web scraping with groovy:


Edited 22/10/2011: Grab with multiple named parameters has been replaced by the more concise version with only one parameter as suggested by Guillaume Laforge.


4 responses to “Web Scraping with Groovy (3 of 3) – JSoup

  1. Guillaume Laforge October 23, 2011 at 9:33 am

    You should use the shorter @Grab variant:
    @Grab(‘org.jsoup:jsoup:1.6.1’ )

  2. Guillaume Laforge October 24, 2011 at 11:51 am

    You don’t need the surrounding @Grapes() at all either.
    So instead of:

    @Grapes( @Grab(‘org.jsoup:jsoup:1.6.1’))

    You can write just:


  3. Lordy December 2, 2011 at 1:03 pm

    Thanks for pointing me in the direction of JSoup!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: