June 27, 2014

Simple Web Crawler

Filed under: automation — SiKing @ 3:19 pm

In some Selenium discussion fora, I often see a question how do you build a web crawler / link checker in Selenium. The short answer is: you don’t! The more lengthy answer is: you pick a different tool / library that is better suited for the job.

First, let me cover why Selenium is a (very) poor choice. Selenium is a tool that interacts with web applications, specifically, it interacts with DOM elements in a web browser. Some links on any given website are going to lead to non-DOM pages: download links, directory listings, etc. In all these cases Selenium just throws up its hands and gives up!


One very excellent library is HTTPBuilder. My implementation of HTTPBuilder crawler is available on SourceForge. (It’s documented inline, so I am not going to repeat myself here.) In fact, many so called Selenium web crawlers only use Selenium to open a page, but use HTTPBuilder to parse the page status … which makes Selenium just unnecessary overhead.

There are a few things that my example crawler does not handle; the exact solution for these edge cases is left as an exercise for the reader. 😉


Just for kicks, I tried to do this in SoapUI. It took a bit of convincing, but it can be done.

If you look at the links on most websites, you will find a mixture of complete URL (server and everything) and paths local to that server. The biggest challenge is to dynamically overwrite the endpoint of a SoapUI REST request. Below is only one possible solution.

First step is to create a new project, new REST service, new Resource, and a new GET Method. For all the prompts only set the endpoint to ${#TestCase#endpoint}.

crawler service

Now create a new testsuite, and a new testcase. The testcase has two properties: baseURL and endpoint. baseURL is going to hold the starting URL of the page you want to check; endpoint will eventually hold the URL of the link you are currently checking.

First test step is going to be a REST call to the baseURL. If you remember we set our Endpoint in the REST service to the literally ${#TestCase#endpoint}. So we need a testcase Setup Script:

testCase.setPropertyValue("endpoint", context.expand( '${#TestCase#baseURL}' ))

Run the first test step, and create an assertion for Valid HTTP Status Codes to be 200, to make sure that we get something back.

Next we want to extract all the links. This is done with a DataSource ste; set the type to XML, Source Step to your previous step, and Source Property to ResponseAsXml. Row XPath to select all the elements will be //*:a[exists(@href)]; this will filter out only anchors that actually have an href attribute. The column will be @href and the property name can be anything, I used “href”. Run it and make sure you are getting the links from your previous step.

After that will be a Groovy step to parse and transform what we just retrieved:

def location = new URI(context.expand( '${anchors#href}' ))

if(location.scheme == null) {
	testRunner.testCase.setPropertyValue("endpoint", context.expand( '${#TestCase#baseURL}' ) + context.expand( '${anchors#href}' ))
} else {
	testRunner.testCase.setPropertyValue("endpoint", context.expand( '${anchors#href}' ))

Note that my DataSource step was called “anchors”. I am using URI(), which breaks the string up into individual components that can I refer to as I need without doing any fancy String manipulations.

At this point it might be necessary to review exactly what is a URI and what is a URL. The big deal, tested in the if statement above, is whether it start with something like “http” or not. If not ( == null), then we have a local path and we have to prepend the server name in front of it from our baseURL. If yes, then it’s a complete link to some other server, so we take it as is. The result is assigned to the testcase property endpoint.

Note that just as in the case of HTTPBuilder above, this does not account for some other edge cases.

Next testcase step is to make another REST call, this time to our modified endpoint – still note that our REST service points to the literal ${#TestCase#endpoint}, so this will be picked up automatically. You can set an assertion for a list of Valid HTTP Status Codes to be whatever you need. Note that SoapUI by default follows redirects; these are normally the 300 status codes. If you explicitly want to see those, you will need to turn off redirect for this step in the test step properties.

crawler redirects

Lastly wrap the testcase in a DataSource loop step.


Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at

%d bloggers like this: