Simply Testable Updates #107: Improvements to: Broken Link Checking, URL Discovery, Background Process Management

Simply Testable Updates #107: Improvements to: Broken Link Checking, URL Discovery, Background Process Management

September 17, 2014
This is the online archive for the Simply Testable weekly behind-the-scenes newsletter.

Subscribe to get weekly updates on the latest changes and the newest planned features.


This is the 107th of weekly progress updates on the development of Simply Testable, your professional automated web frontend testing service providing one-click testing for your entire site.

This week I dealt with improving broken link checking, URL discovery and the management of background processes. Podcast ads are finalised but are yet to run.

Broken Link Checking
I included the phrase 'broken link checking improvements' in the subject of last week's newsletter and then promptly neglected to write about the matter in the newsletter itself. I'll try again now.

I noticed a site-wide broken link test for redislabs.com was taking a long time. A very long time. A very, very long time.

Most of the time there's nothing that can be done about this: if a given page being tested contains a large amount of links and/or contains links to slow-to-respond websites or services, a broken link test for the page is going to take some time.

Sometimes a very long-running test indicates a problem and so in this case I investigated a little further. But before I get into that, we need a bit of background.

If I know a given link in page X was working one minute ago and I discover the same link in page Y, I can, with confidence, mark that same link in as 'working' without needing to actually test page Y. As a full-site broken link test progresses, each as-yet untested page is compared to those already tested to find links that have already been checked. This speeds up the broken link checking process by quite a bit.

In the case of the test for redislabs.com, past test results were having no effect on the speed of the test. Each page to be tested paid no attention to the results of the pages already tested. This could happen if all pages of a site contained a unique collection of links not present on any other page but that wasn't the case here.

In this case, redislabs.com was not willing to give our broken link pre-processor any web pages. Therefore just before a page was about to be tested, we couldn't compare the content of the page to past results within the same test.

A quick fix for how we retrieve web pages sorted that out, brining down the time required to carry out a full-site broken link test for redislabs.com down from about 30 minutes to 4 minutes.


URL Discovery
For sites that don't have an XML sitemap, we first crawl the site to find pages to test. The site is crawled and URLs are collected until the collection of unique URLs reaches a given limit.

I recently made some changes to speed this process up and to find a broader range of pages.

URLs containing fragments (that's the technical name for a URL that ends with #something) were previously considered unique. This is not technically true. The URL fragment is never passed by a web client to a web server and, even if this does happen, the web server will ignore the fragment when responding with a page as URL fragments are purely a client side feature.

In other words, visiting http://example.com/#foo and http://example.com/#bar will return the same content from the server.

The URL discovery process now ignores the fragment part of URL, allowing a broader range of URLs to be collected. This also means that, once the crawling process has finished, each page will be tested only once.


Background Process Management
Almost all the work that goes on when a test is carried out occurs in background processes. If this weren't the case, you'd click the [start] button for a new test, stare blankly at a spinning browser that displays no feedback and then get some results after however long it may take to run the test.

Chances are you'd not stick around long enough and you would presume that Simply Testable is broken. So that's not good.

Also the chances of running a full-site test in one continuous process without something breaking and causing the whole test to fail is next to nothing. Almost all full-site tests would offer no feedback and would, most likely, fail after an indeterminable amount of time. That's not so good either.

This is why we carry out anything but the most simple actions within background processes.

We use two pieces of technology for this to work: a background job storage and processing system called resque and something to manage the resque workers as they go about their work.

Until recently we used the first PHP port of resque that I happened to come across about two years ago and our own set of background process worker management tools to keep things ticking over.

At one point in time this appeared to work. Over time it became apparent that this didn't work for various reasons.

I'm now switching over to using the most popular and most continually-updated PHP port of resque that currently exists and am using supervisor to manage the background processes.

This better addresses some problems I've seen over the past two years and should result in a much more stable system.


Upcoming Work
Over the next week or so I'll be hunting down further podcasts in which I can advertise and will ponder various other ways to promote the service.

As always, if you'd like to see web testing you find boring handled automatically for you, add a suggestion or vote up those that interest you. This really helps.

Feedback, thoughts or ideas: email jon@simplytestable.com, follow @simplytestable or keep an eye on the Simply Testable blog.

Cheers!

 
Follow on Twitter    Forward to Friend 
Copyright © 2014 Simply Testable, All rights reserved.
unsubscribe from this list    update subscription preferences 
Email Marketing Powered by Mailchimp