Simply Testable Updates #51: 200 More Help Pages, Origins Of The "Variable" Element, Crawler Plans

Simply Testable Updates #51: 200 More Help Pages, Origins Of The "Variable" Element, Crawler Plans

August 7, 2013
You're receiving this email because you joined Simply Testable's updates list.

This is the 51st of weekly progress updates on the development of Simply Testable, a brand new automated web frontend testing service providing one-click testing for your entire site.

Highlights this week:
  • 200 more help pages generated, covering the top 25 types of HTML validation error
  • Origins of the "Variable" element discovered
  • Crawler Plans

200 More Help Pages
Armed once more with a list of the most commonly-occurring types HTML validation errors, I continued the task of providing a single-page per for each giving an explanation and a likely fix.

The list of the most common types of HTML validation error has increased from 13 to 25 with the number of individual help pages for specific errors increasing from 170 to 370.

With these help pages directly linked to from HTML validation results you can now more easily resolve issues that you encounter.

Origins of the "Variable" Element Discovered
In writing help pages for HTML validation errors I try to research all specific instances of a given type of error to find out what is really going on.

One such error led me on an interesting journey: Element "Variable" Undefined.

If you're moderately familiar with HTML you'll know that there is no element named "Variable" and as such this seems like an open and closed case.

On the other hand, if you have over 1.8 million HTML validation errors to dig through and you happen to know that Element "Variable" Undefined is the most common occurrence of Element "X" Undefined, you might start wondering why it is such a common occurrence.

It's possible that someone might include <Variable> in their markup. Whilst meaningless it's a lot less odd than some of the markup I've seen recently.

But why are there so many occurrences of <Variable> to make it the most commonly used invalid element And why always <Variable> and not <variable> or <VARIABLE>?

I decided to look in to this a little further, turning to Google to see what I could find. I scrolled through search results, finding nothing specific.

The most relevant information I discovered was a help page from my main competitor which suggested that as the element <Variable> doesn't exist it should be removed to fix the error.

I instead turned to examining all the web pages that generate this error looking for common patterns or common indications of what was going on.

Bingo!

There is a template for Blogger named Typography Blogger that contains a whole bunch of invalid XHTML. This I did not discover by chance; every single web page that generates Element "Variable" Undefined has a comment referencing this template. That was fortunate.

Interestingly, the occurrences of <Variable> within the template are found inside a HTML comment inside a section of CSS. The content of comments shouldn't be treated as markup.

Perhaps the HTML validator was behaving badly? No. That's not it. Take the exact HTML comment containing <Variable> and pop it in to a valid HTML4 or HTML5 document and all is good.

Pop the comment containing <Variable> into a XHTML document and ... tada! ... errors are raised regarding Element "Variable" Undefined.

The root of this problem is how CDATA sections are treated by a conforming XML parser.

Character data, referred to as CDATA, is a section of document that is to be treated literally and that is not to be interpretted as markup. Two of the most relevant sections of a document that should be treated as CDATA are the contents of <script> and <style> elements as these are quite literally in a different language and are most certainly not markup.

HTML parsers will, in general, completely ignore the content of <script> and <style> elements. XML parsers will not, and will instead try to make sense of such content if it looks generally a lot like markup.

And that's exactly what was happening. Since the string <Variable> looks quite a bit like markup, the HTML validator's XML parser treats it as markup because nothing told it not to. This is the correct behaviour.

The fix is to ensure such content is correctly marked as CDATA, specifically in cases where the content could be interpreted as markup.

Since the Internet was lacking a fix to this problem, I wrote up a page that covers the cause of this error and gives a specific solution.

The Internet is now marginally better than before.

Crawler Plans
Since the dawn of time (that is, since about a year ago), the Simply Testable service will discover the URLs to test for a full-site test by retrieving a web site's XML sitemap and seeing what's inside.

A little later on, I added support for discovering URLs from RSS and ATOM feeds.

The reasoning behind this was quite straightforward: it was quick, it was easy to verify and it allowed me to focus on testing URLs for various issues instead of focusing on finding the URLs to test.

Of the entire web that Simply Testable could handle, I was intentionally narrowing it down to a subset that had XML sitemaps. This narrowed down the audience somewhat but that didn't matter.

And from examining some patterns of activity, I've seen that some people would try to test a site, see that it can't be tested as it has no sitemap, add a sitemap and try again. I indirectly made the Internet better which is nice.

I recently revisited the notion of crawling a site to discover URLs to test to see if anything had changed, approaching this matter using Science by seeing how many full-site tests failed to happen due to there not being a sitemap.

Armed with a database of potential evidence and some nifty SQL, I set about this difficult task. And 0.0005 seconds later I discovered that about 30% of full-site tests fail due to there being no sitemap.

I'm not a particularly proficient statistician but even I can say that 30% is significant.

And so I've decided to add in features to crawl a site to find URLs to test if the site has no sitemap. I've realised it is necessary and that's about as far as the plans have got.

Upcoming development
Having covered the top 25 most commonly-occurring HTML validation errors I'll continue to add more but at a slower pace as this covers the majority.

Over the coming two weeks I will be automating the process of handing account downgrades after the free trial period expires and automating the process of sending emails when card payments succeed or fail and will also look in to how best to crawl sites to discover URLs to test.

As always, if you'd like to see web testing you find boring handled automatically for you, add a suggestion or vote up those that interest you. This really helps.

Feedback, thoughts or ideas: email jon@simplytestable.com, follow @simplytestable or keep an eye on the Simply Testable blog.

Cheers!


Follow on Twitter    Forward to Friend 
Copyright © 2013 Simply Testable, All rights reserved.
unsubscribe from this list    update subscription preferences 
Email Marketing Powered by Mailchimp