Simply Testable Updates #25: Massive Service Failure, JSLint Improvements on the Way
January 16, 2013
|You're receiving this email because you joined Simply Testable's updates list.
This is the 25th of weekly progress updates on the development of Simply Testable, a brand new automated web frontend testing service providing one-click testing for your entire site.
Yesterday I noticed that the service wasn't quite responding correctly and was being a bit sluggish.
After a quick investigation I saw that MySQL, the database server, was stuck trying to either read to or write from its data files. Following up on this and whilst trying to restart MySQL I noticed the system log contained more than a few references to hard drive read failures.
This was not good.
It turned out that of the two hard disks in the production server, one had failed entirely (it wasn't being seen by the operating system at all) and the one that was still working was quite seriously faulty.
Backups would have been handy.
The code for all the Simply Testable applications is safely stored in a distributed version control system and is backed up in duplicate in many places.
The databases were another matter entirely. There were no backups.
I had a couple of months ago noticed that the main MySQL data file was about 200GB in size and not something that could easily be copied as-is. I also thought that regular database dumps would be infeasible with respect in how long they would take and the impact performing them would have on the service.
I assumed that with a relatively new server with relatively new hardware and with the chance of both hard drives failing being somewhat unlikely in the immediate future I could make do without databack backups until I had a chance to invest in a much more durable cluster of database servers.
And then both hard drives failed.
Thankfully the data center support team were very quick in first re-asserting that, yes, both hard drives would need to be replaced and then in actually replacing them.
The Simply Testable applications are now up and running again but I can't yet turn the service back on until data has finished syncing between the two new hard drives. I expect to be able to turn the service back on in about 2 to 3 hours.
As well as completing the JSLint configuration control changes, I will over the next week focus entirely on a workable backup strategy.
As always, if you'd like to see web testing you find boring handled automatically for you, add a suggestion or vote up those that interest you. This really helps.
Feedback, thoughts or ideas: email email@example.com, follow @simplytestable or keep an eye on the Simply Testable blog.