On Yesterday's Outage

Folks

First, thank you for using Scholaric for your homeschool planning.  Again, my apologies for not having access to Scholaric for such a long time.

As you've no doubt heard, this outage had an impact well beyond Scholaric, or even Heroku, our hosting provider.  Some of the services continue to be down as I write this out.  In fact, it was covered by CNN: http://money.cnn.com/2011/04/21/technology/amazon_server_outage/index.htm?iid=RNM

I don't mean to make any excuses for the outage, when you run a business, you don't get that luxury.  I can only explain what I know, fix what I can, and try to help you however possible.

What Happened (from what I can tell):

Amazon's Cloud Services (called EC2 for Elastic Cloud Compute) had serious network issues yesterday.  The Elastic part of EC2 makes servers replicate themselves when they get too busy, to handle demands which can spike so quickly on the internet.  Networking issues such as these can cause servers to appear extremely busy, to handle error messages sent between them.  This business caused EC2 to kick in, and caused a number of servers to replicate themselves.  Under normal circumstances, only a few servers will replicate themselves, but this volume of replication was too much load for the off-server disk systems (EBS or Elastic Block Storage) and they became backed up, making the problem a wider one.

Should Scholaric Use a Different (non cloud) Hosting Service:

In my opinion, no.  There are a few models to choose from when shopping for hosting: a Shared Host, where you deploy to a single machine, along with other web programs; a Virtual Private Server, where you share a server with other programs, but some software makes it appear that you have your own (and you mange it yourself); a Dedicated Server, where you truly have your own sever (and again manage it yourself); and finally a Cloud Service, where your software is deployed to an entire set of servers.  Note that experts often talk about moving services "to the cloud" which does not necessarily mean to a cloud-based hosting service.

Of these, clouds are the most complex, but protect against (1) sudden spikes in traffic and (2) continued use of a service beyond its capacity (3) server failure.  The server load issue of (1) and (2) should not be overlooked - if services cannot meet demand, it can be very difficult to add capacity, halting all development for extended periods of time, in order to rearchitect the system to make it scale better.  In a cloud service, the service can scale up to meet your demand.  For Heroku, this is as simple as logging in to their control panel, and cranking a dial up.

For server failure (3) the worry about losing a web server or a database server is non-existent in a cloud service.  The data and code are running in more than one place and should one go down, the other is available, and more instances can be deployed easily.  Of course, things can still go wrong, and cause a service to be unavailable, as they did yesterday.  Our service was still running, but nobody could get to it, due to other issues.  This problem happens, regardless of the above models of hosting.  The difference is (1) how widespread the outage is and (2) with a cloud infrastructure (or shared hosting) the hosting provider is made aware of the issue more quickly.  Yesterday, I didn't have to tell Heroku (or Amazon) I was having a problem.  They were working on it before I had a problem.

Heroku has been a large part of my ability to build Scholaric, and I am still extremely happy I chose Heroku.  In a year since our Beta launch, I know of only one other outage, which lasted a few minutes, and for which I received no customer complaints.

More To The Point - What Are Our Risks:

When you rely on an online service like Scholaric, you want to know the following:

(A) Are backups done?

Yes, Heroku does backups, and I backup the database independently of Heroku.

(B) What happens if a server crashes?

Explained above.

(C) Can I get to my data if I stop using the service?

We have not had this happen yet, but our plan is to make your data available to you as a CSV file, which you can bring into a spreadsheet, should you quit Scholaric.

(D) How is it tested?

We extensively, and automatically test Scholaric.  When I work on Scholaric, I turn on my automated tests.  They run while I am coding, running every time a save a file.  As of this writing, I have 686 tests, which make 1673 checks.  I have written 2.2 lines of test code for every line of program code.  This does not mean that problems can't get into production, but that we make every effort to prevent them.

Finally - What Are We Going To Do About it:

I'll continue to track the response of Amazon and Heroku and let you know of any changes they make in response to this problem.

Again, I don't mean to sound like I'm blaming them, I want to do what I can for you.  That is why we are giving a free month of Scholaric to all customers in response to this outage.  For those in their free 15-day trial, we will move back your first payment date by a month.  For those in the beta program, we are setting your first payment date to July 1st.  If you have yet to see our pricing, please go to http://scholaric.com/marketing/pricing

I apologize again for this outage.

jeff