First off, we’re sorry about this weekend’s SmartZone outage. We know
that email is more and more important to people these days (I have been
known to check my email on four different devices simultaneously) and we
understand the frustration you felt. We’re truly sorry.
Now, the details: SmartZone suffered an outage on Saturday, April 4th,
starting at 7:30 am EDT and lasting until 5pm EDT. This outage was caused by a
power issue in one of our datacenters that impacted some of our email
servers. The outage was compounded by the fact that several databases
needed to be restored from back up (that process was successful, and no
data was lost which is a good thing).
Over the course of the outage, mail that was sent to SmartZone customers
was stored in a number of queues across our email system. Once the
entire system stabilized we were able to start delivering that queued
mail. Our engineers worked around the clock (literally) to identify
email that was delayed, and to get it to the proper recipients. This
process was completed earlier this morning, and now all email messages
queued during the outage have been delivered (one important thing to
note, however, is that the email messages will show up in your inbox
stamped with the date they were delivered to your inbox, not when they
were sent. For example someone might have sent you an email on Saturday
at noon, and it was queued over the weekend so it shows up in your inbox
as a new email today. If you look at your email headers, though, you’ll
see the actual send and received dates).
During the outage there was a window of time (from 11am to 2:50pm EDT on
Saturday) when our email directory was in a state of flux due to the
various databases being rebuilt. There were some small cases (we think a
fraction of emails sent) where someone may have sent you an email during
these hours and our system was unable to recognize you as a user
(because servers were coming back online). In each instance Comcast sent
the sender of the email a ‘user not found’ message to alert them that
their email message should be resent.
I know what you’re thinking: what are you doing to sure this doesn’t
happen again? Our engineers are examining every piece of our email
system to make sure that we are doing everything in our power to make
sure an issue like this doesn’t happen again. We are reviewing all of
our systems - hardware, software and processes — to make sure that we
mitigate the possibility of this occurring again. Email is a critical
means of communication for us, as well as our customers, so we know that
it just needs to work.
Update 4/15/09 3:24pm: Lots of folks in the comments are wondering about whether or not our datacenters have redundant power, so I thought it would make sense to add this update. Our datacenters do have fully redundant systems, including power. That said, while we didn't lose power in our data center, an element of our email platform failed (some more details can be found here) resulting in the need to restore data from backup. I hope that answers some people's questions.