Archive Newer | Older

Wednesday, July 26, 2006

An Expatriate View of the Israeli-Lebanon Conflict
This conflict is truly horrific from my point of view. Civilians on both sides are being killed at a rate that makes you wonder where the morality of both parties really lies. As an American living in London, people ask me how I feel about this stuff from time to time.
 
I do not understand why the United States even attempts to launch diplomatic missions like the one Condi Rice just went on. In the meantime, Beirut is being flattened and innocent civilians are getting killed. Israeli civilians are being killed by random rocket attacks. And the story of the UN observers that were killed is just heart-breaking.
 
I guess the thing that bothers me is that there was (and still is, but barely) an opportunity here for a 'diplomatic layup' that could have given the U. S. a little boost in the region. We'd have to convince Israel that a cease-fire, any cease-fire, was a necessary step towards resolving the issue. Given the close ties between our countries, we should be able to pull that off. Even if we didn't have an approach for resolving the conflict, wouldn't even a mediocre cease-fire with some dialog be better than an escalating conflict whos vicitims are overwhelmingly civilian? Why doesn't the U. S. support an immediate cease-fire? It's incomprehensible to me, particularly when aside from the U. S., Israel and the UK, it seems like the rest of the world (and Hezbollah) have indicated a willingness to back a cease-fire framework.
 
So what's the expatriate piece of this. Well, I live in a very confused world. People who haven't been in contact with Americans before tend to be very curious. Even people who are opposed to U. S. foreign policy often admire the U. S. as a place to live and a culture.
 
Living in London is living in a melting pot. It's a major world capital and it's teeming with people from a vast number of different cultures. People from all over the world come to live and work in London. I'm also in a much smaller country. The news that a New Yorker hears about Boston is like me hearing about Paris. I am exposed to a much broader swath of world opinion as a result. World opinion is not good, although to be fair most countries have their problems and do things that others are critical of. They just don't always involve so many people dying.
 
Anyway here, exposed to a broader swath of world opinion, I find that the actions of the United States often don't make any sense at all to me. Is it because I'm generally a liberal democrat (the U. S. kind)? Maybe. But I think it's more that I get a clearer understanding here about what's really going on and I am very troubled by what I see. My country is way out of step with the world. Frankly, I wonder if other countries will just stop taking us seriously until after the next president takes office.
5:08 pm est

Tuesday, July 25, 2006

Myspace Has An Outage
Popular social network myspace had an outage on Sunday. The Street reports that even prior to this outage there were unreliable bits of the site. The InformationWeek article is vague about how long the outage was and says it was between 90 minutes and 12 hours. They also mention that myspace has two data centres - one in California and one in Arizona. Both reports speculate that power shortages in California attributed to the heat are probably the cause of the outage.
 
So here's the deal. I've spent a lot of time in data centres with web applications. This is a story that I'm actually very qualified to comment on. I will tell you that I know almost nothing about myspace - I haven't studied them or anything. I'm aware generally of who they are, what they do and how big they are.
 
There are at least three obvious things the story could have said or asked but did not:
 
1) A properly designed and operated data centre does not lose power for even 90 minutes without a catastrophic incident as the cause.
 
A properly designed data centre has multiple redundant systems to prevent a power failure. If the stars are aligned and you are very, very lucky, there are even places in the US where you could get on multiple energy grids. But even if that's not the case, there are power distribution systems (PDUs) inside the facility that allow for redundant power down to the circuit level.
 
Supporting that is a room, somewhere, usually in the bowels of the building, where there is an extremely large physical volume of batteries. That's right, for that brief duration between a grid failure and the diesel generators kicking in, the entire facility runs on one very large rechargable battery.
 
A properly designed data centre has, of course, two banks of batteries and two generators, although some people skip those steps because they are ridiculously expensive, millions of pounds.
 
A properly operated data centre runs a live drill of the cutover to battery and diesel on a regular basis (at least monthly, but often weekly) and has in-place arrangements with fuel providers to begin regular deliveries in situations where the generators are needed.
 
Let me just summarize it this way - providing continuous power is a core competancy of a data centre. It's unacceptable to lose power. But unfortunately, I have no idea what actually happened so it wouldn't be fair to draw conclusions about all this. But it is hugely troubling.
 
2) What's the value of having two data centre locations if they clearly don't provide any redundancy.
 
There are a number of reasons that you use two data centres. The best reason is so that you can run your application in a distributed way from both at the same time. That way if you lose a centre, you can still run off the other one. myspace is clearly not set up this way. Are they planning on it, or is the other facility for disaster recovery? (Disaster recovery, DR, is when you can replicate your entire application in another location quickly. It takes a lot of work to do this right.)
 
3) There is discussion about how after the site came back up, certain features were unavailable until later.
 
This illustrates an interesting problem in a data centre. It can actually be quite difficult to turn everything back on. Both physically and logically.
 
The logical challenges are mostly based around the order in which various systems come back online. You want your network up first, then the primary servers (domain servers, email, file servers, etc.) to come up first, followed by the actual web sites or applications that are running. But you can't actually control it because server-class equipment doesn't toggle on and off like a PC. It is designed to boot into a proper run state in this kind of situation.
 
When the power comes on, everything tries to boot at once. In a data centre, that could be 15,000 systems. In some cases, things will come up in the wrong order and not handle it correctly. For example, an application starts to run before the database that powers it is fully started. The application tries to connect to the database and fails. Maybe it's a bit creaky, and it doesn't handle a connection failure well and just stops running. Someone has to go look at that and fix it. Multiple that problem by a few hundred and you start to understand what happens in a data centre when the power goes off and on.
 
The physical situation and statistics come into play now. A data centre person will tell you that any time you turn a server off and on there is a chance it won't come up correctly. They know this to be true because with 10,000 servers in a controlled environment if the power were to go off and on, there might be a few hundred that fail to boot and a good number of them will have bad hard drives or corrupted images, etc. They also see the regular failure of tens of servers on a weekly basis and understand mean time between failure (MTBF) in a way that most of us do not.
 
But that's actually a long digression. The comments about bringing services online later supports the possibility that doing this is hard for myspace to do. That would raise some very troubling issues about how they manage their infrastructure. A web-based company like myspace should have, as a core competancy, the ability to reliably and repeatably configure all of the software required to run the business. This is not an uncommon situation to be in, especially for an exploding business like myspace. It's really a credit to them to have gotten this far before being hit by something like this.
 
By the way, how bad is 90 minutes of downtime? Well, a data centre service level agreement (SLA) is usually five nines, that is, 99.999% uptime. So how long is that? In a 30 day month, it's about 26 seconds a month of downtime. Put it in a different way, if the problem was a data centre problem they just blew their SLA for the entire year and more.
 
Three nines (99.9%) is a common SLA in the web arena for an application. (Web sites typically have a lower uptime requirement than the data centre environment that hosts them, unless they are a major player.) A 99.9% SLA is about a quarter hour a month of downtime. The difference in cost for adding each nine is significant. Each nine cuts the amount of downtime by an order of magnitude. Each nine also costs increasingly more to support. The cost doesn't increase an order of magnitude with each nine, but it does rise dramatically with each nine.
 
A 90 minute outage equates to something like 99.8%. That's not good. If it were in fact 12 hours, the outage would be more like 98.3%.
 
Many organisations have an 'outage' moment that changes the business. I suspect myspace is having one right now. That means that it's an absolute madhouse there as they scramble to examine every aspect of their configuration and data centre operations.
 
A company like myspace makes money by being up. A top executive could probably tell you how much money they make per hour. As is often the case, the damage caused by the outage is not the lost revenue - that's usually immaterial. It's the loss of confidence and prestige in the marketplace. If you are going to be a big player, you've got to be like a dial tone - you are always there. This incident illustrates that myspace is not there yet. They can't change that, but they will want the fastest path to getting there possible in light of what happened.
 
If they are lucky, they'll make some fast changes to resolve these issues. If they are unlucky, they'll suffer another outage. But you have to expect that at this point. The reality is that this stuff is pretty complicated. It takes time and planning. A company like myspace cannot control or predict the usage that will occur on any given day. They've risen incredibly quickly to a massive level of traffic. The fact that there haven't been more problems is a credit to their organisation.
3:26 pm est

Sunday, July 23, 2006

A Rare Break
Poor Alex was so tired he feel asleep at half-four and there was nothing I could do about it, although I tried. Kara is still very sick, although she's up and around now for only the third time today. She thought she was feeling good this morning, so she got up with the kids and let me sleep in a bit. That was awesome and sorely needed, as it turns out, because she crashed soon after that and I've been handling the kids since.
 
It is going OK today. I talked to my father and sister. My sister lives in Illinois, by the way, and was without power for over thirty hours. They've been having a lot of rough weather over there. And it's really, really hot. Much worse than London, although London's summer it pretty impressive this year.
 
By the way, I was over on Marylebone High Street earlier today and there was a large power outage. It was really unusual. A lot of stores just had to close because they had no way to transact business. The larger outfits were up, though. I was in Little Stinkies looking for some costumes for a party tomorrow with Alex and Katherine and ended up changing a twenty for two tens to help the store make change for a customer. And then there was a bunch of other stuff I wanted to do, but I realised that the first thing I needed was a working cash point and that seemed a risky thing to find during a power outage. I ended up just leaving and going elsewhere.
12:00 pm est


Archive Newer | Older
Copyright © 2001 - 2008 | David Owczarek | All Rights Reserved