Look closely at these two cruise ships. Except for the name and the funnel, they are almost identical. In fact they are sister ships both owned by Carnival Cruise lines. Also in common, they both suffered major incidents, both incidents preventable with better engineering and both incidents causing damage to customers, the company and the cruise industry as a whole.
The Costa Concordia sank off the coast of Italy on January 13th, 2012 with tragic loss of life by colliding with a rock. A rock that was well known, existed on nautical charts for generations and on a route that this particular ship had traversed over 100 times. But somehow they still hit the rock. More significantly than that, the ship could not survive it. Why?
The Carnival Splendor made international news last year as the SPAM cruise. After a break in a crank shaft, a fire broke out in one of the engine rooms and caused a failure. A single failure that also disabled all other engines* and power generators on the ship, leaving it immobile and powerless in the Pacific Ocean, where food (Spam as it requires no refrigeration) was air dropped to the 3,000+ passengers while it was towed back for days to San Diego. Why?
Stuff happens. It happens on cruise ships and it happens in Data Centers. More than 80% of all Data Center service outages are caused by human error and in the other 20%, stuff breaks or fails to perform as expected. Good architecture whether that is Naval Architecture or IT Architecture anticipates issues, prevents them where possible and perhaps most importantly has a recovery process that does not sink you or leave you powerless and adrift.
Despite have multiple engines, the Carnival Splendor had one or more single points of failure in the mechanical design that completely disabled the ship. The Carnival/Costa Concordia had at least two fatal design flaws in navigation (control) systems and in hull (resiliency) design. Should the architects of these two ships have:
anticipated the loss of an engine ?– yes
anticipated the possibility of human error in navigation ? – yes
anticipated that the ship could hit a rock and require flood control chambers ?– yes
Now look at your data center.
does your IT architecture anticipate and compensate for the loss of critical components?
does your IT architecture include the possibility of human error in operations and work to prevent, reduce and worst case compensate for the error?
is your IT architecture resilient? Can it bounce back from a hit quickly and keep your business running?
Another thing also in common with Cruise Ships is that your IT architecture design decisions are best made at the point of initial construction. It’s going to be be very expensive to add those water tight doors to Deck 6 later, just as it is to add resilience features to your architecture once it is already in production.
What’s the real cost of bad IT architecture? Part of it is the business interruption and disruption of having a system down and the associated “cruise refunds” for your customers but the real cost is in trust.
When people don’t trust the service to be there when they need it and become concerned that the service may impact their safety or the safety of their business, they stay away. Today, how many people will be booking their Mediterranean cruise? How long will it take for that industry to recover?
The true cost of bad IT architecture is immense. Don’t cut corners. Hire great architects to design something that won’t sink your business.
(* Carnival Splendor diesel-electric propulsion system consisted of two engine rooms with three banks of Wärtsilä 1.3 Mw diesel engines each. 1 engine failure took out the other 5 engines. Engineering Report )