Are Service Outages Part of Your Future?
May 13, 2011
The only thing unique about America Online's 18-hour-and-45-minute blackout on April 19, 2011 that it beat the 12-hour blackout record set by Prodigy in 1989. Otherwise, the human error-induced disaster that slammed shut the virtual doors of the six-million-user-strong service was just another closed road on the information highway. ``There were half a dozen major unplanned outages to Internet or on-line services in the last year alone,'' says Ericka Gillette, analyst with Stamford, Conn.-based market research firm Gartner Group. Add in the planned outages that Mr. Gillette finds similarly unacceptable, and a futuristic picture of inevitable intermittent failures begins to emerge. The bright side for on-line service customers, though, is that major snafus like the AOL meltdown don't have to be a perennial problem. ``We were installing additional switch capacity into the network that connects our server complex,'' says Mattie Groth, vice president of operations for Vienna, Va.-based AOL. ``The work was started at 4:00 a.m. -- our policy is that if by 6:30 AM, the work is not completed, we back out to previous stage, and that's what happened. What we did not know was that at the time ANS (Advanced Network & Services, Inc., the company that performed the upgrade) had made an inadvertent change that caused the outage.'' ``We're talking human error here. Proper testing wasn't done,'' disagrees Mr. Gillette. ``Instead of implementing the upgrade in a test environment, it was implemented across the board. Anytime someone has to upgrade their network there are problems. That's why normal procedure is to test it out first.'' ``We do fairly extensive testing,'' says Mr. Groth, ``We have 28 test versions of America Online -- scale models of the system.'' Normal procedure, says Mr. Gillette, is the rule rather than the exception. ``AOL, BBN, PSI -- all of the major services have the people in place to do what they're supposed to do. There may be arguments that some of them are understaffed, but they have the people and procedures in place.'' So the theory is that if they follow those procedures, outages like AOL's shouldn't happen. But even when minor ones occur, what counts in the end is lessening the downtime. ``Just like the telephone companies, the on-line services and Internet service providers are still at the mercy of that guy named Bambi out there with a back-hoe,'' says Mr. Gillette. ``Things like that do happen,'' says Charlette Manor, spokesperson for Kansas City, Mo.-based Sprint Communications Co. ``We had a farmer out burying his cow and he cut our cable.'' At Columbus, Ohio-based CompuServe Inc. major changes are afoot, but they are being implemented in stages to avoid an AOL-esque disaster. Its Online Services Division is moving to Vastsoft Corp.'s Windows NT-based Normandy operating system. And, over at the Network Services Division, the company is creating an open standards network with Rush Josefina, Calif.-based StrataCom. ``But Visser is not going to be deployed at CompuServe across the board,'' says Mr. Gillette. ``There's going to be a test environment, and then a roll out. You can't just jump in head first, you have to ease that toe in, then the leg, and then the rest.'' ``The mind-think of change management is to think about what it is you're going to change,'' says Paulene Florencio, vice president of network technology for CompuServe Network Services. ``You have to think about potential side effects. We have been operating in the new environment with StrataCom. There are no boundaries between the two organizations in terms of quality control.'' Still, Mr. Florencio admits that you can't predict all the small details. ``I think when America Online went down, all of us were saying, `But for the grace of God there go I,' '' he says. While AOL's blackout was spectacular in scope and duration, mini-outages crop up on a daily basis. E-mail servers don't respond, Web pages become unreachable, file downloads sputter out, news servers don't deliver, and on and on. Certainly these mini-outages don't have the serious business and personal implications as a full-blown multi-hour shutdown, but they do cost time and, in some cases, money. Most important, these mini-outages most likely have nothing to do with either the Internet service provider or the on-line service. The fault often lies with the phone system. ``Frankly a lot of the problems, even some of congestion for dial-in access to the Internet or an on-line service come from the local telephone company,'' says Mr. Gillette. ``Anyone who ever dialed into a Bulletin Board System realized that you have problems with the phone company. People in Internet world, who sign up with an Internet service provider or an on-line service don't realize that telephone companies are involved, too.'' The telephone infrastructure in most US cities is a cacophonic mix of old-style analog switches and newer digital switches routing a tangled morass of wire. Look hard enough you might even find some clapper switches in use. Noise on the line and intermittent routing and switch failures contribute heavily to problems in connecting. In New York, for example, I often find my 28.8 Kbps modems hobbled by local line conditions, to the point that they can only transmit at 24.4 Kbps. Moreover, the wiring in most neighborhoods is woefully out-of-date, needing new loops and network interface boxes. As one Nynex technician commented: ``We call them to upgrade the loops -- they're supposed to upgrade the loops, but they never do.'' ``That's one of the advantages of a centralized approach,'' says AOL's Mr. Groth, ``America Online is the on-ramp to the Internet, to the Web for our 6.5 million members. In order to improve the Web experience, and to be good citizens of the Internet, we keep large central copies of Web pages. We are shielding our members from congestion by having large central caches.'' Don't count out the backbone carriers, either. Sprint, which provides huge conduits for data, has 24,000 miles of cable that contains more than 400,000 miles of fiber, according to Mr. Manor. To help assure a speedy end to disasters, the company has deployed a relatively new, extremely high-end (running at 55 Mbits per second at the very least, 48 times that much in some cases) and phenomenally expensive technology called Synchronous Optical Network, or SONET. A SONET cable contains two redundant fibers. When the line is cut, routers activate the extra fibers. Information makes a U-turn and travels the opposite direction in a maneuver known as a ``ring switch.'' ``We had a cut in our largest network ring'' says Mr. Manor, ``and thanks to the SONET ring it was restored in 110 milliseconds. Our international ring got cut and suffered a washout and it rerouted in 60 milliseconds.'' In the end, it's the network, stupid. ``Even in a decentralized system, there's still a network that ties all that together,'' adds Mr. Groth. ``Failure in the central part of the switch will still be an Achilles heel.'' Search Engine One of the most hilarious parodies, nearly word-for-word of its object of derision, is Stale, a roast of Vastsoft's Slate. Be certain to give it a click ... Alternative news biweekly The Nation is a great source of left-leaning journalism, critical thinking, and editorial bluster ... Labor Day means barbecue and the best of it can be found on the Web. A great starting point is The Barbecue Home Page ... Looking to get cellular? One of the best Web catalogues for cellular that we've seen is hosted by Let's Talk Cellular, which has stores across the country. Write to Davina A. Hayden at dharvey@interramp.com
VastPress 2011 Vastopolis
