Sunday, November 1, 2009

The Way to Deal with an Outage

Communications network outages occur more often than most users realize. When they do, the only way to respond, beyond restoration as rapidly as possible, is apologize. Quite often, perhaps most of the time, it also helps to explain what happened.

Junction Networks, for example, had an unexpected outage Oct. 26, 2009 for about an hour and a half, and the company's apology and explanation is a good example of what to do when the inevitable outage does occur. First, apologize.

"We do sincerely apologize for this service interruption. We know that you have many choices for your phone service, and we deeply appreciate your patience and understanding during yesterday's interruption of service. Below are the full details of the service issue."

Then remind users where they can get information if an outage ever occurs again.

"One of the first things we do when a service issue occurs is update our Network Alert Blog and Twitter page with as much information as we have at that time. We then post comments to that original post as we learn more. Our Network Alert blog is here: http://www.junctionnetworks.com/blog/category/network-alerts"

"Our Twitter account is: http://www.twitter.com/onsip."

Junction Networks then provides a detailed description of its normal maintenance activities, which can cause "planned outages" with an intentional shift to backup systems.

"As a rule, Junction Networks maintains three different types of maintenance windows:
1.) Weekend - early morning: The maintenance performed will produce a service disruption and could affect multiple systems.
2.) Weekday - early morning: The maintenance performed may produce a service disruption, but is isolated to a single system.
3.) Intra-day: The work performed should not affect our customers.
All maintenance, even that which is known to cause a service disruption, is not expected to cause a disruption for more than a few fractions of a second. For anything that would cause a more serious disruption (one second or more), backup services are swapped in to take the place of the maintenance system."

The company then explains why the specific Oct. 26 outage happened, in some detail, and then the remedies it applied.

Nobody likes outages, but they are a fact of life. If you think about it, there is a very simple reason. Consider today's electronic devices, designed to work with only minutes to hours to several days worth of "outages" each year. If you've ever had to reboot a device, that's an outage. If you've ever had software "hang," requiring a reboot, that's an outage.

Now imagine the number of normally reliable devices that have to be connected in series to complete any point-to-point communications link. That's the number of applications running, on the servers, switches, routers and gateways, on the active opto-electronics in all networks that must be connected for any single point-to-point session to occur.

Don't forget the power supplies, power grid, air conditioners and potential accidents that can take a session out. If a backhaul cuts an optical line, you get an outage. If a car knocks down a telephone pole, you can get an outage.

Now remember your mathematics. Any number less than "one," when multiplied by any other number less than "one," necessarily results in a number that is smaller than the original quantity. In other words, as one concatenates many devices, each individually quite reliable, the reliability or availability of the whole system gets worse.

A single device with 99-percent reliability is expected to fail 3 days, 15 hours and 40 minutes every year. But that's just one device. If any session has 50 possible devices in series, each with that same 99-percent reliability, the system as a whole is reliable only as the multiplied availabilities of each discrete device.

In other words, you have to multiple a number less than "one" by 49 other numbers, each less than "one," to determine overall system reliability.

As an example, consider a system of just 12 devices, each 99.99 percent reliable, and expected to fail about 52 minutes, 36 seconds each year. The whole network would then be expected to fail about 10.5 hours each year.

Networks with less reliability than 99.99 percent or with more discrete elements will fail for longer periods of time.

The point is that outages can be minimized, but not prevented entirely. Knowing that, one might as well have a process in place for the times when service is disrupted.




No comments:

Many Winners and Losers from Generative AI

Perhaps there is no contradiction between low historical total factor annual productivity gains and high expected generative artificial inte...