point
Menu
Magazines
Browse by year:
Paradigm Shift in Network Monitoring
Dr. Parag Pruthi
Monday, November 17, 2008
Network management of traditional telecom networks has evolved over several decades.
While many advancements have been made in traditional methods to measure, alarm, report, plan and dimension telecom networks of the past, it is becoming more obvious to those with the responsibility to manage, operate and plan today’s integrated voice and data IP networks, that traditional methods are not scalable for managing modern networks and services. Network management challenges today go well beyond measuring simple network availability and “sneaker-net” based approach to diagnostics & troubleshooting.

Issues such as Quality of Service (QoS), Service Level Agreements (SLA), ubiquitous access to services, network growth/deployment and most importantly Network Security are critical areas that require advanced capabilities for alarming, troubleshooting, diagnosis, and management. New applications are continually being deployed such as Voice over IP (VoIP), Internet banking, trading and mission critical eBusiness services with 7x24x365 access from any device – fixed or remote. The demand from today’s businesses is driving the need for proactive and remote capabilities in operating, managing, and engineering modern networks and services. Networking professionals are faced with more pressure than anytime in their careers; they have less technical staff, “n” times more users and most importantly an operational environment with “0” tolerance for downtime.

Today, the network is mission critical and is the corner-stone of any business. With so many concurrent business requirements, triage situations are the norm in the operation of any business and long timescales on the order of hours in dealing with triage are no longer tolerated. Businesses stand to lose multi-million dollars per hour of downtime.
In this article, we will discuss a revolutionary approach to the monitoring and management of modern data communications networks.

A Paradigm Shift
One of the critical functions in operating and managing any network is the ability to measure the past and current state of the network and to make predictions of the future. Traditional methods of measurements have been based on mainly three specific methods: (1) sampling of traffic statistics typically every 5 to 15 minutes, (2) analyzing flow data, and (3) analyzing event (or log) data from servers (such as web servers, firewalls, routers, etc). These sources of information typically lack the information that is needed to operate and manage current generation networks and services. High resolution packet-by-packet measurements from a variety of networks have shown that these traditional sampling methods lack the information necessary to accurately model, engineer, predict and deal with triage situations in large scale IP networks. One example below helps illustrate the need for higher resolution information from the network:

A multicast storm causes a loss of service as a result of a domino effect on the trading floor of a large investment-banking firm. How does the bank quickly resolve the problem; remember that down time can cost millions of dollars per hour? Analysis of such events has shown that the root-cause is most likely caused by a combination of events coinciding in time. A “micro-burst” (lasting only a few seconds or fractions of seconds) during a period of heavy congestion along with over-subscription of service by one or more traders overlapping in time for only a few seconds can lead to a domino or ripple effect resulting in chaotic oscillations lasting hours. These are three independent events that need to be measured on the order of seconds or sub-seconds and one or more of these driving factors needs to be throttled back in order to cause the chaotic oscillations to subside. Networking professionals typically can’t identify such micro phenomenon using current sampling, flow and/or log analysis techniques.

From the above example, it should be clear that coarse-grained measurements are not effective in providing the level of service demanded by current generation applications and services. Measurements not only at the larger time scales (5 or 15 minutes) are required for longer term planning, but sub-second measurements are also required to ensure high service availability. Also, measurements which provide only utilization or top-N host information are not useful because many issues in network operations deal with many other metrics and don’t necessarily involve the 10 or 20 most active sources. With the rapid proliferation of IP technology, it is also impossible to predict apriori what will be important as a new service or application is practically invented every day.

It seems that the only way to build a robust and secure service is by measuring everything without bias! However, in a large network where there are thousands of clients and servers running thousands of applications and services, the task of measuring everything seems like a daunting and almost impossible task.

Birth of a new Paradigm
A new paradigm was born – measure everything with great detail and now many a problem in managing and securing IP networks can be easily addressed. Well, what may seem impossible has indeed become possible!

In 1997, a few of my colleagues and I set out to address exactly the need for high-resolution, real-time measurements and real-time analytics from IP networks. A few emerging scientific disciplines such as data warehousing and mining, parallel distributed processing, hierarchical management, java and web based remote computing and high volume SAN storage could be leveraged to solve the needs of the new industry. The idea was simple: warehouse hours of continuous raw transactions from the network directly (filtered to comply with corporate or public policies) while analyzing each and every flow by each and every user, server and application in real-time. A specialized appliance was developed to continuously stream all the network traffic at line-rate (up to a few Gigabits per second) directly to a data warehouse while analyzing each and every PDU (packet, frame, cell, etc. depending on the network monitored) and indexing each and every packet, session, flow, host, application including content in a robust data warehouse.

Think of it like having Google for the network! The data warehouse can be as small or large as necessary and depends typically on the application. Typical warehouse sizes for network and service monitoring may range from 500GB to 1TB and for security applications may range from 1TB to 20TB. The robust data warehouse has all the information, from raw-packets to content information and incident analysis or business analysis is simply accomplished by “data-mining” the warehouse. Java and web technologies can be leveraged to data mine the warehouse for specific network / application monitoring, performance analysis, quality or service etc.

With a strategic deployment of such distributed appliances, network managers and operators can build a self-adapting and customized robust knowledge warehouse. This robust knowledge warehouse can be used to provide immediate data (even in the past) needed to quickly handle triage situations, root-cause analytics, applications profiling, quality of service, network / application optimization, planning and trending, etc.

Other data-mining applications for VoIP, accounting, billing, security, content-analytics, expert analytics, enterprise reporting, etc. provide for a scalable and integrated solution which addresses all of the shortcomings raised earlier.

Conclusion
A successful and revolutionary approach to network management and monitoring has been described in this article. By leveraging various scientific methods from various disciplines such as data warehousing, data mining, distributed parallel processing, hierarchical aggregation a novel approach was developed. Up until a few years ago, most network operators were “flying their modern jets using World War II instrumentation!” The approach described herein is no longer theoretical; it has been successfully deployed at hundreds of enterprise and service provider networks allowing them to finally have the instrumentation necessary to safely “fly and navigate their jet aircraft” through an ever changing horizon.


Dr. Parag Pruthi is the founder, Chairman and CEO of Niksun. Based upon his doctoral research on the use of Chaos theory to model high variability phenomenon in networking, Dr. Pruthi along with notable colleagues in the industry developed unique methods of analyzing network traffic enabling a scalable and integrated approach to network security and performance.

Twitter
Share on LinkedIn
facebook