The Much Needed “Easy Button” for Modern Network Operations: Our Investment In Augtera Networks

06 . 29 . 2021

Let's get real about the current state of network operations. It's a total mess. It's 1,000s of daily alerts and alarms based on static thresholds that are largely set by trial and error as a result of some problem that occurred in the past. Remember those financial product disclaimers, "past performance is not indicative of future success"... those should be applied to network operations. And worst of all, it's the beeper buzzing at 2 A.M. for a poor SRE or NetOps team member that needs to come in and spend hours debugging a grey failure (i.e. network performance degradation) all while the application being served on top is driving a poor experience for end-users. This leads to unhappy customers, lost business value, endless fire drills, and occasionally lost jobs. Fundamentally, it's all reactionary; the problem has already occurred and we need to figure out how to stop it ASAP before it gets worse.

Now let's click down a bit more precisely on how we got here, before we talk about what we're going to do about it. There are a number of critical trends that we've been observing the past couple of years:

Enterprise Networks are Heterogeneous & Large. The complexity is beyond "human scale" to comprehend. The modern enterprise network is made up of many vendor products, both virtualized and appliance based. There are an untold amount of devices running in production, each with different model and firmware combos. Layer on top the unique and independent operating systems that are running in these environments (some vendor proprietary, some open source). A typical Fortune 500 datacenter network is running in the 1,000s to 10,000s+ of appliance-based and virtualized switches, routers, firewalls, and load balancers, not to mention the corresponding servers. These devices, in turn, have 2 to 3 billion lines of text in configuration coding with 100s to 1,000s of rules per device. On top of that, the NetOps team is running 100s of changes per week in this environment. And the scenario above is talking about a single datacenter. The introduction of new, dynamic and cloud-native network architectures resulting from microservices and containers, hybrid networking, and cloud to edge migration further exacerbates the issue. Fundamentally, network operators within large enterprises and service providers have a gargantuan hybrid network stack that is utterly complex.
They are Managed with Inadequate Tooling. To put it bluntly, management is extremely difficult, highly manual and largely reactionary once problems occur. One culprit is the tooling being used. Many of the existing Network Performance Management & Diagnostic (NPMD) tools still in production with many NetOps teams were not originally built to handle the scale and complexity of the networks they run today. Meanwhile, telemetry collection often provided by these legacy NPMD tools is largely insufficient.
This Results in Fragility. When management is extremely difficult, highly manual and largely reactionary... well... problems occur. Identifying problems is hard, and then fixing them without then creating yet another problem is harder. But fragility is not just failure, it is also confidence. How can you have confidence in the performance of your systems when you are receiving 100s to 1,000s of alerts a day, the vast bulk of which are not relevant or actionable?

To sum it up, scale + complexity + many changes + bad tooling = degradations, outages and, ultimately, unhappy customers.

The initial wave of NPMD tools sought to solve these problems, but were fundamentally not built for the way modern enterprise networks have evolved. For instance, Gartner's Market Guide for Network Performance Monitoring and Diagnostics (March 2020) noted that by 2024, 50% of network operations teams will be required to re-architect their network monitoring stack, due to the impact of hybrid networking. Beyond just the principal architectural changes to multi-cloud, hybrid networks, these tools were also largely rules- and threshold-based, which ultimately leads to alert fatigue and, fundamentally, reactionary troubleshooting and debugging.

With this backdrop in mind, we've been on the hunt for the right team and product to really take a holistic approach to entirely changing the paradigm of modern network operations. Enter Augtera Networks, which we're thrilled is launching out of stealth today with a $13M Series A round led by Intel Capital with participation from Bain Capital Ventures, Dell Technologies Capital and Acrew Capital.

Augtera-Rahul-Aggarwal-and-Bhupesh-Kothari — Rahul Aggarwal, Founder and CEO, (Left) and Bhupesh Kothari, Co-Founder and VP of Engineering (Right)

When we first met Rahul Aggarwal in the Fall of 2019, it was an immediate meeting of the minds. Rahul walked through his views on the current challenges of network operations today, which matched our observations entirely. As Rahul described, the existing tooling was just collecting signals from cloud, WAN and datacenter networks, and placing them in a massive (and ever growing) data "haystack". Meanwhile, network operators were manually plowing through this "haystack" to find the "needles" of data that really can remediate a problem.

Fortunately, Rahul and his co-founder, Bhupesh Kothari, have a deep understanding of network architecture to re-think the approach to operations. Having first met at Juniper, Rahul and Bhupesh have been instrumental in developing numerous modern networking protocols and methodologies. At Augtera, they built a Network AI Platform from the ground up, bringing the benefits of AI-driven operations, planning and orchestration to physical, virtual and cloud network environments.

Data is the core of the platform, ingesting standard and vendor specific data at large scale. In parallel, unsupervised machine learning models bring automated and proactive management, identification and remediation of grey failures and other degradations before they impact customers. The signal to noise ratio is dramatically increased, with alerts being highly actionable. Some of the use cases that are, or will be in the near-future, covered by Augtera's Network AI Platform include:

Automated Network Topology – Providing network operators with zero touch, multi-layer topology with deep predictive visibility
Automated Grey Failure Detection – Early and predictive detection of grey failures in network hardware and the control plane
Automated Multi-Layer Correlation – Correlation of network events and Augtera-generated AI insights to reduce root cause analysis times by an order of magnitude and more
Recommended Remediation – The platform will soon be extended to learn on an unsupervised basis from past fixes and automatically recommend fixes to new network incidentsÂ
Automated Remediation – The platform currently has direct integrations with ticketing systems, and will also soon be extended to allow operators to apply recommended fixes automatically.

Augtera's Network AI Platform is driving towards offering customers an "easy button" for managing networks in a proactive and automated way. Rahul and Bhupesh have leveraged their deep domain experience to build a product that is automatically identifying and correlating grey failure and other challenging networking problems that have plagued large enterprises and service providers. Soon the platform will be extended to provide recommended and automated remediation of these challenges as well.

The feedback from our customer calls on the ROI they received from the platform really validated our thesis that the market was in need of a wholesale new approach – which Augtera is delivering. Augtera is currently deployed in large-scale production environments with several major enterprise and service provider customers, including both hybrid datacenter, WAN and backbone networks. Each of the customers was originally focused on detecting and fixing grey failures and other hidden network misbehaviors before they caused customer-facing issues and/or outages. As we clicked-down on these deployments, it was clear that Augtera's Network AI Platform was central to their networking strategy and provided significant business value, including identifying numerous network anomalies that would have impacted service had they not been proactively detected. Another customer noted that they were seeing between 25% to 75% daily reduction in tickets compared to traditional NPMD tools. Most importantly, all of these systems are running at large scale, and Augtera's unsupervised machine learning platform is continuing to learn from these networks and gain further efficacy.

We're thrilled to have established our partnership with the Augtera team at the seed stage and now lean in to lead the Series A. Together with our co-investors, we're looking forward to the new future for network operations moving from manual and reactionary to automated and proactive with Augtera.

The materials available at this web site are for informational purposes only. The opinions expressed at or through this site are the opinions of the individual author and may not reflect the opinions of Intel Capital, Intel Corporation or any of their affiliates or individual employees.

The Much Needed “Easy Button” for Modern Network Operations: Our Investment In Augtera Networks

To sum it up, scale + complexity + many changes + bad tooling = degradations, outages and, ultimately, unhappy customers.

Related News