Observability: Modern tools for modern challenges in multi-cloud environments

Dynatrace federal CTO discusses how observability — a holistic system applying automation and intelligence in a single platform — can help agencies proactively improve performance and security through deep visibility and insight.
(Getty Images)

While federal agencies are making significant headway moving IT workloads to multiple cloud service platforms, they continue to face a variety of challenges in gaining a complete picture of their total operating environment. A new generation of automated solutions — designed to provide end-to-end observability of assets, applications and performance across legacy and cloud systems — make that job easier, says Federal Chief Technology Officer Willie Hicks at Dynatrace. 

Hicks, who led IT operations in the financial sector for nearly two decades before joining Dynatrace, describes in this exclusive FedScoop interview how automated software intelligence platforms — and “deterministic AI” — can give agencies a faster and more powerful way to optimize application performance and security in dynamic IT environments. 

FedScoop: What are you seeing as the key pain points federal IT leaders are wrestling with now to manage operations in today’s increasingly dynamic and distributed IT environments?

Willie Hicks, Federal CTO, Dynatrace

Willie Hicks: Oftentimes, they don’t know what they don’t know. They don’t have visibility or “observability” into their systems. Observability means taking application performance management (APM) to the next level. With traditional APM you gathered metrics, logs, or even some transactional data. Five or six years ago, that manual approach may have been okay because you were often dealing with one runtime, like Java.

In today’s landscape, applications are being built cloud native, taking advantage of microservices, distributed across multiple clouds, often utilizing multiple run times. Instrumenting and collecting data using the old approach is not sustainable for this new environment, and the data fidelity of older APM technologies is not fine enough to adequately monitor and analyze these highly complex systems. Instead, an observability platform can automate the capture of high fidelity data, analyze it, and make sense of systems that are now too complex for a single person or even a team of people to analyze and find root causes when problems occur.  

Many of our government customers have these extremely complex systems — legacy data centers or hybrid environments — where they are crossing multiple clouds, and they start to lose control and visibility into these apps. And that leads to a lack of understanding when there is a problem: Where is the problem originating? What’s the root cause? And how do I quickly resolve it? But these systems must always be on and highly available. Think of the military with critical weapon systems. If these systems aren’t online or performing at full capacity, it could compromise our national security. The same is true for so many other agencies. Observability provides full situational awareness of all of your assets at any given time and an understanding of how they’re performing. And when they’re not, what is the root cause? 

FedScoop: There are a lot of solutions that claim to provide a single-pane-of-glass view of agency networks and applications. What is keeping agencies from adopting these more comprehensive solutions to capture system-wide visibility?

Hicks: Unfortunately, this type of visibility is oftentimes an afterthought. When these systems are designed and built, there’s often a lack of knowledge about platforms like Dynatrace that can provide this type of visibility. A lot of agencies still think in terms of needing to scrub every log to correlate events so they can figure out what’s causing a problem. That’s an antiquated approach. 

What they’re missing is true causal analysis. And to get to that point, you have to have a platform that’s going to be easy to deploy. It has to be automated. And it has to start sending those key telemetry points and do the analysis automatically. 

FedScoop: How mature are the AI and automation capabilities in these types of platforms, and can they really accommodate the mix of legacy and cloud systems agencies have to manage?

Hicks: Everyone’s talking about AI Ops and machine learning. There is a lot of confusion about what AI really can do today and what AI is really good at. Let me be clear. AI is extremely important. In most large organizations in the private sector, but especially in the public sector, you’re looking at these massive systems that have billions of dependencies that could go wrong. You couldn’t analyze all of those without AI.

At Dynatrace, we use more of a deterministic kind of AI model, which is actually quicker in addressing these types of problems. Systems are very dynamic. You have lots of containers spinning up and down. You’re bringing more servers online in the cloud. AI that uses machine learning requires a slower process in order to train their algorithms. It would be tripped up by a lot of dynamic changes in the system. Deterministic AI can handle that type of dynamic environment and give you good answers, even though this system might be shifting and changing under it. 

Another key problem when we start talking about AI and AI Ops…is explain-ability. We have to build into the system the ability to provide an explainable view of what’s happening for the customer and to program owners — that not only tells me what the problem is but shows me why it is the problem. And why you implemented whatever fix you deployed. 

Finally, agencies need to ask are we going to build our own AI monitoring system? Are we going to partner with industry and get something off the shelf? Or work with a contractor to build an AI system to identify problems on our network? Dynatrace partners with agencies to deploy better observability into their most critical systems that have gotten too complex — or in instances where they don’t have visibility out to the edge anymore, like they did when they controlled everything in their data center. Agencies need something that can parse billions of dependencies; something that can scale to handle tens of thousands of devices; something that can easily perform root cause analysis. I see RFPs almost every week from CIOs and agencies that are looking for these types of monitoring systems now. 

FedScoop: What does observability mean in all of this?

Observability extends beyond the analysis of metrics, logs, and traces to also integrate user experience data, as well as data from the latest open source standards (e.g. OpenTelemetry), all in context, at massive scale and with very low overhead.  

It’s important to note that observability is not just a beefed-up term for monitoring. Monitoring is something you do; observability is something you have. Typically, monitoring requires insight into exactly what data to monitor. It’s great for tasks like tracking performance, identifying problems and anomalies, finding the root causes of issues, and gaining insights into physical and cloud environments. If or when something goes wrong, a good monitoring tool alerts you to what is happening based on what you’re monitoring. 

You need monitoring to have observability. Monitoring can view the external, and logging can see the outputs over time. With these combined, observability can be achieved by inferring the internal based on the historical. Think of observability as that insight you need to have on exactly what to monitor. It’s proactive, using logs, AI, and causation to create a holistic system that is visible for everyone. 

FedScoop: What do you say to agencies to give them greater assurance that these automation and intelligent tools can and will perform reliably?

Hicks: Before I joined Dynatrace, I worked for some large banks for about 17 years. And I’ve been in the application performance space for quite some time. I lived through trying to configure systems to generate reports and trying to isolate root cause. Oftentimes, the purpose-built software we bought for monitoring became shelf ware because it was simply too difficult to deploy, too difficult to configure, too difficult to understand. And that was well before AWS, Azure and cloud. The level of complexity has grown an order of magnitude. Now there are systems with hundreds of different physical systems, containers, cloud instances, databases, and back end supporting systems running in the cloud. How do you instrument all that when you don’t even know what’s there?

I guarantee you, if I went to five different people in the organization — a developer, a system admin, a network person, the business owner, the application owner — and said “Show me a simple diagram of the application,” every one of them would give me a different diagram, if they could do it at all. No one knows how the whole system was created. No one knows what is there at any given time. 

You need to have an automated platform to keep track of the true breadth of details, or you will not get complete observability. It has to be intelligent, because there can be millions, if not billions of dependencies, making it virtually impossible to understand manually. So this idea of intelligent, automated observability is a necessity; you really can’t do it any other way.

Learn more about how Dynatrace can help your organization transform faster through advanced observability, automation and intelligence in one platform.

Latest Podcasts