Lockheed Martin's Enterprise Operations chose OTel for better observability
Challenge
Lockheed Martin’s Enterprise Operations, located in Bethesda, Maryland, provides business services and information technology support to its approximately 114,000 employees worldwide. The security and aerospace company is heavily regulated, and the Enterprise Operations team is required to store security and audit logs separately from main systems. The problem? All of the existing solutions could not scale to support the size of their Kubernetes clusters.
Solution
Lockheed Martin decided to try OpenTelemetry when it was still an alpha release, and was able to very quickly get all audit logs, security logs, and performance metrics for the clusters, and all the applications running on the clusters. Data wasn’t dropped and everything was visible.
Impact
OTel gave Lockheed Martin the ability to do something it couldn’t do before: have visibility into all the environments and scale as much as necessary. The team has roughly 70 Kubernetes clusters they’re responsible for maintaining and operating and now they have insight across all of them. They no longer need to log into a cluster to look at a problem, and they have common data despite running different flavors of Kubernetes. With clusters running in multiple clouds, and on-prem, getting consistent data from each cluster is crucial.
By the numbers
70
Kubernetes clusters
3
flavors of Kubernetes
1 collector
sends data to 2 different platforms
Getting Otel metrics flowing in minutes (literally)
Three years ago, Lockheed Martin’s Enterprise Operations team knew something had to change. To keep the security and audit logs off the main systems they needed scalability but all of their options at the time either wouldn’t scale, or didn’t follow cloud native best practices, explained James Sevener, Staff Full Stack Engineer at Lockheed Martin.
“The OpenShift logging project at the time – it’s iterated since then – couldn’t scale to the size of our clusters,” Sevener explained. “It would get to around 3,000 events per second and would just start dropping data. We needed data for all of the pods but we were only collecting data from this subset of pods or this subset of nodes. All the solutions at the time could not scale out to our cluster sizes.”
OpenTelemetry had just come out in alpha, so the team went to the GitHub repo, deployed it, had it configured, had all the documentation and…it just didn’t work. Sevener and team weren’t getting any data, so they reached out to their Splunk account rep, who put them in touch with the developers for the OTel project. It turns out Lockheed Martin was using the wrong repo – a testing repo that wasn’t supposed to be public.
“Within a 5 minute session on the call we were able to switch out to the current release and get it configured,” Sevener said. “Because it’s a handful of options for the base install we had logs and metrics flowing from one of our test clusters. It was literally minutes for us to be able to get data flowing. It is really spectacular. There are a handful of options needed for the default collector and it works. That’s it.”
After that, it was really smooth sailing. The team went deeper, adding in some custom requirements of their own like restricting logs from some namespaces. That process was quite straightforward and just required a simple change in the deployment. Then Lockheed Martin started deploying other flavors of Kubernetes.
“Now we’re using more or less the same deployment across three different flavors of Kubernetes today,” he said.
Sending telemetry to two different platforms
Lockheed Martin’s starting point was to send all the security and audit log data to Splunk, including VMs and Windows servers, and metrics for all of the Kubernetes clusters as well. But another team within the organization wanted to see that data in Dynatrace. Dynatrace has its own agent deployed on Kubernetes but also has its own OTel Collector, so the team extended Splunk OTel to include the Dynatrace exporter.
“We did it and it works,” Sevener said. “It was really nice to send data to two entirely different platforms from the same collector because we have this compliance software that just stacks up in our environments. Using multiple agents for each target, we end up with half of our cluster resources being dedicated to compliance tasks like logging, which is unfortunate.”