Breaking the multi-cloud barrier in a regulated industry
How Kubernetes and Linkerd became Lunar’s multi-cloud communication backbone
At Lunar, a Scandinavian online bank, we embraced cloud native tech early on. We’ve been running Kubernetes since 2017 and today have over 250 microservices distributed across three clouds.
This blog will explore how we set out to centralize all platform services. The gains were substantial — from being better prepared to absorb newly acquired companies to improved developer productivity.
Founded in 2015, Lunar set out to challenge the banking status quo by reinventing how people interact with their finances. Lunar is for those who want everything money-related in one place — 100% digital, right in their hands.
For us, that meant offering customers a smarter way to manage their money with more control, faster savings, easier investments, and no meaningless fees. That’s how we envision the future of banking.
In 2021, Lunar acquired Lendify, a Swedish lending company; and PayLike, a Danish fintech startup. This is all part of Lunar’s broader strategy to grow and scale. It also meant we had to integrate all these systems, so they work together smoothly.
Lunar’s commitment to cloud native principles
Lunar’s team of 150+ full-time engineers push about 40 releases to production on any given day. Out of these 150, ten are platform engineers, and that’s the team that I lead.
We operate nine Kubernetes clusters across three cloud providers (AWS, Microsoft Azure, and Google Cloud Platform) on multiple availability zones. We also run 250+ microservices plus a range of platform services that are part of our self-service developer platform. We want our teams — or Squads as we call them — to be autonomous and self-driven. To support this “shift left” mindset, a group of platform Squads builds abstractions and tooling to ensure developers can move their features fast, securely, compliant, and efficiently.
The Lendify acquisition means we now have an Azure-based platform we have to integrate and adapt, so it complies with the same cloud native principles Lunar is built on. We are currently working on seamlessly connecting our AWS and Azure environments.
There are multiple reasons why we chose the cloud native path. First, we needed a platform that allowed our teams to manage their services and be fully autonomous. Secondly, as a fintech company pioneering cloud-based banking, we had to provide a clear exit strategy for cloud providers — a regulatory requirement by the Danish FSA.
Kubernetes was a perfect fit. Functioning as an abstraction on top of a cloud provider, it helped us achieve both goals.
This autonomy allowed us to scale easily as most dependencies were removed. Squads are also supported by a mix of open source tooling, including Backstage, Prometheus, and Jaeger, and some custom-built solutions, which we have open-sourced, such as shuttle and release-manager.
This multi-cloud strategy and work style support the company’s goal of scaling, both in terms of the number of employees and mergers and acquisitions. It also allows us to stay technology agnostic and choose the technologies that best fit our needs.
Oh no, where are our production logs?
The idea of centralizing platform services started with our log management system Humio. At the time, we were developing failover processes for our production Kubernetes clusters. As it turned out, this led to missing logs in our log management system. That’s when we realized we had to remove the system from our production cluster and centralize it before performing any failover in production.
From logs to centralizing all platform services
After successfully centralizing our log management system, we decided to embark on a platform services centralization journey prior to any corporate acquisitions. While we had multiple environments, many of our platform services, such as our observability stack, were replicated in each environment. These services require a vast amount of resources and are fairly complex. Services such as Humio, Prometheus, and Jaeger (with Elasticsearch), are stateful services. Having stateful services in “workload” clusters makes failover and disaster recovery much harder. For this reason, we decided to minimize the number of stateful services in these environments. Additionally, running nine replicated setups, simply didn’t scale — we needed a centralized solution.
Moreover, having multiple endpoints for accessing things like Grafana, led to lots of duplication of users, dashboards, etc. This caused some confusion for our developers, changes had to be made in multiple places, leading to drift between environments, and other challenges. Managing users in one system was a lot more efficient than doing so in nine (or more).
That’s why we decided to create a centralized cluster owned by the platform team that would eventually run the entire observability stack, release management, developer tooling, and cluster-API.
Today, our log and release management runs as centralized services the platform team provides. Also, Backstage is provided out of the centralized environment along with a handful of other tools. Next in line is our monitoring setup, a mix of Buoyant Cloud and Prometheus/Grafana.
The quest to connect our clusters
Once we started centralizing platform services, we needed to connect our clusters. At the time, we were only running clusters in AWS and considered VPC peering across our accounts. Doing that was somewhat painful due to clashing CIDR ranges. We also evaluated VPNs but aren’t big fans of using technologies with two static boxes on each end. Besides, we wanted to move towards zero trust networking, following the principles of BeyondProd by Google.
Service meshes finally caught up with our needs!
We continuously evaluated service meshes during our 5+ years of running Kubernetes in production. In 2017, we had Linkerd running as a PoC but decided against it. It was still the JVM-based Linkerd 1 and quite complex. We kept following the development and evolution of service meshes and, when we saw the Linkerd 2.8 release and its multi-cluster capabilities, we realized it was time to give service meshes another shot.
Our decision was further reinforced by some problems we were experiencing with gRPC load balancing (which is not natively supported by Kubernetes) and the need to switch to mTLS for all internal communication. A service mesh made a lot more sense now.
While we evaluated both Linkerd and Istio, we have always been big fans of the approach Linkerd took: start with the basics and make that work well. We gave ourselves a week: two engineers; one playing with Istio and the other one with Linkerd.
We had the Linkerd multi-cluster up and running within an hour! After a few days of struggling with Istio, we gave up on it. Linkerd did the job fast and easily — the perfect mesh for us. It had all the features we needed at the time; was easy to operate, had a great community, and solid documentation.
Since going live, we also started using Buoyant Cloud for better visibility across all our environments.
Lunar is committed to the CNCF stack
At Lunar, we are big fans of CNCF projects and use many of them (in fact, I’m a CNCF Ambassador and love educating the community on these awesome projects!). Lunar is also a CNCF End User Member.
Our stack includes Kubernetes, Prometheus, cert-manager, Jaeger, Core DNS, Fluent-bit, Flux, Open Policy Agent, Backstage, gRPC, and Envoy among others. We’ve built an Envoy-based ingress/egress gateway in all clusters to provide a nice abstraction for developers to expose services in different clouds.
Prepared to scale our business and shake up the European banking market
From a technology perspective, we have now achieved a fairly simple way to provide and connect clusters across clouds. Kubernetes allows us to run anywhere, Linkerd enables us to seamlessly connect our clusters, and GitOps provides an audited way to manage our environments across multiple clouds with the same tooling and process. And from a developer perspective, whether you deploy on GCP or AWS, the process is identical.
Seamless integration with newly acquired startups
The business impact has been substantial. With our new multi-cloud communication backbone, we are better positioned to support upcoming mergers and acquisitions — a key part of our business strategy. Having a cloud agnostic way to extend the Lunar platform regardless of where they run, is incredibly powerful. It also allows us to select the provider that best fits our needs for each use case.
Fully prepared for DR while compliant with government regulations
The fact that we are no longer losing logs during failover is huge. We’ll soon implement quarterly failovers for our production clusters. We need to ensure we know exactly how our system behaves in case of a failure and how to bring it back up. It’s important both from a regulatory perspective and a business perspective. If our customers were to lose access to their account information, it would have disastrous consequences for our business. That’s why we proactively train for the worst-case scenario. If something were to happen, we would know exactly what to do and how to avert an issue.
We are big believers in the pets vs. cattle idea but go a step further. We don’t want to have pet servers or pet clusters either. Imagine losing logs each time we perform a failover. Without audit logs, we’d fall out of regulatory compliance right there and then.
Centralized services and streamlined processes increased developer productivity
Centralizing most of our platform services has already streamlined many processes and improved developer productivity. We ensure that all releases, metrics, logs, traces, etc., are properly tagged with fields such Squad names, environments, and so on, making it easy for developers to find what they are looking for. It also ensures clear ownership of that particular piece.
Managing the team is also a lot simpler. For me, that means I don’t have to set up dashboards, help search through logs, etc. — our Squads are truly independent. Because our platform is based on self-service, it is decoupled from the organization allowing our team to focus on implementing the next thing that will help our developers move faster, be more secure, or ensure better quality.
Easy audits and peace of mind for management
Then there are the easy audits. Since everything is centralized, we can run audit reports for all clouds and services across groups and environments. That is good for us and provides peace of mind in the highly-regulated financial services industry.
While we aren’t there yet, we expect to save significant time in engineering resources by not having to operate and maintain nine versions of the soon-to-be fully centralized stack.
Well-positioned to scale fast and smoothly
Overall we feel well-positioned for upcoming acquisitions and organic growth. With a platform able to extend anywhere, we’ve become a truly elastic organization.