Table of Contents

1. Introduction

Taking a look at the infrastructure in my first few days of working at Xoxoday, I quickly realized that the infrastructure was relatively nascent. The multiple loose ends I noticed were potential indicators of technical debt, strictly speaking in terms of Infrastructure setup. The loose ends included the design, architecture, tooling, monitoring, and logging, along with various processes & practices.

Considering the nature of business, Xoxoday is a user-facing web service that could typically be categorized under Software as a Service (SaaS). That puts a lot of pressure on specific technical requirements such as availability, uptime, reliability, robustness & scalability.

A typical e-commerce website such as Xoxoday with high global traffic necessitates that we adapt to the latest and unique design patterns, architecture, tools & practices. This is to ensure that we keep up with the quality of service expected by the user base and ideally exceed the same.

Finally, another critical requirement that arises due to privacy and security compliance reasons is that most customers care about the location of the data. The usual expectation is that the data resides and never leaves their requested country or region.

This created the need to consider deploying multiple production environments in parallel, atomic deployments independent of each other. We call this a polycloud, although the term may be debatable. This requirement needed us to have the state of the art automation in place as we now needed to replicate our infrastructure management & maintenance in multiple regions.

Hence, after some careful consideration, we quickly realized that we needed to have the best of the infrastructure automation processes & practices along with a portfolio of modern cloud-based technologies to go cloud-native and provide the best service and experience to our users.

The silver lining for us was that the developers had already broken their services down into microservices. That helped us move quickly in terms of adopting the latest trends & practices of infrastructure.

2. Need for automation

Service automation in the cloud has been a hot topic in the last few years, leading to the rise of many excellent tools & technologies. Our needs were typically driven by the nature of the business & its demands. For us, the following factors were crucial:

  • High Availability
  • Auto-Scaling
  • Self-Healing

Additionally, automation brings in the opportunity to achieve infrastructure as code, allowing us to interact with the infrastructure while leveraging the fantastic benefits of Version Control Systems.

This was a big paradigm shift as it allowed us to proactively look at the infrastructure and reduce the reactive (imperative) interactions. It made rollbacks easier and helped us efficiently keep track of changes across the infrastructure. Also, automation meant that our DevOps teams could rest in peace at odd hours.

Another added benefit of automation is that it opens up various avenues for modernizing and improving developer productivity. We shall take a deeper look into this particular topic in the following section on continuous delivery.

3. Continuous Integration, Delivery & Deployments

Releases could be a painful nightmare if the proper pipelines, tools, and guidelines are not in place. This meant that the critical backbone of DevOps was the CI/CD pipelines which helped us test, integrate, build & deploy our code into production.

We aimed to be able to do multiple production deployments in a day and do these deployments seamlessly. If it broke, we wanted to roll back the updates seamlessly and quickly.

As mentioned earlier, our developers had done a fantastic job of breaking down the monolithic services into microservices which opened the doors to the native-cloud world. Still, it came with its own set of challenges. The biggest challenge we faced was integration testing.

Additionally, our architecture and some excellent cloud-native tools allowed us to approach software development innovatively. We were disrupting the traditional conventions to adapt to the new and exciting world of microservices.

Starting with the well-established practice of having multiple environments allowed us to test and scrutinize our portfolio of features before deploying it to production. Once done working on their particular features, bugs, or improvements, our developers opened up a pull request to the “dev” branch on git.

Once the PR was approved, the commits were merged onto the dev branch. We had provided a manual trigger mechanism to deploy the dev branch to the developers. ( which we will explore later)

The codebase on this branch was deployed to the developer environment, a dedicated environment for developers to test their bleeding-edge features. There were two developer environments, one for the developers to test their features and microservices and the other for integration testing.

Once the codebase was tested, and the QA team approved it, the commits were further merged into the “staging or uat” branch. From here, the codebase was deployed (manually triggered) to the Staging environment, which we strove to keep as close to production as possible.

There was another set of eyes on Staging before considering which features and microservices will be shipped to production. We had a different environment, the demo environment, reserved solely to demo our services to potential customers interested in exploring our services.

We surveyed and analyzed the best possible tools for CI/CD, and finally, we settled with Github Actions. In short, Github Actions provided us with an integrated environment that plays well together with the GitHub workflow and has minimal onboarding for the developer teams.

Along with the tight integration, it also provided us with the state of the art CI/CD mechanisms. It’s super-fast, affordable, easy to configure and customize. And it had a super easy YAML-based syntax to define the jobs allowing us to commit the CI/CD definition with the codebase and track it with the awesomeness of git.

a. Continuous Integration

We explored a few ways to trigger builds automatically but quickly realized that it's better to let the developers control when to build. There could be many reasons in favor of or against automated deployment and manual deployment.

It is better to align the teams’ expectations and wrap this decision based on convenience, as too much automation has potential pitfalls.

For our non-production environments, we had a manual trigger to build and deploy the latest builds into the respective environments. In our production environments, builds were triggered automatically on a git tag push. Git tags allowed us to stick to a versioning scheme and make it convenient to roll back the versions in production if required.

Leaving aside the unit tests and the functional tests that the developers ran on their commits, let’s focus on the infrastructure side of things in our CI/CD. The build job, once triggered, basically authenticates with ECR, manages the image cache, and runs the docker build command to create a container image.

This container image is pushed to the respective ECR repository. It is a simple setup but extremely powerful and flexible as it allows us to hand over the trigger to the developers while automating the entire process.

b. Continuous Deployments

We had a manual trigger provided to the developers to deploy the latest builds onto the respective environments. We used Docker image tags to discriminate between the container images for the individual git branch and the environment.

Our microservices ran on Kubernetes. A typical deploy job does the following:

  1. Update the configuration files & variables:We delete the existing configmaps and recreate them as Kubernetes does not update an existing configmap. Our configuration files are committed to a private git repository, and the sensitive information is stored on AWS Secrets Manager. Hence, git allows us to track the configuration files making it easier for us to delete and re-create the configmaps in kubernetes.
  2. Update the deployment & service on Kubernetes (in case there are changes to the Yaml files)
  3. And finally, do a kubernetes rollout deployment because Kubernetes does not automatically update the deployments/pods just for changes to the configuration files (configmaps).

We were using the following resources from the GitHub actions community to help us easily connect, authenticate and communicate with EKS/ECR & Docker.

4. Architecture

We used AWS for our infrastructure needs. AWS provides us with a rich ecosystem of services that efficiently runs and manages our fleet of microservices and related backend and middleware.

These microservices are running atop Kubernetes managed by AWS EKS. AWS EKS offering saves us a lot of time and effort that Xoxoday would otherwise spend on managing the lifecycle of the Kubernetes cluster itself. That allows us to take advantage of the cloud native ecosystem while focusing on our services and working on them.

‍Kubernetes brings in about a decade worth of experience from running the global infrastructure of Google. It provides us with various features. To name a few: Self-healing, autoscaling, high availability while consolidating and increasing the efficiency at which we use our underlying infrastructure. We use the ingress to expose our services to the outside world via a mixture of Application and Classic Load Balancers and route53 DNS entry automation.

These services connect to our backend, which is hosted in a mixture of AWS managed services like managed Kafka, RDS & self-hosted EC2 instances. We are heavily leveraging Terraform for the automated setup of EKS, a unique blend of Terraform and AWS Launch Templates to spawn the EC2 instances and manage them automatically.

Additionally, SaltStack is used for complex maneuvers of managing the automated setup of the fleet of EC2 instances. It allows us to automatically manage, update & maintain the Operating System and the services running on top. SaltStack has a robust feature set and a brilliantly designed, flexible & pluggable architecture. We then automatically provision the services & take the newly configured backend settings (IP Addresses etc.) and populate the kubernetes configuration files, update the configmaps and deploy them into the environment.

This allows us to configure and reconfigure the stateless services running in the Kubernetes environment to be automatically provisioned & configured with the dynamic changes to the backend setup. Our container images are stored on AWS ECR as it provides us another fantastic service that seamlessly integrates with our given set of technologies and architecture.

5. Logging, Monitoring, Alerting & APM

We would be blind if we did not have the appropriate feedback mechanisms to understand what’s going on in our infrastructure, especially if we are talking about a Polycloud scenario.

a. Elasticsearch/Fluentd/Kibana

We utilize the unique EFK stack for our logging-related concerns. This allows us to create and maintain a dynamic set of microservices running across containers and virtual machines while retaining the logs in a centralized pipeline.

This allows us to access logs for instances/containers which are destroyed for various reasons. And that makes it easier for us to sensibly access and address multiple concerns in production. Here are other elasticsearch alternative for you to go for.

b. Grafana/Prometheus

Grafana dashboard, along with the Prometheus time-series database, allows us to retain various details about our infrastructure like the CPU/RAM/Storage, etc. and keeps us updated with the infrastructure’s current status in near real-time.

That allows us to implement alerting mechanisms, making it easier for our DevOps team to handle incidents in our infrastructure. Despite all preparations and the best of the best architecture & design, things can still break. The Grafana Prometheus systems, along with Alertmanager, allow mitigating incidents in production.

Additionally, this stack benefits from Application Programming Management, which provides us with a rich set of metrics that gives us critical insights into the products, their usage, and more.

6. Disrupting the developer environments

Since we were walking on the path of microservices, we had to manage over 40 microservices which when orchestrated together to form our web service This implies a lot of communication between these microservices, and the sheer number of them makes it impossible and impractical to run it locally on the developer’s machine.

Hence, we had to get creative and go back into the deep waters of the native-cloud world to find a possible solution. We saw many, to name a few:

- Telepresence

- Skaffold

- Kubefwd

At present, our favorite and the most useful one fulfilling our needs is Kubefwd, but we may explore the other tools in the future. Kubefwd allows us to make Kubernetes services accessible on the local workstation using the same DNS as if the local developer’s machine is located inside the Kubernetes cluster!

This means an amazing productivity boost that makes developers more efficient while enhancing the overall experience and reducing the meantime to get a feature out there into the world.

7. Conclusion

After months of scrutinizing our infrastructure needs and getting it aligned with the business requirements, future growth, and current trends in the infrastructure and cloud-native world, we have finally come to a point where we can confidently look forward to some amazing days ahead.

Not only has our infrastructure taken a decade’s worth of evolutionary leaps, but it also managed to reduce the strain on DevOps teams while drastically increasing the developer productivity. Finally, our setup is cheaper and does more, and does it better!

This setup allows us to easily replicate our entire set of services, frontend, and backend and automatically set itself up within a few minutes. That will enable us to create multiple atomic, independent production environments, retrofit it with our CI/CD pipelines and automatically manage and maintain the complex setup with ease.

Additionally, the most crucial benefit of all this is that our DevOps teams can remain small and scale linearly while providing exponential global scale for our services. And the cherry on the top is that the DevOps team gets to have peaceful evenings & great weekends without having to fire-fight production issues on a reactive basis as our design and architecture allows us to address these challenges proactively.

One of the fantastic things about our setup is that the automated nature allows us to utilize spot instances for our non-production and non-critical systems. This helps us with optimizing and consolidating the costs further while having ample computing power at our disposal. This means highly cost-effective setups, and if in case AWS decides to shutdown our spot instances, no problem, within a few minutes, we will autoscale our dedicated instances and make the best of both worlds!

The current trends in the infrastructure world, especially looking at the Cloud Native developments, make it an exciting decade ahead for DevOps; it feels like we are near to achieving nirvana. The sheer number of tools, solutions, services & ecosystems that are popping up could be overwhelming at times. Still, in retrospect, we are heading towards an exciting era of rock-solid web services that are robust and always available while providing a fantastic experience for the users.

Pranav Salunke