Highlight 2025

Reflecting on a year of infrastructure work, incident management, and the transformative impact of AI on software engineering.

Company

This year, I continued my work at Vertesia, an AI startup focused on content management. I really enjoyed working here. I had the privilege of working with highly talented individuals, regardless of age or location. I also had the privilege of working with other developers in person in Paris. We shipped many features over the last year. Here are some changes related to my contribution.

Kubernetes

This year, there are many changes on the Kubernetes side.

  • Service expansion: We deployed multiple services using Kubernetes, including system-worker, image-worker, github-worker, and some custom workers for customers.
  • Isolation: We now have two clusters: one for development and one for production, which isolate workloads in different environments. The workloads are also attached to different Kubernetes Service Accounts, which allows us to identify them better.
  • Infrastructure as code: We also defined the Kubernetes resources directly in the source and integrated them into the automation. These manifests are applied when the changes are merged into the shared branch. Therefore, it allows developers to inspect the system’s definitions quickly.
  • Autoscaling: We also used KEDA autoscaling for the Kubernetes workloads. Our workloads are primarily Temporal workers, so using KEDA enables us to scale Deployments based on the number of tasks in Temporal’s task queues.

Incident Management

There have been more than 30 incidents this year. They were caused by different factors. Some were related to APIs, some to authentication, some to the database, etc. To better manage incidents, we introduced the incident management mechanism.

  • Incident channel. Every time an incident occurs, we create a dedicated Slack channel to group the discussion. This channel lets us find the timeline and quickly search for relevant information.
  • Post-mortem. We always conducted a post-incident analysis to identify the root cause and take action. This could be related to the application source code, deployment automation, observability, and more. Most importantly, we never ignore the incident. We always seek improvements and ensure that our system is more robust than it was before. Maybe this year, we can introduce an incident template for our company based on Google’s SRE book: https://sre.google/sre-book/example-postmortem/.
  • Collective work. We introduced an SRE rotation to encourage everyone to participate in the SRE activities. On one side, engineers should be aware of the impact of their source code and perform a first-level analysis when an incident occurs. On the other side, SRE ensures the process is manageable, i.e., the information is discoverable, the process is not too complex, and so on.
  • Observability. Incidents couldn’t be detected easily without sufficient information. We relied primarily on Datadog to provide that information. More precisely, we used logs, APM, metrics, and monitors to provide information. We also shipped business-level information, such as tenant ID and LLM model, to quickly understand the scope of the impact.

After working at Datadog for years, I realized that having incidents doesn’t simply mean that something is broken. Beyond mitigating the issues, it’s also an opportunity to improve: improve the process, enforce architectural decisions, and observability. It’s also an opportunity build trust with the customers. When you consistently operate with high standards and put your customers first, people know that they can rely on you. When a company grows and develops new features, unexpected things will happen. What matters is how significant the impact is, how long it lasts, and how you deal with it.

Authentication

This year, I also contributed to the authentication and authorization service to help the team improve the flow.

  • Improve the signing mechanism. We have moved from the in-house signing mechanism to Google Cloud’s Key Management Service (KMS). By doing so, we improve the security posture and reduce the operational risk. The key material cannot be read or exported, so attackers cannot steal it.
  • Moving the authorization service to a dedicated service. We introduced a new Secure Token Service (STS) to generate JSON Web Tokens (JWTs) for the platform, previously handled by the monolith.
  • Firebase Tenant. We adopted the tenant-based architecture so that we can store the customers in different tenants. This approach provides a better isolation for different customers, and allows us to provide additional capabititlies to customers, such as a System of Cross-domain Identity Manager (SCIM).

CI/CD

Continuous Integration and Continuous Delivery (CI/CD) is the key for a software company to innovate at a fast pace. With robust CI/CD, developers can quickly build the features they want, share them with the team in a development environment, run automated tests to validate them, and finally deploy them to the target environment. Our CI/CD is based on GitHub Actions.

  • Deployment pipeline. The deployment pipeline became a multi-stage pipeline over the last year. It consists of build, tests, and deployment. This pipeline can be triggered automatically when a commit is pushed, or be triggered manually by a human. A lot of pieces are configurable but all operational. It means that most of the team member can handle the deployment, regardless their expertise in CI/CD. It simply requires filling a form and the full automation will be triggered. It orchestrates multiple services used by the platform, hosted in Google Cloud (Cloud Run, GKE), AWS (App Runner) and Vercel.
  • Cross-repository and cross-branch updates. We have most of the customer-facing components open-source. But the service implementations are mainly closed source. In practice, we use Git sub-module to encapsulate the open source repositories as sub modules. We heavily rely on the GitHub actions to ensure that the Git submodules are up to date. We also have multiple shared branches. Therefore, it requires additional automation to handle them.

ESIGELEC

This year is the second year that I went to ESIGELEC, an engineering school in Normandy to teach students my course “Kubernetes”. It’s a 20 hour module which consists of basics of Docker and Kubernetes. This year, I adopted the microservices of “Spring PetClinic” to bring more realistic examples for the students, where they learned how to define and deploy Java applications to Kubernetes and perform inter-service communications. This choice is based on their scholar program, where they already have other courses related to Java and REST APIs, so this course brought them to the operational side. For more details, visit blog: ESIGELEC Kubernetes 2025 Recap.

Thoughts

  • AI is rapidly changing the software industry. Like it or not, AI is quickly and profoundly shifting the software landscape. Developers use AI to generate code. People in functional roles, like Product Owner, Program Manager, Quality Assurance (QA), use AI to perform vide coding. As a Software Engineer, I think we should leverage AI to help ourselves and our teams to bring more value to the team. This is the only way to get benefits from this game. It means using AI to learn new concepts, to assist on system design, to develop software, to handle operations, and to expand your scope. I like to think it as a personal assistant, helping you to achieve more in your job.
  • Think more about the problems and the product. As a Software Engineer, we spent a significant portion of our time to develop software. But nowadays, development becomes easier. We can use AI to assist on development. We can rely on frameworks and tools to achieve different tasks. It means that the “HOW” (how to implement a specific feature) is less important. We need to spend more time on the “WHY” and “WHAT”: why this feature is important, why does it bring value to the users, why not choosing another approach, etc. What are the expectations, what are the constraints, what users may expect, what is company’s strategy, etc. Knowing those “WHY”s and “WHAT”s allow us to focus on more important tasks to solve and therefore, ensure we delivery more value.

Thank you for spending time reading this article. I wish you a happy new year and see you the next time!