
Authored by: Igor Sokolov, Jessica Hickey, and Rongxin Du
Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured.
Elasticsearch powers the Recommendations (further referred to as Recs), Trust & Safety capabilities and logging systems at Tinder. Over 90% of all our Recs comes from a single Elasticsearch cluster.
At Tinder Recs, Elasticsearch (further abbreviated as ES) provides two core capabilities:
Tinder ES plugin is a crucial in-house technology that allows us to run complex, observable, efficient and testable scoring algorithms written in Java. The blog posts “How We Improved Our Performance Using ElasticSearch Plugins: Part 1” and “How We Improved Our Performance Using ElasticSearch Plugins: Part 2” shed light on some major aspects of the Tinder ES plugin.
Since Tinder’s launch, Elasticsearch has been a cornerstone technology powering our Recommendation system. It served us well during the early days, but by 2021, a combination of technical debt accrued over time and a focus on other critical parts of Tinder’s infrastructure left our Elasticsearch setup increasingly strained. These challenges not only impacted the engineering team’s efficiency but also hindered Tinder’s ability to leverage cutting-edge advancements in search technology.
By 2021, the core Elasticsearch clusters were still running a version 6, that was quickly approaching its end-of-life. This created substantial risks for maintaining stability. However, the more pressing issue was that Elasticsearch 6, along with its legacy customizations, had become a bottleneck in Tinder’s tech stack. For example, debugging a specific MLeap model loading issue consumed nearly two months of engineering time, which was a symptom of the fragility of our setup and the inefficiency of dealing with an aging infrastructure.
Adding to the complexity was the original cluster deployments that relied heavily on Puppet, custom scripts, and manual EC2 instance provisioning. This meant that any operational incidents or scaling efforts drained significant Cloud Ops resources, turning what should have been routine tasks into time-consuming, error-prone processes. There was also no integration with our in-house Infrastructure as Code (IaC) framework, Scaffold, forcing all changes to go through manual JIRA-based workflows. This approach not only slowed iteration, but also introduced inconsistencies between production and non-production environments.
Beyond maintainability, we saw an urgent need to modernize our search capabilities to stay competitive in the dating app space. Advancements in Elasticsearch 7 and 8, such as embedding-based retrieval and first-class vector search support, represented a leap forward in enabling more data-driven, personalized recommendations. These features were critical for Tinder to deliver the best possible user experience and keep pace with other apps serving similar audiences.
Additionally, staying modern meant embracing performance optimization opportunities. The newer Elasticsearch versions, built on advancements in Lucene, brought significant improvements developed by some of the best minds in the Java and search community. These enhancements were designed to boost cluster performance, improve response times, and increase stability, which in turn could lower infrastructure costs. This modernization was particularly important at Tinder’s scale, where efficiency and performance directly impact user experience and operational costs.
During our analysis, one key question emerged: should we continue our journey with Elasticsearch, or should we consider alternatives like OpenSearch or other search technologies?
After thorough evaluation, we decided to stay with Elasticsearch for several reasons:
By addressing the operability challenges and modernizing our infrastructure, we could elevate Elasticsearch to a platform-like experience, reducing the operational burden and unlocking new opportunities for innovation. This migration wasn’t just about staying current; it was about enabling Tinder’s engineering teams to move faster, iterate more efficiently, and ultimately deliver better experiences to our users.
Kubernetes support has been a cornerstone of Tinder’s infrastructure strategy, making it a natural choice to manage Elasticsearch clusters using the same ecosystem. Integrating Elasticsearch into our Kubernetes ecosystem also allowed us to align with our in-house Infrastructure as Code framework, Scaffold, ensuring a seamless and unified provisioning experience. By configuring clusters and indexes with simple YAML files, we aimed to empower owning teams with self-service capabilities.
Elastic Cloud on Kubernetes (ECK) provided the ideal solution as a Kubernetes operator developed by Elastic to manage components like Elasticsearch, Logstash, and Kibana. With ECK, we could declare Elasticsearch clusters as Kubernetes objects, and the operator ensured safe execution of operations such as bootstrapping, configuration changes, and scaling.
In 2021, we began managing Elasticsearch indexes through Scaffold, which became the first step in tackling the larger challenge of cluster migration. While ECK offered robust support for basic cluster operations, we had to address several challenges to align it with Tinder’s requirements. This included:
By 2022, we successfully deployed our first ECK-managed cluster into production. This marked the foundational step in building Tinder’s Elasticsearch platform and set the stage for broader adoption. It also established a collaborative effort between Cloud Ops and the Recs Infra team to enhance the platform further, including adding support for the Tinder ES plugin and initiating the Recs Elasticsearch 8 migration.
The major challenges related to the Recs clusters migration involved both infrastructure and application-level complexities:
We approached these challenges with the mindset of turning them into opportunities:
One of the critical goals we defined was to establish a reusable ES migration framework. We recognized that this wouldn’t be a one-time migration effort. While major version upgrades are expected every 2–3 years, even minor ES version updates — expected to occur multiple times a year — carry a significant risk of business impact if issues arise. To mitigate this, we aimed to create a repeatable process that would instill confidence and ensure smooth application of changes across all upgrade scenarios.
Given the diversity in cluster size and purpose, a “one-size-fits-all” approach proved ineffective. For instance, one of our smallest clusters, used for name lookups (e.g., colleague names), does not warrant the same rigor as our primary Recs cluster. To address this, we developed a migration template with recommended stages and steps. Teams could tailor the process based on the cluster’s specific nature and use cases, opting to follow all stages or focusing on a minimum viable path. Additionally, we created a suite of tools to streamline these stages and steps. The next sections delve deeper into these stages and the tools we developed.
The first stage focuses on the write path, aiming to establish eventual consistency between the new and old Elasticsearch clusters.
The key prerequisite is having a single source of truth containing all updates related to the Elasticsearch cluster. In our case this was a Kafka topic used as a Stream as it is defined in the well known Confluent “Stream-table duality” primer. The process follows four main steps:

A new Kafka worker gets a new consumer group with the latest position, initially configured to log messages without sending them to the new cluster. Once the logs are verified for correctness, the worker’s consumption is paused.

A custom backfill job (called esreindexjob) is triggered to copy data from the old cluster to the new one. This job leverages the Elasticsearch reindex API, enhanced to handle multiple indices (e.g., geo-sharded clusters — see details Geosharded Recommendations Part 1: Sharding Approach) and apply slicing for parallelized reindexing even in case of cross-cluster replication (not supported out the box).

After the backfill job completes, the Kafka worker is unpaused and begins sending updates to the new cluster. This ensures that all changes missed during the backfill are applied. Once the Kafka worker catches up, the new cluster achieves eventual consistency with the old one.

The last step is to perform two type of checks:
As highlighted in the Migration Challenges section, we encountered several complexities:
To address these challenges, we decided to leverage production traffic. This approach ensured that we focused only on actively used ES functionalities while enabling realistic performance evaluation and tuning, avoiding reliance on potentially misleading synthetic tests.
Given the high impact of real-time logic injection, we started by relaying a portion of production traffic to the new ES cluster.
We designed the solution with the following goals:
We identified three theoretical points where traffic could be intercepted and relayed to the test Tinder ES cluster:
The picture below illustrates that:

At the same time, we explored two major approaches for relaying traffic:
After careful evaluation, we opted for the event-based replay approach from within the Search Service. While HTTP mirroring via a service mesh is widely used in the industry (e.g., Shadow mirroring with Envoy by Mark Vincze, Traffic Shadowing With Istio: Reducing the Risk of Code Release, Advanced Traffic-shadowing Patterns for Microservices With Istio Service Mesh by Christian Posta), we chose the more controlled and flexible method of sending and receiving events. Our key reasons included:
As a result, we adopted the approach illustrated in the diagram below:

The diagram illustrates two critical aspects that offline evaluation must address:
During the offline correctness evaluation, a small percentage of production traffic was captured, storing both the Elasticsearch request and its response in each event. The ES offline evaluation tool processes the captured Elasticsearch request, modifies it as needed, sends it to the new ES cluster, and compares the received response against the original using two approaches:
For offline performance evaluation, a significant portion of sampled production search traffic was routed to the ES evaluation tool, capturing only the search request. The same request modifications used in correctness evaluation were applied here. The primary focus was on ES cluster CPU utilization, response times, and overall stability. With the test ES cluster running at a controlled fraction of the main Recs cluster’s load, we performed both load and stress tests, yielding several key insights:
As part of our offline performance optimization, we reassessed our Elasticsearch geo-sharding approach. Originally implemented to address scaling challenges six years ago, geo-sharding provided clear benefits but also introduced long-term maintenance costs, including:
After extensive testing, we concluded:
These findings helped refine our ES migration strategy, ensuring both performance improvements and operational scalability for Tinder Recs moving forward.
At this stage, we had identified and resolved all discrepancies stemming from ES6 to ES8 behavior changes and finalized the optimal ES cluster configuration to meet our performance expectations with high confidence. While offline evaluation allowed us to efficiently detect and address functional and non-functional issues, it was conducted in a controlled environment rather than the actual services generating production search traffic. To bridge this gap, we introduced an additional “online” evaluation step to validate that the integration of ES8 within our services functioned correctly in real-world conditions.
Correctness was assessed using the same key metrics from offline evaluation; however, this phase was designed to eliminate the iterative explore-fix-explore cycle. Instead, the expectation was that all major issues had already been addressed, making this a validation step rather than a discovery phase.
Given the potential impact of any migration-related issue on Tinder’s search and recommendation systems, we prioritized minimizing risk to business KPIs. To ensure a smooth transition, we adopted a phased rollout approach, implementing the migration through three waves of A/B testing. This strategy allowed us to closely monitor system behavior and confirm that the migration had no negative impact on key business metrics.
After an extensive and meticulously planned migration process, all Recs Elasticsearch clusters have successfully transitioned to ES8. This was accomplished with zero outages and a data validation discrepancy of less than 0.2%, ensuring continuity and reliability for Tinder’s recommendation system.
The migration to ES8 has enabled Tinder to leverage cutting-edge search and recommendation features, including:
The migration paved the way for the Tinder Elasticsearch platform, which is now primed for company-wide adoption. This standardized platform reduces operational complexity and facilitates scalable growth across teams.
Significant efficiency improvements were realized through:
While the migration delivered numerous benefits, it also introduced additional cross-AZ network costs due to the following fact:
These migration outcomes not only improved Tinder’s recommendation capabilities but also established a scalable, high-performance search platform, ensuring continued innovation and optimization for the future.