Tinder’s migration to Elasticsearch 8

Authored by: Igor Sokolov, Jessica Hickey, and Rongxin Du

Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured.

Elasticsearch powers the Recommendations (further referred to as Recs), Trust & Safety capabilities and logging systems at Tinder. Over 90% of all our Recs comes from a single Elasticsearch cluster.

At Tinder Recs, Elasticsearch (further abbreviated as ES) provides two core capabilities:

Filter-based search: an ability to sift through millions of user profiles to find candidates based on location, age, interests and other preferences.
Advanced ranking using our custom Tinder ES plugin: an ability to apply custom scoring algorithms using ML models for personalized and engaging user experiences.

Tinder ES plugin is a crucial in-house technology that allows us to run complex, observable, efficient and testable scoring algorithms written in Java. The blog posts “How We Improved Our Performance Using ElasticSearch Plugins: Part 1” and “How We Improved Our Performance Using ElasticSearch Plugins: Part 2” shed light on some major aspects of the Tinder ES plugin.

‍

Drivers Behind the Migration Effort

Since Tinder’s launch, Elasticsearch has been a cornerstone technology powering our Recommendation system. It served us well during the early days, but by 2021, a combination of technical debt accrued over time and a focus on other critical parts of Tinder’s infrastructure left our Elasticsearch setup increasingly strained. These challenges not only impacted the engineering team’s efficiency but also hindered Tinder’s ability to leverage cutting-edge advancements in search technology.

‍

Operability and Maintainability Challenges

By 2021, the core Elasticsearch clusters were still running a version 6, that was quickly approaching its end-of-life. This created substantial risks for maintaining stability. However, the more pressing issue was that Elasticsearch 6, along with its legacy customizations, had become a bottleneck in Tinder’s tech stack. For example, debugging a specific MLeap model loading issue consumed nearly two months of engineering time, which was a symptom of the fragility of our setup and the inefficiency of dealing with an aging infrastructure.

Adding to the complexity was the original cluster deployments that relied heavily on Puppet, custom scripts, and manual EC2 instance provisioning. This meant that any operational incidents or scaling efforts drained significant Cloud Ops resources, turning what should have been routine tasks into time-consuming, error-prone processes. There was also no integration with our in-house Infrastructure as Code (IaC) framework, Scaffold, forcing all changes to go through manual JIRA-based workflows. This approach not only slowed iteration, but also introduced inconsistencies between production and non-production environments.

‍

Staying Modern and Competitive

Beyond maintainability, we saw an urgent need to modernize our search capabilities to stay competitive in the dating app space. Advancements in Elasticsearch 7 and 8, such as embedding-based retrieval and first-class vector search support, represented a leap forward in enabling more data-driven, personalized recommendations. These features were critical for Tinder to deliver the best possible user experience and keep pace with other apps serving similar audiences.

Additionally, staying modern meant embracing performance optimization opportunities. The newer Elasticsearch versions, built on advancements in Lucene, brought significant improvements developed by some of the best minds in the Java and search community. These enhancements were designed to boost cluster performance, improve response times, and increase stability, which in turn could lower infrastructure costs. This modernization was particularly important at Tinder’s scale, where efficiency and performance directly impact user experience and operational costs.

‍

Strategic Considerations for Staying with Elasticsearch

During our analysis, one key question emerged: should we continue our journey with Elasticsearch, or should we consider alternatives like OpenSearch or other search technologies?

After thorough evaluation, we decided to stay with Elasticsearch for several reasons:

Custom Plugin Advantage: A core part of Tinder’s Recs system relies on a highly customized Elasticsearch plugin and the ability to run custom Java scoring scripts. Migrating to AWS OpenSearch managed service would have meant losing this flexibility, as custom plugins were not supported at the time.
Kubernetes Support: Building a self-service platform for Elasticsearch, akin to a managed cloud service, was a key goal for this migration. While OpenSearch offered Kubernetes support, it was relatively immature at the time, making Elasticsearch the better choice for achieving our operational objectives.
Institutional Knowledge: Tinder had already invested significantly in developing expertise around Elasticsearch. This institutional knowledge, combined with Elasticsearch’s position as one of the leading open-source search engines, made it the logical choice for us to continue building on.

By addressing the operability challenges and modernizing our infrastructure, we could elevate Elasticsearch to a platform-like experience, reducing the operational burden and unlocking new opportunities for innovation. This migration wasn’t just about staying current; it was about enabling Tinder’s engineering teams to move faster, iterate more efficiently, and ultimately deliver better experiences to our users.

‍

Elastic Cloud for Kubernetes at Tinder

Kubernetes support has been a cornerstone of Tinder’s infrastructure strategy, making it a natural choice to manage Elasticsearch clusters using the same ecosystem. Integrating Elasticsearch into our Kubernetes ecosystem also allowed us to align with our in-house Infrastructure as Code framework, Scaffold, ensuring a seamless and unified provisioning experience. By configuring clusters and indexes with simple YAML files, we aimed to empower owning teams with self-service capabilities.

Elastic Cloud on Kubernetes (ECK) provided the ideal solution as a Kubernetes operator developed by Elastic to manage components like Elasticsearch, Logstash, and Kibana. With ECK, we could declare Elasticsearch clusters as Kubernetes objects, and the operator ensured safe execution of operations such as bootstrapping, configuration changes, and scaling.

In 2021, we began managing Elasticsearch indexes through Scaffold, which became the first step in tackling the larger challenge of cluster migration. While ECK offered robust support for basic cluster operations, we had to address several challenges to align it with Tinder’s requirements. This included:

Integrating ECK with Scaffold for automated and dynamic Kubernetes deployments.
Implementing transit encryption within our service mesh.
Enabling autoscaling mechanisms to optimize resource usage.
Managing ephemeral storage safely in a Kubernetes environment.

By 2022, we successfully deployed our first ECK-managed cluster into production. This marked the foundational step in building Tinder’s Elasticsearch platform and set the stage for broader adoption. It also established a collaborative effort between Cloud Ops and the Recs Infra team to enhance the platform further, including adding support for the Tinder ES plugin and initiating the Recs Elasticsearch 8 migration.

‍

Recs ES 8 Migration challenges

The major challenges related to the Recs clusters migration involved both infrastructure and application-level complexities:

Breaking changes: One of the biggest challenges was adapting to the prohibition of negative scoring in ES8, which required a significant overhaul to preserve the existing ranking order and maintain ecosystem stability.
Ecosystem KPI risks: Tinder’s complex sender-receiver ecosystem (swiper and swipee) meant that any ranking changes could significantly impact user experience. Minimizing these risks and ensuring neutral or positive ecosystem KPI changes was a top priority.
Customization and technical debt: Over the last six years, significant customizations had been developed, including 100+ scoring scripts with minimal test coverage. The loss of institutional knowledge compounded the difficulty of maintaining these customizations during the migration.
Cluster diversity and management: ES clusters were distributed across multiple Rec teams, each with unique usage patterns and applications, making a uniform migration strategy impractical.

We approached these challenges with the mindset of turning them into opportunities:

Delivering new capabilities: We leveraged the migration to introduce business-critical features, such as Elasticsearch Approximate Nearest Neighbor (ANN)/K-Nearest Neighbors (KNN) search to enable advanced deep learning capabilities.
Platformization: The migration was an opportunity to democratize the Tinder ES plugin, which had previously been limited to the Recs Infra team. This effort enabled other teams to leverage the earned knowledge and capabilities independently, reducing reliance on Recs Infra for new initiatives. By creating a scalable and reusable foundation, we aimed to empower teams to self-serve their Elasticsearch needs and accelerate innovation.

‍

ES migration framework

One of the critical goals we defined was to establish a reusable ES migration framework. We recognized that this wouldn’t be a one-time migration effort. While major version upgrades are expected every 2–3 years, even minor ES version updates — expected to occur multiple times a year — carry a significant risk of business impact if issues arise. To mitigate this, we aimed to create a repeatable process that would instill confidence and ensure smooth application of changes across all upgrade scenarios.

Given the diversity in cluster size and purpose, a “one-size-fits-all” approach proved ineffective. For instance, one of our smallest clusters, used for name lookups (e.g., colleague names), does not warrant the same rigor as our primary Recs cluster. To address this, we developed a migration template with recommended stages and steps. Teams could tailor the process based on the cluster’s specific nature and use cases, opting to follow all stages or focusing on a minimum viable path. Additionally, we created a suite of tools to streamline these stages and steps. The next sections delve deeper into these stages and the tools we developed.

‍

Stage 1: Achieve Data Consistency (Write Path)

The first stage focuses on the write path, aiming to establish eventual consistency between the new and old Elasticsearch clusters.

The key prerequisite is having a single source of truth containing all updates related to the Elasticsearch cluster. In our case this was a Kafka topic used as a Stream as it is defined in the well known Confluent “Stream-table duality” primer. The process follows four main steps:

‍

Create the new cluster and the associated worker.

A new Kafka worker gets a new consumer group with the latest position, initially configured to log messages without sending them to the new cluster. Once the logs are verified for correctness, the worker’s consumption is paused.

‍

Backfill data.

A custom backfill job (called esreindexjob) is triggered to copy data from the old cluster to the new one. This job leverages the Elasticsearch reindex API, enhanced to handle multiple indices (e.g., geo-sharded clusters — see details Geosharded Recommendations Part 1: Sharding Approach) and apply slicing for parallelized reindexing even in case of cross-cluster replication (not supported out the box).

‍

Resume real-time updates.

After the backfill job completes, the Kafka worker is unpaused and begins sending updates to the new cluster. This ensures that all changes missed during the backfill are applied. Once the Kafka worker catches up, the new cluster achieves eventual consistency with the old one.

‍

Verification.

The last step is to perform two type of checks:

Automated checks: A custom data consistency verification job retrieves random document IDs from one cluster, fetches those documents from both clusters, and compares them field by field. Metrics track mismatches, which should only occur for frequently updated fields (e.g., last active time).
Manual checks: Additional comparisons are conducted by analyzing document slices based on criteria like cohorts or last active dates. This provides confidence that the two clusters are aligned or nearly identical.

‍

Stage 2: Offline Evaluation

As highlighted in the Migration Challenges section, we encountered several complexities:

Selective ES usage: While Elasticsearch offers extensive capabilities, Tinder Recs relies on only a subset.
Knowledge gaps: Over time, maintaining comprehensive documentation of all search functionalities became challenging due to accumulated tech debt and organizational changes as engineers transitioned to new opportunities, resulting in some gaps in knowledge.Over time, tracing all utilized search functionalities became difficult due to engineers leaving the company, leading to partial knowledge loss.
Evolving usage patterns: Recs capabilities and approaches have evolved, influenced by changes in user behavior and system requirements.
Customizations at risk: Our custom components (e.g., loader plugin, scoring scripts) depend on internal ES APIs and external libraries, which are prone to silent changes that could cause breakages during migration.

To address these challenges, we decided to leverage production traffic. This approach ensured that we focused only on actively used ES functionalities while enabling realistic performance evaluation and tuning, avoiding reliance on potentially misleading synthetic tests.

Given the high impact of real-time logic injection, we started by relaying a portion of production traffic to the new ES cluster.

We designed the solution with the following goals:

Minimal production impact: Using a non-blocking, best-effort transmission approach to prevent disruptions.
Non-intrusiveness: Avoiding code changes or redeployment of running services.
Flexibility: Allowing traffic relay to start/stop dynamically and process only a controlled fraction of requests.
Scalability: Ensuring compatibility with a clustered and highly distributed environment.

We identified three theoretical points where traffic could be intercepted and relayed to the test Tinder ES cluster:

Before reaching the search service — intercepting incoming search traffic.
Inside the search service — capturing search events at the application level.
After reaching the ES cluster — intercepting raw HTTP requests to Elasticsearch.

The picture below illustrates that:

At the same time, we explored two major approaches for relaying traffic:

Event-based replay: Sending events to a messaging system (e.g., Kafka, Redis Streams) and replaying them through an external service.
HTTP mirroring: Copying original HTTP packets to another socket in memory with optional modifications and no intermediate data storage.

After careful evaluation, we opted for the event-based replay approach from within the Search Service. While HTTP mirroring via a service mesh is widely used in the industry (e.g., Shadow mirroring with Envoy by Mark Vincze, Traffic Shadowing With Istio: Reducing the Risk of Code Release, Advanced Traffic-shadowing Patterns for Microservices With Istio Service Mesh by Christian Posta), we chose the more controlled and flexible method of sending and receiving events. Our key reasons included:

Greater control and reusability: Persisted production traffic allows for controlled replay and debugging.
Infrastructure complexity: HTTP mirroring required significant investment in Tinder’s scaffold/mesh capabilities (e.g., Envoy request mirror policy with percentage control via etcd).
Request modifications: We needed the ability to modify queries before sending them to the new cluster to validate scoring logic changes and iterate quickly. We had low confidence that envoy Lua scripting would provide sufficient flexibility, making a dedicated relay application the more viable option.

As a result, we adopted the approach illustrated in the diagram below:

The diagram illustrates two critical aspects that offline evaluation must address:

Correctness: Ensuring that the custom scoring logic functions as expected in the new ES cluster. The objective is to verify that replacing the underlying scoring platform does not alter the recommendation outcomes.
Performance: Verifying that the Recs system remains compliant with the internal Service Level Agreement (SLA), maintaining a seamless user experience. While the new ES version was expected to offer performance improvements, we needed to confirm that our legacy scoring logic leveraged up-to-date mechanisms rather than inefficient workarounds that could degrade performance.

‍

Offline correctness evaluation

During the offline correctness evaluation, a small percentage of production traffic was captured, storing both the Elasticsearch request and its response in each event. The ES offline evaluation tool processes the captured Elasticsearch request, modifies it as needed, sends it to the new ES cluster, and compares the received response against the original using two approaches:

Set-based comparison: The search results (hits) are treated as an unordered set of user numbers. The algorithm measures how many users appear in the response from the old ES cluster versus the new version. While this metric provides a broad sense of discrepancy, it has limitations, as recommendation candidates typically number in the thousands, but users primarily engage with only the top-ranked results.
List-based comparison using Levenshtein distance: Search results are compared with their respective positions considered. The Levenshtein distance algorithm helps prevent skewed metrics due to minor shifts early in the list. To emphasize the importance of top-ranked candidates, we applied bucketing techniques — comparing only the first 20 results, the first 100, etc. This ensured that discrepancies in the most visible results carried more weight than those buried deeper in the ranking.

‍

Offline performance evaluation

For offline performance evaluation, a significant portion of sampled production search traffic was routed to the ES evaluation tool, capturing only the search request. The same request modifications used in correctness evaluation were applied here. The primary focus was on ES cluster CPU utilization, response times, and overall stability. With the test ES cluster running at a controlled fraction of the main Recs cluster’s load, we performed both load and stress tests, yielding several key insights:

Performance improvements: The Elasticsearch and Lucene communities have made substantial optimizations since Elasticsearch 6. Across our primary search queries, response times improved by 8% to 33%. While a few much simpler queries with typical small hit size showed a significant increase (more than 50%), their overall impact remained negligible and well within SLA compliance.
Optimal node sizing: Larger but fewer data nodes performed better. We started with c6id.8xlarge and later scaled up to c6id.16xlarge. Given our CPU-intensive recommendation algorithms, larger nodes improved resource allocation efficiency while reducing overhead from redundant data copies and coordination tasks.
Merge policy tuning: Since our geo-sharded indices are significantly smaller than typical log or metric datasets, adjusting merge.policy.segments_per_tier from 10 to 5 reduced CPU usage and improved response times.
Parallel query execution stability: Initially, we disabled parallel search execution due to indexing load-induced exceptions, suggesting race conditions. The parallel search query execution was as a long-standing community request (#80693 — Concurrent Searching) and was enabled by default in ES 8.12. Testing later versions (8.15 and beyond) confirmed the resolution, allowing us to safely re-enable parallel execution.

‍

Geo-Sharding Re-Evaluation

As part of our offline performance optimization, we reassessed our Elasticsearch geo-sharding approach. Originally implemented to address scaling challenges six years ago, geo-sharding provided clear benefits but also introduced long-term maintenance costs, including:

Shard imbalance over time: We have to invest engineering resources into a custom tool to periodically rebalance geo-shard distribution and allocations.
Limited compatibility with standard connectors: Technologies like Kafka Connect Elasticsearch connector and Apache Camel Elasticsearch component obviously have no support for our custom geo-sharded setup.
Constraints on exploring alternative search technologies: Geo-sharding made evaluating alternative search engines and vector databases more complex.

After extensive testing, we concluded:

Geo-sharding remains advantageous: It continues to deliver significant CPU savings compared to a single-index approach (12% CPU utilization vs. 50%).
Increasing shards per node: Raising the number of shards per node from 3 to 14 had minimal impact on resource consumption while ensuring smoother load distribution and reducing the risk of overload during traffic spikes.
Doubling primary shards for compute-heavy workloads: Increasing primary shards from 1 to 2 nearly doubled CPU usage but improved response times for scoring-heavy queries by 25–30%. This adjustment enables future adoption of more computationally demanding models while maintaining p99 SLA compliance by parallelizing search tasks.

These findings helped refine our ES migration strategy, ensuring both performance improvements and operational scalability for Tinder Recs moving forward.

‍

Stage 3: Online Evaluation and Cutover

At this stage, we had identified and resolved all discrepancies stemming from ES6 to ES8 behavior changes and finalized the optimal ES cluster configuration to meet our performance expectations with high confidence. While offline evaluation allowed us to efficiently detect and address functional and non-functional issues, it was conducted in a controlled environment rather than the actual services generating production search traffic. To bridge this gap, we introduced an additional “online” evaluation step to validate that the integration of ES8 within our services functioned correctly in real-world conditions.

Correctness was assessed using the same key metrics from offline evaluation; however, this phase was designed to eliminate the iterative explore-fix-explore cycle. Instead, the expectation was that all major issues had already been addressed, making this a validation step rather than a discovery phase.

Given the potential impact of any migration-related issue on Tinder’s search and recommendation systems, we prioritized minimizing risk to business KPIs. To ensure a smooth transition, we adopted a phased rollout approach, implementing the migration through three waves of A/B testing. This strategy allowed us to closely monitor system behavior and confirm that the migration had no negative impact on key business metrics.

‍

Outcomes

‍

Complete Migration of Recs ES Clusters

After an extensive and meticulously planned migration process, all Recs Elasticsearch clusters have successfully transitioned to ES8. This was accomplished with zero outages and a data validation discrepancy of less than 0.2%, ensuring continuity and reliability for Tinder’s recommendation system.

‍

Unlocking Advanced Capabilities

The migration to ES8 has enabled Tinder to leverage cutting-edge search and recommendation features, including:

k-nearest neighbor (kNN) vector search, unlocking new potential for AI-driven recommendations.
Two-Tower (2T) Deep Learning models experimentation, enhancing the personalization of user recommendations:
The 2T P(Match) experiment resulted in a +6.5% match rate increase and a +22% match volume lift.
The 2T P(Like) experiment led to a +3.8% Swipe Right Rate (SRR) increase.
Profile embeddings similarity search, allowing real-time “MoreLikeThis” functionality to recommend similar profiles dynamically.

‍

Establishing the Tinder Elasticsearch Platform

The migration paved the way for the Tinder Elasticsearch platform, which is now primed for company-wide adoption. This standardized platform reduces operational complexity and facilitates scalable growth across teams.

‍

Operational Efficiency Gains

Significant efficiency improvements were realized through:

Million plus dollar annual cost savings, achieved through optimized infrastructure and resource allocation.
Faster iteration cycles, enabling 13 iterations of the Tinder ES plugin and 4 Elasticsearch version upgrades, ensuring that we remain at the forefront of search technology.

‍

Performance and Stability Enhancements

Achieved a 12% — 56% reduction in p99 search latency, greatly improving response times for users.
Enhanced cluster stability, reducing operational burden and minimizing incident response times.

‍

Increased Network Costs (Cross-AZ Traffic)

While the migration delivered numerous benefits, it also introduced additional cross-AZ network costs due to the following fact:

The ECK migration expanded from 3 to 4 availability zones, ensuring higher availability and resilience.
Increased replica shards distribution to improve load balancing.
Elasticsearch deprecated shard allocation-aware search routing in favor of adaptive replica selection (see Elastic issue #60236).

These migration outcomes not only improved Tinder’s recommendation capabilities but also established a scalable, high-performance search platform, ensuring continued innovation and optimization for the future.

Tags for this post:

No items found.