<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://harshaavardaan.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://harshaavardaan.com/" rel="alternate" type="text/html" /><updated>2026-02-26T08:26:36+00:00</updated><id>https://harshaavardaan.com/feed.xml</id><title type="html">Harshaa Vardaan</title><subtitle>Harshaa&apos;s site</subtitle><author><name>Harshaa Vardaan</name></author><entry><title type="html">The Hidden Race in Our Data Export Pipeline</title><link href="https://harshaavardaan.com/distributed%20systems/the-hidden-race-in-our-data-export-pipeline/" rel="alternate" type="text/html" title="The Hidden Race in Our Data Export Pipeline" /><published>2026-02-22T14:00:00+00:00</published><updated>2026-02-22T14:00:00+00:00</updated><id>https://harshaavardaan.com/distributed%20systems/the-hidden-race-in-our-data-export-pipeline</id><content type="html" xml:base="https://harshaavardaan.com/distributed%20systems/the-hidden-race-in-our-data-export-pipeline/"><![CDATA[<p>It started with a file that was slightly larger than it should have been.</p>

<p>No crashes. No alerts. Just an export that didn’t look quite right. When we opened it, we found duplicated sections and partially repeated content. At first it felt like bad data or a serialization issue. But the deeper we dug, the stranger it became.</p>

<p>Our platform supports exporting large datasets on demand. A client requests an export, we generate a job identifier, process it asynchronously, and write the final file to a network-attached storage path. Once completed, we update the status and return the file location. The pipeline was built on a Redis-based worker model, jobs pushed to a queue, workers polling and processing them in the background. Simple, practical, and stable under normal load.</p>

<p>But after correlating worker logs and job timestamps in production, a pattern emerged. The same job was being processed more than once. The system wasn’t crashing, it was racing. What we uncovered was a subtle race condition hidden inside what looked like a perfectly safe Redis worker model.</p>

<hr />

<h3 id="how-the-race-happened-in-the-redis-worker-model">How the Race Happened in the Redis Worker Model</h3>

<p><img src="/assets/images/component_diagram_v2.svg" alt="Figure 1: Two workers observing the same job before acknowledgment" class="align-center" /></p>

<p>The export process consists of multiple logical stages. A worker fetches data from internal services, transforms and aggregates the dataset, streams the output into a file, and finally updates the job status to mark it as complete. Each stage functions correctly in isolation. The failure did not originate inside a single stage but at the boundary between worker executions.</p>

<p>Under production load two scenarios triggered the issue. A job could time out and be retried while the original execution was still running. In another case a worker could crash after writing part of the file but before updating the job status.</p>

<p>The Redis worker implementation followed a pattern where a worker would read a job identifier from the queue, begin processing it, and only remove it from the queue after completion. The gap between reading and acknowledging created a window where another worker could observe and process the same job. If the first worker crashed before removing the job, it would remain visible in the queue and another worker would pick it up.</p>

<p><img src="/assets/images/timing_window_v2.svg" alt="Figure 2: Visualize overlapping execution windows." class="align-center" /></p>

<p>Because both executions targeted the same file path, their writes interleaved. The filesystem allowed concurrent writes and from its perspective nothing illegal happened. From the application perspective however the output became corrupted, containing duplicated or truncated data.</p>

<p>The core issue was the lack of atomic ownership. Observing a job was treated as equivalent to claiming a job, but the system never enforced exclusivity.</p>

<hr />

<h3 id="moving-to-kafka-for-deterministic-ownership">Moving to Kafka for Deterministic Ownership</h3>

<p><img src="/assets/images/kafka_ownership_diagram.svg" alt="Figure 3: Partition-level ownership in Kafka consumer groups" class="align-center" /></p>

<p>To eliminate the issue we redesigned the pipeline using Kafka consumer groups. Jobs are now published to a Kafka topic and consumed by workers that belong to a consumer group. Kafka guarantees that each partition is assigned to exactly one consumer within the group at a time. Message delivery becomes the atomic claim operation. Workers no longer compete through polling and visibility gaps.</p>

<p>When a worker begins processing a job, it updates the job state in persistent storage to RUNNING along with metadata such as worker identifier, start timestamp, and execution attempt number. This metadata becomes critical for crash detection and recovery.</p>

<p>If a worker crashes mid-execution, Kafka will eventually trigger a rebalance and reassign the partition to another worker. When the new worker receives the same job message, it does not blindly execute it. Instead, it first checks the persisted job state. If the job is marked as RUNNING but the associated worker has not heartbeated within an expected interval, the system considers that execution abandoned.</p>

<p>We implemented a lightweight heartbeat mechanism where workers periodically update a timestamp while processing long running exports. If that timestamp becomes stale beyond a configured threshold, the job is eligible for recovery. The new worker then safely transitions the job to a RETRY state using a conditional update and begins a new execution attempt.</p>

<p>This prevents duplicate active executions while still allowing automatic recovery from crashes.</p>

<hr />

<h3 id="dead-letter-queue-and-intelligent-failure-handling">Dead Letter Queue and Intelligent Failure Handling</h3>

<p><img src="/assets/images/heartbeat_recovery_diagram.svg" alt="Figure 4: Dead Letter Queue and Intelligent Failure Handling" class="align-center" /></p>

<p>We also introduced a Dead Letter Queue to handle irrecoverable failures. If a job fails repeatedly beyond a configured retry threshold, it is no longer reprocessed indefinitely. Instead, the job is published to a dedicated DLQ topic along with diagnostic metadata including the failure reason and attempt count.</p>

<p>This prevents poison messages from blocking partitions or causing infinite retry loops. It also allows operators to inspect failed exports separately without impacting healthy traffic.</p>

<p>Because job state is persisted independently of message delivery, workers can intelligently determine whether a message represents a fresh execution, a retry after crash, or a previously failed attempt. The system no longer assumes that message redelivery implies safe re-execution. It verifies ownership and state before performing side effects.</p>

<hr />

<h3 id="safe-file-writing-under-recovery">Safe File Writing Under Recovery</h3>

<p>Even with deterministic ownership and crash detection, we redesigned the file writing strategy to eliminate partial corruption risks.</p>

<p>Workers write to a temporary file that includes a unique execution identifier rather than writing directly to the final network path. If a worker crashes mid-write, the temporary file remains isolated and does not affect previously completed exports. When a new worker recovers the job, it either resumes from a clean state or overwrites the temporary artifact associated with the abandoned execution.</p>

<p>Only after the export completes successfully is the temporary file atomically renamed to the final path. The rename operation is atomic at the filesystem level, ensuring that the final file becomes visible only when fully written.</p>

<p>This combination of atomic message ownership, persistent execution state, heartbeat based crash detection, DLQ isolation, and atomic file writes eliminated the hidden race that existed in the original Redis model and made the export pipeline resilient under real production failure conditions.</p>]]></content><author><name>Harshaa Vardaan</name></author><category term="Distributed Systems" /><category term="Distributed Systems" /><category term="Kafka" /><category term="Concurrency" /><category term="System Design" /><category term="Redis" /><summary type="html"><![CDATA[It started with a file that was slightly larger than it should have been.]]></summary></entry><entry><title type="html">Escaping the Kafka Rebalance Spiral</title><link href="https://harshaavardaan.com/kafka/escaping-the-kafka-rebalance-spiral/" rel="alternate" type="text/html" title="Escaping the Kafka Rebalance Spiral" /><published>2025-07-01T02:51:30+00:00</published><updated>2025-07-01T02:51:30+00:00</updated><id>https://harshaavardaan.com/kafka/escaping-the-kafka-rebalance-spiral</id><content type="html" xml:base="https://harshaavardaan.com/kafka/escaping-the-kafka-rebalance-spiral/"><![CDATA[<p>If you’ve ever worked with Kafka in production, you’ve probably dealt with consumer group rebalancing at some point. Most of the time, it’s just background noise. But every now and then, it turns into a full-blown operational headache, especially when your processes have a task that is long.</p>

<p>We ran into this exact scenario when building out a feature that exports large datasets into files on demand. At first glance, it sounded simple: a consumer reads a message, gathers some data, writes it to a file, and stores it on a network path for download. The issue? Some of these tasks could run for close to thirty minutes.</p>

<h2 id="where-things-went-wrong">Where Things Went Wrong</h2>
<p>In our Kafka setup, each export request lands on a topic. A group of consumers picks them up and starts processing. Nothing special, until a task drags on longer than expected. Kafka expects each consumer to poll the broker at regular intervals. By default, that interval (<code class="language-plaintext highlighter-rouge">max.poll.interval.ms</code>) is set to five minutes.</p>

<p><img src="/assets/images/figure1.drawio.png" alt="Initial state" /></p>

<p>You can probably guess what happened next.</p>

<p>A consumer begins writing a large file, completely focused on the task. But since it’s busy with the task and doesn’t get around to polling in time, Kafka assumes the worst: the consumer must have died. A rebalance is triggered, and its partition is handed off to another consumer.</p>

<p><img src="/assets/images/figure2.drawio.png" alt="First rebalance" /></p>

<p>This second consumer picks up where the first one left off but it runs into the exact same delay. Another rebalance is triggered. And the cycle repeats.</p>

<p><img src="/assets/images/figure3.drawio.png" alt="Rebalance spiral" /></p>

<p>Over time, this created a <em>rebalance spiral</em> where no task ever completed, and every consumer in the group kept getting kicked out.</p>

<p><strong>Why Not Just Increase the Timeout?</strong> <br />
We considered bumping up the poll interval. But when you’re dealing with unpredictable file sizes and multiple downstream dependencies, it’s hard to pick a safe value. Set it too low and you’re back in the spiral. Set it too high and real consumer failures go unnoticed for too long.</p>

<p>We also explored the usual suspects: code optimization, splitting files, fire-and-forget models but they either added too much complexity or failed to guarantee data consistency.</p>

<p>Eventually, we landed on something that worked: <em>pause the partition, delegate the heavy lifting to another thread, and resume when done.</em></p>

<h2 id="how-the-fix-works">How the Fix Works</h2>
<p>Instead of letting the Kafka listener get stuck on a long task, we do the following:</p>

<p>As soon as we receive a message that triggers a heavy export, we pause the partition it’s coming from.</p>

<p><img src="/assets/images/figure4.drawio.png" alt="Fix step1" /></p>

<p>The listener continues polling Kafka at regular intervals but it won’t receive any new records from the paused partition.</p>

<p>Meanwhile, the actual export logic is handed off to a separate thread, freeing up the main listener. We can still enforce a hard timeout by scheduling a task cancellation after checking if the task is done</p>

<p><img src="/assets/images/figure5.drawio.png" alt="Fix step 2" /></p>

<p>Once the thread finishes its task, whether successfully or due to timeout, the partition is resumed so the consumer can continue processing the next message.</p>

<p><img src="/assets/images/figure6.drawio.png" alt="Fix step 3" /></p>

<p>By doing this, we keep Kafka happy (because we’re still polling) and avoid duplicate work (because we don’t reassign the same partition mid-task).</p>

<p>Here’s a simplified version of what that looks like:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="kt">void</span> <span class="nf">handleHeavyRecord</span><span class="o">(</span><span class="nc">ConsumerRecord</span><span class="o">&lt;</span><span class="nc">String</span><span class="o">,</span> <span class="nc">String</span><span class="o">&gt;</span> <span class="n">record</span><span class="o">)</span> <span class="o">{</span>

    <span class="c1">// Pause the specific partition</span>
    <span class="kt">var</span> <span class="n">container</span> <span class="o">=</span> <span class="n">kafkaListenerEndpointRegistry</span><span class="o">.</span><span class="na">getListenerContainer</span><span class="o">(</span><span class="s">"exportListener"</span><span class="o">);</span>
    <span class="n">container</span><span class="o">.</span><span class="na">pausePartition</span><span class="o">(</span><span class="n">record</span><span class="o">.</span><span class="na">topicPartition</span><span class="o">());</span>

    <span class="c1">// Create Thread Pool</span>
    <span class="nc">ScheduledExecutorService</span> <span class="n">executorService</span> <span class="o">=</span> <span class="nc">Executors</span><span class="o">.</span><span class="na">newScheduledThreadPool</span><span class="o">(</span><span class="no">THREAD_COUNT</span><span class="o">);</span>

    <span class="c1">// Create export task</span>
    <span class="nc">FutureTask</span><span class="o">&lt;</span><span class="nc">Void</span><span class="o">&gt;</span> <span class="n">exportTask</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">FutureTask</span><span class="o">&lt;&gt;(</span>
        <span class="o">()</span> <span class="o">-&gt;</span> <span class="o">{</span>
            <span class="k">try</span> <span class="o">{</span>
                <span class="n">processTask</span><span class="o">(</span><span class="n">record</span><span class="o">);</span>
            <span class="o">}</span>
            <span class="k">catch</span> <span class="o">(</span><span class="nc">Exception</span> <span class="n">e</span><span class="o">)</span> <span class="o">{</span>
                <span class="n">handleFailure</span><span class="o">(</span><span class="n">record</span><span class="o">,</span> <span class="s">"Exception occured while task was running"</span> <span class="o">+</span> <span class="n">e</span><span class="o">.</span><span class="na">toString</span><span class="o">());</span>
            <span class="o">}</span>
            <span class="k">finally</span> <span class="o">{</span>
                <span class="c1">// Resume partition after task has completed or failed</span>
                <span class="n">listenerContainer</span><span class="o">.</span><span class="na">resume</span><span class="o">();</span>
            <span class="o">}</span>
            <span class="k">return</span> <span class="kc">null</span><span class="o">;</span>
        <span class="o">}</span>
    <span class="o">);</span>

    <span class="c1">// Start export task</span>
    <span class="n">executorService</span><span class="o">.</span><span class="na">submit</span><span class="o">(</span><span class="n">exportTask</span><span class="o">);</span>

    <span class="c1">// Timeout after 30 minutes</span>
    <span class="n">executorService</span><span class="o">.</span><span class="na">schedule</span><span class="o">(</span>
        <span class="o">()</span> <span class="o">-&gt;</span> <span class="o">{</span>

            <span class="c1">// Skip if task is already cancelled or has completed</span>
            <span class="k">if</span> <span class="o">(</span><span class="n">exportTask</span><span class="o">.</span><span class="na">isCancelled</span><span class="o">()</span> <span class="o">||</span> <span class="n">exportTask</span><span class="o">.</span><span class="na">isDone</span><span class="o">())</span> 
                <span class="k">return</span><span class="o">;</span>
            
            <span class="n">exportTask</span><span class="o">.</span><span class="na">cancel</span><span class="o">(</span><span class="kc">true</span><span class="o">);</span>

            <span class="c1">// Resume partition after timeout</span>
            <span class="n">listenerContainer</span><span class="o">.</span><span class="na">resume</span><span class="o">();</span>
            <span class="n">handleFailure</span><span class="o">(</span><span class="n">record</span><span class="o">,</span> <span class="s">"Export task exceeded time limit"</span><span class="o">);</span>
        <span class="o">},</span> 
        <span class="mi">30</span><span class="o">,</span> 
        <span class="nc">TimeUnit</span><span class="o">.</span><span class="na">MINUTES</span>
    <span class="o">);</span>
<span class="o">}</span>
</code></pre></div></div>]]></content><author><name>Harshaa Vardaan</name></author><category term="Kafka" /><category term="Kafka" /><category term="Java" /><category term="Scalability" /><category term="System Design" /><category term="Distributed Systems" /><summary type="html"><![CDATA[If you’ve ever worked with Kafka in production, you’ve probably dealt with consumer group rebalancing at some point. Most of the time, it’s just background noise. But every now and then, it turns into a full-blown operational headache, especially when your processes have a task that is long.]]></summary></entry></feed>