Git Diff Is Just LCS

Oct 9, 2025(updated Oct 13, 2025)

·16 min read

Here's something wild: you've probably solved the Longest Common Subsequence problem on LeetCode, cursed at the DP table, finally got it accepted, and moved on. Maybe you even argued on Reddit about whether companies should ask DSA questions in interviews.

But have you ever stopped to think about what runs every single time you type git diff?

Yeah. It's literally LCS. That "useless DSA problem" you solved is running millions of times a day across every developer's machine on the planet.

The Irony

Every other CS student these days is grinding DSA. LeetCode subscriptions, Striver's sheet, 450 questions, you name it. And there's this endless debate online: "Is DSA relevant for SWE jobs?" "Do companies ask too much DP?" "When will I ever use this in real life?"

Meanwhile, every git diff. Every GitHub PR. Every merge. That's LCS running under the hood.

Let me show you something. Open your terminal right now and run(in a git repo):

git diff HEAD~1

That's it. You just executed an instance of the Longest Common Subsequence problem.

The thing is, diff isn't just some utility command. It's everywhere:

Every git diff, git merge, git cherry-pick
GitHub's pull request UI showing what changed
Your IDE's merge conflict resolver
Code review tools
Even npm diff to see package changes

NOTE

And Yes, react also uses "diffing" to comapre changes in Virtual DOM and real DOM.

And the algorithm behind it? That's where things get interesting.

TIP

What You'll Learn: By the end of this post, you'll understand the exact algorithm running when you type git diff, why it's faster than the obvious solution, and how a "useless" LeetCode problem powers one of the most-used developer tools on the planet.

Wait, How Does Diff Even Work?

Think about it: you change one line in a 1000-line file. Git instantly shows you exactly what changed.

How?

Let's start with two simple files:

# File A (old)
def calculate_sum(numbers):
    total = 0
    for num in numbers:
        total += num
    return total

# File B (new)
def calculate_sum(numbers):
    total = 0
    for num in numbers:
        if num > 0:
            total += num
    return total

As a human, you immediately see: "Oh, they added an if num > 0: check." But how does a computer figure that out?

The naive approach: "Just tell me what lines are different!"

Line 4: Changed
    total += num
=>  if num > 0:
            total += num

Terrible. This doesn't communicate what changed. It just says "lines are different" which is obvious and useless.

What we actually want:

  def calculate_sum(numbers):
      total = 0
      for num in numbers:
        if num > 0:
          total += num
      return total

Now that's useful. It shows exactly what was added, in context. But how do we compute this?

The Insight That Changes Everything

Here's the breakthrough: finding what changed is the same as finding what stayed the same.

If you know which lines are common to both files (in order), everything else is either deleted or added.

Wait. That's just... Longest Common Subsequence.

Let me make this concrete. Consider these two strings:

A = "ABCDEFG"
B = "ACDXFG"

The LCS is "ACDFG" (length 5). Everything not in the LCS is either a deletion from A or an insertion into B:

Delete 'B' (not in LCS)
Keep 'A', 'C', 'D'
Delete 'E' (not in LCS)
Insert 'X'
Keep 'F', 'G'

That's your diff! The LCS tells you what stayed the same, and everything else is what changed.

LCS Alignment Visualizer

See how Longest Common Subsequence finds matching characters in order

String A

String B

LCS: ACDEFlength: 5

Highlighted characters are part of the Longest Common Subsequence and maintain their original order.

The Classic Solution: The DP Table You Know

If you've done LeetCode, you've seen this a hundred times:

def lcs_length(A, B):
    m, n = len(A), len(B)
    # dp[i][j] = length of LCS of A[0..i-1] and B[0..j-1]
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if A[i-1] == B[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])

    return dp[m][n]

This is the standard $O(m \cdot n)$ time, $O(m \cdot n)$ space algorithm. You fill a table, compare characters, take maximums. Easy.

But here's the problem:

Files with 10,000 lines each → need a 10,000 × 10,000 table
That's 100 million cells to compute
For a simple diff that takes milliseconds

Real-world reality check:

Linux kernel: 30+ million lines of code
A 50MB JSON file: thousands of lines
Your typical pull request: dozens to hundreds of files

Yeah, we need something better.

The Problem: It's Too Slow

Classic DP is $O(m \cdot n)$ . For two 10,000-line files, that's 100 million operations.

In 1976, Doug McIlroy (the Unix pipes guy) and James Hunt had a better idea.

Their insight: most lines in code are unique.

Think about it: In a typical code file, how many lines are literally just {? Maybe a few. But lines like def calculate_fibonacci(n): or return total + calculate_tax(amount) are probably unique.

NOTE

Key Insight: Instead of comparing every line with every other line (expensive!), we can:

Hash each line once (cheap!)
Quickly look up where each line appears (instant!)
Only do detailed work on the ambiguous parts

Hunt-McIlroy's three-step process:

Build a hash table of where each unique line appears
Use that to quickly find matching segments
Only do detailed DP in ambiguous regions

The Algorithm: Step by Step

Let's trace through a small example:

# File A
def foo():
    x = 1
    y = 2
    return x + y

# File B
def foo():
    x = 1
    z = 3
    y = 2
    return x + y

Step 1: Build the occurrence table

For each line in B, record where it appears in A:

occurrence_map = {
    "def foo():": [0],      # Line 0 in A
    "    x = 1": [1],       # Line 1 in A
    "    z = 3": [],        # Not in A
    "    y = 2": [2],       # Line 2 in A
    "    return x + y": [3] # Line 3 in A
}

Step 2: Find candidate matches

Now walk through B and note potential matches:

B[0] "def foo():"     -> matches A[0]  ✓
B[1] "    x = 1"      -> matches A[1]  ✓
B[2] "    z = 3"      -> no match      ✗
B[3] "    y = 2"      -> matches A[2]  ✓
B[4] "    return x+y" -> matches A[3]  ✓

Step 3: Find longest increasing subsequence

We need matches where line numbers in A are increasing:

A[0] → B[0] ✓
A[1] → B[1] ✓
A[2] → B[3] ✓ (2 > 1, so valid)
A[3] → B[4] ✓ (3 > 2, so valid)

This gives us the longest common subsequence! Everything else is insertions/deletions.

Here's the beautiful part: finding the longest increasing subsequence can be done in $O(n \log n)$ with binary search, not $O(n^2)$ DP!

def longest_increasing_subsequence(matches):
    """
    matches: list of (line_in_A, line_in_B) tuples
    Returns the length of LIS based on line_in_A values
    """
    from bisect import bisect_left

    # tails[i] = smallest tail element of all increasing subsequences of length i+1
    tails = []

    for a_line, b_line in sorted(matches, key=lambda x: x[1]):
        # Find where a_line would fit in our current best subsequences
        pos = bisect_left(tails, a_line)

        if pos == len(tails):
            tails.append(a_line)
        else:
            tails[pos] = a_line

    return len(tails)

Why This Is Faster

Classic DP approach:

Time: $O(m \cdot n)$
Space: $O(m \cdot n)$

Hunt-McIlroy:

Building hash table: $O(m + n)$
Finding matches: $O(m + n)$
Computing LIS: $O(k \log k)$ where $k$ is the number of matches
Total: $O((m + n) + k \log k)$

Why this matters: For typical files where most lines are unique, $k \ll m \cdot n$ . The algorithm is effectively $O(n \log n)$ in practice!

Bonus: Modern implementations use space-efficient LIS algorithms, bringing space down to $O(n)$ instead of $O(m \cdot n)$ .

Real numbers: Comparing two 10,000-line files:

Classic DP: ~100 million operations
Hunt-McIlroy: ~20,000 operations (if most lines unique)
That's 5,000× faster!

What Git Actually Uses: Myers' Algorithm

Hunt-McIlroy was great. But in 1986, Eugene Myers published something better.

Instead of asking "what's the longest common part?", he asked: "what's the shortest edit sequence?"

Myers' algorithm uses a concept called the "edit graph":

Think of it as a grid:

Horizontal moves → delete from A (move right)
Vertical moves → insert to B (move down)
Diagonal moves → match (free! no cost)

Your goal: Get from top-left to bottom-right using the fewest horizontal/vertical moves.

def myers_diff(A, B):
    """
    Simplified Myers' algorithm - finds the shortest edit sequence.
    Runs in O(ND) where N=len(A)+len(B) and D is the edit distance.
    """
    N, M = len(A), len(B)
    MAX = N + M

    # v[k] represents the furthest reaching x value on diagonal k
    v = {1: 0}
    trace = []

    for d in range(MAX + 1):
        trace.append(v.copy())

        for k in range(-d, d + 1, 2):
            # Should we go down (insert) or right (delete)?
            if k == -d or (k != d and v.get(k - 1, -1) < v.get(k + 1, -1)):
                x = v.get(k + 1, 0)  # Came from diagonal k+1 (vertical/insert)
            else:
                x = v.get(k - 1, 0) + 1  # Came from diagonal k-1 (horizontal/delete)

            y = x - k

            # Follow any diagonal matches (no cost!)
            while x < N and y < M and A[x] == B[y]:
                x, y = x + 1, y + 1

            v[k] = x

            # Found the end?
            if x >= N and y >= M:
                return d, trace

    return MAX, trace

The genius of Myers' algorithm:

Explores in "waves" of increasing edit distance (like BFS)
Matches are free - follow diagonals as far as possible
Only "spend" operations on actual changes

Complexity: $O(N \cdot D)$ where:

$N$ = total length of both files
$D$ = edit distance (how different they are)

TIP

Why this is brilliant: For files that are mostly similar (which is most diffs!), $D$ is small.

Example: Two 10,000-line files with only 10 changed lines:

Classic DP: $O(10000 \times 10000)$ = 100 million ops
Myers: $O(20000 \times 10)$ = 200,000 ops
500× faster!

Edit Graph Visualizer

Myers' algorithm uses this grid to find the shortest edit path

String A

String B

Why Git Uses This

Let's bring this back to reality. When you run git diff, here's what actually happens:

Git retrieves the two versions of the file from its object database
It runs Myers' algorithm (or a variant) to compute the edit script
It formats the output as a unified diff

$ git diff HEAD~1 myfile.py

Git is smart about this:

It diff by lines, not characters (usually)
It can use heuristics to improve output quality
It supports different diff algorithms (--diff-algorithm=myers|minimal|patience|histogram)

The patience algorithm is particularly interesting - it's designed to produce more "human-readable" diffs by prioritizing unique lines as anchors.

Actually Implementing a Simple Diff

Let's build a minimal but functional diff tool to cement these ideas:

def simple_diff(A, B):
    """
    Compute a simple diff between two lists of lines.
    Returns a diff in unified diff format.
    """

    # Compute LCS using DP (for simplicity, not Hunt-McIlroy)
    m, n = len(A), len(B)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if A[i-1] == B[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])


    # Backtrack to find the actual LCS and generate diff
    diff = []
    i, j = m, n

    while i > 0 or j > 0:
        if i > 0 and j > 0 and A[i-1] == B[j-1]:
            # Match - part of LCS
            diff.append(('  ', A[i-1]))
            i -= 1
            j -= 1
        elif j > 0 and (i == 0 or dp[i][j-1] >= dp[i-1][j]):
            # Insertion
            diff.append(('+', B[j-1]))
            j -= 1
        else:
            # Deletion
            diff.append(('-', A[i-1]))
            i -= 1

    return list(reversed(diff))

# Example usage
old_file = [
    "def calculate_sum(numbers):",
    "    total = 0",
    "    for num in numbers:",
    "        total += num",
    "    return total"
]

new_file = [
    "def calculate_sum(numbers):",
    "    total = 0",
    "    for num in numbers:",
    "        if num > 0:",
    "            total += num",
    "    return total"
]

diff = simple_diff(old_file, new_file)

for op, line in diff:
    print(f"{op} {line}")

Output:

def calculate_sum(numbers):
    total = 0
    for num in numbers:
       total += num
       if num > 0:
           total += num
    return total

Beautiful! We've implemented the core of diff in ~40 lines of Python.

NOTE

Learning Checkpoint: At this point, you understand:

Diff is just LCS applied to lines
The classic $O(m \cdot n)$ DP solution
Hunt-McIlroy's hash table optimization
Myers' edit graph approach
How to implement a basic diff in Python

Try it yourself: Edit the code in the visualizer below to see diff in action!

Interactive Diff Simulator

Edit the code to see real-time diff output

Old Version

New Version

-1

4 unchanged

  def calculate_sum(numbers):
      total = 0
      for num in numbers:
-         total += num
+         if num > 0:
+             total += num
      return total

+ additions, - deletions — same as git diff

The Real Lesson: DSA Is Everywhere

You literally cannot use Git without running an algorithm that appears in every DSA curriculum. Every time you look at a pull request diff on GitHub, you're staring at the output of the Hunt-McIlroy or Myers algorithm.

But it goes deeper than just diff:

Algorithm	"Interview Question"	Real-World Usage
LCS	That DP problem everyone hates	`git diff`, version control, DNA sequencing
Edit Distance	Levenshtein distance	Spell checkers, fuzzy search, plagiarism detection
Dijkstra	Shortest path in graph	Google Maps, network routing, compilers
Union-Find	Disjoint sets	Network connectivity, image processing
Trie	Prefix tree	Autocomplete, IP routing, spell checkers
Segment Trees	Range queries	Databases, computational geometry

NOTE

The Pattern: Every single one of these "useless interview questions" is actually used in production systems you use every single day.

The problem isn't that DSA is irrelevant. The problem is that we teach it badly, divorced from real applications.

When you're grinding LeetCode, you're not just preparing for interviews. You're learning the fundamental building blocks of systems you use every single day.

The next time someone on Twitter says "nobody uses DP in real life," just run git diff and smile.

Going Deeper: Resources and Rabbit Holes

If you've made it this far and want to dive deeper, here are the canonical resources:

The Original Papers

Hunt & McIlroy (1976): "An Algorithm for Differential File Comparison" - The original Unix diff algorithm. Surprisingly readable for a 1970s paper.
Myers (1986): "An O(ND) Difference Algorithm and Its Variations" - This is what Git uses. The paper is dense but worth it.
Heckel (1978): "A Technique for Isolating Differences Between Files" - Another influential approach, used in some merge tools.

Implementations to Study

Git's diff implementation: Check out xdiff/xdiffi.c in the Git source. It's in C and surprisingly readable.
diff-match-patch: Google's diff-match-patch library implements Myers' algorithm in multiple languages. Great for learning.
Python's difflib: The standard library's difflib module implements a variant of these algorithms. Check the source!

Tools to Explore

Try different diff algorithms and see how they behave:

# Standard Myers
git diff HEAD~1

# Patience diff (often better for code)
git diff --patience HEAD~1

# Histogram diff (Git's default in recent versions)
git diff --histogram HEAD~1

# Minimal diff (slower but smallest output)
git diff --minimal HEAD~1

Each one makes different trade-offs between speed, diff size, and "human readability."

Books Worth Reading

"The Algorithm Design Manual" by Skiena - Chapter on DP includes diff as a motivating example
"Algorithms" by Dasgupta, Papadimitriou, Vazirani - Clean explanation of edit distance
"Version Control with Git" by Loeliger & McCullough - Chapter 8 dives into diff and merge

The Uncomfortable Truth

Here's what I really think about the DSA debate:

The problem isn't that companies ask too much DSA. The problem is that:

We teach DSA as puzzle-solving, not as tools for building systems
Interviews test recognition, not understanding or application
The industry cargo-cults Google, even when the context is completely different

But the solution isn't to abandon DSA. It's to teach it better, contextualize it, show where it appears in real systems.

Because here's the thing: you can have opinions about whether LeetCode-style interviews are good or bad. That's fine. Debate that all day.

But you cannot have opinions about whether LCS is useful. It objectively is. You used it today, multiple times, whether you realized it or not.

That LeetCode problem you thought was useless? You used it today. Probably dozens of times.

Next time you run git diff, remember: you're looking at the output of the exact algorithm you solved at 2 AM and forgot about.

The difference between "toy problem" and "production system" is often just context.

It's all connected. And that's kind of beautiful.

References

Hunt, J. W., & McIlroy, M. D. (1976). An Algorithm for Differential File Comparison. Bell Laboratories Computing Science Technical Report #41.
Myers, E. W. (1986). An O(ND) difference algorithm and its variations. Algorithmica, 1(1-4), 251-266.
Ukkonen, E. (1985). Algorithms for approximate string matching. Information and control, 64(1-3), 100-118.
Tichy, W. F. (1984). The string-to-string correction problem with block moves. ACM Transactions on Computer Science, 2(4), 309-321.
Git Documentation: git-diff and gitdiffcore
Bergroth, L., Hakonen, H., & Raita, T. (2000). A survey of longest common subsequence algorithms. String Processing and Information Retrieval, 39-48.
Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). Introduction to Algorithms (3rd ed.). MIT Press. Chapter 15.4: Longest Common Subsequence.