Git Diff Is Just LCS
Here's something wild: you've probably solved the Longest Common Subsequence problem on LeetCode, cursed at the DP table, finally got it accepted, and moved on. Maybe you even argued on Reddit about whether companies should ask DSA questions in interviews.
But have you ever stopped to think about what runs every single time you type git diff
?
Yeah. It's literally LCS. That "useless DSA problem" you solved is running millions of times a day across every developer's machine on the planet.

The Irony
Every other CS student these days is grinding DSA. LeetCode subscriptions, Striver's sheet, 450 questions, you name it. And there's this endless debate online: "Is DSA relevant for SWE jobs?" "Do companies ask too much DP?" "When will I ever use this in real life?"
Meanwhile, every git diff
. Every GitHub PR. Every merge. That's LCS running under the hood.
Let me show you something. Open your terminal right now and run(in a git repo):
git diff HEAD~1
That's it. You just executed an instance of the Longest Common Subsequence problem.
The thing is, diff
isn't just some utility command. It's everywhere:
- Every
git diff
,git merge
,git cherry-pick
- GitHub's pull request UI showing what changed
- Your IDE's merge conflict resolver
- Code review tools
- Even
npm diff
to see package changes
And the algorithm behind it? That's where things get interesting.
TIP
What You'll Learn: By the end of this post, you'll understand the exact algorithm running when you type git diff
, why it's faster than the obvious solution, and how a "useless" LeetCode problem powers one of the most-used developer tools on the planet.
Wait, How Does Diff Even Work?
Think about it: you change one line in a 1000-line file. Git instantly shows you exactly what changed.
How?
Let's start with two simple files:
# File A (old)def calculate_sum(numbers): total = 0 for num in numbers: total += num return total
# File B (new)def calculate_sum(numbers): total = 0 for num in numbers: if num > 0: total += num return total
As a human, you immediately see: "Oh, they added an if num > 0:
check." But how does a computer figure that out?
The naive approach: "Just tell me what lines are different!"
Line 4: Changed total += num=> if num > 0: total += num
Terrible. This doesn't communicate what changed. It just says "lines are different" which is obvious and useless.
What we actually want:
def calculate_sum(numbers): total = 0 for num in numbers: if num > 0: total += num return total
Now that's useful. It shows exactly what was added, in context. But how do we compute this?
The Insight That Changes Everything
Here's the breakthrough: finding what changed is the same as finding what stayed the same.
If you know which lines are common to both files (in order), everything else is either deleted or added.
Wait. That's just... Longest Common Subsequence.
Let me make this concrete. Consider these two strings:
A = "ABCDEFG"B = "ACDXFG"
The LCS is "ACDFG" (length 5). Everything not in the LCS is either a deletion from A or an insertion into B:
- Delete 'B' (not in LCS)
- Keep 'A', 'C', 'D'
- Delete 'E' (not in LCS)
- Insert 'X'
- Keep 'F', 'G'
That's your diff! The LCS tells you what stayed the same, and everything else is what changed.
LCS Alignment Visualizer
See how Longest Common Subsequence finds matching characters in order
LCS: ACDEFlength: 5
Highlighted characters are part of the Longest Common Subsequence and maintain their original order.
The Classic Solution: The DP Table You Know
If you've done LeetCode, you've seen this a hundred times:
def lcs_length(A, B): m, n = len(A), len(B) # dp[i][j] = length of LCS of A[0..i-1] and B[0..j-1] dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1): for j in range(1, n + 1): if A[i-1] == B[j-1]: dp[i][j] = dp[i-1][j-1] + 1 else: dp[i][j] = max(dp[i-1][j], dp[i][j-1])
return dp[m][n]
This is the standard time, space algorithm. You fill a table, compare characters, take maximums. Easy.
But here's the problem:
- Files with 10,000 lines each → need a 10,000 × 10,000 table
- That's 100 million cells to compute
- For a simple diff that takes milliseconds
Real-world reality check:
- Linux kernel: 30+ million lines of code
- A 50MB JSON file: thousands of lines
- Your typical pull request: dozens to hundreds of files
Yeah, we need something better.
The Problem: It's Too Slow
Classic DP is . For two 10,000-line files, that's 100 million operations.
In 1976, Doug McIlroy (the Unix pipes guy) and James Hunt had a better idea.
Their insight: most lines in code are unique.
Think about it: In a typical code file, how many lines are literally just {
? Maybe a few. But lines like def calculate_fibonacci(n):
or return total + calculate_tax(amount)
are probably unique.
NOTE
Key Insight: Instead of comparing every line with every other line (expensive!), we can:
- Hash each line once (cheap!)
- Quickly look up where each line appears (instant!)
- Only do detailed work on the ambiguous parts
Hunt-McIlroy's three-step process:
- Build a hash table of where each unique line appears
- Use that to quickly find matching segments
- Only do detailed DP in ambiguous regions
The Algorithm: Step by Step
Let's trace through a small example:
# File Adef foo(): x = 1 y = 2 return x + y
# File Bdef foo(): x = 1 z = 3 y = 2 return x + y
Step 1: Build the occurrence table
For each line in B, record where it appears in A:
occurrence_map = { "def foo():": [0], # Line 0 in A " x = 1": [1], # Line 1 in A " z = 3": [], # Not in A " y = 2": [2], # Line 2 in A " return x + y": [3] # Line 3 in A}
Step 2: Find candidate matches
Now walk through B and note potential matches:
B[0] "def foo():" -> matches A[0] ✓B[1] " x = 1" -> matches A[1] ✓B[2] " z = 3" -> no match ✗B[3] " y = 2" -> matches A[2] ✓B[4] " return x+y" -> matches A[3] ✓
Step 3: Find longest increasing subsequence
We need matches where line numbers in A are increasing:
- A[0] → B[0] ✓
- A[1] → B[1] ✓
- A[2] → B[3] ✓ (2 > 1, so valid)
- A[3] → B[4] ✓ (3 > 2, so valid)
This gives us the longest common subsequence! Everything else is insertions/deletions.
Here's the beautiful part: finding the longest increasing subsequence can be done in with binary search, not DP!
def longest_increasing_subsequence(matches): """ matches: list of (line_in_A, line_in_B) tuples Returns the length of LIS based on line_in_A values """ from bisect import bisect_left
# tails[i] = smallest tail element of all increasing subsequences of length i+1 tails = []
for a_line, b_line in sorted(matches, key=lambda x: x[1]): # Find where a_line would fit in our current best subsequences pos = bisect_left(tails, a_line)
if pos == len(tails): tails.append(a_line) else: tails[pos] = a_line
return len(tails)
Why This Is Faster
Classic DP approach:
- Time:
- Space:
Hunt-McIlroy:
- Building hash table:
- Finding matches:
- Computing LIS: where is the number of matches
- Total:
Why this matters: For typical files where most lines are unique, . The algorithm is effectively in practice!
Bonus: Modern implementations use space-efficient LIS algorithms, bringing space down to instead of .
Real numbers: Comparing two 10,000-line files:
- Classic DP: ~100 million operations
- Hunt-McIlroy: ~20,000 operations (if most lines unique)
- That's 5,000× faster!
What Git Actually Uses: Myers' Algorithm
Hunt-McIlroy was great. But in 1986, Eugene Myers published something better.
Instead of asking "what's the longest common part?", he asked: "what's the shortest edit sequence?"
Myers' algorithm uses a concept called the "edit graph":
Think of it as a grid:
- Horizontal moves → delete from A (move right)
- Vertical moves → insert to B (move down)
- Diagonal moves → match (free! no cost)
Your goal: Get from top-left to bottom-right using the fewest horizontal/vertical moves.
def myers_diff(A, B): """ Simplified Myers' algorithm - finds the shortest edit sequence. Runs in O(ND) where N=len(A)+len(B) and D is the edit distance. """ N, M = len(A), len(B) MAX = N + M
# v[k] represents the furthest reaching x value on diagonal k v = {1: 0} trace = []
for d in range(MAX + 1): trace.append(v.copy())
for k in range(-d, d + 1, 2): # Should we go down (insert) or right (delete)? if k == -d or (k != d and v.get(k - 1, -1) < v.get(k + 1, -1)): x = v.get(k + 1, 0) # Came from diagonal k+1 (vertical/insert) else: x = v.get(k - 1, 0) + 1 # Came from diagonal k-1 (horizontal/delete)
y = x - k
# Follow any diagonal matches (no cost!) while x < N and y < M and A[x] == B[y]: x, y = x + 1, y + 1
v[k] = x
# Found the end? if x >= N and y >= M: return d, trace
return MAX, trace
The genius of Myers' algorithm:
- Explores in "waves" of increasing edit distance (like BFS)
- Matches are free - follow diagonals as far as possible
- Only "spend" operations on actual changes
Complexity: where:
- = total length of both files
- = edit distance (how different they are)
TIP
Why this is brilliant: For files that are mostly similar (which is most diffs!), is small.
Example: Two 10,000-line files with only 10 changed lines:
- Classic DP: = 100 million ops
- Myers: = 200,000 ops
- 500× faster!
Edit Graph Visualizer
Myers' algorithm uses this grid to find the shortest edit path
Why Git Uses This
Let's bring this back to reality. When you run git diff
, here's what actually happens:
- Git retrieves the two versions of the file from its object database
- It runs Myers' algorithm (or a variant) to compute the edit script
- It formats the output as a unified diff
$ git diff HEAD~1 myfile.py
Git is smart about this:
- It diff by lines, not characters (usually)
- It can use heuristics to improve output quality
- It supports different diff algorithms (
--diff-algorithm=myers|minimal|patience|histogram
)
The patience
algorithm is particularly interesting - it's designed to produce more "human-readable" diffs by prioritizing unique lines as anchors.
Actually Implementing a Simple Diff
Let's build a minimal but functional diff tool to cement these ideas:
def simple_diff(A, B): """ Compute a simple diff between two lists of lines. Returns a diff in unified diff format. """
# Compute LCS using DP (for simplicity, not Hunt-McIlroy) m, n = len(A), len(B) dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1): for j in range(1, n + 1): if A[i-1] == B[j-1]: dp[i][j] = dp[i-1][j-1] + 1 else: dp[i][j] = max(dp[i-1][j], dp[i][j-1])
# Backtrack to find the actual LCS and generate diff diff = [] i, j = m, n
while i > 0 or j > 0: if i > 0 and j > 0 and A[i-1] == B[j-1]: # Match - part of LCS diff.append((' ', A[i-1])) i -= 1 j -= 1 elif j > 0 and (i == 0 or dp[i][j-1] >= dp[i-1][j]): # Insertion diff.append(('+', B[j-1])) j -= 1 else: # Deletion diff.append(('-', A[i-1])) i -= 1
return list(reversed(diff))
# Example usageold_file = [ "def calculate_sum(numbers):", " total = 0", " for num in numbers:", " total += num", " return total"]
new_file = [ "def calculate_sum(numbers):", " total = 0", " for num in numbers:", " if num > 0:", " total += num", " return total"]
diff = simple_diff(old_file, new_file)
for op, line in diff: print(f"{op} {line}")
Output:
def calculate_sum(numbers): total = 0 for num in numbers: total += num if num > 0: total += num return total
Beautiful! We've implemented the core of diff
in ~40 lines of Python.
NOTE
Learning Checkpoint: At this point, you understand:
- Diff is just LCS applied to lines
- The classic DP solution
- Hunt-McIlroy's hash table optimization
- Myers' edit graph approach
- How to implement a basic diff in Python
Try it yourself: Edit the code in the visualizer below to see diff in action!
Interactive Diff Simulator
Edit the code to see real-time diff output
def calculate_sum(numbers):total = 0for num in numbers:- total += num+ if num > 0:+ total += numreturn total
+ additions, - deletions — same as git diff
The Real Lesson: DSA Is Everywhere
You literally cannot use Git without running an algorithm that appears in every DSA curriculum. Every time you look at a pull request diff on GitHub, you're staring at the output of the Hunt-McIlroy or Myers algorithm.
But it goes deeper than just diff:
Algorithm | "Interview Question" | Real-World Usage |
---|---|---|
LCS | That DP problem everyone hates | git diff , version control, DNA sequencing |
Edit Distance | Levenshtein distance | Spell checkers, fuzzy search, plagiarism detection |
Dijkstra | Shortest path in graph | Google Maps, network routing, compilers |
Union-Find | Disjoint sets | Network connectivity, image processing |
Trie | Prefix tree | Autocomplete, IP routing, spell checkers |
Segment Trees | Range queries | Databases, computational geometry |
NOTE
The Pattern: Every single one of these "useless interview questions" is actually used in production systems you use every single day.
The problem isn't that DSA is irrelevant. The problem is that we teach it badly, divorced from real applications.
When you're grinding LeetCode, you're not just preparing for interviews. You're learning the fundamental building blocks of systems you use every single day.
The next time someone on Twitter says "nobody uses DP in real life," just run git diff
and smile.
Going Deeper: Resources and Rabbit Holes
If you've made it this far and want to dive deeper, here are the canonical resources:
The Original Papers
Hunt & McIlroy (1976): "An Algorithm for Differential File Comparison" - The original Unix diff algorithm. Surprisingly readable for a 1970s paper.
Myers (1986): "An O(ND) Difference Algorithm and Its Variations" - This is what Git uses. The paper is dense but worth it.
Heckel (1978): "A Technique for Isolating Differences Between Files" - Another influential approach, used in some merge tools.
Implementations to Study
Git's diff implementation: Check out
xdiff/xdiffi.c
in the Git source. It's in C and surprisingly readable.diff-match-patch: Google's diff-match-patch library implements Myers' algorithm in multiple languages. Great for learning.
Python's difflib: The standard library's
difflib
module implements a variant of these algorithms. Check the source!
Tools to Explore
Try different diff algorithms and see how they behave:
# Standard Myersgit diff HEAD~1
# Patience diff (often better for code)git diff --patience HEAD~1
# Histogram diff (Git's default in recent versions)git diff --histogram HEAD~1
# Minimal diff (slower but smallest output)git diff --minimal HEAD~1
Each one makes different trade-offs between speed, diff size, and "human readability."
Books Worth Reading
- "The Algorithm Design Manual" by Skiena - Chapter on DP includes diff as a motivating example
- "Algorithms" by Dasgupta, Papadimitriou, Vazirani - Clean explanation of edit distance
- "Version Control with Git" by Loeliger & McCullough - Chapter 8 dives into diff and merge
The Uncomfortable Truth
Here's what I really think about the DSA debate:
The problem isn't that companies ask too much DSA. The problem is that:
- We teach DSA as puzzle-solving, not as tools for building systems
- Interviews test recognition, not understanding or application
- The industry cargo-cults Google, even when the context is completely different
But the solution isn't to abandon DSA. It's to teach it better, contextualize it, show where it appears in real systems.
Because here's the thing: you can have opinions about whether LeetCode-style interviews are good or bad. That's fine. Debate that all day.
But you cannot have opinions about whether LCS is useful. It objectively is. You used it today, multiple times, whether you realized it or not.
That LeetCode problem you thought was useless? You used it today. Probably dozens of times.
Next time you run git diff
, remember: you're looking at the output of the exact algorithm you solved at 2 AM and forgot about.
The difference between "toy problem" and "production system" is often just context.
It's all connected. And that's kind of beautiful.
References
Hunt, J. W., & McIlroy, M. D. (1976). An Algorithm for Differential File Comparison. Bell Laboratories Computing Science Technical Report #41.
Myers, E. W. (1986). An O(ND) difference algorithm and its variations. Algorithmica, 1(1-4), 251-266.
Ukkonen, E. (1985). Algorithms for approximate string matching. Information and control, 64(1-3), 100-118.
Tichy, W. F. (1984). The string-to-string correction problem with block moves. ACM Transactions on Computer Science, 2(4), 309-321.
Git Documentation: git-diff and gitdiffcore
Bergroth, L., Hakonen, H., & Raita, T. (2000). A survey of longest common subsequence algorithms. String Processing and Information Retrieval, 39-48.
Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). Introduction to Algorithms (3rd ed.). MIT Press. Chapter 15.4: Longest Common Subsequence.