How Diff Algorithms Work: From Git to Modern Code Review Tools
The first time I tried to understand how git diff works, I assumed it was simple line-by-line string comparison. It's not. It's a carefully optimized implementation of a 40-year-old algorithm that solves a non-trivial computer science problem: given two sequences, find the minimum number of edits to transform one into the other.
This matters because the algorithm you choose changes what your diff looks like -- and a diff that's technically correct but visually confusing wastes your time. Understanding how these algorithms work helps you read diffs faster, configure your tools better, and recognize when the algorithm is misleading you.
Let's start with the core problem and work through the solutions.
The Core Problem: Longest Common Subsequence
A diff algorithm answers one question: what's the minimal set of changes needed to transform file A into file B?
The mathematical foundation is the Longest Common Subsequence (LCS) problem. Given two sequences of tokens (usually lines of text), find the longest subsequence that appears in both, in the same order, without being contiguous.
Example:
A: a b c d e f g
B: a b x d e y g
The LCS is a b d e g. Everything in A outside the LCS was "removed" (c, f). Everything in B outside the LCS was "added" (x, y). The diff output shows:
a
b
- c
+ x
d
e
- f
+ y
g
This is also called the shortest edit script (SES) -- the minimal number of insertions and deletions to transform A into B. The diffing problem reduces to computing the SES efficiently.
The Myers Diff Algorithm
Eugene Myers published "An O(ND) Difference Algorithm and Its Variations" in 1986. Git uses this algorithm. It's the default in most version control systems and the one you interact with every time you run git diff.
How It Works
Myers models the diff as a grid search problem. If file A has M lines and file B has N lines, you construct an (M+1) × (N+1) grid. Moving right means "delete a line from A." Moving down means "insert a line from B." Moving diagonally means "keep the line (it matches)."
The algorithm finds the shortest path from the top-left corner (both files empty) to the bottom-right corner (both files fully consumed). Diagonal moves are "free" (zero cost) because they represent matching lines. Horizontal and vertical moves cost one edit each.
B: a b x d e y g
A: +------------------------
| 0 1 2 3 4 5 6 7
a | 1 \ 2 3 4 5 6 7
b | 2 2 \ 3 4 5 6 7
c | 3 3 3 4 5 6 7 8
d | 4 4 4 4 \ 5 6 7
e | 5 5 5 5 5 \ 6 7
f | 6 6 6 6 6 6 7 8
g | 7 7 7 7 7 7 7 \
The backslash character \ represents a diagonal move (matching line). The number at each cell is D, the edit distance. The algorithm searches outward from the start, expanding the frontier of reachable positions as D increases, until the bottom-right corner is reached.
Why It's Fast
Myers runs in O(ND) time, where N is the number of lines and D is the edit distance. For files with few differences (small D), it's nearly linear. For files with many differences, it degrades to O(N²) in the worst case.
Git's implementation adds several optimizations:
- It processes lines as hash values, not raw strings, so string comparison is constant-time after hashing.
- It uses a sliding window to handle large files -- diffs beyond a configurable size threshold use a faster but less precise heuristic.
- It breaks the diff into "hunks" -- blocks of changes surrounded by N lines of unchanged context.
When Myers Produces Confusing Output
Myers finds a shortest edit script, but not necessarily the most human-readable one. Consider this scenario:
# Original
function greet() {
console.log("Hello");
console.log("World");
}
# Modified (swapped the two log lines)
function greet() {
console.log("World");
console.log("Hello");
}
Myers might produce:
function greet() {
- console.log("Hello");
console.log("World");
+ console.log("Hello");
}
This misleads you into thinking one line was deleted and added lower down, when in reality the two lines were simply swapped. A human would prefer:
function greet() {
- console.log("Hello");
+ console.log("World");
- console.log("World");
+ console.log("Hello");
}
The first output is technically a valid shortest edit script (2 operations: delete "Hello" and add "Hello" below "World"). The second is also an SES (2 operations: change the first line, change the second). Myers doesn't distinguish between them -- both have the same edit distance. This is where more modern algorithms improve the result.
Patience Diff
Bram Cohen (creator of BitTorrent) introduced the patience diff algorithm to solve exactly this readability problem. Git supports it via git diff --patience.
How It Works
Patience diff focuses on unique matching lines -- lines that appear exactly once in both files. These are high-confidence anchors.
- Find all lines that appear exactly once in both file A and file B.
- From these, compute the longest common subsequence (using patience sorting, hence the name).
- These matching lines become "anchors" that partition the files into sections.
- Recursively diff each section using the same algorithm (or fall back to Myers for small sections).
# Example with swapped functions
# File A: # File B:
function greet() { function greet() {
console.log("Hello"); console.log("World");
} }
function farewell() { function farewell() {
console.log("Bye"); console.log("Bye");
} }
function notify() {
console.log("Ping");
}
Myers might try to match console.log lines across functions, producing a confusing diff. Patience recognizes function farewell() and function greet() as unique anchors, partitions the file around them, and produces a cleaner, function-aligned diff.
When to Use Patience Diff
git diff --patience
git merge --strategy-option=patience
Use patience diff when:
- Functions or code blocks have been reordered.
- You want the diff to align structurally rather than textually.
- The default diff output looks correct but confusing.
Patience is slower than Myers (O(N²) worst case for the patience sorting step), but the difference is negligible on files under 10,000 lines.
Histogram Diff
Git also supports histogram diff (git diff --histogram), which refines patience diff further. Instead of only using unique lines as anchors, histogram diff considers lines that appear multiple times but are still useful for partitioning.
How It Works
- Build a histogram of line occurrences in file A.
- For each line in the histogram, check how many times it appears in file B.
- Select lines with the lowest combined occurrence count as anchors.
- Partition and recursively diff.
This bridges the gap between Myers (uses all lines, including noisy duplicates) and patience (only uses unique lines, potentially missing useful anchors in repetitive code).
git diff --histogram
Histogram is often the best default for code review. It produces readable output for reordered code while handling repetitive code (loops, template boilerplate, import statements) more gracefully than pure patience diff.
How Git Chooses and Configures Diff Algorithms
Git's default is Myers, but you can change it globally:
git config --global diff.algorithm histogram
Or per-repository in .gitattributes:
*.js diff=histogram
*.py diff=patience
The .gitattributes file can also define custom diff drivers for generated files:
*.min.js -diff # Don't diff minified files
package-lock.json -diff
For comparing build outputs or generated files where diffing is pointless, this saves rendering time and keeps the PR focused on meaningful changes.
Indentation Heuristic
Git 2.9+ includes an indentation heuristic that improves diff readability for code. When a diff block contains a blank line or a line that starts a new indentation block, Git prefers to split the hunk at that point. This produces diffs that align with code structure:
git config --global diff.indentHeuristic true
This is enabled by default in newer Git versions.
Diff in Modern Code Review Tools
GitHub and GitLab
Both platforms use Git's diff engine on the backend with custom rendering on the frontend. They add:
- Syntax highlighting per file type.
- Word-level diffing within changed lines (the darker highlighting inside a changed line shows which specific word or token changed).
- Hunk-level commenting so reviewers can attach feedback to specific blocks.
- "Load diff" on-demand for large files to avoid overwhelming the browser.
VS Code
VS Code's diff editor uses its own JavaScript-based diff algorithm optimized for interactive editing. It supports:
- Word-level diffing with character-level precision.
- Whitespace-ignore toggles.
- Minimap integration showing where changes cluster.
- Live updates as you edit -- the diff recalculates on every keystroke.
Our Browser-Based Diff Tools
Our diff tool implements a word-level diff algorithm in JavaScript, processing everything locally in the browser. The text compare tool uses a side-by-side layout with the same word-level algorithm. For terminal-style output, the text diff tool produces unified diff format.
The core algorithm is an optimized LCS implementation that balances accuracy with rendering speed. For files under 5,000 lines, the word-level approach catches changes at the token level, not just the line level.
Performance Considerations
Time Complexity Summary
| Algorithm | Worst Case | Best Case | Typical |
|---|---|---|---|
| Myers | O(N²) | O(N+D) | O(N·D) |
| Patience | O(N²) | O(N log N) | O(N log N) |
| Histogram | O(N²) | O(N log N) | O(N log N) |
Where N is the number of lines and D is the edit distance (number of changes).
Practical Performance
On a typical source file (200-500 lines, 10-30 changes), all three algorithms complete in under a millisecond. The differences only become noticeable on:
- Very large files (10,000+ lines) with many changes → Myers is fastest.
- Files with reordered sections → Patience and histogram produce better output at similar speed.
- Generated files with repetitive content → Histogram handles repetition better than patience, Myers handles it fastest.
Memory Usage
All three algorithms require O(N) memory to store the edit graph and LCS. For a 100,000-line file, that's roughly 10-20 MB per comparison -- well within the capacity of any modern machine.
When to Skip Diffing Entirely
For binary files or generated artifacts where diffing is meaningless:
diff <(sha256sum build_v1/bin/) <(sha256sum build_v2/bin/)
Use a checksum tool or hash generator to compare file hashes instead. This is orders of magnitude faster than parsing a diff and doesn't risk misinterpretation.
The Hidden Complexity: What Diff Algorithms Don't Tell You
Diffing Is Not Symmetric
diff(A, B) and diff(B, A) produce different output in most algorithms. The edit distance is the same, but the edit script can differ because the algorithm makes arbitrary choices when multiple equally-short paths exist.
This matters in practice: a code review showing "deleted 20 lines, added 15 lines" reads differently from "modified 8 lines" even if both describe the same change. The algorithm's path choice affects how you perceive the change.
Tokenization Affects Everything
Before diffing, both files are tokenized. If you tokenize by line, the diff is line-oriented. If you tokenize by word, the diff shows word-level changes. If you tokenize by character, you get the most granular diff but the most noise.
Choosing the right tokenizer matters more than choosing the right algorithm for most use cases. This is why our text compare tool uses word-level tokenization -- it finds the sweet spot between line-level (too coarse) and character-level (too noisy).
Semantic Diffing Doesn't Exist (Yet)
No diff algorithm understands code semantics. It doesn't know that renaming a variable from userCount to totalUsers is a refactoring, not a functional change. It doesn't know that moving a function to a different file is a reorganization, not a deletion and addition.
This is why code review still requires human judgment. The diff shows you what changed; you supply the why.
FAQ
What diff algorithm does Git use by default?
Myers diff algorithm. You can switch to patience (--patience) or histogram (--histogram) for better readability on reordered code.
Why does my diff show a function as entirely deleted and re-added when I only moved it?
This is a common Myers diff artifact. When lines shift position, Myers may find a shorter edit script by deleting the old copy and adding the new one rather than identifying the move. Switch to git diff --patience for function-aligned output.
Which algorithm is fastest?
Myers is fastest for files with few changes (small edit distance). For very similar files, it runs in near-linear time. All three algorithms are fast enough for any source file you'll encounter in practice.
How does GitHub highlight specific words within a changed line?
GitHub applies a secondary word-level diff on each changed line. The line diff (Myers) identifies which lines changed, then a character-level algorithm finds the specific tokens within those lines that differ. This is why you see darker highlighting inside changed lines.
Can diff algorithms handle binary files?
No. Diff algorithms are designed for text. For binary comparison, use a hex diff tool (Beyond Compare's hex view), a checksum comparison to verify identity, or bsdiff for binary patching.
Why does VS Code sometimes show a different diff than git diff?
VS Code uses its own diff engine (not Git's) optimized for interactive editing. The algorithms both compute shortest edit scripts, but they may choose different paths when multiple equally-short scripts exist. Both are correct; they just render the same change differently.
When should I use patience vs histogram?
Use patience when functions or structured blocks are reordered and Myers produces confusing output. Use histogram when you have repetitive code (loops, templates) and patience misses useful anchors. For most projects, git diff --histogram is a good default.
How do diff algorithms relate to merge algorithms?
Merge algorithms (used by git merge) build on diff algorithms. A three-way merge computes two diffs: (ancestor → ours) and (ancestor → theirs). The merge algorithm then combines these two edit scripts. If both scripts modify the same lines, you get a conflict. The quality of the diff algorithm directly affects how many false conflicts you see.
Curious to see how different diff algorithms actually look on your code? Our diff tool renders word-level differences directly in the browser with no setup. For side-by-side visual comparison, try the text compare tool. Both process files entirely on your machine -- no uploads, no server round-trips, and no privacy concerns.