Text Diff & Compare Tool
Diff Algorithms
The problem of finding differences between two sequences is one of the most studied problems in computer science. At its core, a diff algorithm must identify which elements are common to both sequences and which are unique to each. The output is a minimal set of operations (insertions and deletions) that transform the first sequence into the second.
The foundational algorithm for this problem is the Longest Common Subsequence (LCS) algorithm, first formalized in the 1970s. LCS finds the longest sequence of elements that appears in both inputs in the same order, though not necessarily consecutively. For example, given the strings "ABCDE" and "ACDEB", the LCS is "ACDE" (length 4). The elements not in the LCS are the differences: "B" was removed from position 2 in the original, and "B" was added at position 5 in the modified text.
The classic LCS algorithm uses dynamic programming with a two-dimensional table. For two sequences of length m and n, it requires O(mn) time and space. While this is efficient enough for most text comparison tasks, it becomes prohibitive for very large files with millions of lines.
In 1986, Eugene Myers published a landmark paper "An O(ND) Difference Algorithm and Its Variations" that introduced a more efficient approach. Myers' algorithm finds the shortest edit script (the minimal number of insertions and deletions) in O(ND) time, where N is the sum of the sequence lengths and D is the size of the minimum edit script. For sequences that are very similar (small D), this is dramatically faster than the classic O(mn) approach. Myers' algorithm is the foundation of the diff implementations in Git, GNU diff, and most modern version control systems.
Later improvements include the patience diff algorithm (used as an option in Git), which produces more human-readable diffs by first matching unique lines, then applying LCS to the remaining sequences. The histogram diff algorithm, also available in Git, extends patience diff with a frequency-based heuristic that improves performance on repetitive content.
Version Control History
The concept of tracking differences between file versions is as old as software development itself. The Unix diff command, written by Douglas McIlroy in the early 1970s at Bell Labs, was one of the first tools to compute and display differences between text files. McIlroy's implementation used the Hunt-Szymanski algorithm, a practical variant of LCS.
The diff command became a fundamental building block of Unix-based development. The patch command, created by Larry Wall (later the creator of Perl), could take the output of diff and apply it to transform one file into another. This diff-and-patch workflow enabled collaborative software development over email and bulletin board systems before the internet era.
The first widely used version control systems — SCCS (1972), RCS (1982), and CVS (1990) — all relied on diff to store file histories efficiently. Rather than storing complete copies of every version, these systems stored the initial version plus a chain of diffs (deltas) that could reconstruct any version. This approach reduced storage requirements dramatically.
Modern distributed version control systems like Git (created by Linus Torvalds in 2005) and Mercurial use more sophisticated storage strategies, but diff remains central to their operation. Every git diff, every pull request review, and every merge conflict resolution involves computing and displaying differences between text versions.
Unified vs Context Diff Format
The output of a diff comparison can be displayed in several standardized formats. The two most common are unified diff format and context diff format. Understanding these formats is important because they are used throughout the software development ecosystem.
Unified diff format (produced by diff -u) is the most widely used format today. It shows changed lines with a single-character prefix: + for added lines, - for removed lines, and a space for unchanged context lines. A header line beginning with @@ indicates the line numbers in both files. Unified diff is the default format for git diff, GitHub pull request reviews, and most code review tools.
Context diff format (produced by diff -c) predates unified diff and shows changes with more surrounding context. Changed lines are prefixed with !, added lines with +, and removed lines with -. Each change group (called a "hunk") shows lines from both files separately, preceded by *** for the original and --- for the modified version. Context diff is less compact than unified diff but can be easier to read for large changes.
This tool uses a visual diff format optimized for browser display, with color coding (green for additions, red for deletions, gray for unchanged) and line numbers. While not identical to unified or context diff, the visual format conveys the same information in a more immediately readable form.
Practical Applications
Text comparison tools serve a wide range of practical purposes beyond software development. In code review, developers examine diffs to understand what changed in a pull request, identify potential bugs, and ensure code quality. Effective code review depends on clear, readable diff output that highlights meaningful changes while minimizing noise from formatting differences.
In content management, writers and editors compare document versions to track revisions, review suggested edits, and merge contributions from multiple authors. The ability to see exactly what was added, removed, or modified between drafts is essential for maintaining editorial quality and accountability.
In the legal field, contract comparison (also called redlining or blacklining) is a critical workflow. Lawyers must identify every change between contract versions to ensure that no terms have been altered without agreement. Automated text comparison tools have largely replaced the manual process of reading two documents side by side, reducing the risk of missed changes.
In data quality, comparing expected output against actual output is fundamental to testing and validation. Database migrations, API responses, report generation, and data pipeline outputs all require comparison against expected results. Diff tools make it easy to identify discrepancies and diagnose their causes.
Configuration management relies heavily on diffing to track changes to configuration files across environments. Comparing production, staging, and development configurations helps identify discrepancies that could cause bugs or security vulnerabilities. Tools like diff, Ansible, and Terraform use text comparison to plan and verify infrastructure changes.
Frequently Asked Questions
What algorithm does this diff tool use?
The tool uses the Longest Common Subsequence (LCS) algorithm, which is the foundation of most diff implementations including the Unix diff command and Git's diff engine. LCS finds the longest sequence of lines common to both texts and marks everything else as additions or deletions.
What is the difference between line-by-line and word-by-word comparison?
Line-by-line comparison treats each line as a unit and marks entire lines as added, removed, or unchanged. Word-by-word comparison breaks lines into individual words and highlights specific words that differ within each line. Word-by-word mode is more granular and useful when changes are small edits within lines.
Can I ignore whitespace differences?
Yes. Enable the Ignore Whitespace option to normalize spaces, tabs, and trailing whitespace before comparison. Lines that differ only in whitespace will be treated as identical. This is especially useful when comparing code with different indentation styles or line-ending conventions.
Can I ignore case when comparing?
Yes. Enable the Ignore Case option to perform a case-insensitive comparison. The original case is preserved in the output display, but lines that differ only in capitalization will be treated as identical.
How is similarity percentage calculated?
Similarity is calculated as the number of unchanged lines divided by the total number of unique lines in both texts, expressed as a percentage. A score of 100% means the texts are identical. A score of 0% means the texts share no common lines.
Is my text sent to a server?
No. All comparison happens entirely in your browser using JavaScript. No data is transmitted to any server. You can verify this by disconnecting from the internet and confirming the tool still works.
Can I upload files for comparison?
Yes. Click the Upload File link below each textarea to load a text file from your computer. The file is read locally using the browser's FileReader API. Supported file types include .txt, .md, .html, .css, .js, .json, .xml, .csv, and common programming language source files.