Hi,I have noticed that raw output of Myers diff algorithm may be a bit confusing in some cases, most notably when highlighting inline changes. The issue may be easily reproduced and I am attaching a simple test case (see e.g. first g in gnomeglade) and a screenshot showing the results. The included patch tries to address the issue by adding a post-processing clean-up step and reducing the number of small matching chunks.
Comments and testing welcome. Cheers, Piotr
>From d6feec0227b3b3ae67c23aba6650edd26a64b0f7 Mon Sep 17 00:00:00 2001
From: Piotr Piastucki <the_leech users berlios de>
Date: Fri, 28 Jan 2011 10:57:05 +0100
Subject: [PATCH] Add post-processing cleanup algoritm to MyersSequenceMatcher
While there may exist several optimal edit scripts some of them are
more suitable for human use. The goal of a post-processing cleanup
algorithm is to provide more meaningful results by reducing 'chaff' in
diff algorithm output. This commit adds a simple cleanup algorithm
that combines small commonalities together in order to yield more
human-friendly yet optimal results.
---
meld/matchers.py | 27 +++++++++++++++++++++++++++
1 files changed, 27 insertions(+), 0 deletions(-)
diff --git a/meld/matchers.py b/meld/matchers.py
index 5393de0..9540b90 100644
--- a/meld/matchers.py
+++ b/meld/matchers.py
@@ -133,6 +133,32 @@ class MyersSequenceMatcher(difflib.SequenceMatcher):
b = b2
return (a, b)
+ def postprocess(self):
+ mb = []
+ i = len(self.matching_blocks) - 2
+ mb.append(self.matching_blocks[i + 1])
+ while i >= 0:
+ cur_block = self.matching_blocks[i]
+ cur_a = cur_block[0]
+ cur_b = cur_block[1]
+ cur_len = cur_block[2]
+ i -= 1
+ while i >= 0:
+ prev_block = self.matching_blocks[i]
+ prev_a = prev_block[0]
+ prev_b = prev_block[1]
+ prev_len = prev_block[2]
+ if prev_b + prev_len == cur_b or prev_a + prev_len == cur_a:
+ if self.a[cur_a - prev_len : cur_a] == self.b[cur_b - prev_len : cur_b]:
+ cur_b -= prev_len
+ cur_a -= prev_len
+ cur_len += prev_len
+ i -= 1
+ continue
+ break
+ mb.insert(0, (cur_a,cur_b,cur_len))
+ self.matching_blocks = mb
+
def build_matching_blocks(self, lastsnake, snakes):
"""
Build list of matching blocks based on snakes taking into consideration all preprocessing
@@ -263,4 +289,5 @@ class MyersSequenceMatcher(difflib.SequenceMatcher):
lastsnake = node
break
self.build_matching_blocks(lastsnake, snakes)
+ self.postprocess()
yield 1
--
1.7.1
Attachment:
diff_chaff.tar.gz
Description: GNU Zip compressed data