Skip to content

Improve the base diffing algorithm #160

@gdavoianb

Description

@gdavoianb

Hi, folks. I was so inspired by jsdiff library, that I decided to port it to Python, see pydiff (@kpdecker, I hope I am not violating anything, but if so, then let me know please).

The current implementation of the algorithm tends to perform additions first, and then removals, and sometimes even swapping added and removed parts afterwards doesn't help too much, because we might already obtain a "bad" solution.

Let me show you an example. Consider the following two strings:

TERMS AND CONDITIONS
A. TERMS OF SALE
B. ITUNES STORE TERMS AND CONDITIONS
C. MAC APP STORE, APP STORE, APP STORE FOR APPLE TV AND IBOOKS STORE TERMS AND CONDITIONS
D. APPLE MUSIC TERMS AND CONDITIONS
THE LEGAL AGREEMENTS SET OUT BELOW GOVERN YOUR USE OF THE ITUNES STORE, MAC APP STORE, APP STORE, APP STORE FOR APPLE TV, IBOOKS STORE AND APPLE MUSIC SERVICES ("SERVICES").

and

TERMS OR CONDITIONS
A. TERMS OF SALE
B. ITUNES STORE TERMS AND CONDITIONS
C. MAC APP STORE, APP STORE, APP STORE FOR APPLE TV TERMS AND CONDITIONS, AND SOMETHING ELSE
D. APPLE MUSIC TERMS AND CONDITIONS
THE LEGAL CONTRACTS SET OUT BELOW GOVERN YOUR USE OF THE ITUNES STORE, MAC APP STORE, APP STORE, APP STORE FOR APPLE TV, IBOOKS STORE AND APPLE MUSIC SERVICES ("SERVICES").

The content of the strings doesn't matter, the most interesting part here is the clause C, where there are a deleted part AND IBOOKS STORE and an inserted part , AND SOMETHING ELSE.

Using jsdiff.diffWords for comparing these string we obtain the following diff:

[ { count: 2, value: 'TERMS ' },
  { count: 1, added: undefined, removed: true, value: 'AND' },
  { count: 1, added: true, removed: undefined, value: 'OR' },
  { count: 50,
    value: ' CONDITIONS\nA. TERMS OF SALE\nB. ITUNES STORE TERMS AND CONDITIONS\nC. MAC APP STORE, APP STORE, APP STORE FOR APPLE TV ' },
  { count: 2, added: true, removed: undefined, value: 'TERMS ' },
  { count: 2, value: 'AND ' },
  { count: 1, added: undefined, removed: true, value: 'IBOOKS' },
  { count: 2,
    added: true,
    removed: undefined,
    value: 'CONDITIONS,' },
  { count: 1, value: ' ' },
  { count: 1, added: undefined, removed: true, value: 'STORE' },
  { count: 1, added: true, removed: undefined, value: 'AND' },
  { count: 1, value: ' ' },
  { count: 1, added: undefined, removed: true, value: 'TERMS' },
  { count: 1, added: true, removed: undefined, value: 'SOMETHING' },
  { count: 1, value: ' ' },
  { count: 1, added: undefined, removed: true, value: 'AND' },
  { count: 1, added: true, removed: undefined, value: 'ELSE' },
  { count: 1, value: '\n' },
  { count: 2,
    added: undefined,
    removed: true,
    value: 'CONDITIONS\n' },
  { count: 17,
    value: 'D. APPLE MUSIC TERMS AND CONDITIONS\nTHE LEGAL ' },
  { count: 1, added: undefined, removed: true, value: 'AGREEMENTS' },
  { count: 1, added: true, removed: undefined, value: 'CONTRACTS' },
  { count: 60,
    value: ' SET OUT BELOW GOVERN YOUR USE OF THE ITUNES STORE, MAC APP STORE, APP STORE, APP STORE FOR APPLE TV, IBOOKS STORE AND APPLE MUSIC SERVICES ("SERVICES").' } ]

Though this diff is optimal in terms of edit distance (17 removals/additions), it is not good enough, it is too complicated and has too many components.

But there exists yet another optimal diff with 17 changes (obtained in Python):

[{'count': 2, 'value': 'TERMS '},
 {'count': 1, 'removed': True, 'added': None, 'value': 'AND'},
 {'count': 1, 'removed': None, 'added': True, 'value': 'OR'},
 {'count': 50, 'value': ' CONDITIONS\nA. TERMS OF SALE\nB. ITUNES STORE TERMS AND CONDITIONS\nC. MAC APP STORE, APP STORE, APP STORE FOR APPLE TV '},
 {'count': 6, 'removed': True, 'value': 'AND IBOOKS STORE ', 'added': None},
 {'count': 5, 'value': 'TERMS AND CONDITIONS'},
 {'count': 1, 'removed': None, 'added': True, 'value': ','},
 {'count': 1, 'value': ' '},
 {'count': 6, 'removed': None, 'value': 'AND SOMETHING ELSE\n', 'added': True},
 {'count': 17, 'value': 'D. APPLE MUSIC TERMS AND CONDITIONS\nTHE LEGAL '},
 {'count': 1, 'removed': True, 'added': None, 'value': 'AGREEMENTS'},
 {'count': 1, 'removed': None, 'added': True, 'value': 'CONTRACTS'},
 {'count': 60, 'value': ' SET OUT BELOW GOVERN YOUR USE OF THE ITUNES STORE, MAC APP STORE, APP STORE, APP STORE FOR APPLE TV, IBOOKS STORE AND APPLE MUSIC SERVICES ("SERVICES").'}]

IMHO, this diff looks much better, because it has less components and thus are more human-readable. Also it tries to remove as many tokens as possible, and only then tries to add tokens, and as a result we get bigger disjoint groups of removed/added tokens instead of many one-word changes like removed/added, removed/add etc.

What do you think about this change? Do you find it reasonable? When I started porting JS code to Python, I tried to make Python-output to be as close as possible to JS-output, but I decided to make an additional step ahead to improve readability of diffs, and now I am suggesting to back-port this change to jsdiff. I don't have any JS implementation yet, but my Python implementation can be used as a baseline.

Any suggestions, comments, alternative opinions, and also criticism are welcome :)

Thank you for attention.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions