Skip to content

Conversation

goojba
Copy link
Contributor

@goojba goojba commented Jul 3, 2014

While doing some line diffs, I came across this bug. It was very hard to come up with a local fix: the assumption that one index == one character seems baked into the DiffBisect algorithm. Instead I converted part of the diff machinery to []rune.

I actually think all of the diff code should use []rune instead of string. The public interface like DiffMain can remain string, but the strings should be converted to []rune immediately and used throughout. That way there will be no mismatch between indexing and characters.

I'm sure there are many more utf8 bugs lurking in the code. For example, look at the loop in DiffCommonOverlap. It tries to extract the last character from the string, but actually extracts the last byte. And look at the way diffHalfMatchI indexes its first argument:

seed := l[i : i+len(l)/4]

There is no reason to believe that that slice will be valid utf8.

These bugs don't show up the in the tests because the tests use ASCII mostly. But line diffs involve converting the lines into runes, and that produces a lot of non-ASCII utf8. That is how I found the bug.

So what do you think? Will you consider a pull request that rewrites the diff internals to use []rune?

@sergi
Copy link
Owner

sergi commented Jul 4, 2014

This sounds very reasonable. I'd say go ahead with the pull request as long as you keep in mind these concerns (which I'm sure you would anyway):

  • API should stay the same
  • Not change existing tests and add new ones for runes, just to make sure we don't introduce regressions.
  • I am not sure if it applies, but I'd love to have some simple performance benchmarks. I have no idea whether using runes degrades performance significantly, but it would be nice anyway. That's just more of a thought than a task to do :)

Thoughts?

@goojba
Copy link
Contributor Author

goojba commented Jul 15, 2014

I did this work in a separate pull request. Any chance you could merge these in soon? Anything I can do to make these more acceptable?

@sergi
Copy link
Owner

sergi commented Jul 25, 2014

Checked and tested. Looks good!

sergi added a commit that referenced this pull request Jul 25, 2014
Use []rune in diff internals
@sergi sergi merged commit 085e8b9 into sergi:master Jul 25, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants