-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
CLN: prepare unifying hashtable.factorize and .unique; add doc-strings #22986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 11 commits
640162f
31d0dc5
c5e5147
9918d52
52ae84e
dbe4e0e
8481e19
27ceb4d
f5cd5e9
b1705a9
a6ed5dd
17752ce
19eaf32
ce7626f
7b9014f
471c4da
9d45378
00b2ccb
8687315
a267d4a
7f1bb40
9593992
08d7f50
d91be98
e27ec9a
d825be0
1a342d0
28e0441
d65e4fd
bca615c
3438727
facc111
6d0e86b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -355,19 +355,15 @@ cdef class {{name}}HashTable(HashTable): | |||
|
|
||||
| return np.asarray(locs) | ||||
|
|
||||
| def factorize(self, {{dtype}}_t values): | ||||
| uniques = {{name}}Vector() | ||||
| labels = self.get_labels(values, uniques, 0, 0) | ||||
| return uniques.to_array(), labels | ||||
|
|
||||
| @cython.boundscheck(False) | ||||
| def get_labels(self, const {{dtype}}_t[:] values, {{name}}Vector uniques, | ||||
| Py_ssize_t count_prior, Py_ssize_t na_sentinel, | ||||
| object na_value=None): | ||||
| @cython.wraparound(False) | ||||
| def _unique_with_inverse(self, const {{dtype}}_t[:] values, | ||||
| {{name}}Vector uniques, bint ignore_na=False, | ||||
| Py_ssize_t count_prior=0, | ||||
| Py_ssize_t na_sentinel=-1, object na_value=None): | ||||
| cdef: | ||||
| Py_ssize_t i, n = len(values) | ||||
| Py_ssize_t i, idx, count = count_prior, n = len(values) | ||||
| int64_t[:] labels | ||||
| Py_ssize_t idx, count = count_prior | ||||
| int ret = 0 | ||||
| {{dtype}}_t val, na_value2 | ||||
| khiter_t k | ||||
|
|
@@ -392,16 +388,19 @@ cdef class {{name}}HashTable(HashTable): | |||
| for i in range(n): | ||||
| val = values[i] | ||||
|
|
||||
| if val != val or (use_na_value and val == na_value2): | ||||
| if ignore_na and (val != val | ||||
| or (use_na_value and val == na_value2)): | ||||
| labels[i] = na_sentinel | ||||
| continue | ||||
|
|
||||
| k = kh_get_{{dtype}}(self.table, val) | ||||
|
|
||||
| if k != self.table.n_buckets: | ||||
| # k falls into a previous bucket | ||||
| idx = self.table.vals[k] | ||||
| labels[i] = idx | ||||
| else: | ||||
| # k hasn't been seen yet | ||||
| k = kh_put_{{dtype}}(self.table, val, &ret) | ||||
| self.table.vals[k] = count | ||||
|
|
||||
|
|
@@ -416,7 +415,26 @@ cdef class {{name}}HashTable(HashTable): | |||
| labels[i] = count | ||||
| count += 1 | ||||
|
|
||||
| return np.asarray(labels) | ||||
| return uniques.to_array(), np.asarray(labels) | ||||
|
|
||||
| def unique(self, const {{dtype}}_t[:] values, bint return_inverse=False): | ||||
| if return_inverse: | ||||
| return self._unique_with_inverse(values, uniques={{name}}Vector(), | ||||
| ignore_na=False) | ||||
| return self._unique_no_inverse(values) | ||||
|
|
||||
| def factorize(self, {{dtype}}_t[:] values): | ||||
| return self._unique_with_inverse(values, uniques={{name}}Vector(), | ||||
| ignore_na=True) | ||||
|
|
||||
| def get_labels(self, const {{dtype}}_t[:] values, {{name}}Vector uniques, | ||||
| Py_ssize_t count_prior=0, Py_ssize_t na_sentinel=-1, | ||||
| object na_value=None): | ||||
| _, labels = self._unique_with_inverse(values, uniques, ignore_na=True, | ||||
| count_prior=count_prior, | ||||
| na_sentinel=na_sentinel, | ||||
| na_value=na_value) | ||||
|
||||
| labels = table.get_labels(values, uniques, 0, na_sentinel, |
Which can directly use the hashtable.factorize method. So I think this would be easy to remove, and would actually simplify further the diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche
I'd prefer not loading up this PR with more things. The removal of get_labels in favour of factorize could be a separate PR, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it could. But it is here that you are changing get_labels, and actually doing this would make the diff a lot smaller. So I think that is an indication that it naturally fits here.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am bothered by the naming, call these just _unique, you can leave _unique_with_inverse, though add a comment that makes it clear this is less performant as its doing more ops. bonus for doc-strings on these.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this keyword is completely untested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that it is (at least for ignore_na=True) because all the tests for factorize need to run through this.
Do we explicitly test cython code? I can if necessary. This PR does the required preparations on cython level to continue with return_inverse later on. It will be excessively tested there (because unique(return_inverse=True) calls _unique_with_inverse(ignore_na=False)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes of course, you added a keyword.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am talking about the ignore_na keyword
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that I added this keyword to switch between .unique/.factorize behaviour.
Can you point me to where the cython hashtable code is tested? I could add some tests there. That's why I asked "Do we explicitly test cython code?", because I haven't come across it yet.
Otherwise, I would add copious tests for the behaviour as soon as the keyword makes its way into the pandas API, but this is just the preparation for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@h-vetinari : IMO tests/generic/test_generic.py or tests/generic/test_frame.py could be good places to put any kinds of tests for this refactoring for now.
If it is possible to hit those keyword arguments from Python land, then do add some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's drop the ignore_na for now. and add it when I see that you actually need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
This aspect is the whole point of the PR - to not duplicate the code between .factorize (=hashtable._unique_with_inverse(ignore_na=True)) and the not-yet-implemented-because-this-PR-is-starting-the-process .unique(return_inverse=True) (=hashtable._unique_with_inverse(ignore_na=False))
hashtable._unique_with_inverse(ignore_na=False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment: since what this algo does is what we typically call "factorize" in pandas, I would call this function "_factorize"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche
Can we keep this for the follow-up please? It will make sense there, because then
factorize = unique_with_inverse(ignore_na=True).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand your point here.
hashtable.factorizealready exists?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it makes sense to do that here, because it is here you are changing
get_labelsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I thought this was a reply to the other comment :-)
Yes, and otherwise the other way around
unique(return_inverse=True) = factorize(ignore_na=False). The implemenation of that function is what factorize does in pandas (and not unique), so it makes sense to me to use that name.