-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
Deprecate SparseDataFrame and SparseSeries #26137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
d518404
c32e5ff
836d19b
c0d6cf2
8f06d88
380c7c0
21569e2
6a81837
12a8329
01c7710
e9b9b29
b295ce1
ccf71db
7e6fbd6
865f1aa
9915c48
30f3670
b043243
b2aef95
706c5dc
13d30d2
c5fa3fb
101c425
b76745f
f153400
0c49ddc
1903f67
0b03ac2
12d8d83
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -6,27 +6,28 @@ | |||||
| Sparse data structures | ||||||
| ********************** | ||||||
|
|
||||||
| We have implemented "sparse" versions of ``Series`` and ``DataFrame``. These are not sparse | ||||||
| in the typical "mostly 0". Rather, you can view these objects as being "compressed" | ||||||
| where any data matching a specific value (``NaN`` / missing value, though any value | ||||||
| can be chosen) is omitted. A special ``SparseIndex`` object tracks where data has been | ||||||
| "sparsified". This will make much more sense with an example. All of the standard pandas | ||||||
| data structures have a ``to_sparse`` method: | ||||||
| .. note:: | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| ts = pd.Series(np.random.randn(10)) | ||||||
| ts[2:-2] = np.nan | ||||||
| sts = ts.to_sparse() | ||||||
| sts | ||||||
| ``SparseSeries`` and ``SparseDataFrame`` have been deprecated. Their purpose | ||||||
| is served equally well by a :class:`Series` or :class:`DataFrame` with | ||||||
| sparse values. See :ref:`sparse.migration` for tips on migrating. | ||||||
|
|
||||||
| The ``to_sparse`` method takes a ``kind`` argument (for the sparse index, see | ||||||
| below) and a ``fill_value``. So if we had a mostly zero ``Series``, we could | ||||||
| convert it to sparse with ``fill_value=0``: | ||||||
| Pandas provides data structures for efficiently storing sparse data. | ||||||
| These are not necessarily sparse in the typical "mostly 0". Rather, you can view these | ||||||
| objects as being "compressed" where any data matching a specific value (``NaN`` / missing value, though any value | ||||||
| can be chosen, including 0) is omitted. A special ``SparseIndex`` object tracks where data has been | ||||||
| "sparsified". For example, | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| ts.fillna(0).to_sparse(fill_value=0) | ||||||
| arr = np.random.randn(10) | ||||||
| arr[2:-2] = np.nan | ||||||
| ts = pd.Series(pd.SparseArray(arr)) | ||||||
| ts | ||||||
|
|
||||||
| Notice the dtype, ``Sparse[float64, nan]``. The ``nan`` means that elements in the | ||||||
| array that are ``nan`` aren't actually stored, only the non-``nan`` elements are. | ||||||
| Those non-``nan`` elements have a ``float64`` dtype. | ||||||
|
|
||||||
| The sparse objects exist for memory efficiency reasons. Suppose you had a | ||||||
| large, mostly NA ``DataFrame``: | ||||||
|
|
@@ -35,21 +36,64 @@ large, mostly NA ``DataFrame``: | |||||
|
|
||||||
| df = pd.DataFrame(np.random.randn(10000, 4)) | ||||||
| df.iloc[:9998] = np.nan | ||||||
| sdf = df.to_sparse() | ||||||
| sdf = df.astype(pd.SparseDtype("float", np.nan)) | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For such a purpose, I was thinking we could also provide There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense to me, though perhaps as a followup? I don't plan to put more time into sparse personally. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this would have to be a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @TomAugspurger can you respond to this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Kinda. If you just do In [6]: pd.DataFrame({"A": [1, 2]}).sparse
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-6-ab0fb67ed650> in <module>
----> 1 pd.DataFrame({"A": [1, 2]}).sparse
...
~/sandbox/pandas/pandas/core/arrays/sparse.py in _validate(self, data)
2119 dtypes = data.dtypes
2120 if not all(isinstance(t, SparseDtype) for t in dtypes):
-> 2121 raise AttributeError(self._validation_msg)
2122
2123 @classmethod
AttributeError: Can only use the '.sparse' accessor with Sparse data.But we also allow for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It would be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok this is all fine; is there a test for using .sparse on non-any-sparse df? |
||||||
| sdf | ||||||
TomAugspurger marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| sdf.density | ||||||
| sdf.sparse.density | ||||||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| As you can see, the density (% of values that have not been "compressed") is | ||||||
| extremely low. This sparse object takes up much less memory on disk (pickled) | ||||||
| and in the Python interpreter. Functionally, their behavior should be nearly | ||||||
| identical to their dense counterparts. | ||||||
|
|
||||||
| Any sparse object can be converted back to the standard dense form by calling | ||||||
| ``to_dense``: | ||||||
| .. _sparse.array: | ||||||
|
|
||||||
| SparseArray | ||||||
| ----------- | ||||||
|
|
||||||
| :class:`SparseArray` is a :class:`~pandas.api.extensions.ExtensionArray` | ||||||
| for storing an array of sparse values (see :ref:`basics.dtypes` for more | ||||||
| on extension arrays). It is a 1-dimensional ndarray-like object storing | ||||||
| only values distinct from the ``fill_value``: | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| arr = np.random.randn(10) | ||||||
| arr[2:5] = np.nan | ||||||
| arr[7:8] = np.nan | ||||||
| sparr = pd.SparseArray(arr) | ||||||
| sparr | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not important for this PR, but we should actually improve the repr of SparseArray. Currently the example gives (so way to wide, and showing too much detail of the random numbers) |
||||||
|
|
||||||
| A sparse array can be converted to a regular (dense) ndarray with :meth:`numpy.asarray` | ||||||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| np.asarray(sparr) | ||||||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| The :attr:`SparseArray.dtype` property stores two pieces of information | ||||||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| 1. The dtype of the non-sparse values | ||||||
| 2. The scalar fill value | ||||||
|
|
||||||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| A :class:`SparseDtype` may be constructed by passing each of these | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| pd.SparseDtype(np.dtype('datetime64[ns]')) | ||||||
|
|
||||||
| The default fill value for a given NumPy dtype is the "missing" value for that dtype, | ||||||
| though it may be overridden. | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| pd.SparseDtype(np.dtype('datetime64[ns]'), | ||||||
| fill_value=pd.Timestamp('2017-01-01')) | ||||||
|
|
||||||
| Finally, the string alias ``'Sparse[dtype]'`` may be used to specify a sparse dtype | ||||||
| in many places | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| sts.to_dense() | ||||||
| pd.array([1, 0, 0, 2], dtype='Sparse[int]') | ||||||
|
|
||||||
| .. _sparse.accessor: | ||||||
|
|
||||||
|
|
@@ -71,30 +115,11 @@ attributes and methods that are specific to sparse data. | |||||
| This accessor is available only on data with ``SparseDtype``, and on the :class:`Series` | ||||||
| class itself for creating a Series with sparse data from a scipy COO matrix with. | ||||||
|
|
||||||
| .. _sparse.array: | ||||||
|
|
||||||
| SparseArray | ||||||
| ----------- | ||||||
|
|
||||||
| ``SparseArray`` is the base layer for all of the sparse indexed data | ||||||
| structures. It is a 1-dimensional ndarray-like object storing only values | ||||||
| distinct from the ``fill_value``: | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| arr = np.random.randn(10) | ||||||
| arr[2:5] = np.nan | ||||||
| arr[7:8] = np.nan | ||||||
| sparr = pd.SparseArray(arr) | ||||||
| sparr | ||||||
|
|
||||||
| Like the indexed objects (SparseSeries, SparseDataFrame), a ``SparseArray`` | ||||||
| can be converted back to a regular ndarray by calling ``to_dense``: | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| sparr.to_dense() | ||||||
| .. versionadded:: 0.25.0 | ||||||
|
|
||||||
| A ``.sparse`` accessor has been added for :class:`DataFrame` as well. | ||||||
| See :ref:`api.dataframe.sparse` for more. | ||||||
|
||||||
| See :ref:`api.dataframe.sparse` for more. | |
| See :ref:`api.frame.sparse` for more. |
TomAugspurger marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
TomAugspurger marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
TomAugspurger marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
TomAugspurger marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use *Previous* and *New*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change old to previous. I think I'll keep them as comments, rather than **-style headings, since we're using ** for the subtopic (e.g. construction).
jreback marked this conversation as resolved.
Show resolved
Hide resolved
TomAugspurger marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
TomAugspurger marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accssor typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the diff here is all that informative. I'd recommend just viewing the new file. The basic flow is