-
Notifications
You must be signed in to change notification settings - Fork 1
Description
So, I thought I'd start opening up issues to enable discussion of individual dataframe features we'd like to see. I'd like to start with 'heterogeneous column types': the ability to have a dataframe with columns of different types.
In looking through existing WIPs in #4, I came across a few different methods of implementing this:
- Using an
enum
for either a column or for individual values. utah (and really any arbitrarily-typed dataframe library) can house enums as values, which allows you to mix types however you want (even within the same column), at the cost of run-time type safety and some performance. I didn't see any library currently use column-based enums, but I could see having something like
enum Column {
Float(Vec<f64>),
Int(Vec<i64>),
Text(Vec<String>),
}
and in fact did it this way in an early version of agnes
.
- Using
Any
-based storage, along with some metadata for relating columns to data types at run-time. Used by rust-dataframe and black-jack. - Using cons-lists to provide compile-time type safety. Used by agnes and frames.
Each of these has its own advantages and disadvantages. For example, 1 and 2 lack compile-time type-checking, but have much cleaner type signatures and potentially cleaner error messages than 3 (where you have something like DataFrame<Cons<usize, Cons<f64, Cons<String, Nil>>>>
for a relatively simple three-column dataframe).
You could also have a combination of the above techniques -- I could see something like cons-list type-checking column metadata structure while the data itself is stored in some sort of Any
-based structure.
I'm personally a fan of the compile-time type-checking that cons-lists provide, but they can be hard to work with for those unfamiliar with them. I've started work on a labeled cons-list library (which will replace the one I'm using in agnes
) to hopefully help out with some of these issues.
What are everyone's thoughts / opinions? Are there other options we should consider? I'd love to hear what people think the best approach is!