Skip to content

Heterogeneous column types #6

@jblondin

Description

@jblondin

So, I thought I'd start opening up issues to enable discussion of individual dataframe features we'd like to see. I'd like to start with 'heterogeneous column types': the ability to have a dataframe with columns of different types.

In looking through existing WIPs in #4, I came across a few different methods of implementing this:

  1. Using an enum for either a column or for individual values. utah (and really any arbitrarily-typed dataframe library) can house enums as values, which allows you to mix types however you want (even within the same column), at the cost of run-time type safety and some performance. I didn't see any library currently use column-based enums, but I could see having something like
enum Column { 
    Float(Vec<f64>),
    Int(Vec<i64>),
    Text(Vec<String>), 
}

and in fact did it this way in an early version of agnes.

  1. Using Any-based storage, along with some metadata for relating columns to data types at run-time. Used by rust-dataframe and black-jack.
  2. Using cons-lists to provide compile-time type safety. Used by agnes and frames.

Each of these has its own advantages and disadvantages. For example, 1 and 2 lack compile-time type-checking, but have much cleaner type signatures and potentially cleaner error messages than 3 (where you have something like DataFrame<Cons<usize, Cons<f64, Cons<String, Nil>>>> for a relatively simple three-column dataframe).

You could also have a combination of the above techniques -- I could see something like cons-list type-checking column metadata structure while the data itself is stored in some sort of Any-based structure.

I'm personally a fan of the compile-time type-checking that cons-lists provide, but they can be hard to work with for those unfamiliar with them. I've started work on a labeled cons-list library (which will replace the one I'm using in agnes) to hopefully help out with some of these issues.

What are everyone's thoughts / opinions? Are there other options we should consider? I'd love to hear what people think the best approach is!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions