Skip to content

Redundant whitespace in the demo data #233

@AndresAlgaba

Description

@AndresAlgaba

Hi everyone! First of all, thanks for all the work on this fantastic library and the Synthetic Data Vault in general :). I believe I found a minor bug in loading the demo data set and propose a quick fix for which I will submit a PR.

Environment Details

  • CTGAN version: latest (0.5.2.dev1)
  • Python version: 3.9.7
  • Operating System: Windows

Error Description

When running the usage example for the CTGANSynthesizer with conditional sampling via the condition_column and condition_value arguments in the sample method:

samples = ctgan.sample(1000, condition_column='native-country', condition_value='United-States')

I get the following error:
rdt\transformers\categorical.py:374: UserWarning: The data contains 1 new categories that were not seen in the original data (examples: {'United-States'}). Creating a vector of all 0s. If you want to model new categories, please fit the transformer again with the new data.

After looking into it, I found out that the discrete variables contain redundant whitespace in front of the categories. Using ' United-States' (with the redundant whitespace) works fine:

samples = ctgan.sample(1000, condition_column='native-country', condition_value=' United-States')

Solution

I propose to set the skipinitialspace argument in the pd.read_csv to True in the load_demo function:

def load_demo():
    """Load the demo."""
    return pd.read_csv(DEMO_URL, compression='gzip', skipinitialspace=True)

This seems to solve the issue.

Steps to reproduce

from ctgan import CTGANSynthesizer
from ctgan import load_demo

data = load_demo()

# Names of the columns that are discrete
discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

ctgan = CTGANSynthesizer(epochs=1)
ctgan.fit(data, discrete_columns)

# Synthetic copy
samples = ctgan.sample(1000, condition_column='native-country', condition_value='United-States')

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions