Hi everyone! First of all, thanks for all the work on this fantastic library and the Synthetic Data Vault in general :). I believe I found a minor bug in loading the demo data set and propose a quick fix for which I will submit a PR.
Environment Details
- CTGAN version: latest (0.5.2.dev1)
- Python version: 3.9.7
- Operating System: Windows
Error Description
When running the usage example for the CTGANSynthesizer with conditional sampling via the condition_column and condition_value arguments in the sample method:
samples = ctgan.sample(1000, condition_column='native-country', condition_value='United-States')
I get the following error:
rdt\transformers\categorical.py:374: UserWarning: The data contains 1 new categories that were not seen in the original data (examples: {'United-States'}). Creating a vector of all 0s. If you want to model new categories, please fit the transformer again with the new data.
After looking into it, I found out that the discrete variables contain redundant whitespace in front of the categories. Using ' United-States' (with the redundant whitespace) works fine:
samples = ctgan.sample(1000, condition_column='native-country', condition_value=' United-States')
Solution
I propose to set the skipinitialspace argument in the pd.read_csv to True in the load_demo function:
def load_demo():
"""Load the demo."""
return pd.read_csv(DEMO_URL, compression='gzip', skipinitialspace=True)
This seems to solve the issue.
Steps to reproduce
from ctgan import CTGANSynthesizer
from ctgan import load_demo
data = load_demo()
# Names of the columns that are discrete
discrete_columns = [
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'native-country',
'income'
]
ctgan = CTGANSynthesizer(epochs=1)
ctgan.fit(data, discrete_columns)
# Synthetic copy
samples = ctgan.sample(1000, condition_column='native-country', condition_value='United-States')
Hi everyone! First of all, thanks for all the work on this fantastic library and the Synthetic Data Vault in general :). I believe I found a minor bug in loading the demo data set and propose a quick fix for which I will submit a PR.
Environment Details
Error Description
When running the usage example for the
CTGANSynthesizerwith conditional sampling via thecondition_columnandcondition_valuearguments in thesamplemethod:I get the following error:
rdt\transformers\categorical.py:374: UserWarning: The data contains 1 new categories that were not seen in the original data (examples: {'United-States'}). Creating a vector of all 0s. If you want to model new categories, please fit the transformer again with the new data.After looking into it, I found out that the discrete variables contain redundant whitespace in front of the categories. Using ' United-States' (with the redundant whitespace) works fine:
Solution
I propose to set the
skipinitialspaceargument in thepd.read_csvtoTruein theload_demofunction:This seems to solve the issue.
Steps to reproduce