Tabular data

Open In Colab

Helper functions to get data in a DataLoaders in the tabular application and higher class TabularDataLoaders

  1. /usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  2. return torch._C._cuda_getDeviceCount() > 0

The main class to get your data ready for model training is TabularDataLoaders and its factory methods. Checkout the tabular tutorial for examples of use.

class TabularDataLoaders[source]

TabularDataLoaders(*loaders, path='.', device=None) :: DataLoaders

Basic wrapper around several DataLoaders with factory methods for tabular data

This class should not be used directly, one of the factory methods should be preferred instead. All those factory methods accept as arguments:

  • cat_names: the names of the categorical variables
  • cont_names: the names of the continuous variables
  • y_names: the names of the dependent variables
  • y_block: the TransformBlock to use for the target
  • valid_idx: the indices to use for the validation set (defaults to a random split otherwise)
  • bs: the batch size
  • val_bs: the batch size for the validation DataLoader (defaults to bs)
  • shuffle_train: if we shuffle the training DataLoader or not
  • n: overrides the numbers of elements in the dataset
  • device: the PyTorch device to use (defaults to default_device())

TabularDataLoaders.from_df[source]

TabularDataLoaders.from_df(df, path='.', procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, valid_idx=None, bs=64, shuffle_train=None, shuffle=True, val_shuffle=False, n=None, device=None, drop_last=None, val_bs=None)

Create from df in path using procs

Let’s have a look on an example with the adult dataset:

  1. path = untar_data(URLs.ADULT_SAMPLE)
  2. df = pd.read_csv(path/'adult.csv', skipinitialspace=True)
  3. df.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
049Private101320Assoc-acdm12.0Married-civ-spouseNaNWifeWhiteFemale0190240United-States>=50k
144Private236746Masters14.0DivorcedExec-managerialNot-in-familyWhiteMale10520045United-States>=50k
238Private96185HS-gradNaNDivorcedNaNUnmarriedBlackFemale0032United-States<50k
338Self-emp-inc112847Prof-school15.0Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040United-States>=50k
442Self-emp-not-inc822977th-8thNaNMarried-civ-spouseOther-serviceWifeBlackFemale0050United-States<50k
  1. cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
  2. cont_names = ['age', 'fnlwgt', 'education-num']
  3. procs = [Categorify, FillMissing, Normalize]
  1. dls = TabularDataLoaders.from_df(df, path, procs=procs, cat_names=cat_names, cont_names=cont_names,
  2. y_names="salary", valid_idx=list(range(800,1000)), bs=64)
  1. dls.show_batch()
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_naagefnlwgteducation-numsalary
0Private11thSeparatedAdm-clericalUnmarriedBlackFalse55.0213894.0005627.0<50k
1PrivateHS-gradMarried-civ-spouseMachine-op-inspctHusbandWhiteFalse53.0228500.0013859.0>=50k
2PrivateHS-gradMarried-civ-spouseTech-supportHusbandWhiteFalse38.0256864.0009099.0>=50k
3PrivateBachelorsMarried-civ-spouseTech-supportHusbandWhiteFalse40.0247879.99719013.0>=50k
4PrivateSome-collegeDivorcedCraft-repairNot-in-familyWhiteFalse41.040151.00192510.0>=50k
5PrivateHS-gradMarried-civ-spouseSalesHusbandWhiteFalse37.0110713.0015999.0>=50k
6PrivateBachelorsMarried-civ-spouseExec-managerialHusbandWhiteFalse38.0278924.00090213.0>=50k
7Self-emp-not-inc11thMarried-civ-spouseFarming-fishingHusbandWhiteFalse60.0220341.9993567.0<50k
8?9thNever-married?Not-in-familyWhiteFalse30.0104965.0010135.0<50k
9?HS-gradNever-married?Not-in-familyWhiteFalse21.0105311.9974159.0<50k

TabularDataLoaders.from_csv[source]

TabularDataLoaders.from_csv(csv, skipinitialspace=True, path='.', procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, valid_idx=None, bs=64, shuffle_train=None, shuffle=True, val_shuffle=False, n=None, device=None, drop_last=None, val_bs=None)

Create from csv file in path using procs

  1. cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
  2. cont_names = ['age', 'fnlwgt', 'education-num']
  3. procs = [Categorify, FillMissing, Normalize]
  4. dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, procs=procs, cat_names=cat_names, cont_names=cont_names,
  5. y_names="salary", valid_idx=list(range(800,1000)), bs=64)

External structured data files can contain unexpected spaces, e.g. after a comma. We can see that in the first row of adult.csv "49, Private,101320, ...". Often trimming is needed. Pandas has a convenient parameter skipinitialspace that is exposed by TabularDataLoaders.from_csv()). Otherwise category labels use for inference later such as workclass:Private will be categorized wrongly to 0 or "#na#" if training label was read as " Private". Let’s test this feature.

  1. test_data = {
  2. 'age': [49],
  3. 'workclass': ['Private'],
  4. 'fnlwgt': [101320],
  5. 'education': ['Assoc-acdm'],
  6. 'education-num': [12.0],
  7. 'marital-status': ['Married-civ-spouse'],
  8. 'occupation': [''],
  9. 'relationship': ['Wife'],
  10. 'race': ['White'],
  11. }
  12. input = pd.DataFrame(test_data)
  13. tdl = dls.test_dl(input)
  14. test_ne(0, tdl.dataset.iloc[0]['workclass'])

Company logo

©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021