Tabular core

Open In Colab

Basic function to preprocess tabular data before assembling it in a DataLoaders.

  1. /usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  2. return torch._C._cuda_getDeviceCount() > 0

Initial preprocessing

make_date[source]

make_date(df, date_field)

Make sure df[date_field] is of the right date type.

  1. df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
  2. make_date(df, 'date')
  3. test_eq(df['date'].dtype, np.dtype('datetime64[ns]'))

add_datepart[source]

add_datepart(df, field_name, prefix=None, drop=True, time=False)

Helper function that adds columns relevant to a date in the column field_name of df.

For example if we have a series of dates we can then generate features such as Year, Month, Day, Dayofweek, Is_month_start, etc as shown below:

  1. df = pd.DataFrame({'date': ['2019-12-04', None, '2019-11-15', '2019-10-24']})
  2. df = add_datepart(df, 'date')
  3. df.head()
YearMonthWeekDayDayofweekDayofyearIs_month_endIs_month_startIs_quarter_endIs_quarter_startIs_year_endIs_year_startElapsed
02019.012.049.04.02.0338.0FalseFalseFalseFalseFalseFalse1.575418e+09
1NaNNaNNaNNaNNaNNaNFalseFalseFalseFalseFalseFalseNaN
22019.011.046.015.04.0319.0FalseFalseFalseFalseFalseFalse1.573776e+09
32019.010.043.024.03.0297.0FalseFalseFalseFalseFalseFalse1.571875e+09

add_elapsed_times[source]

add_elapsed_times(df, field_names, date_field, base_field)

Add in df for each event in field_names the elapsed time according to date_field grouped by base_field

  1. df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24'],
  2. 'event': [False, True, False, True], 'base': [1,1,2,2]})
  3. df = add_elapsed_times(df, ['event'], 'date', 'base')
  4. df.head()
dateeventbaseAftereventBeforeeventevent_bwevent_fw
02019-12-04False1501.00.0
12019-11-29True1001.01.0
22019-11-15False22201.00.0
32019-10-24True2001.01.0

cont_cat_split[source]

cont_cat_split(df, max_card=20, dep_var=None)

Helper function that returns column names of cont and cat variables from given df.

This function works by determining if a column is continuous or categorical based on the cardinality of its values. If it is above the max_card parameter (or a float datatype) then it will be added to the cont_names else cat_names. An example is below:

  1. df = pd.DataFrame({'cat1': [1, 2, 3, 4], 'cont1': [1., 2., 3., 2.], 'cat2': ['a', 'b', 'b', 'a'],
  2. 'i8': pd.Series([1, 2, 3, 4], dtype='int8'),
  3. 'u8': pd.Series([1, 2, 3, 4], dtype='uint8'),
  4. 'f16': pd.Series([1, 2, 3, 4], dtype='float16'),
  5. 'y1': [1, 0, 1, 0], 'y2': [2, 1, 1, 0]})
  6. cont_names, cat_names = cont_cat_split(df)
  1. cont_names: ['cont1', 'f16']
  2. cat_names: ['cat1', 'cat2', 'i8', 'u8', 'y1', 'y2']`
  1. df = pd.DataFrame({'cat1': pd.Series(['l','xs','xl','s'], dtype='category'),
  2. 'ui32': pd.Series([1, 2, 3, 4], dtype='UInt32'),
  3. 'i64': pd.Series([1, 2, 3, 4], dtype='Int64'),
  4. 'f16': pd.Series([1, 2, 3, 4], dtype='Float64'),
  5. 'd1_date': ['2021-02-09', None, '2020-05-12', '2020-08-14'],
  6. })
  7. df = add_datepart(df, 'd1_date', drop=False)
  8. df['cat1'].cat.set_categories(['xl','l','m','s','xs'], ordered=True, inplace=True)
  9. cont_names, cat_names = cont_cat_split(df, max_card=0)
  1. cont_names: ['ui32', 'i64', 'f16', 'd1_Year', 'd1_Month', 'd1_Week', 'd1_Day', 'd1_Dayofweek', 'd1_Dayofyear', 'd1_Elapsed']
  2. cat_names: ['cat1', 'd1_date', 'd1_Is_month_end', 'd1_Is_month_start', 'd1_Is_quarter_end', 'd1_Is_quarter_start', 'd1_Is_year_end', 'd1_Is_year_start']

df_shrink_dtypes[source]

df_shrink_dtypes(df, skip=[], obj2cat=True, int2uint=False)

Return any possible smaller data types for DataFrame columns. Allows object->category, int->uint, and exclusion.

For example we will make a sample DataFrame with int, float, bool, and object datatypes:

  1. df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'e': [True, False, True],
  2. 'date':['2019-12-04','2019-11-29','2019-11-15',]})
  3. df.dtypes
  1. i int64
  2. f float64
  3. e bool
  4. date object
  5. dtype: object

We can then call df_shrink_dtypes to find the smallest possible datatype that can support the data:

  1. dt = df_shrink_dtypes(df)
  2. dt
  1. {'i': dtype('int8'), 'f': dtype('float32'), 'date': 'category'}

df_shrink[source]

df_shrink(df, skip=[], obj2cat=True, int2uint=False)

Reduce DataFrame memory usage, by casting to smaller types returned by df_shrink_dtypes().

df_shrink(df) attempts to make a DataFrame uses less memory, by fit numeric columns into smallest datatypes. In addition:

  • boolean, category, datetime64[ns] dtype columns are ignored.
  • ‘object’ type columns are categorified, which can save a lot of memory in large dataset. It can be turned off by obj2cat=False.
  • int2uint=True, to fit int types to uint types, if all data in the column is >= 0.
  • columns can be excluded by name using excl_cols=['col1','col2'].

To get only new column data types without actually casting a DataFrame, use df_shrink_dtypes() with all the same parameters for df_shrink().

  1. df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'u':[0, 10,254],
  2. 'date':['2019-12-04','2019-11-29','2019-11-15']})
  3. df2 = df_shrink(df, skip=['date'])

Let’s compare the two:

  1. df.dtypes
  1. i int64
  2. f float64
  3. u int64
  4. date object
  5. dtype: object
  1. df2.dtypes
  1. i int8
  2. f float32
  3. u int16
  4. date object
  5. dtype: object

We can see that the datatypes changed, and even further we can look at their relative memory usages:

  1. Initial Dataframe: 224 bytes
  2. Reduced Dataframe: 173 bytes

Here’s another example using the ADULT_SAMPLE dataset:

  1. path = untar_data(URLs.ADULT_SAMPLE)
  2. df = pd.read_csv(path/'adult.csv')
  3. new_df = df_shrink(df, int2uint=True)
  1. Initial Dataframe: 3.907448 megabytes
  2. Reduced Dataframe: 0.818329 megabytes

We reduced the overall memory used by 79%!

class Tabular[source]

Tabular(df, procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True) :: CollBase

A DataFrame wrapper that knows which cols are cont/cat/y, and returns rows in __getitem__

  • df: A DataFrame of your data
  • cat_names: Your categorical x variables
  • cont_names: Your continuous x variables
  • y_names: Your dependent y variables
    • Note: Mixed y’s such as Regression and Classification is not currently supported, however multiple regression or classification outputs is
  • y_block: How to sub-categorize the type of y_names (CategoryBlock or RegressionBlock)
  • splits: How to split your data
  • do_setup: A parameter for if Tabular will run the data through the procs upon initialization
  • device: cuda or cpu
  • inplace: If True, Tabular will not keep a separate copy of your original DataFrame in memory. You should ensure pd.options.mode.chained_assignment is None before setting this
  • reduce_memory: fastai will attempt to reduce the overall memory usage by the inputted DataFrame with df_shrink

class TabularPandas[source]

TabularPandas(df, procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True) :: Tabular

A Tabular object with transforms

class TabularProc[source]

TabularProc(enc=None, dec=None, split_idx=None, order=None) :: InplaceTransform

Base class to write a non-lazy tabular processor for dataframes

These transforms are applied as soon as the data is available rather than as data is called from the DataLoader

class Categorify[source]

Categorify(enc=None, dec=None, split_idx=None, order=None) :: TabularProc

Transform the categorical variables to something similar to pd.Categorical

While visually in the DataFrame you will not see a change, the classes are stored in to.procs.categorify as we can see below on a dummy DataFrame:

  1. df = pd.DataFrame({'a':[0,1,2,0,2]})
  2. to = TabularPandas(df, Categorify, 'a')
  3. to.show()
a
00
11
22
30
42

Each column’s unique values are stored in a dictionary of column:[values]:

  1. cat = to.procs.categorify
  2. cat.classes
  1. {'a': ['#na#', 0, 1, 2]}

class FillStrategy[source]

FillStrategy()

Namespace containing the various filling strategies.

Currently, filling with the median, a constant, and the mode are supported.

class FillMissing[source]

FillMissing(fill_strategy=median, add_col=True, fill_vals=None) :: TabularProc

Fill the missing values in continuous columns.

class ReadTabBatch[source]

ReadTabBatch(to) :: ItemTransform

Transform TabularPandas values into a Tensor with the ability to decode

class TabDataLoader[source]

TabDataLoader(dataset, bs=16, shuffle=False, after_batch=None, num_workers=0, verbose=False, do_setup=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, wif=None, before_iter=None, after_item=None, before_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None) :: TfmdDL

A transformed DataLoader for Tabular data

Integration example

For a more in-depth explanation, see the tabular tutorial

  1. path = untar_data(URLs.ADULT_SAMPLE)
  2. df = pd.read_csv(path/'adult.csv')
  3. df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
  4. df_test.drop('salary', axis=1, inplace=True)
  5. df_main.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
049Private101320Assoc-acdm12.0Married-civ-spouseNaNWifeWhiteFemale0190240United-States>=50k
144Private236746Masters14.0DivorcedExec-managerialNot-in-familyWhiteMale10520045United-States>=50k
238Private96185HS-gradNaNDivorcedNaNUnmarriedBlackFemale0032United-States<50k
338Self-emp-inc112847Prof-school15.0Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040United-States>=50k
442Self-emp-not-inc822977th-8thNaNMarried-civ-spouseOther-serviceWifeBlackFemale0050United-States<50k
  1. cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
  2. cont_names = ['age', 'fnlwgt', 'education-num']
  3. procs = [Categorify, FillMissing, Normalize]
  4. splits = RandomSplitter()(range_of(df_main))
  1. to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary", splits=splits)
  1. dls = to.dataloaders()
  2. dls.valid.show_batch()
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_naagefnlwgteducation-numsalary
0PrivateHS-gradNever-marriedHandlers-cleanersOwn-childBlackFalse28.000000335356.9997109.0<50k
1?HS-gradMarried-civ-spouse?HusbandWhiteFalse65.99999937330.9981729.0<50k
2PrivateMastersNever-married#na#Not-in-familyAsian-Pac-IslanderFalse32.000000116137.99793214.0<50k
3PrivateHS-gradMarried-civ-spouseCraft-repairHusbandWhiteFalse45.000000273434.9980179.0<50k
4PrivateHS-gradMarried-civ-spouseCraft-repairHusbandWhiteFalse51.000000101431.9968429.0<50k
5PrivateBachelorsMarried-civ-spouseProf-specialtyHusbandWhiteFalse48.000000332465.00342813.0<50k
6PrivateSome-collegeNever-marriedSalesOwn-childWhiteFalse17.999999192409.00002410.0<50k
7PrivateHS-gradDivorcedMachine-op-inspctUnmarriedBlackTrue37.000000175390.00010810.0<50k
8PrivateBachelorsMarried-civ-spouseExec-managerialHusbandWhiteFalse38.000000192337.00000613.0>=50k
9Federal-govHS-gradMarried-civ-spouseAdm-clericalHusbandWhiteFalse37.00000032528.0064709.0>=50k
  1. to.show()
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_naagefnlwgteducation-numsalary
279PrivateHS-gradNever-married#na#Own-childWhiteTrue20.0155775.010.0<50k
6459PrivateHS-gradDivorcedCraft-repairNot-in-familyWhiteFalse55.035551.09.0<50k
5544PrivateAssoc-vocDivorcedTech-supportNot-in-familyBlackFalse53.0479621.011.0<50k
3500?10thNever-married?Not-in-familyWhiteFalse19.0182590.06.0<50k
3788Self-emp-not-incBachelorsMarried-civ-spouseSalesHusbandWhiteFalse31.0340880.013.0<50k
4002Self-emp-not-incSome-collegeNever-marriedSalesOwn-childWhiteFalse30.0196342.010.0<50k
204?HS-gradMarried-civ-spouse#na#HusbandWhiteTrue60.0174073.010.0<50k
9097PrivateHS-gradMarried-civ-spouseAdm-clericalHusbandWhiteFalse39.083893.09.0>=50k
5972PrivateBachelorsMarried-civ-spouseExec-managerialHusbandWhiteFalse48.0105838.013.0>=50k
5661PrivateHS-gradNever-marriedAdm-clericalOwn-childWhiteFalse26.0262656.09.0<50k

We can decode any set of transformed data by calling to.decode_row with our raw data:

  1. row = to.items.iloc[0]
  2. to.decode_row(row)
  1. age 20.0
  2. workclass Private
  3. fnlwgt 155775.0
  4. education HS-grad
  5. education-num 10.0
  6. marital-status Never-married
  7. occupation #na#
  8. relationship Own-child
  9. race White
  10. sex Male
  11. capital-gain 0
  12. capital-loss 0
  13. hours-per-week 30
  14. native-country United-States
  15. salary <50k
  16. education-num_na True
  17. Name: 279, dtype: object

We can make new test datasets based on the training data with the to.new()

Note: Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training

  1. to_tst = to.new(df_test)
  2. to_tst.process()
  3. to_tst.items.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countryeducation-num_na
100000.45547651.326789101.1782003212Male0040Philippines1
10001-0.93629751.24048412-0.42071431514Male0040United-States1
100021.04148650.1468952-1.2201711925Female0037United-States1
100030.5287275-0.28263912-0.4207147255Female0043United-States1
100040.74848161.42847890.3787433515Male0060United-States1

We can then convert it to a DataLoader:

  1. tst_dl = dls.valid.new(to_tst)
  2. tst_dl.show_batch()
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_naagefnlwgteducation-num
0PrivateBachelorsMarried-civ-spouseAdm-clericalHusbandAsian-Pac-IslanderFalse45.000000338105.00196713.0
1PrivateHS-gradMarried-civ-spouseTransport-movingHusbandOtherFalse26.000000328663.0056019.0
2Private11thDivorcedOther-serviceNot-in-familyWhiteFalse53.000000209021.9997957.0
3PrivateHS-gradWidowedAdm-clericalUnmarriedWhiteFalse46.000000162029.9994979.0
4Self-emp-incAssoc-vocMarried-civ-spouseExec-managerialHusbandWhiteFalse49.000000349229.99778011.0
5Local-govSome-collegeMarried-civ-spouseExec-managerialHusbandWhiteFalse34.000000124827.00245010.0
6Self-emp-incSome-collegeMarried-civ-spouseSalesHusbandWhiteFalse53.000000290640.00164410.0
7PrivateSome-collegeNever-marriedSalesOwn-childWhiteFalse19.000000106272.99874010.0
8PrivateSome-collegeMarried-civ-spouseProtective-servHusbandBlackFalse72.00000153684.00346210.0
9PrivateSome-collegeNever-marriedSalesOwn-childWhiteFalse20.000000505980.00706910.0

Other target types

Multi-label categories

one-hot encoded label

  1. def _mock_multi_label(df):
  2. sal,sex,white = [],[],[]
  3. for row in df.itertuples():
  4. sal.append(row.salary == '>=50k')
  5. sex.append(row.sex == ' Male')
  6. white.append(row.race == ' White')
  7. df['salary'] = np.array(sal)
  8. df['male'] = np.array(sex)
  9. df['white'] = np.array(white)
  10. return df
  1. path = untar_data(URLs.ADULT_SAMPLE)
  2. df = pd.read_csv(path/'adult.csv')
  3. df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
  4. df_main = _mock_multi_label(df_main)
  1. df_main.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalarymalewhite
049Private101320Assoc-acdm12.0Married-civ-spouseNaNWifeWhiteFemale0190240United-StatesTrueFalseTrue
144Private236746Masters14.0DivorcedExec-managerialNot-in-familyWhiteMale10520045United-StatesTrueTrueTrue
238Private96185HS-gradNaNDivorcedNaNUnmarriedBlackFemale0032United-StatesFalseFalseFalse
338Self-emp-inc112847Prof-school15.0Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040United-StatesTrueTrueFalse
442Self-emp-not-inc822977th-8thNaNMarried-civ-spouseOther-serviceWifeBlackFemale0050United-StatesFalseFalseFalse
  1. cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
  2. cont_names = ['age', 'fnlwgt', 'education-num']
  3. procs = [Categorify, FillMissing, Normalize]
  4. splits = RandomSplitter()(range_of(df_main))
  5. y_names=["salary", "male", "white"]
  1. %time to = TabularPandas(df_main, procs, cat_names, cont_names, y_names=y_names, y_block=MultiCategoryBlock(encoded=True, vocab=y_names), splits=splits)
  1. CPU times: user 60 ms, sys: 0 ns, total: 60 ms
  2. Wall time: 59.4 ms
  1. dls = to.dataloaders()
  2. dls.valid.show_batch()
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_naagefnlwgteducation-numsalarymalewhite
0PrivateHS-gradMarried-civ-spouseSalesHusbandWhiteFalse47.000000186533.9998489.0TrueTrueTrue
1PrivateSome-collegeNever-marriedAdm-clericalNot-in-familyWhiteFalse32.000000115631.00121610.0FalseFalseTrue
2Federal-govSome-collegeWidowedExec-managerialNot-in-familyWhiteFalse60.00000127466.00387310.0FalseFalseTrue
3PrivateHS-gradNever-marriedOther-serviceNot-in-familyWhiteFalse49.000000129639.9976029.0FalseFalseTrue
4Local-govProf-schoolMarried-civ-spouseProf-specialtyHusbandWhiteFalse37.000000265038.00158215.0TrueTrueTrue
5PrivateBachelorsNever-marriedHandlers-cleanersOther-relativeWhiteFalse23.000001256755.00292913.0FalseFalseTrue
6PrivateHS-gradNever-marriedMachine-op-inspctNot-in-familyWhiteFalse39.000000185052.9999589.0FalseFalseTrue
7PrivateHS-gradNever-marriedHandlers-cleanersOwn-childWhiteFalse28.000000189346.0001399.0FalseTrueTrue
8Private10thMarried-civ-spouseOther-serviceHusbandAsian-Pac-IslanderFalse35.000000176122.9994946.0FalseTrueFalse
9Private5th-6thNever-marriedMachine-op-inspctOther-relativeWhiteFalse25.000000521399.9968823.0FalseTrueTrue

Not one-hot encoded

  1. def _mock_multi_label(df):
  2. targ = []
  3. for row in df.itertuples():
  4. labels = []
  5. if row.salary == '>=50k': labels.append('>50k')
  6. if row.sex == ' Male': labels.append('male')
  7. if row.race == ' White': labels.append('white')
  8. targ.append(' '.join(labels))
  9. df['target'] = np.array(targ)
  10. return df
  1. path = untar_data(URLs.ADULT_SAMPLE)
  2. df = pd.read_csv(path/'adult.csv')
  3. df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
  4. df_main = _mock_multi_label(df_main)
  1. df_main.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalarytarget
049Private101320Assoc-acdm12.0Married-civ-spouseNaNWifeWhiteFemale0190240United-States>=50k>50k white
144Private236746Masters14.0DivorcedExec-managerialNot-in-familyWhiteMale10520045United-States>=50k>50k male white
238Private96185HS-gradNaNDivorcedNaNUnmarriedBlackFemale0032United-States<50k
338Self-emp-inc112847Prof-school15.0Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040United-States>=50k>50k male
442Self-emp-not-inc822977th-8thNaNMarried-civ-spouseOther-serviceWifeBlackFemale0050United-States<50k
  1. @MultiCategorize
  2. def encodes(self, to:Tabular):
  3. #to.transform(to.y_names, partial(_apply_cats, {n: self.vocab for n in to.y_names}, 0))
  4. return to
  5. @MultiCategorize
  6. def decodes(self, to:Tabular):
  7. #to.transform(to.y_names, partial(_decode_cats, {n: self.vocab for n in to.y_names}))
  8. return to
  1. cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
  2. cont_names = ['age', 'fnlwgt', 'education-num']
  3. procs = [Categorify, FillMissing, Normalize]
  4. splits = RandomSplitter()(range_of(df_main))
  1. %time to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="target", y_block=MultiCategoryBlock(), splits=splits)
  1. CPU times: user 68 ms, sys: 0 ns, total: 68 ms
  2. Wall time: 65 ms
  1. to.procs[2].vocab
  1. ['-', '_', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y']

Regression

  1. path = untar_data(URLs.ADULT_SAMPLE)
  2. df = pd.read_csv(path/'adult.csv')
  3. df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
  4. df_main = _mock_multi_label(df_main)
  1. cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
  2. cont_names = ['fnlwgt', 'education-num']
  3. procs = [Categorify, FillMissing, Normalize]
  4. splits = RandomSplitter()(range_of(df_main))
  1. %time to = TabularPandas(df_main, procs, cat_names, cont_names, y_names='age', splits=splits)
  1. CPU times: user 60 ms, sys: 4 ms, total: 64 ms
  2. Wall time: 63.3 ms
  1. to.procs[-1].means
  1. {'fnlwgt': 192492.332875, 'education-num': 10.075499534606934}
  1. dls = to.dataloaders()
  2. dls.valid.show_batch()
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_nafnlwgteducation-numage
0Private9thMarried-civ-spouseMachine-op-inspctHusbandWhiteFalse288185.0023015.025.0
1Self-emp-incHS-gradMarried-civ-spouseCraft-repairHusbandWhiteFalse383492.9977539.044.0
2PrivateHS-gradMarried-civ-spouseCraft-repairHusbandWhiteFalse84136.0019209.040.0
3PrivateBachelorsNever-marriedHandlers-cleanersOwn-childWhiteTrue31778.00265610.028.0
4PrivateSome-collegeMarried-civ-spouseAdm-clericalHusbandBlackFalse193036.00000110.034.0
5Private10thDivorcedMachine-op-inspctNot-in-familyBlackFalse131713.9988196.029.0
6PrivateHS-gradMarried-civ-spouseMachine-op-inspctHusbandWhiteFalse275632.0020749.030.0
7PrivateHS-gradMarried-civ-spouseOther-serviceHusbandWhiteFalse107236.0030159.027.0
8PrivateHS-gradMarried-civ-spouseMachine-op-inspctHusbandBlackFalse83878.9978169.028.0
9Private7th-8thNever-marriedHandlers-cleanersOwn-childWhiteFalse255476.0000254.029.0

Not being used now - for multi-modal

  1. class TensorTabular(fastuple):
  2. def get_ctxs(self, max_n=10, **kwargs):
  3. n_samples = min(self[0].shape[0], max_n)
  4. df = pd.DataFrame(index = range(n_samples))
  5. return [df.iloc[i] for i in range(n_samples)]
  6. def display(self, ctxs): display_df(pd.DataFrame(ctxs))
  7. class TabularLine(pd.Series):
  8. "A line of a dataframe that knows how to show itself"
  9. def show(self, ctx=None, **kwargs): return self if ctx is None else ctx.append(self)
  10. class ReadTabLine(ItemTransform):
  11. def __init__(self, proc): self.proc = proc
  12. def encodes(self, row):
  13. cats,conts = (o.map(row.__getitem__) for o in (self.proc.cat_names,self.proc.cont_names))
  14. return TensorTabular(tensor(cats).long(),tensor(conts).float())
  15. def decodes(self, o):
  16. to = TabularPandas(o, self.proc.cat_names, self.proc.cont_names, self.proc.y_names)
  17. to = self.proc.decode(to)
  18. return TabularLine(pd.Series({c: v for v,c in zip(to.items[0]+to.items[1], self.proc.cat_names+self.proc.cont_names)}))
  19. class ReadTabTarget(ItemTransform):
  20. def __init__(self, proc): self.proc = proc
  21. def encodes(self, row): return row[self.proc.y_names].astype(np.int64)
  22. def decodes(self, o): return Category(self.proc.classes[self.proc.y_names][o])
  1. # enc = tds[1]
  2. # test_eq(enc[0][0], tensor([2,1]))
  3. # test_close(enc[0][1], tensor([-0.628828]))
  4. # test_eq(enc[1], 1)
  5. # dec = tds.decode(enc)
  6. # assert isinstance(dec[0], TabularLine)
  7. # test_close(dec[0], pd.Series({'a': 1, 'b_na': False, 'b': 1}))
  8. # test_eq(dec[1], 'a')
  9. # test_stdout(lambda: print(show_at(tds, 1)), """a 1
  10. # b_na False
  11. # b 1
  12. # category a
  13. # dtype: object""")

Company logo

©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021