The Dataset

The dataset we use in this chapter is from the Blue Book for Bulldozers Kaggle competition, which has the following description: “The goal of the contest is to predict the sale price of a particular piece of heavy equipment at auction based on its usage, equipment type, and configuration. The data is sourced from auction result postings and includes information on usage and equipment configurations.”

This is a very common type of dataset and prediction problem, similar to what you may see in your project or workplace. The dataset is available for download on Kaggle, a website that hosts data science competitions.

Kaggle Competitions

Kaggle is an awesome resource for aspiring data scientists or anyone looking to improve their machine learning skills. There is nothing like getting hands-on practice and receiving real-time feedback to help you improve your skills.

Kaggle provides:

  • Interesting datasets
  • Feedback on how you’re doing
  • A leaderboard to see what’s good, what’s possible, and what’s state-of-the-art
  • Blog posts by winning contestants sharing useful tips and techniques

Until now all our datasets have been available to download through fastai’s integrated dataset system. However, the dataset we will be using in this chapter is only available from Kaggle. Therefore, you will need to register on the site, then go to the page for the competition. On that page click “Rules,” then “I Understand and Accept.” (Although the competition has finished, and you will not be entering it, you still have to agree to the rules to be allowed to download the data.)

The easiest way to download Kaggle datasets is to use the Kaggle API. You can install this using pip by running this in a notebook cell:

  1. !pip install kaggle

You need an API key to use the Kaggle API; to get one, click on your profile picture on the Kaggle website, and choose My Account, then click Create New API Token. This will save a file called kaggle.json to your PC. You need to copy this key on your GPU server. To do so, open the file you downloaded, copy the contents, and paste them in the following cell in the notebook associated with this chapter (e.g., creds = '{"username":"xxx","key":"xxx"}'):

In [3]:

  1. creds = ''

Then execute this cell (this only needs to be run once):

In [4]:

  1. cred_path = Path('~/.kaggle/kaggle.json').expanduser()
  2. if not cred_path.exists():
  3. cred_path.parent.mkdir(exist_ok=True)
  4. cred_path.write_text(creds)
  5. cred_path.chmod(0o600)

Now you can download datasets from Kaggle! Pick a path to download the dataset to:

In [5]:

  1. path = URLs.path('bluebook')
  2. path

Out[5]:

  1. Path('/home/jhoward/.fastai/archive/bluebook')

In [6]:

  1. #hide
  2. Path.BASE_PATH = path

And use the Kaggle API to download the dataset to that path, and extract it:

In [7]:

  1. if not path.exists():
  2. path.mkdir(parents=true)
  3. api.competition_download_cli('bluebook-for-bulldozers', path=path)
  4. file_extract(path/'bluebook-for-bulldozers.zip')
  5. path.ls(file_type='text')

Out[7]:

  1. (#7) [Path('TrainAndValid.csv'),Path('Machine_Appendix.csv'),Path('random_forest_benchmark_test.csv'),Path('Test.csv'),Path('median_benchmark.csv'),Path('ValidSolution.csv'),Path('Valid.csv')]

Now that we have downloaded our dataset, let’s take a look at it!

Look at the Data

Kaggle provides information about some of the fields of our dataset. The Data explains that the key fields in train.csv are:

  • SalesID:: The unique identifier of the sale.
  • MachineID:: The unique identifier of a machine. A machine can be sold multiple times.
  • saleprice:: What the machine sold for at auction (only provided in train.csv).
  • saledate:: The date of the sale.

In any sort of data science work, it’s important to look at your data directly to make sure you understand the format, how it’s stored, what types of values it holds, etc. Even if you’ve read a description of the data, the actual data may not be what you expect. We’ll start by reading the training set into a Pandas DataFrame. Generally it’s a good idea to specify low_memory=False unless Pandas actually runs out of memory and returns an error. The low_memory parameter, which is True by default, tells Pandas to only look at a few rows of data at a time to figure out what type of data is in each column. This means that Pandas can actually end up using different data type for different rows, which generally leads to data processing errors or model training problems later.

Let’s load our data and have a look at the columns:

In [8]:

  1. df = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)

In [9]:

  1. df.columns

Out[9]:

  1. Index(['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
  2. 'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
  3. 'saledate', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc',
  4. 'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
  5. 'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
  6. 'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
  7. 'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
  8. 'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
  9. 'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
  10. 'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
  11. 'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
  12. 'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
  13. 'Travel_Controls', 'Differential_Type', 'Steering_Controls'],
  14. dtype='object')

That’s a lot of columns for us to look at! Try looking through the dataset to get a sense of what kind of information is in each one. We’ll shortly see how to “zero in” on the most interesting bits.

At this point, a good next step is to handle ordinal columns. This refers to columns containing strings or similar, but where those strings have a natural ordering. For instance, here are the levels of ProductSize:

In [10]:

  1. df['ProductSize'].unique()

Out[10]:

  1. array([nan, 'Medium', 'Small', 'Large / Medium', 'Mini', 'Large', 'Compact'], dtype=object)

We can tell Pandas about a suitable ordering of these levels like so:

In [11]:

  1. sizes = 'Large','Large / Medium','Medium','Small','Mini','Compact'

In [12]:

  1. df['ProductSize'] = df['ProductSize'].astype('category')
  2. df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)

The most important data column is the dependent variable—that is, the one we want to predict. Recall that a model’s metric is a function that reflects how good the predictions are. It’s important to note what metric is being used for a project. Generally, selecting the metric is an important part of the project setup. In many cases, choosing a good metric will require more than just selecting a variable that already exists. It is more like a design process. You should think carefully about which metric, or set of metrics, actually measures the notion of model quality that matters to you. If no variable represents that metric, you should see if you can build the metric from the variables that are available.

However, in this case Kaggle tells us what metric to use: root mean squared log error (RMSLE) between the actual and predicted auction prices. We need do only a small amount of processing to use this: we take the log of the prices, so that rmse of that value will give us what we ultimately need:

In [13]:

  1. dep_var = 'SalePrice'

In [14]:

  1. df[dep_var] = np.log(df[dep_var])

We are now ready to explore our first machine learning algorithm for tabular data: decision trees.