clean

To download images with Bing Image Search, sign up at Microsoft Azure for a free account. You will be given a key, which you can copy and enter in a cell as follows (replacing ‘XXX’ with your key and executing it):

In [ ]:

  1. key = os.environ.get('AZURE_SEARCH_KEY', 'XXX')

Or, if you’re comfortable at the command line, you can set it in your terminal with:

  1. export AZURE_SEARCH_KEY=your_key_here

and then restart Jupyter Notebook, and use the above line without editing it.

Once you’ve set key, you can use search_images_bing. This function is provided by the small utils class included with the notebooks online. If you’re not sure where a function is defined, you can just type it in your notebook to find out:

In [ ]:

  1. search_images_bing

Out[ ]:

  1. <function fastbook.search_images_bing(key, term, min_sz=128, max_images=150)>

In [ ]:

  1. results = search_images_bing(key, 'grizzly bear')
  2. ims = results.attrgot('content_url')
  3. len(ims)

Out[ ]:

  1. 150

We’ve successfully downloaded the URLs of 150 grizzly bears (or, at least, images that Bing Image Search finds for that search term).

NB: there’s no way to be sure exactly what images a search like this will find. The results can change over time. We’ve heard of at least one case of a community member who found some unpleasant pictures of dead bears in their search results. You’ll receive whatever images are found by the web search engine. If you’re running this at work, or with kids, etc, then be cautious before you display the downloaded images.

Let’s look at one:

In [ ]:

  1. #hide
  2. ims = ['http://3.bp.blogspot.com/-S1scRCkI3vY/UHzV2kucsPI/AAAAAAAAA-k/YQ5UzHEm9Ss/s1600/Grizzly%2BBear%2BWildlife.jpg']

In [ ]:

  1. dest = 'images/grizzly.jpg'
  2. download_url(ims[0], dest)

In [ ]:

  1. im = Image.open(dest)
  2. im.to_thumb(128,128)

Out[ ]:

clean - 图1

This seems to have worked nicely, so let’s use fastai’s download_images to download all the URLs for each of our search terms. We’ll put each in a separate folder:

In [ ]:

  1. bear_types = 'grizzly','black','teddy'
  2. path = Path('bears')

In [ ]:

  1. if not path.exists():
  2. path.mkdir()
  3. for o in bear_types:
  4. dest = (path/o)
  5. dest.mkdir(exist_ok=True)
  6. results = search_images_bing(key, f'{o} bear')
  7. download_images(dest, urls=results.attrgot('contentUrl'))

Our folder has image files, as we’d expect:

In [ ]:

  1. fns = get_image_files(path)
  2. fns

Out[ ]:

  1. (#406) [Path('bears/black/00000149.jpg'),Path('bears/black/00000095.jpg'),Path('bears/black/00000133.jpg'),Path('bears/black/00000062.jpg'),Path('bears/black/00000023.jpg'),Path('bears/black/00000029.jpg'),Path('bears/black/00000094.jpg'),Path('bears/black/00000124.jpg'),Path('bears/black/00000105.jpg'),Path('bears/black/00000046.jpg')...]

j: I just love this about working in Jupyter notebooks! It’s so easy to gradually build what I want, and check my work every step of the way. I make a lot of mistakes, so this is really helpful to me…

Often when we download files from the internet, there are a few that are corrupt. Let’s check:

In [ ]:

  1. failed = verify_images(fns)
  2. failed

Out[ ]:

  1. (#11) [Path('bears/black/00000147.jpg'),Path('bears/black/00000057.jpg'),Path('bears/black/00000140.jpg'),Path('bears/black/00000129.jpg'),Path('bears/teddy/00000006.jpg'),Path('bears/teddy/00000048.jpg'),Path('bears/teddy/00000076.jpg'),Path('bears/teddy/00000125.jpg'),Path('bears/teddy/00000090.jpg'),Path('bears/teddy/00000075.jpg')...]

To remove all the failed images, you can use unlink on each of them. Note that, like most fastai functions that return a collection, verify_images returns an object of type L, which includes the map method. This calls the passed function on each element of the collection:

In [ ]:

  1. failed.map(Path.unlink);

Sidebar: Getting Help in Jupyter Notebooks

Jupyter notebooks are great for experimenting and immediately seeing the results of each function, but there is also a lot of functionality to help you figure out how to use different functions, or even directly look at their source code. For instance, if you type in a cell:

  1. ??verify_images

a window will pop up with:

  1. Signature: verify_images(fns)
  2. Source:
  3. def verify_images(fns):
  4. "Find images in `fns` that can't be opened"
  5. return L(fns[i] for i,o in
  6. enumerate(parallel(verify_image, fns)) if not o)
  7. File: ~/git/fastai/fastai/vision/utils.py
  8. Type: function

This tells us what argument the function accepts (fns), then shows us the source code and the file it comes from. Looking at that source code, we can see it applies the function verify_image in parallel and only keeps the image files for which the result of that function is False, which is consistent with the doc string: it finds the images in fns that can’t be opened.

Here are some other features that are very useful in Jupyter notebooks:

  • At any point, if you don’t remember the exact spelling of a function or argument name, you can press Tab to get autocompletion suggestions.
  • When inside the parentheses of a function, pressing Shift and Tab simultaneously will display a window with the signature of the function and a short description. Pressing these keys twice will expand the documentation, and pressing them three times will open a full window with the same information at the bottom of your screen.
  • In a cell, typing ?func_name and executing will open a window with the signature of the function and a short description.
  • In a cell, typing ??func_name and executing will open a window with the signature of the function, a short description, and the source code.
  • If you are using the fastai library, we added a doc function for you: executing doc(func_name) in a cell will open a window with the signature of the function, a short description and links to the source code on GitHub and the full documentation of the function in the library docs.
  • Unrelated to the documentation but still very useful: to get help at any point if you get an error, type %debug in the next cell and execute to open the Python debugger, which will let you inspect the content of every variable.

End sidebar

One thing to be aware of in this process: as we discussed in <>, models can only reflect the data used to train them. And the world is full of biased data, which ends up reflected in, for example, Bing Image Search (which we used to create our dataset). For instance, let’s say you were interested in creating an app that could help users figure out whether they had healthy skin, so you trained a model on the results of searches for (say) “healthy skin.” <> shows you the kinds of results you would get.

clean - 图2

With this as your training data, you would end up not with a healthy skin detector, but a young white woman touching her face detector! Be sure to think carefully about the types of data that you might expect to see in practice in your application, and check carefully to ensure that all these types are reflected in your model’s source data. footnote:[Thanks to Deb Raji, who came up with the “healthy skin” example. See her paper “Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products” for more fascinating insights into model bias.]

Now that we have downloaded some data, we need to assemble it in a format suitable for model training. In fastai, that means creating an object called DataLoaders.