14.4 USDA食品数据库

美国农业部(USDA)制作了一份有关食物营养信息的数据库。Ashley Williams制作了该数据的JSON版(http://ashleyw.co.uk/project/food-nutrient-database)。其中的记录如下所示:

  1. {
  2. "id": 21441,
  3. "description": "KENTUCKY FRIED CHICKEN, Fried Chicken, EXTRA CRISPY,
  4. Wing, meat and skin with breading",
  5. "tags": ["KFC"],
  6. "manufacturer": "Kentucky Fried Chicken",
  7. "group": "Fast Foods",
  8. "portions": [
  9. {
  10. "amount": 1,
  11. "unit": "wing, with skin",
  12. "grams": 68.0
  13. },
  14. ...
  15. ],
  16. "nutrients": [
  17. {
  18. "value": 20.8,
  19. "units": "g",
  20. "description": "Protein",
  21. "group": "Composition"
  22. },
  23. ...
  24. ]
  25. }

每种食物都带有若干标识性属性以及两个有关营养成分和分量的列表。这种形式的数据不是很适合分析工作,因此我们需要做一些规整化以使其具有更好用的形式。

从上面列举的那个网址下载并解压数据之后,你可以用任何喜欢的JSON库将其加载到Python中。我用的是Python内置的json模块:

  1. In [154]: import json
  2. In [155]: db = json.load(open('datasets/usda_food/database.json'))
  3. In [156]: len(db)
  4. Out[156]: 6636

db中的每个条目都是一个含有某种食物全部数据的字典。nutrients字段是一个字典列表,其中的每个字典对应一种营养成分:

  1. In [157]: db[0].keys()
  2. Out[157]: dict_keys(['id', 'description', 'tags', 'manufacturer', 'group', 'porti
  3. ons', 'nutrients'])
  4. In [158]: db[0]['nutrients'][0]
  5. Out[158]:
  6. {'description': 'Protein',
  7. 'group': 'Composition',
  8. 'units': 'g',
  9. 'value': 25.18}
  10. In [159]: nutrients = pd.DataFrame(db[0]['nutrients'])
  11. In [160]: nutrients[:7]
  12. Out[160]:
  13. description group units value
  14. 0 Protein Composition g 25.18
  15. 1 Total lipid (fat) Composition g 29.20
  16. 2 Carbohydrate, by difference Composition g 3.06
  17. 3 Ash Other g 3.28
  18. 4 Energy Energy kcal 376.00
  19. 5 Water Composition g 39.28
  20. 6 Energy Energy kJ 1573.00

在将字典列表转换为DataFrame时,可以只抽取其中的一部分字段。这里,我们将取出食物的名称、分类、编号以及制造商等信息:

  1. In [161]: info_keys = ['description', 'group', 'id', 'manufacturer']
  2. In [162]: info = pd.DataFrame(db, columns=info_keys)
  3. In [163]: info[:5]
  4. Out[163]:
  5. description group id \
  6. 0 Cheese, caraway Dairy and Egg Products 1008
  7. 1 Cheese, cheddar Dairy and Egg Products 1009
  8. 2 Cheese, edam Dairy and Egg Products 1018
  9. 3 Cheese, feta Dairy and Egg Products 1019
  10. 4 Cheese, mozzarella, part skim milk Dairy and Egg Products 1028
  11. manufacturer
  12. 0
  13. 1
  14. 2
  15. 3
  16. 4
  17. In [164]: info.info()
  18. <class 'pandas.core.frame.DataFrame'>
  19. RangeIndex: 6636 entries, 0 to 6635
  20. Data columns (total 4 columns):
  21. description 6636 non-null object
  22. group 6636 non-null object
  23. id 6636 non-null int64
  24. manufacturer 5195 non-null object
  25. dtypes: int64(1), object(3)
  26. memory usage: 207.5+ KB

通过value_counts,你可以查看食物类别的分布情况:

  1. In [165]: pd.value_counts(info.group)[:10]
  2. Out[165]:
  3. Vegetables and Vegetable Products 812
  4. Beef Products 618
  5. Baked Products 496
  6. Breakfast Cereals 403
  7. Fast Foods 365
  8. Legumes and Legume Products 365
  9. Lamb, Veal, and Game Products 345
  10. Sweets 341
  11. Pork Products 328
  12. Fruits and Fruit Juices 328
  13. Name: group, dtype: int64

现在,为了对全部营养数据做一些分析,最简单的办法是将所有食物的营养成分整合到一个大表中。我们分几个步骤来实现该目的。首先,将各食物的营养成分列表转换为一个DataFrame,并添加一个表示编号的列,然后将该DataFrame添加到一个列表中。最后通过concat将这些东西连接起来就可以了:

顺利的话,nutrients的结果是:

  1. In [167]: nutrients
  2. Out[167]:
  3. description group units value id
  4. 0 Protein Composition g 25.180 1008
  5. 1 Total lipid (fat) Composition g 29.200 1008
  6. 2 Carbohydrate, by difference Composition g 3.060 1008
  7. 3 Ash Other g 3.280 1008
  8. 4 Energy Energy kcal 376.000 1008
  9. ... ... ...
  10. ... ... ...
  11. 389350 Vitamin B-12, added Vitamins mcg 0.000 43546
  12. 389351 Cholesterol Other mg 0.000 43546
  13. 389352 Fatty acids, total saturated Other g 0.072 43546
  14. 389353 Fatty acids, total monounsaturated Other g 0.028 43546
  15. 389354 Fatty acids, total polyunsaturated Other g 0.041 43546
  16. [389355 rows x 5 columns]

我发现这个DataFrame中无论如何都会有一些重复项,所以直接丢弃就可以了:

  1. In [168]: nutrients.duplicated().sum() # number of duplicates
  2. Out[168]: 14179
  3. In [169]: nutrients = nutrients.drop_duplicates()

由于两个DataFrame对象中都有”group”和”description”,所以为了明确到底谁是谁,我们需要对它们进行重命名:

  1. In [170]: col_mapping = {'description' : 'food',
  2. .....: 'group' : 'fgroup'}
  3. In [171]: info = info.rename(columns=col_mapping, copy=False)
  4. In [172]: info.info()
  5. <class 'pandas.core.frame.DataFrame'>
  6. RangeIndex: 6636 entries, 0 to 6635
  7. Data columns (total 4 columns):
  8. food 6636 non-null object
  9. fgroup 6636 non-null object
  10. id 6636 non-null int64
  11. manufacturer 5195 non-null object
  12. dtypes: int64(1), object(3)
  13. memory usage: 207.5+ KB
  14. In [173]: col_mapping = {'description' : 'nutrient',
  15. .....: 'group' : 'nutgroup'}
  16. In [174]: nutrients = nutrients.rename(columns=col_mapping, copy=False)
  17. In [175]: nutrients
  18. Out[175]:
  19. nutrient nutgroup units value id
  20. 0 Protein Composition g 25.180 1008
  21. 1 Total lipid (fat) Composition g 29.200 1008
  22. 2 Carbohydrate, by difference Composition g 3.060 1008
  23. 3 Ash Other g 3.280 1008
  24. 4 Energy Energy kcal 376.000 1008
  25. ... ... ... ... ... ...
  26. 389350 Vitamin B-12, added Vitamins mcg 0.000 43546
  27. 389351 Cholesterol Other mg 0.000 43546
  28. 389352 Fatty acids, total saturated Other g 0.072 43546
  29. 389353 Fatty acids, total monounsaturated Other g 0.028 43546
  30. 389354 Fatty acids, total polyunsaturated Other g 0.041 43546
  31. [375176 rows x 5 columns]

做完这些,就可以将info跟nutrients合并起来:

  1. In [176]: ndata = pd.merge(nutrients, info, on='id', how='outer')
  2. In [177]: ndata.info()
  3. <class 'pandas.core.frame.DataFrame'>
  4. Int64Index: 375176 entries, 0 to 375175
  5. Data columns (total 8 columns):
  6. nutrient 375176 non-null object
  7. nutgroup 375176 non-null object
  8. units 375176 non-null object
  9. value 375176 non-null float64
  10. id 375176 non-null int64
  11. food 375176 non-null object
  12. fgroup 375176 non-null object
  13. manufacturer 293054 non-null object
  14. dtypes: float64(1), int64(1), object(6)
  15. memory usage: 25.8+ MB
  16. In [178]: ndata.iloc[30000]
  17. Out[178]:
  18. nutrient Glycine
  19. nutgroup Amino Acids
  20. units g
  21. value 0.04
  22. id 6158
  23. food Soup, tomato bisque, canned, condensed
  24. fgroup Soups, Sauces, and Gravies
  25. manufacturer
  26. Name: 30000, dtype: object

我们现在可以根据食物分类和营养类型画出一张中位值图(如图14-11所示):

  1. In [180]: result = ndata.groupby(['nutrient', 'fgroup'])['value'].quantile(0.5)
  2. In [181]: result['Zinc, Zn'].sort_values().plot(kind='barh')

图片14-11 根据营养分类得出的锌中位值

只要稍微动一动脑子,就可以发现各营养成分最为丰富的食物是什么了:

  1. by_nutrient = ndata.groupby(['nutgroup', 'nutrient'])
  2. get_maximum = lambda x: x.loc[x.value.idxmax()]
  3. get_minimum = lambda x: x.loc[x.value.idxmin()]
  4. max_foods = by_nutrient.apply(get_maximum)[['value', 'food']]
  5. # make the food a little smaller
  6. max_foods.food = max_foods.food.str[:50]

由于得到的DataFrame很大,所以不方便在书里面全部打印出来。这里只给出”Amino Acids”营养分组:

  1. In [183]: max_foods.loc['Amino Acids']['food']
  2. Out[183]:
  3. nutrient
  4. Alanine Gelatins, dry powder, unsweetened
  5. Arginine Seeds, sesame flour, low-fat
  6. Aspartic acid Soy protein isolate
  7. Cystine Seeds, cottonseed flour, low fat (glandless)
  8. Glutamic acid Soy protein isolate
  9. ...
  10. Serine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
  11. Threonine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
  12. Tryptophan Sea lion, Steller, meat with fat (Alaska Native)
  13. Tyrosine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
  14. Valine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
  15. Name: food, Length: 19, dtype: object