Recipes Dataset

RecipeNLG dataset is available for download here. It contains 2.2 million recipes. The size is slightly less than 1 GB.

Download and unpack the dataset

Accept Terms and Conditions and download it here. Unpack the zip file with unzip. You will get the full_dataset.csv file.

Create a table

Run clickhouse-client and execute the following CREATE query:

  1. CREATE TABLE recipes
  2. (
  3. title String,
  4. ingredients Array(String),
  5. directions Array(String),
  6. link String,
  7. source LowCardinality(String),
  8. NER Array(String)
  9. ) ENGINE = MergeTree ORDER BY title;

Insert the data

Run the following command:

  1. clickhouse-client --query "
  2. INSERT INTO recipes
  3. SELECT
  4. title,
  5. JSONExtract(ingredients, 'Array(String)'),
  6. JSONExtract(directions, 'Array(String)'),
  7. link,
  8. source,
  9. JSONExtract(NER, 'Array(String)')
  10. FROM input('num UInt32, title String, ingredients String, directions String, link String, source LowCardinality(String), NER String')
  11. FORMAT CSVWithNames
  12. " --input_format_with_names_use_header 0 --format_csv_allow_single_quote 0 --input_format_allow_errors_num 10 < full_dataset.csv

This is a showcase how to parse custom CSV, as it requires multiple tunes.

Explanation:
- the dataset is in CSV format, but it requires some preprocessing on insertion; we use table function input to perform preprocessing;
- the structure of CSV file is specified in the argument of the table function input;
- the field num (row number) is unneeded - we parse it from file and ignore;
- we use FORMAT CSVWithNames but the header in CSV will be ignored (by command line parameter --input_format_with_names_use_header 0), because the header does not contain the name for the first field;
- file is using only double quotes to enclose CSV strings; some strings are not enclosed in double quotes, and single quote must not be parsed as the string enclosing - that’s why we also add the --format_csv_allow_single_quote 0 parameter;
- some strings from CSV cannot parse, because they contain \M/ sequence at the beginning of the value; the only value starting with backslash in CSV can be \N that is parsed as SQL NULL. We add --input_format_allow_errors_num 10 parameter and up to ten malformed records can be skipped;
- there are arrays for ingredients, directions and NER fields; these arrays are represented in unusual form: they are serialized into string as JSON and then placed in CSV - we parse them as String and then use JSONExtract function to transform it to Array.

Validate the inserted data

By checking the row count:

  1. SELECT count() FROM recipes
  2. ┌─count()─┐
  3. 2231141
  4. └─────────┘

Example queries

Top components by the number of recipes:

  1. SELECT
  2. arrayJoin(NER) AS k,
  3. count() AS c
  4. FROM recipes
  5. GROUP BY k
  6. ORDER BY c DESC
  7. LIMIT 50
  8. ┌─k────────────────────┬──────c─┐
  9. salt 890741
  10. sugar 620027
  11. butter 493823
  12. flour 466110
  13. eggs 401276
  14. onion 372469
  15. garlic 358364
  16. milk 346769
  17. water 326092
  18. vanilla 270381
  19. olive oil 197877
  20. pepper 179305
  21. brown sugar 174447
  22. tomatoes 163933
  23. egg 160507
  24. baking powder 148277
  25. lemon juice 146414
  26. Salt 122557
  27. cinnamon 117927
  28. sour cream 116682
  29. cream cheese 114423
  30. margarine 112742
  31. celery 112676
  32. baking soda 110690
  33. parsley 102151
  34. chicken 101505
  35. onions 98903
  36. vegetable oil 91395
  37. oil 85600
  38. mayonnaise 84822
  39. pecans 79741
  40. nuts 78471
  41. potatoes 75820
  42. carrots 75458
  43. pineapple 74345
  44. soy sauce 70355
  45. black pepper 69064
  46. thyme 68429
  47. mustard 65948
  48. chicken broth 65112
  49. bacon 64956
  50. honey 64626
  51. oregano 64077
  52. ground beef 64068
  53. unsalted butter 63848
  54. mushrooms 61465
  55. Worcestershire sauce 59328
  56. cornstarch 58476
  57. green pepper 58388
  58. Cheddar cheese 58354
  59. └──────────────────────┴────────┘
  60. 50 rows in set. Elapsed: 0.112 sec. Processed 2.23 million rows, 361.57 MB (19.99 million rows/s., 3.24 GB/s.)

In this example we learn how to use arrayJoin function to multiply data by array elements.

The most complex recipes with strawberry

  1. SELECT
  2. title,
  3. length(NER),
  4. length(directions)
  5. FROM recipes
  6. WHERE has(NER, 'strawberry')
  7. ORDER BY length(directions) DESC
  8. LIMIT 10
  9. ┌─title────────────────────────────────────────────────────────────┬─length(NER)─┬─length(directions)─┐
  10. Chocolate-Strawberry-Orange Wedding Cake 24 126
  11. Strawberry Cream Cheese Crumble Tart 19 47
  12. Charlotte-Style Ice Cream 11 45
  13. Sinfully Good a Million Layers Chocolate Layer Cake, With Strawb 31 45
  14. Sweetened Berries With Elderflower Sherbet 24 44
  15. Chocolate-Strawberry Mousse Cake 15 42
  16. Rhubarb Charlotte with Strawberries and Rum 20 42
  17. Chef Joey's Strawberry Vanilla Tart │ 7 │ 37 │
  18. │ Old-Fashioned Ice Cream Sundae Cake │ 17 │ 37 │
  19. │ Watermelon Cake │ 16 │ 36 │
  20. └──────────────────────────────────────────────────────────────────┴─────────────┴────────────────────┘
  21. 10 rows in set. Elapsed: 0.215 sec. Processed 2.23 million rows, 1.48 GB (10.35 million rows/s., 6.86 GB/s.)

In this example, we involve has function to filter by array elements and sort by the number of directions.

There is a wedding cake that requires the whole 126 steps to produce!

Show that directions:

  1. SELECT arrayJoin(directions)
  2. FROM recipes
  3. WHERE title = 'Chocolate-Strawberry-Orange Wedding Cake'
  4. ┌─arrayJoin(directions)───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
  5. Position 1 rack in center and 1 rack in bottom third of oven and preheat to 350F.
  6. Butter one 5-inch-diameter cake pan with 2-inch-high sides, one 8-inch-diameter cake pan with 2-inch-high sides and one 12-inch-diameter cake pan with 2-inch-high sides.
  7. Dust pans with flour; line bottoms with parchment.
  8. Combine 1/3 cup orange juice and 2 ounces unsweetened chocolate in heavy small saucepan.
  9. Stir mixture over medium-low heat until chocolate melts.
  10. Remove from heat.
  11. Gradually mix in 1 2/3 cups orange juice.
  12. Sift 3 cups flour, 2/3 cup cocoa, 2 teaspoons baking soda, 1 teaspoon salt and 1/2 teaspoon baking powder into medium bowl.
  13. using electric mixer, beat 1 cup (2 sticks) butter and 3 cups sugar in large bowl until blended (mixture will look grainy).
  14. Add 4 eggs, 1 at a time, beating to blend after each.
  15. Beat in 1 tablespoon orange peel and 1 tablespoon vanilla extract.
  16. Add dry ingredients alternately with orange juice mixture in 3 additions each, beating well after each addition.
  17. Mix in 1 cup chocolate chips.
  18. Transfer 1 cup plus 2 tablespoons batter to prepared 5-inch pan, 3 cups batter to prepared 8-inch pan and remaining batter (about 6 cups) to 12-inch pan.
  19. Place 5-inch and 8-inch pans on center rack of oven.
  20. Place 12-inch pan on lower rack of oven.
  21. Bake cakes until tester inserted into center comes out clean, about 35 minutes.
  22. Transfer cakes in pans to racks and cool completely.
  23. Mark 4-inch diameter circle on one 6-inch-diameter cardboard cake round.
  24. Cut out marked circle.
  25. Mark 7-inch-diameter circle on one 8-inch-diameter cardboard cake round.
  26. Cut out marked circle.
  27. Mark 11-inch-diameter circle on one 12-inch-diameter cardboard cake round.
  28. Cut out marked circle.
  29. Cut around sides of 5-inch-cake to loosen.
  30. Place 4-inch cardboard over pan.
  31. Hold cardboard and pan together; turn cake out onto cardboard.
  32. Peel off parchment.Wrap cakes on its cardboard in foil.
  33. Repeat turning out, peeling off parchment and wrapping cakes in foil, using 7-inch cardboard for 8-inch cake and 11-inch cardboard for 12-inch cake.
  34. Using remaining ingredients, make 1 more batch of cake batter and bake 3 more cake layers as described above.
  35. Cool cakes in pans.
  36. Cover cakes in pans tightly with foil.
  37. (Can be prepared ahead.
  38. Let stand at room temperature up to 1 day or double-wrap all cake layers and freeze up to 1 week.
  39. Bring cake layers to room temperature before using.)
  40. Place first 12-inch cake on its cardboard on work surface.
  41. Spread 2 3/4 cups ganache over top of cake and all the way to edge.
  42. Spread 2/3 cup jam over ganache, leaving 1/2-inch chocolate border at edge.
  43. Drop 1 3/4 cups white chocolate frosting by spoonfuls over jam.
  44. Gently spread frosting over jam, leaving 1/2-inch chocolate border at edge.
  45. Rub some cocoa powder over second 12-inch cardboard.
  46. Cut around sides of second 12-inch cake to loosen.
  47. Place cardboard, cocoa side down, over pan.
  48. Turn cake out onto cardboard.
  49. Peel off parchment.
  50. Carefully slide cake off cardboard and onto filling on first 12-inch cake.
  51. Refrigerate.
  52. Place first 8-inch cake on its cardboard on work surface.
  53. Spread 1 cup ganache over top all the way to edge.
  54. Spread 1/4 cup jam over, leaving 1/2-inch chocolate border at edge.
  55. Drop 1 cup white chocolate frosting by spoonfuls over jam.
  56. Gently spread frosting over jam, leaving 1/2-inch chocolate border at edge.
  57. Rub some cocoa over second 8-inch cardboard.
  58. Cut around sides of second 8-inch cake to loosen.
  59. Place cardboard, cocoa side down, over pan.
  60. Turn cake out onto cardboard.
  61. Peel off parchment.
  62. Slide cake off cardboard and onto filling on first 8-inch cake.
  63. Refrigerate.
  64. Place first 5-inch cake on its cardboard on work surface.
  65. Spread 1/2 cup ganache over top of cake and all the way to edge.
  66. Spread 2 tablespoons jam over, leaving 1/2-inch chocolate border at edge.
  67. Drop 1/3 cup white chocolate frosting by spoonfuls over jam.
  68. Gently spread frosting over jam, leaving 1/2-inch chocolate border at edge.
  69. Rub cocoa over second 6-inch cardboard.
  70. Cut around sides of second 5-inch cake to loosen.
  71. Place cardboard, cocoa side down, over pan.
  72. Turn cake out onto cardboard.
  73. Peel off parchment.
  74. Slide cake off cardboard and onto filling on first 5-inch cake.
  75. Chill all cakes 1 hour to set filling.
  76. Place 12-inch tiered cake on its cardboard on revolving cake stand.
  77. Spread 2 2/3 cups frosting over top and sides of cake as a first coat.
  78. Refrigerate cake.
  79. Place 8-inch tiered cake on its cardboard on cake stand.
  80. Spread 1 1/4 cups frosting over top and sides of cake as a first coat.
  81. Refrigerate cake.
  82. Place 5-inch tiered cake on its cardboard on cake stand.
  83. Spread 3/4 cup frosting over top and sides of cake as a first coat.
  84. Refrigerate all cakes until first coats of frosting set, about 1 hour.
  85. (Cakes can be made to this point up to 1 day ahead; cover and keep refrigerate.)
  86. Prepare second batch of frosting, using remaining frosting ingredients and following directions for first batch.
  87. Spoon 2 cups frosting into pastry bag fitted with small star tip.
  88. Place 12-inch cake on its cardboard on large flat platter.
  89. Place platter on cake stand.
  90. Using icing spatula, spread 2 1/2 cups frosting over top and sides of cake; smooth top.
  91. Using filled pastry bag, pipe decorative border around top edge of cake.
  92. Refrigerate cake on platter.
  93. Place 8-inch cake on its cardboard on cake stand.
  94. Using icing spatula, spread 1 1/2 cups frosting over top and sides of cake; smooth top.
  95. Using pastry bag, pipe decorative border around top edge of cake.
  96. Refrigerate cake on its cardboard.
  97. Place 5-inch cake on its cardboard on cake stand.
  98. Using icing spatula, spread 3/4 cup frosting over top and sides of cake; smooth top.
  99. Using pastry bag, pipe decorative border around top edge of cake, spooning more frosting into bag if necessary.
  100. Refrigerate cake on its cardboard.
  101. Keep all cakes refrigerated until frosting sets, about 2 hours.
  102. (Can be prepared 2 days ahead.
  103. Cover loosely; keep refrigerated.)
  104. Place 12-inch cake on platter on work surface.
  105. Press 1 wooden dowel straight down into and completely through center of cake.
  106. Mark dowel 1/4 inch above top of frosting.
  107. Remove dowel and cut with serrated knife at marked point.
  108. Cut 4 more dowels to same length.
  109. Press 1 cut dowel back into center of cake.
  110. Press remaining 4 cut dowels into cake, positioning 3 1/2 inches inward from cake edges and spacing evenly.
  111. Place 8-inch cake on its cardboard on work surface.
  112. Press 1 dowel straight down into and completely through center of cake.
  113. Mark dowel 1/4 inch above top of frosting.
  114. Remove dowel and cut with serrated knife at marked point.
  115. Cut 3 more dowels to same length.
  116. Press 1 cut dowel back into center of cake.
  117. Press remaining 3 cut dowels into cake, positioning 2 1/2 inches inward from edges and spacing evenly.
  118. Using large metal spatula as aid, place 8-inch cake on its cardboard atop dowels in 12-inch cake, centering carefully.
  119. Gently place 5-inch cake on its cardboard atop dowels in 8-inch cake, centering carefully.
  120. Using citrus stripper, cut long strips of orange peel from oranges.
  121. Cut strips into long segments.
  122. To make orange peel coils, wrap peel segment around handle of wooden spoon; gently slide peel off handle so that peel keeps coiled shape.
  123. Garnish cake with orange peel coils, ivy or mint sprigs, and some berries.
  124. (Assembled cake can be made up to 8 hours ahead.
  125. Let stand at cool room temperature.)
  126. Remove top and middle cake tiers.
  127. Remove dowels from cakes.
  128. Cut top and middle cakes into slices.
  129. To cut 12-inch cake: Starting 3 inches inward from edge and inserting knife straight down, cut through from top to bottom to make 6-inch-diameter circle in center of cake.
  130. Cut outer portion of cake into slices; cut inner portion into slices and serve with strawberries.
  131. └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
  132. 126 rows in set. Elapsed: 0.011 sec. Processed 8.19 thousand rows, 5.34 MB (737.75 thousand rows/s., 480.59 MB/s.)

Online playground

The dataset is also available in the Playground.