DataLoader

class paddle.fluid.io.DataLoader[源代码]

方法

from_generator(feed_list=None, capacity=None, use_double_buffer=True, iterable=True, return_list=False, use_multiprocess=False, drop_last=True)

注解

框架保证DataLoader的数据加载顺序与用户提供的数据源读取顺序一致。

创建一个DataLoader对象用于加载Python生成器产生的数据。数据会由Python线程预先读取,并异步送入一个队列中。

本方法创建的DataLoader对象提供了3个方法设置数据源,分别是 set_sample_generator , set_sample_list_generatorset_batch_generator 。请查阅下述示例代码了解它们的使用方法。

如果iterable = True,本方法创建的DataLoader对象时一个Python生成器,可以for-range的方法循环迭代。

如果iterable = False,本方法创建的DataLoader对象提供 start()reset() 方法控制数据读取过程。此模式用于兼容 fluid.layers.py_reader 的使用方式。用户可使用iterable = False模式,方便地将 fluid.layers.py_reader 的代码迁移至 fluid.io.DataLoader

参数

  • feed_list (list(Variable)|tuple(Variable)) - feed变量列表,由 fluid.layers.data() 创建。
  • capacity (int) - DataLoader对象内部维护队列的容量大小。单位是batch数量。若reader读取速度较快,建议设置较大的capacity值。
  • use_double_buffer (bool) - 是否使用 double_buffer_reader 。若use_double_buffer=True,DataLoader会异步地预读取下一个batch的数据,可加速数据读取过程,但同时会占用少量的CPU/GPU存储,即一个batch输入数据的存储空间。
  • iterable (bool) - 所创建的DataLoader对象是否可迭代。
  • return_list (bool) - 每个设备上的数据是否以list形式返回。仅在iterable = True模式下有效。若return_list = False,每个设备上的返回数据均是str -> LoDTensor的映射表,其中映射表的key是每个输入变量的名称。若return_list = True,则每个设备上的返回数据均是list(LoDTensor)。推荐在静态图模式下使用return_list = False,在动态图模式下使用return_list = True。
  • use_multiprocess (bool) - 设置是否是用多进程加速动态图的数据载入过程。注意:该参数的设置仅在动态图模式下有效, 在静态图模式下,该参数设置与否均无任何影响。默认值为False。
  • drop_last (bool): 是否丢弃最后的不足CPU/GPU设备数的批次。默认值为True。在网络训练时,用户不能设置drop_last=False,此时所有CPU/GPU设备均应从DataLoader中读取到数据。在网络预测时,用户可以设置drop_last=False,此时最后不足CPU/GPU设备数的批次可以进行预测。

返回

被创建的DataLoader对象

返回类型

loader (DataLoader)

代码示例 1

  1. import paddle.fluid as fluid
  2. import numpy as np
  3. BATCH_NUM = 10
  4. BATCH_SIZE = 16
  5. EPOCH_NUM = 4
  6. CLASS_NUM = 10
  7. ITERABLE = True # whether the created DataLoader object is iterable
  8. USE_GPU = False # whether to use GPU
  9. DATA_FORMAT = 'batch_generator' # data format of data source user provides
  10. def simple_net(image, label):
  11. fc_tmp = fluid.layers.fc(image, size=CLASS_NUM)
  12. cross_entropy = fluid.layers.softmax_with_cross_entropy(image, label)
  13. loss = fluid.layers.reduce_mean(cross_entropy)
  14. sgd = fluid.optimizer.SGD(learning_rate=1e-3)
  15. sgd.minimize(loss)
  16. return loss
  17. def get_random_images_and_labels(image_shape, label_shape):
  18. image = np.random.random(size=image_shape).astype('float32')
  19. label = np.random.random(size=label_shape).astype('int64')
  20. return image, label
  21. # If the data generator yields one sample each time,
  22. # use DataLoader.set_sample_generator to set the data source.
  23. def sample_generator_creator():
  24. def __reader__():
  25. for _ in range(BATCH_NUM * BATCH_SIZE):
  26. image, label = get_random_images_and_labels([784], [1])
  27. yield image, label
  28. return __reader__
  29. # If the data generator yield list of samples each time,
  30. # use DataLoader.set_sample_list_generator to set the data source.
  31. def sample_list_generator_creator():
  32. def __reader__():
  33. for _ in range(BATCH_NUM):
  34. sample_list = []
  35. for _ in range(BATCH_SIZE):
  36. image, label = get_random_images_and_labels([784], [1])
  37. sample_list.append([image, label])
  38. yield sample_list
  39. return __reader__
  40. # If the data generator yields a batch each time,
  41. # use DataLoader.set_batch_generator to set the data source.
  42. def batch_generator_creator():
  43. def __reader__():
  44. for _ in range(BATCH_NUM):
  45. batch_image, batch_label = get_random_images_and_labels([BATCH_SIZE, 784], [BATCH_SIZE, 1])
  46. yield batch_image, batch_label
  47. return __reader__
  48. # If DataLoader is iterable, use for loop to train the network
  49. def train_iterable(exe, prog, loss, loader):
  50. for _ in range(EPOCH_NUM):
  51. for data in loader():
  52. exe.run(prog, feed=data, fetch_list=[loss])
  53. # If DataLoader is not iterable, use start() and reset() method to control the process
  54. def train_non_iterable(exe, prog, loss, loader):
  55. for _ in range(EPOCH_NUM):
  56. loader.start() # call DataLoader.start() before each epoch starts
  57. try:
  58. while True:
  59. exe.run(prog, fetch_list=[loss])
  60. except fluid.core.EOFException:
  61. loader.reset() # call DataLoader.reset() after catching EOFException
  62. def set_data_source(loader, places):
  63. if DATA_FORMAT == 'sample_generator':
  64. loader.set_sample_generator(sample_generator_creator(), batch_size=BATCH_SIZE, drop_last=True, places=places)
  65. elif DATA_FORMAT == 'sample_list_generator':
  66. loader.set_sample_list_generator(sample_list_generator_creator(), places=places)
  67. elif DATA_FORMAT == 'batch_generator':
  68. loader.set_batch_generator(batch_generator_creator(), places=places)
  69. else:
  70. raise ValueError('Unsupported data format')
  71. image = fluid.layers.data(name='image', shape=[784], dtype='float32')
  72. label = fluid.layers.data(name='label', shape=[1], dtype='int64')
  73. # Define DataLoader
  74. loader = fluid.io.DataLoader.from_generator(feed_list=[image, label], capacity=16, iterable=ITERABLE)
  75. # Define network
  76. loss = simple_net(image, label)
  77. # Set data source of DataLoader
  78. #
  79. # If DataLoader is iterable, places must be given and the number of places must be the same with device number.
  80. # - If you are using GPU, call `fluid.cuda_places()` to get all GPU places.
  81. # - If you are using CPU, call `fluid.cpu_places()` to get all CPU places.
  82. #
  83. # If DataLoader is not iterable, places can be None.
  84. places = fluid.cuda_places() if USE_GPU else fluid.cpu_places()
  85. set_data_source(loader, places)
  86. exe = fluid.Executor(places[0])
  87. exe.run(fluid.default_startup_program())
  88. prog = fluid.CompiledProgram(fluid.default_main_program()).with_data_parallel(loss_name=loss.name)
  89. if loader.iterable:
  90. train_iterable(exe, prog, loss, loader)
  91. else:
  92. train_non_iterable(exe, prog, loss, loader)
  93. '''
  94. Users can use return_list = True in dygraph mode.
  95. '''
  96. with fluid.dygraph.guard(places[0]):
  97. loader = fluid.io.DataLoader.from_generator(capacity=2, return_list=True)
  98. set_data_source(loader, places[0])
  99. for image, label in loader():
  100. relu = fluid.layers.relu(image)
  101. assert image.shape == [BATCH_SIZE, 784]
  102. assert label.shape == [BATCH_SIZE, 1]
  103. assert relu.shape == [BATCH_SIZE, 784]

代码示例 2

  1. import paddle.fluid as fluid
  2. import numpy as np
  3. import os
  4. # We use 2 CPU cores to run inference network
  5. os.environ['CPU_NUM'] = '2'
  6. # The data source has only 3 batches, which can not be
  7. # divided evenly to each CPU core
  8. def batch_generator():
  9. for i in range(3):
  10. yield np.array([i+1]).astype('float32'),
  11. x = fluid.data(name='x', shape=[None], dtype='float32')
  12. y = x * x
  13. def run_inference(drop_last):
  14. loader = fluid.io.DataLoader.from_generator(feed_list=[x],
  15. capacity=8, drop_last=drop_last)
  16. loader.set_batch_generator(batch_generator, fluid.cpu_places())
  17. exe = fluid.Executor(fluid.CPUPlace())
  18. prog = fluid.CompiledProgram(fluid.default_main_program())
  19. prog = prog.with_data_parallel()
  20. result = []
  21. for data in loader():
  22. each_ret, = exe.run(prog, feed=data, fetch_list=[y])
  23. result.extend(each_ret)
  24. return result
  25. # Set drop_last to True, so that the last batch whose
  26. # number is less than CPU core number would be discarded.
  27. print(run_inference(drop_last=True)) # [1.0, 4.0]
  28. # Set drop_last to False, so that the last batch whose
  29. # number is less than CPU core number can be tested.
  30. print(run_inference(drop_last=False)) # [1.0, 4.0, 9.0]

from_dataset(dataset, places, drop_last=True)

创建一个DataLoader对象用于加载Dataset产生的数据。目前,Dataset仅支持Linux系统下使用。

参数

  • dataset (InMemoryDataset|QueueDataset) - Dataset对象。
  • places (list(CUDAPlace)|list(CPUPlace)) - DataLoader对象返回数据所在的place。
  • drop_last (bool) - 是否丢弃最后样本数量不足batch size的batch。若drop_last = True则丢弃,若drop_last = False则不丢弃。

返回

被创建的DataLoader对象,可以for-range的方式循环迭代

返回类型

loader (DataLoader)

代码示例

  1. import paddle.fluid as fluid
  2. image = fluid.layers.data(name='image', shape=[784], dtype='float32')
  3. label = fluid.layers.data(name='label', shape=[1], dtype='int64')
  4. dataset = fluid.DatasetFactory().create_dataset("QueueDataset")
  5. dataset.set_batch_size(32)
  6. dataset.set_filelist(['a.txt', 'b.txt', 'c.txt'])
  7. dataset.set_use_var([image, label])
  8. dataset.set_pipe_command('cat')
  9. loader = fluid.io.DataLoader.from_dataset(dataset, fluid.cpu_places())