Dataset Specification

Dataset is an important part of machine learning. Subsequent models are built based on datasets. We need to manage datasets. The following is the standard format of the dataset that Pipcook should save after the data is collected through the DataCollectType plugin. Our DataAccessType layer assumes that the data already meets the following specifications.

For different dataset formats, DataCollectType plugin is used to smooth the differences.

Image

PascalVOC Dataset format, the detailed directory is as follows:

  1. 📂dataset
  2. 📂annotations
  3. 📂train
  4. 📜...
  5. 📜${image_name}.xml
  6. 📂test
  7. 📂validation
  8. 📂images
  9. 📜...
  10. 📜${image_name}.jpg

Or representing in XML:

  1. <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
  2. <annotation>
  3. <folder>folder path</folder>
  4. <filename>image name</filename>
  5. <size>
  6. <width>width</width>
  7. <height>height</height>
  8. </size>
  9. <object>
  10. <name>category name</name>
  11. <bndbox> <!--this is not necessary for image classification problem-->
  12. <xmin>left</xmin>
  13. <ymin>top</ymin>
  14. <xmax>right</xmax>
  15. <ymax>bottom</ymax>
  16. </bndbox>
  17. </object>
  18. </annotation>

Text

The text category should be a CSV file. The first column is the text content, and the second column is the category name. The delimiter is ‘,’ without a header.

  1. name, category
  2. prod1, type1
  3. prod2, type2
  4. prod3, type2
  5. prod4, type1