@@ -9,11 +9,24 @@ The template is [here](./datasheet-for-dataset-template.md).
99[ Datasheets for datasets] ( https://arxiv.org/abs/1803.09010 ) were created to increase transparency
1010of datasets.
1111
12+ > [ Datasheets for datasets] document [ the dataset] motivation, composition, collection process,
13+ > recommended uses, and so on. [ They] have the potential to increase transparency and accountability
14+ > within the machine learning community, mitigate unwanted biases in machine learning systems, facilitate
15+ > greater reproducibility of machine learning results, and help researchers and practitioners select more
16+ > appropriate datasets for their chosen tasks.
17+
1218The problem it is trying to solve:
1319
1420> Despite the importance of data to machine learning, there is no standardized process for
1521> documenting machine learning datasets. To address this gap, we propose _ datasheets for datasets_ .
1622
23+ The datasheet is not a passive, after-the-fact document. Dataset creators are expected to read the
24+ questions in the _ motivation_ , _ composition_ , and _ collection process_ sections ** before** they start
25+ collecting data for the dataset. The questions in these sections have considerations that, if not taken
26+ into account before data is gathered, cannot be easily rectified later. Similarly, the dataset creators
27+ are expected to read the questions in the _ preprocesssing/cleaning/labeling_ , before they preprocessing
28+ the raw data.
29+
1730## Why use a markdown file for the datasheet?
1831
1932The short explanation: using a markdown file allows us to easily compare (diff) one version
0 commit comments