Skip to content

Commit a46e134

Browse files
authored
Improve description and use
1 parent 10f1ccd commit a46e134

1 file changed

Lines changed: 13 additions & 0 deletions

File tree

README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,24 @@ The template is [here](./datasheet-for-dataset-template.md).
99
[Datasheets for datasets](https://arxiv.org/abs/1803.09010) were created to increase transparency
1010
of datasets.
1111

12+
> [Datasheets for datasets] document [the dataset] motivation, composition, collection process,
13+
> recommended uses, and so on. [They] have the potential to increase transparency and accountability
14+
> within the machine learning community, mitigate unwanted biases in machine learning systems, facilitate
15+
> greater reproducibility of machine learning results, and help researchers and practitioners select more
16+
> appropriate datasets for their chosen tasks.
17+
1218
The problem it is trying to solve:
1319

1420
> Despite the importance of data to machine learning, there is no standardized process for
1521
> documenting machine learning datasets. To address this gap, we propose _datasheets for datasets_.
1622
23+
The datasheet is not a passive, after-the-fact document. Dataset creators are expected to read the
24+
questions in the _motivation_, _composition_, and _collection process_ sections **before** they start
25+
collecting data for the dataset. The questions in these sections have considerations that, if not taken
26+
into account before data is gathered, cannot be easily rectified later. Similarly, the dataset creators
27+
are expected to read the questions in the _preprocesssing/cleaning/labeling_, before they preprocessing
28+
the raw data.
29+
1730
## Why use a markdown file for the datasheet?
1831

1932
The short explanation: using a markdown file allows us to easily compare (diff) one version

0 commit comments

Comments
 (0)