Choice of API#
Grain offers two different ways of defining data processing pipelines:
DataLoader
and Dataset
.
TL;DR: If you need to do one of the following:
mix multiple data sources
pack variable length elements
split dataset elements and globally shuffle the splits
then you should use
Dataset
, otherwise use simplerDataLoader
.
DataLoader
#
DataLoader
is a high-level API that uses the following abstractions to define
data processing:
RandomAccessDataSource
that reads raw input data.A
Sampler
that defines the order in which the raw data should be read.A flat sequence of
Transformation
s to apply to the raw data.
You can specify other execution parameters for asynchronous data processing,
sharding, shuffling, and DataLoader
will automatically take care of inserting
them in the right places between the data processing steps.
These are simple and usually general enough to cover most data processing use
cases. Prefer using DataLoader
if your workflow can be described using the
abstractions above. See tutorial
for more details.
Dataset
#
Dataset
is a lower-level API that uses chaining syntax to define data
transformation steps. It allows more general types of processing (e.g. dataset
mixing) and more control over the execution (e.g. different order of data
sharding and shuffling). Dataset
transformations are composed in a way that
allows to preserve random access property past the source and some of the
transformations. This, among other things, can be used for debugging by
evaluating dataset elements at specific positions without processing the entire
dataset.
There are 3 main classes comprising the Dataset
API:
MapDataset
defines a dataset that supports efficient random access. Think of it as an (infinite)Sequence
that computes values lazily.IterDataset
defines a dataset that does not support efficient random access and only supports iterating over it. It’s anIterable
. AnyMapDataset
can be turned into aIterDataset
by callingto_iter_dataset()
.DatasetIterator
defines a stateful iterator of anIterDataset
. The state of the iterator can be saved and restored.
Most data pipelines will start with one or more MapDataset
(often derived from
a RandomAccessDataSource
) and switch to IterDataset
late or not at all. See
tutorial
for more details.