Data handling and loading

von Noel K.

How can you make sure of reproducibility?

rng = np.random.default_rng(seed=0)

a = rng.uniform(size=(3,))

torch.random.manual_seed(0)

a = torch.rand(size=(3,))

How can you create a Dataset?

from torch.utils.data import Dataset

class Simple1DRandomDataset(Dataset):

def __init__(self, samples: np.ndarray):

# initialization of the dataset

def __getitem__(self, index):

# return one sample with the sample itself an the belonging index

def __len__(self):

# return the total numbers of samples in the dataset

How can we load the data?

from torch.utils.data import DataLoader

our_dataloader = DataLoader( our_dataset, shuffle=True,

batch_size=4, num_workers=0)

The batch size is a hyperparameter we need to set
Number of workers refers to the option of multiple processing, only works with the name = main statement

How can you split the data into test, training and validation set?

from torch.utils.data import Subset

n_samples = len(our_dataset)

shuffled_indices = rng.permutation(n_samples)

test_set_indices = shuffled_indices[:int(n_samples / 5)]

validation_set_indices = shuffled_indices [int(n_samples / 5):int(n_samples / 5) * 2]

training_set_indices = shuffled_indices[int(n_samples / 5) * 2:]

How can you custom stacking, e.g. if you do not want to stack the minitbatches?

training_loader = DataLoader(training_set, shuffle=False,

batch_size=4, collate_fn=no_stack_collate_fn)

What should the getitem() method typically return?

What is the argument collate_fn for?

What is the purpose of an instance of torch.utils.data.DataLoader?

They create minibatches from the samples returned by a torch.utils.data.Dataset instance

Zuletzt geändert
vor 2 Jahren