详解TensorFlow的 tf.data.Dataset.from_tensor_slices 函数：从张量创建数据集

2023年3月30日下午7:29 • tensorflow-function

TensorFlow中的tf.data.Dataset.from_tensor_slices函数

TensorFlow中的tf.data.Dataset.from_tensor_slices函数可以流式从内存中的一组数据中访问单独的数据条目。这可以用于如图像分类等任务，其中在只有少量数据的情况下，可以将所有数据一次性存储在内存中。Dataset.from_tensor_slices可以将numpy数组、张量（tensors）和pandas DataFrame数据结构切片（slices）并转换为Python generator，其中的元素被顺序读取并拆解以指定的形式传递给TensorFlow图。这意味着利用Dataset.from_tensor_slices可以在准备数据集时进行大量的数据预处理，而不必将预处理的结果保存在内存或磁盘中。因此，如果您有大量内存，Dataset.from_tensor_slices可能会成为您新的最爱！

下面我们将分别从以下几个方面对该函数进行详细讲解：

函数的调用方法和输入参数
函数返回的是一个迭代器
生成的迭代器如何使用
使用from_tensor_slices的一个实例：从numpy张量创建数据集
使用from_tensor_slices的另一个实例：从pandas DataFrame创建数据集

调用方法和输入参数

tf.data.Dataset.from_tensor_slices(tensors)

该函数的参数tensors可以是以下类型：

普通张量或numpy数组
一个或多个(tf或numpy)张量或numpy数组组成的元组或字典对象中的元素。

这意味着，你可以把这个函数作为一个单独的张量没有任何问题，但你也可以传递多个张量、数据字典和/或元组。

函数返回的是一个迭代器

Dataset.from_tensor_slices会返回一个迭代器。这个迭代器每个派生的元素都对应于输入张量的一个切片，这个元素是一个包含整个张量形状的元组。这个元组的第一个维度表示整个list中batch的大小，而其他的维度就对应着输入张量的形状。举个例子，如果输入张量的形状是 num_examples x data_rows x data_cols x num_channels，那么派生的元素就会有形状为 (batch_size, data_rows, data_cols, num_channels)

生成的迭代器如何使用

生成的迭代器可以直接传递给TensorFlow的fit()方法进行训练。使用Dataset API的主要优点之一是你可以通过将数据集转换为管道，而无需将数据集加载到内存中，即可以大大提高代码效率。这种方法使得TensorFlow代码更直观和清晰，同时也会让您的代码更快、更健壮和更可靠。

从numpy数组创建数据集

现在我们将展示一个从numpy数组创建数据集的例子。我们将使用比如MINST数据集或CIFAR-10数据集，这些数据集非常适合这种Use Case。

import numpy as np
import tensorflow as tf

# Load the data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Normalize the data
x_train, x_test = x_train / 255.0, x_test / 255.0

# Create the dataset
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test))

# Shuffle and batch the dataset
train_ds = train_ds.shuffle(buffer_size=1024).batch(32)
test_ds = test_ds.batch(32)

# Train the model using the dataset
model.fit(train_ds, validation_data=test_ds, epochs=10)

从Pandas DataFrame创建数据集

在这个例子中，我们将演示如何将Pandas数据帧的数据切片并将其转换为可用于TensorFlow的迭代器。

import tensorflow as tf
import pandas as pd

# Load the data
df = pd.read_csv('my_data.csv')

# Split the data into training and testing splits
train_df = df.loc[:10000, :]
test_df = df.loc[10000:, :]

# Convert the data to TensorFlow dataset format
train_ds = tf.data.Dataset.from_tensor_slices((dict(train_df.iloc[:, :-1]), train_df.iloc[:, -1]))
test_ds = tf.data.Dataset.from_tensor_slices((dict(test_df.iloc[:, :-1]), test_df.iloc[:, -1]))

# Shuffle and batch the dataset
train_ds = train_ds.shuffle(buffer_size=1024).batch(32)
test_ds = test_ds.batch(32)

# Train the model using the dataset
model.fit(train_ds, validation_data=test_ds, epochs=10)

最后，值得一提的是，使用from_tensor_slices函数时，您应该先将数据集打乱，并将它们拆分为训练和测试集，然后将它们批量设置到合适的大小，以便将它们传递给模型。这些步骤是在构建任何机器学习应用程序时都应该经过的步骤。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：详解TensorFlow的 tf.data.Dataset.from_tensor_slices 函数：从张量创建数据集 - Python技术站