PyTorch 學習筆記 (一)：從 NumPy 到 Tensor，掌握數據處理基礎

3390 words

17 minutes

PyTorch 學習筆記 (一)：從 NumPy 到 Tensor，掌握數據處理基礎

2025-08-09

DeepLearning

pytorch

/

tensor

/

numpy

/

data

/

preprocessing

嗨，歡迎來到我的 PyTorch 學習筆記！🚀

在我們一頭栽進神經網路、模型訓練這些酷炫的主題之前，有一個非常重要但常被忽略的基礎——數據。所有深度學習模型的核心，都是對數據進行各種數學運算，而承載這些數據的容器，就是我們今天的主角：張量 (Tensor)。

如果你曾接觸過 Python 的數據科學庫，你一定對 NumPy 不陌生。好消息是，PyTorch 的 Tensor 和 NumPy 的 ndarray 非常相似，這將讓你的學習曲線平緩許多。

這篇筆記將會涵蓋：

Pandas, NumPy 與 Tensor 之間是什麼關係？
什麼是標量、向量、矩陣和張量？
如何用 PyTorch 創建和操作 Tensor？
為什麼需要數據預處理？

準備好了嗎？讓我們開始吧！

核心數據結構：從 NumPy 到 Tensor#

在數據科學領域，NumPy 的 ndarray 是處理數值運算的王者。PyTorch 借鑒了這一點，創造了 Tensor，你可以把它想像成是 能在 GPU 上加速運算的 NumPy ndarray。

為什麼不直接用 NumPy？#

GPU 加速：這是 PyTorch Tensor 最大的優勢。神經網路涉及大量的矩陣運算，使用 GPU 可以將訓練速度提升數十倍甚至上百倍。
自動求導：PyTorch 內建了強大的自動求導引擎 (torch.autograd)，這是訓練神經網路的核心。Tensor 可以追蹤其上的所有操作，從而自動計算梯度。

Tensor vs. NumPy：何時分工合作？#

既然 Tensor 這麼強大，為什麼我們還需要 NumPy？

答案在於 生態系統。NumPy 是 Python 科學計算的基石，無數強大的函式庫（如 pandas 用於數據分析、scikit-learn 用於機器學習、OpenCV 用於圖像處理）都建立在 NumPy 之上或與其深度整合。

因此，一個最常見且高效的工作流程是：

使用 Pandas, OpenCV 等工具 載入和進行初步的數據清理。
將數據轉換為 NumPy ndarray 進行複雜的數值運算和預處理（例如特徵工程）。
在最後一步，將準備好的 NumPy 陣列轉換為 torch.Tensor，準備送入 PyTorch 模型進行訓練。

讓我們看看如何創建一個 Tensor，以及它和 NumPy 之間的轉換有多簡單。

1
import torch
2
import numpy as np
3

4
# 從 NumPy array 創建 Tensor
5
numpy_array = np.array([1, 2, 3, 4])
6
torch_tensor = torch.from_numpy(numpy_array)
7
print(f"NumPy Array: {numpy_array}")
8
print(f"Torch Tensor: {torch_tensor}")
9

10
# 從 Tensor 轉換回 NumPy array
11
new_numpy_array = torch_tensor.numpy()
12
print(f"Back to NumPy: {new_numpy_array}")
13

14
# 直接從 Python list 創建 Tensor
15
data = [[1, 2], [3, 4]]
16
tensor_from_data = torch.tensor(data)
17
print(f"Tensor from list:\n {tensor_from_data}")

PyTorch Tensor 的常用創建方法#

除了從現有數據創建 Tensor，PyTorch 還提供了許多便捷的方法來創建特定類型的 Tensor：

1
import torch
2

3
# 創建全零張量
4
zeros_tensor = torch.zeros(3, 4)
5
print(f"Zeros tensor (3x4):\n {zeros_tensor}")
6

7
# 創建全一張量
8
ones_tensor = torch.ones(2, 3)
9
print(f"Ones tensor (2x3):\n {ones_tensor}")
10

11
# 創建單位矩陣
12
eye_tensor = torch.eye(3)
13
print(f"Identity matrix (3x3):\n {eye_tensor}")
14

15
# 創建隨機張量 (0-1 均勻分佈)
16
rand_tensor = torch.rand(2, 3)
17
print(f"Random tensor (uniform 0-1):\n {rand_tensor}")
18

19
# 創建標準常態分佈隨機張量
20
randn_tensor = torch.randn(2, 3)
21
print(f"Random tensor (normal distribution):\n {randn_tensor}")
22

23
# 創建等差數列
24
arange_tensor = torch.arange(0, 10, 2)  # 從0到10，步長為2
25
print(f"Arange tensor: {arange_tensor}")
26

27
# 創建線性等分數列
28
linspace_tensor = torch.linspace(0, 1, 5)  # 在0到1之間創建5個等間距的數
29
print(f"Linspace tensor: {linspace_tensor}")
30

31
# 創建與現有張量相同形狀的張量
32
existing_tensor = torch.tensor([[1, 2], [3, 4]])
33
zeros_like = torch.zeros_like(existing_tensor)
34
ones_like = torch.ones_like(existing_tensor)
35
print(f"Zeros like existing tensor:\n {zeros_like}")
36
print(f"Ones like existing tensor:\n {ones_like}")

NumPy 的常用操作#

既然我們經常在 NumPy 和 PyTorch 之間切換，了解 NumPy 的常用操作也很重要：

1
import numpy as np
2

3
# NumPy 的創建方法 (與 PyTorch 非常相似)
4
np_zeros = np.zeros((3, 4))
5
np_ones = np.ones((2, 3))
6
np_eye = np.eye(3)
7
np_random = np.random.rand(2, 3)
8
np_randn = np.random.randn(2, 3)
9
np_arange = np.arange(0, 10, 2)
10
np_linspace = np.linspace(0, 1, 5)
11

12
print(f"NumPy zeros:\n {np_zeros}")
13
print(f"NumPy arange: {np_arange}")
14

15
# NumPy 特有的常用操作
16
arr = np.array([[1, 2, 3], [4, 5, 6]])
17

18
# 形狀操作
19
print(f"Original shape: {arr.shape}")
20
reshaped = arr.reshape(3, 2)
21
print(f"Reshaped (3x2):\n {reshaped}")
22

23
# 展平
24
flattened = arr.flatten()
25
print(f"Flattened: {flattened}")
26

27
# 轉置
28
transposed = arr.T
29
print(f"Transposed:\n {transposed}")
30

31
# 統計操作
32
print(f"Sum of all elements: {arr.sum()}")
33
print(f"Sum along axis 0: {arr.sum(axis=0)}")  # 按列求和
34
print(f"Sum along axis 1: {arr.sum(axis=1)}")  # 按行求和
35
print(f"Mean: {arr.mean()}")
36
print(f"Max: {arr.max()}")
37
print(f"Min: {arr.min()}")
38

39
# 條件操作
40
condition_result = arr > 3
41
print(f"Elements > 3:\n {condition_result}")
42
filtered = arr[arr > 3]
43
print(f"Filtered values > 3: {filtered}")

張量的基本概念：標量、向量、矩陣#

「張量 (Tensor)」聽起來可能有點嚇人，但其實它只是一個用來表示多維陣列的通用術語。我們可以根據它的「維度 (dimensions)」或「階 (rank)」來理解：

標量 (Scalar)：一個單獨的數字，例如 5。它是一個 0 維張量。
向量 (Vector)：一列數字，例如 [1, 2, 3]。它是一個 1 維張量。
矩陣 (Matrix)：一個二維的數字網格，像一個表格。它是一個 2 維張量。
張量 (Tensor)：可以有任意數量的維度。例如，一張彩色圖片可以表示為一個 3 維張量（高度 x 寬度 x 顏色通道）。

1
# 0D Tensor (Scalar)
2
scalar = torch.tensor(42)
3
print(f"Scalar: {scalar}")
4
print(f"Dimension: {scalar.ndim}")
5

6
# 1D Tensor (Vector)
7
vector = torch.tensor([1, 2, 3])
8
print(f"Vector: {vector}")
9
print(f"Dimension: {vector.ndim}")
10

11
# 2D Tensor (Matrix)
12
matrix = torch.tensor([[1, 2], [3, 4]])
13
print(f"Matrix:\n {matrix}")
14
print(f"Dimension: {matrix.ndim}")
15

16
# 3D Tensor
17
tensor_3d = torch.randn(2, 3, 4) # 創建一個形狀為 2x3x4 的隨機張量
18
print(f"3D Tensor shape: {tensor_3d.shape}")
19
print(f"Dimension: {tensor_3d.ndim}")

數據操作：Tensor 的基本運算#

學會創建 Tensor 後，下一步就是對它進行操作。這和 NumPy 的操作也非常相似。

索引與切片 (Indexing & Slicing)#

1
tensor = torch.tensor([[1, 2, 3], [4, 5, 6]])
2

3
# 取得第一行
4
print(f"First row: {tensor[0]}")
5

6
# 取得第一列
7
print(f"First column: {tensor[:, 0]}")
8

9
# 取得右下角的數字 6
10
print(f"Element at (1, 2): {tensor[1, 2]}")

算術運算#

讓我們從數學的角度來理解這些運算是如何進行的。假設我們有兩個 2×2 矩陣：

X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \quad Y = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix}

1
x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
2
y = torch.tensor([[5, 6], [7, 8]], dtype=torch.float32)

元素對元素相加 (Element-wise Addition)#

數學上，元素對元素的加法定義為：

Z = X + Y = \begin{bmatrix} x_{11} + y_{11} & x_{12} + y_{12} \\ x_{21} + y_{21} & x_{22} + y_{22} \end{bmatrix}

對於我們的例子：

Z = \begin{bmatrix} 1+5 & 2+6 \\ 3+7 & 4+8 \end{bmatrix} = \begin{bmatrix} 6 & 8 \\ 10 & 12 \end{bmatrix}

1
# 元素對元素相加
2
z1 = x + y
3
z2 = torch.add(x, y)
4
print(f"Addition result:\n {z1}")

元素對元素相乘 (Element-wise Multiplication / Hadamard Product)#

元素對元素的乘法（也稱為哈達瑪積）定義為：

Z = X \odot Y = \begin{bmatrix} x_{11} \times y_{11} & x_{12} \times y_{12} \\ x_{21} \times y_{21} & x_{22} \times y_{22} \end{bmatrix}

對於我們的例子：

Z = \begin{bmatrix} 1 \times 5 & 2 \times 6 \\ 3 \times 7 & 4 \times 8 \end{bmatrix} = \begin{bmatrix} 5 & 12 \\ 21 & 32 \end{bmatrix}

1
# 元素對元素相乘
2
z3 = x * y
3
z4 = torch.multiply(x, y)
4
print(f"Multiplication result:\n {z3}")

矩陣乘法 (Matrix Multiplication)#

矩陣乘法的數學定義是：對於 $A_{m \times n}$ 和 $B_{n \times p}$ 兩個矩陣，其乘積 $C = AB$ 是一個 $m \times p$ 的矩陣，其中：

c_{ij} = \sum_{k=1}^{n} a_{ik} \times b_{kj}

對於我們的 2×2 矩陣例子：

Z = X \times Y = \begin{bmatrix} \sum_{k} x_{1k} y_{k1} & \sum_{k} x_{1k} y_{k2} \\ \sum_{k} x_{2k} y_{k1} & \sum_{k} x_{2k} y_{k2} \end{bmatrix}

具體計算：

Z = \begin{bmatrix} 1 \times 5 + 2 \times 7 & 1 \times 6 + 2 \times 8 \\ 3 \times 5 + 4 \times 7 & 3 \times 6 + 4 \times 8 \end{bmatrix} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}

1
# 矩陣乘法
2
z5 = x.matmul(y)
3
z6 = torch.matmul(x, y)
4
print(f"Matrix multiplication result:\n {z5}")

重要提醒：矩陣乘法要求第一個矩陣的列數等於第二個矩陣的行數。對於上面的例子，兩個都是 2×2 矩陣，所以可以相乘，結果也是 2×2 矩陣。

廣播機制 (Broadcasting)#

到目前為止，我們討論的運算都是在相同形狀的張量之間進行的。但在實際應用中，我們經常需要在不同形狀的張量間進行運算。這時候，廣播機制 (Broadcasting) 就派上用場了。

廣播允許 PyTorch (和 NumPy) 自動擴展較小的張量，使其能夠與較大的張量進行元素對元素的運算，而不需要顯式地複製數據。

廣播的基本規則#

廣播遵循以下規則：

從最後一個維度開始比較兩個張量的形狀
如果兩個維度相等，或其中一個為 1，則該維度兼容
如果其中一個張量在某個維度上缺失，則視為大小為 1

讓我們通過一些例子來理解：

1
# 標量與張量的廣播
2
tensor = torch.tensor([[1, 2, 3], [4, 5, 6]])  # 形狀 (2, 3)
3
scalar = 10
4

5
# 標量會自動廣播到與張量相同的形狀
6
result = tensor + scalar
7
print(f"Original tensor:\n {tensor}")
8
print(f"Result (tensor + scalar):\n {result}")
9
# 相當於 tensor + [[10, 10, 10], [10, 10, 10]]

1
# 向量與矩陣的廣播
2
matrix = torch.tensor([[1, 2, 3], [4, 5, 6]])  # 形狀 (2, 3)
3
vector = torch.tensor([10, 20, 30])            # 形狀 (3,)
4

5
# vector 會自動廣播成 (2, 3) 的形狀
6
result = matrix + vector
7
print(f"Matrix:\n {matrix}")
8
print(f"Vector: {vector}")
9
print(f"Result (matrix + vector):\n {result}")
10
# 相當於 matrix + [[10, 20, 30], [10, 20, 30]]

1
# 更複雜的廣播例子
2
a = torch.tensor([[1], [2], [3]])  # 形狀 (3, 1)
3
b = torch.tensor([10, 20])         # 形狀 (2,)
4

5
# 廣播後的結果形狀為 (3, 2)
6
result = a + b
7
print(f"Tensor a (3, 1):\n {a}")
8
print(f"Tensor b (2,): {b}")
9
print(f"Result shape: {result.shape}")
10
print(f"Result:\n {result}")
11
# a 廣播為 [[1, 1], [2, 2], [3, 3]]
12
# b 廣播為 [[10, 20], [10, 20], [10, 20]]

廣播的實際應用#

在數據預處理中，廣播機制非常有用，例如：

1
# 數據標準化 (Normalization)
2
data = torch.randn(100, 5)  # 100 個樣本，每個有 5 個特徵
3

4
# 計算每個特徵的平均值和標準差
5
mean = data.mean(dim=0, keepdim=True)  # 形狀 (1, 5)
6
std = data.std(dim=0, keepdim=True)    # 形狀 (1, 5)
7

8
# 透過廣播進行標準化
9
normalized_data = (data - mean) / std
10
print(f"Original data shape: {data.shape}")
11
print(f"Mean shape: {mean.shape}")
12
print(f"Normalized data shape: {normalized_data.shape}")

廣播的優勢：

記憶體效率：不需要實際複製數據
計算效率：充分利用底層優化的向量化運算
代碼簡潔：讓數學表達式更直觀

數據預處理入門#

在真實世界的專案中，原始數據很少是乾淨且能直接餵給模型的。它們通常充滿了缺失值、非數值型的文字，以及各種需要清理的格式問題。我們需要進行「預處理」，將這些髒數據 (dirty data) 清理、轉換成模型能夠理解的純數值格式。

pandas 是這個階段最強大的工具。讓我們來看一個更真實的例子。

模擬一個真實場景#

假設我們有一個 housing.csv 檔案，內容如下：

1
SquareFeet,Bedrooms,City,Price
2
1500,3,Taipei,5000000
3
2000,,New York,8000000
4
,2,Tokyo,6500000
5
1800,3,NaN,5800000
6
2200,4,Taipei,7200000

這個數據有幾個問題：

Bedrooms 欄位有缺失值 (第二行)。
SquareFeet 欄位有缺失值 (第三行)。
City 欄位是文字，且也有缺失值 (NaN，第四行)。

我們的目標是將這些數據轉換成一個純數值的 PyTorch Tensor。

步驟 1：使用 Pandas 載入與清理#

首先，我們用 pandas 讀取數據，並處理最明顯的問題。

1
import pandas as pd
2
import numpy as np
3
import torch
4

5
# 模擬讀取 CSV
6
from io import StringIO
7

8
csv_data = """SquareFeet,Bedrooms,City,Price
9
1500,3,Taipei,5000000
10
2000,,New York,8000000
11
,2,Tokyo,6500000
12
1800,3,,5800000
13
2200,4,Taipei,7200000
14
"""
15

16
df = pd.read_csv(StringIO(csv_data))
17
print("原始 DataFrame:")
18
print(df)
19
print("\n數據資訊:")
20
df.info()

df.info() 的輸出會告訴我們 SquareFeet、Bedrooms 和 City 都有缺失值。

步驟 2：填充數值型缺失值#

對於數值型的缺失值，一個常見的策略是用該欄位的平均數或中位數來填充。這裡我們使用平均數。

1
# 填充 SquareFeet 的缺失值
2
mean_sqft = df['SquareFeet'].mean()
3
df['SquareFeet'].fillna(mean_sqft, inplace=True)
4

5
# 填充 Bedrooms 的缺失值
6
mean_bedrooms = df['Bedrooms'].mean()
7
df['Bedrooms'].fillna(mean_bedrooms, inplace=True)
8

9
print("\n填充數值缺失值後的 DataFrame:")
10
print(df)

步驟 3：處理分類特徵 (One-Hot Encoding)#

模型無法理解 “Taipei” 或 “New York” 這樣的文字。我們需要將它們轉換成數字。最好的方法是 獨熱編碼 (One-Hot Encoding)。

它會為每個城市創建一個新的欄位，如果該行數據對應這個城市，則該欄位值為 1，否則為 0。

1
# 首先，填充分類特徵的缺失值，例如用 'Unknown'
2
df['City'].fillna('Unknown', inplace=True)
3

4
# 使用 pandas 的 get_dummies 進行 One-Hot Encoding
5
city_dummies = pd.get_dummies(df['City'], prefix='City')
6
print("\nOne-Hot Encoding 結果:")
7
print(city_dummies)
8

9
# 將 one-hot 編碼後的新欄位加回原 DataFrame，並刪除原始的 City 欄位
10
df = pd.concat([df, city_dummies], axis=1)
11
df.drop('City', axis=1, inplace=True)
12

13
print("\n處理完所有特徵後的 DataFrame:")
14
print(df)

步驟 4：轉換為 PyTorch Tensor#

現在，我們的 DataFrame 已經是純數值、沒有缺失值的乾淨數據了。最後一步就是將它轉換為 PyTorch Tensor。通常我們會把特徵 (X) 和目標 (y，我們要預測的 Price) 分開。

1
# 分離特徵 (X) 和目標 (y)
2
features = df.drop('Price', axis=1)
3
target = df['Price']
4

5
# 將 pandas DataFrame 轉換為 NumPy array
6
X_np = features.values.astype(np.float32)
7
y_np = target.values.astype(np.float32)
8

9
# 將 NumPy array 轉換為 PyTorch Tensor
10
X_tensor = torch.from_numpy(X_np)
11
y_tensor = torch.from_numpy(y_np)
12

13
# y_tensor 通常需要調整形狀以符合模型輸入
14
y_tensor = y_tensor.view(-1, 1)
15

16
print("\n最終的特徵 Tensor (X):")
17
print(X_tensor)
18
print(f"Shape: {X_tensor.shape}")
19

20
print("\n最終的目標 Tensor (y):")
21
print(y_tensor)
22
print(f"Shape: {y_tensor.shape}")