简单起见,仅实验了矩阵加法及广播操作,其他操作未实验。
目前结论是:
- 将numpy转为pytorch的tensor,可以加速(0.22s -> 0.12s)
- 如果将tensor加载到gpu上,能够加速更多(0.22s -> 0.0005s),但是内存与显存的拷贝时间不容忽视
实验过的环境如下,结论都成立:
- Win10, 64 bit
- Ubuntu 18.04, 64 bit
但是据同事在Win10的Linux子系统下验证,据说将numpy转为pytorch的tensor后反而比前者更慢,怀疑是子系统实现产生的问题。
下面是验证流程。
import time
import numpy as np
import torch
print(torch.__version__)
1.4.0
def check_time(func, run_times=10):
t = time.time()
for i in range(run_times):
func()
print('avg time = %s sec' % ((time.time()-t)/run_times))
shape = (5000,5000)
a = np.ones(shape, dtype=np.float)
b = np.ones(shape, dtype=np.float)
k = np.ones((shape[0],1), dtype=np.float)
# - simple numpy ndarray plus
def test_np_1():
c = a+b
return c
check_time(test_np_1)
avg time = 0.21692438125610353 sec
# - simple numpy ndarray and broadcast
def test_np_2():
c = a+b+k
return c
check_time(test_np_2)
avg time = 0.45278918743133545 sec
# - use pytorch tensor
def test_torch_1():
ta = torch.from_numpy(a)
tb = torch.from_numpy(b)
tc = ta+tb
c = tc.numpy()
return c
check_time(test_torch_1)
avg time = 0.11778402328491211 sec
# - use pytorch tensor and broadcast
def test_torch_2():
ta = torch.from_numpy(a)
tb = torch.from_numpy(b)
tk = torch.from_numpy(k)
tc = ta+tb+tk
c = tc.numpy()
return c
check_time(test_torch_2)
avg time = 0.2651021957397461 sec
# - check gpu
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
ga = torch.from_numpy(a).float().to(device)
gb = torch.from_numpy(b).float().to(device)
gk = torch.from_numpy(k).float().to(device)
cuda:0
# - try tensor on gpu
def test_torch_cuda_1():
ca = torch.from_numpy(a).float().to(device)
cb = torch.from_numpy(b).float().to(device)
cc = ca+cb
c = cc.cpu().numpy()
return c
check_time(test_torch_cuda_1)
avg time = 0.44039239883422854 sec
# - try tensor on gpu and broadcast
def test_torch_cuda_2():
ca = torch.from_numpy(a).float().to(device)
cb = torch.from_numpy(b).float().to(device)
ck = torch.from_numpy(k).float().to(device)
cc = ca+cb+ck
c = cc.cpu().numpy()
return c
check_time(test_torch_cuda_2)
avg time = 0.4477779150009155 sec
# - try tensor on gpu and broadcast, preload in gpu before call, and not copy to cpu after
def test_torch_cuda_3():
cc = ga+gb+gk
return cc
check_time(test_torch_cuda_3)
avg time = 0.0004986286163330078 sec
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:[笔记] 将numpy的操作转移到pytorch的tensor上运行可以加速 - Python技术站