将MLPerf训练结果库拷到本地
使用的是training_results_v0.6,而不是mlperf / training存储库中提供的参考实现。请注意,这些实现有效地用作基准实现的起点,但尚未完全优化,并且不打算用于软件框架或硬件的“实际”性能评估。
git clone https://github.com/Caiyishuai/training_results_v0.6
在此存储库中,有每个供应商提交的目录(Google,Intel,NVIDIA等),其中包含用于生成结果的代码和脚本。在NVIDIA GPU上运行基准测试。
[root@2 ~]# cd training_results_v0.6/ [root@2 training_results_v0.6]# ls Alibaba CONTRIBUTING.md Fujitsu Google Intel LICENSE NVIDIA README.md [root@2 training_results_v0.6]# cd NVIDIA/; ls benchmarks LICENSE.md README.md results systems [root@2 NVIDIA]# cd benchmarks/; ls gnmt maskrcnn minigo resnet ssd transformer
下载并验证数据集
[root@2 implementations]# pwd /data/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations [root@2 implementations]# ls data download_dataset2.sh download_dataset3.sh download_dataset.sh pytorch verify_dataset.sh wget-log
[root@2 implementations]# bash download_dataset.sh
查看download_dataset.sh,可以查看数据的具体链接,如果网速较慢,可以将链接复制到其它下载器中下载,然后更改download_dataset.sh
[root@2 implementations]# cat download_dataset.sh #! /usr/bin/env bash # Copyright 2017 Google Inc. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. set -e export LANG=C.UTF-8 export LC_ALL=C.UTF-8 OUTPUT_DIR=${1:-"data"} echo "Writing to ${OUTPUT_DIR}. To change this, set the OUTPUT_DIR environment variable." OUTPUT_DIR_DATA="${OUTPUT_DIR}/data" mkdir -p $OUTPUT_DIR_DATA echo "Downloading Europarl v7. This may take a while..." wget -nc -nv -O ${OUTPUT_DIR_DATA}/europarl-v7-de-en.tgz \ http://www.statmt.org/europarl/v7/de-en.tgz echo "Downloading Common Crawl corpus. This may take a while..." wget -nc -nv -O ${OUTPUT_DIR_DATA}/common-crawl.tgz \ http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz echo "Downloading News Commentary v11. This may take a while..." wget -nc -nv -O ${OUTPUT_DIR_DATA}/nc-v11.tgz \ http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz echo "Downloading dev/test sets" wget -nc -nv -O ${OUTPUT_DIR_DATA}/dev.tgz \ http://data.statmt.org/wmt16/translation-task/dev.tgz wget -nc -nv -O ${OUTPUT_DIR_DATA}/test.tgz \ http://data.statmt.org/wmt16/translation-task/test.tgz ……………… done echo "All done."
如果通过其它方式已经下载了文件在本目录下,可以更改上述wegt代码
echo "Downloading Europarl v7. This may take a while..." mv -i data/de-en.tgz ${OUTPUT_DIR_DATA}/europarl-v7-de-en.tgz \ echo "Downloading Common Crawl corpus. This may take a while..." mv -i data/training-parallel-commoncrawl.tgz ${OUTPUT_DIR_DATA}/common-crawl.tgz \ echo "Downloading News Commentary v11. This may take a while..." mv -i data/training-parallel-nc-v11.tgz ${OUTPUT_DIR_DATA}/nc-v11.tgz \ echo "Downloading dev/test sets" mv -i data/dev.tgz ${OUTPUT_DIR_DATA}/dev.tgz \ mv -i data/test.tgz ${OUTPUT_DIR_DATA}/test.tgz \
执行脚本以验证是否已正确下载数据集。
[root@2 implementations]# du -sh data/ 13G data/
配置文件开始准备训练
用于执行训练作业的脚本和代码位于pytorch目录中。
[root@2 implementations]# cd pytorch/ [root@2 pytorch]# ll total 124 -rw-r--r-- 1 root root 5047 Jan 22 15:45 bind_launch.py -rwxr-xr-x 1 root root 1419 Jan 22 15:45 config_DGX1_multi.sh -rwxr-xr-x 1 root root 718 Jan 25 10:50 config_DGX1.sh -rwxr-xr-x 1 root root 1951 Jan 22 15:45 config_DGX2_multi_16x16x32.sh -rwxr-xr-x 1 root root 1950 Jan 22 15:45 config_DGX2_multi.sh -rwxr-xr-x 1 root root 718 Jan 22 15:45 config_DGX2.sh -rw-r--r-- 1 root root 1372 Jan 22 15:45 Dockerfile -rw-r--r-- 1 root root 1129 Jan 22 15:45 LICENSE -rw-r--r-- 1 root root 6494 Jan 22 15:45 mlperf_log_utils.py -rw-r--r-- 1 root root 4145 Jan 22 15:45 preprocess_data.py -rw-r--r-- 1 root root 12665 Jan 22 15:45 README.md -rw-r--r-- 1 root root 43 Jan 22 15:45 requirements.txt -rwxr-xr-x 1 root root 2220 Jan 22 15:45 run_and_time.sh -rwxr-xr-x 1 root root 7173 Jan 25 10:56 run.sub drwxr-xr-x 3 root root 45 Jan 22 15:45 scripts drwxr-xr-x 7 root root 90 Jan 22 15:45 seq2seq -rw-r--r-- 1 root root 1082 Jan 22 15:45 setup.py -rw-r--r-- 1 root root 25927 Jan 22 15:45 train.py -rw-r--r-- 1 root root 8056 Jan 22 15:45 translate.py
需要配置config_ <system> .sh以反映您的系统配置。如果系统具有8个或16个GPU,则可以使用现有的config_DGX1.sh或config_DGX2.sh配置文件来启动训练作业。
要编辑的参数:
DGXNGPU = 8
DGXSOCKETCORES = 18
DGXNSOCKET = 2
您可以使用nvidia-smi
命令获取GPU信息,并使用lscpu
命令获取CPU信息,尤其是:
Core(s) per socket: 18
Socket(s): 2
下载docker镜像
docker build -t mlperf-nvidia:rnn_translator .
需要不少时间
[root@2 pytorch]# docker build -t mlperf-nvidia:rnn_translator . Sending build context to Docker daemon 279kB Step 1/12 : ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.05-py3 Step 2/12 : FROM ${FROM_IMAGE_NAME} 19.05-py3: Pulling from nvidia/pytorch 7e6591854262: Pulling fs layer 089d60cb4e0a: Pulling fs layer 7e6591854262: Downloading [> ] 452.4kB/43.75MB 45085432511a: Waiting 6ca460804a89: Waiting 2631f04ebf64: Pulling fs layer 86f56e03e071: Pulling fs layer 234646620160: Waiting 7f717cd17058: Waiting e69a2ba99832: Waiting bc9bca17b13c: Waiting 1870788e477f: Waiting 603e0d586945: Waiting 717dfedf079c: Waiting 2631f04ebf64: Waiting c5bd7559c3ad: Waiting 9c461696bc09: Download complete 059d4f560014: Waiting f3f14cff44df: Waiting 603e0d586945: Downloading [==========> ] 102.7MB/492.7MB c5bd7559c3ad: Pull complete d82c679b8708: Pull complete 059d4f560014: Pull complete f3f14cff44df: Pull complete 96502bde320c: Pull complete bc5bb9379810: Pull complete e4d8bb046bc2: Pull complete 4e2187010a7c: Pull complete 9d62684b94c3: Pull complete e70e61e48991: Pull complete 683f2d0d75c5: Pull complete d91684765fac: Pull complete ceb6cf7ee657: Pull complete 8d2533535f88: Pull complete 15c2061baa94: Pull complete fe35706ec086: Pull complete ef06e50267e2: Pull complete 24569ba3e1d3: Pull complete c49dc7cbf15c: Pull complete 34e55507c797: Pull complete c26e49a3c2c6: Pull complete 7f6410878ec9: Pull complete 97f3bcccbcdf: Pull complete 3f9a50c314fa: Pull complete d6c800c70bb2: Pull complete 9c785de98406: Pull complete acb71385d77d: Pull complete ea9fb98cc638: Pull complete 08e43405860a: Pull complete 02899df1d7b5: Pull complete 66e5d0f2b0fa: Pull complete 46bb7884fc3b: Pull complete af50c16f8064: Pull complete a8c14d818405: Pull complete 8c3f313defdf: Pull complete Digest: sha256:6614fa29720fc253bcb0e99c29af2f93caff16976661f241ec5ed5cf08e7c010 Status: Downloaded newer image for nvcr.io/nvidia/pytorch:19.05-py3 ---> 7e98758d4777 Step 3/12 : RUN apt-get update && apt-get install -y --no-install-recommends infiniband-diags pciutils && rm -rf /var/lib/apt/lists/* ---> Running in 7b374edf0b57 Get:1 http://archive.ubuntu.com/ubuntu xenial InRelease [247 kB] Get:2 http://security.ubuntu.com/ubuntu xenial-security InRelease [109 kB] Get:3 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages [1905 kB] Get:4 http://archive.ubuntu.com/ubuntu xenial-updates InRelease [109 kB] Get:5 http://archive.ubuntu.com/ubuntu xenial-backports InRelease [107 kB] Get:6 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages [1558 kB] Get:7 http://archive.ubuntu.com/ubuntu xenial/restricted amd64 Packages [14.1 kB] Get:8 http://archive.ubuntu.com/ubuntu xenial/universe amd64 Packages [9827 kB] Get:9 http://security.ubuntu.com/ubuntu xenial-security/restricted amd64 Packages [15.9 kB] Get:10 http://security.ubuntu.com/ubuntu xenial-security/universe amd64 Packages [982 kB] Get:11 http://security.ubuntu.com/ubuntu xenial-security/multiverse amd64 Packages [8820 B] Get:12 http://archive.ubuntu.com/ubuntu xenial/multiverse amd64 Packages [176 kB] Get:13 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages [2414 kB] Get:14 http://archive.ubuntu.com/ubuntu xenial-updates/restricted amd64 Packages [16.4 kB] Get:15 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 Packages [1534 kB] Get:16 http://archive.ubuntu.com/ubuntu xenial-updates/multiverse amd64 Packages [26.4 kB] Get:17 http://archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages [10.9 kB] Get:18 http://archive.ubuntu.com/ubuntu xenial-backports/universe amd64 Packages [12.6 kB] Fetched 19.1 MB in 30s (621 kB/s) Reading package lists... Reading package lists... Building dependency tree... Reading state information... The following additional packages will be installed: libibmad5 libibnetdisc5 libibumad3 libosmcomp3 libpci3 The following NEW packages will be installed: infiniband-diags libibmad5 libibnetdisc5 libibumad3 libosmcomp3 libpci3 pciutils 0 upgraded, 7 newly installed, 0 to remove and 120 not upgraded. Need to get 574 kB of archives. After this operation, 2638 kB of additional disk space will be used. Get:1 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 libpci3 amd64 1:3.3.1-1.1ubuntu1.3 [24.3 kB] Get:2 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 pciutils amd64 1:3.3.1-1.1ubuntu1.3 [254 kB] Get:3 http://archive.ubuntu.com/ubuntu xenial/universe amd64 libibumad3 amd64 1.3.10.2-1 [16.7 kB] Get:4 http://archive.ubuntu.com/ubuntu xenial/universe amd64 libibmad5 amd64 1.3.12-1 [29.9 kB] Get:5 http://archive.ubuntu.com/ubuntu xenial/universe amd64 libosmcomp3 amd64 3.3.19-1 [22.2 kB] Get:6 http://archive.ubuntu.com/ubuntu xenial/universe amd64 libibnetdisc5 amd64 1.6.6-1 [22.8 kB] Get:7 http://archive.ubuntu.com/ubuntu xenial/universe amd64 infiniband-diags amd64 1.6.6-1 [205 kB] debconf: unable to initialize frontend: Dialog debconf: (TERM is not set, so the dialog frontend is not usable.) debconf: falling back to frontend: Readline debconf: unable to initialize frontend: Readline debconf: (This frontend requires a controlling tty.) debconf: falling back to frontend: Teletype dpkg-preconfigure: unable to re-open stdin: Fetched 574 kB in 5s (112 kB/s) Selecting previously unselected package libpci3:amd64. (Reading database ... 21560 files and directories currently installed.) Preparing to unpack .../libpci3_1%3a3.3.1-1.1ubuntu1.3_amd64.deb ... Unpacking libpci3:amd64 (1:3.3.1-1.1ubuntu1.3) ... Selecting previously unselected package pciutils. Preparing to unpack .../pciutils_1%3a3.3.1-1.1ubuntu1.3_amd64.deb ... Unpacking pciutils (1:3.3.1-1.1ubuntu1.3) ... Selecting previously unselected package libibumad3. Preparing to unpack .../libibumad3_1.3.10.2-1_amd64.deb ... Unpacking libibumad3 (1.3.10.2-1) ... Selecting previously unselected package libibmad5. Preparing to unpack .../libibmad5_1.3.12-1_amd64.deb ... Unpacking libibmad5 (1.3.12-1) ... Selecting previously unselected package libosmcomp3. Preparing to unpack .../libosmcomp3_3.3.19-1_amd64.deb ... Unpacking libosmcomp3 (3.3.19-1) ... Selecting previously unselected package libibnetdisc5. Preparing to unpack .../libibnetdisc5_1.6.6-1_amd64.deb ... Unpacking libibnetdisc5 (1.6.6-1) ... Selecting previously unselected package infiniband-diags. Preparing to unpack .../infiniband-diags_1.6.6-1_amd64.deb ... Unpacking infiniband-diags (1.6.6-1) ... Processing triggers for libc-bin (2.23-0ubuntu11) ... Setting up libpci3:amd64 (1:3.3.1-1.1ubuntu1.3) ... Setting up pciutils (1:3.3.1-1.1ubuntu1.3) ... Setting up libibumad3 (1.3.10.2-1) ... Setting up libibmad5 (1.3.12-1) ... Setting up libosmcomp3 (3.3.19-1) ... Setting up libibnetdisc5 (1.6.6-1) ... Setting up infiniband-diags (1.6.6-1) ... Processing triggers for libc-bin (2.23-0ubuntu11) ... Removing intermediate container 7b374edf0b57 ---> 91942ef1e039 Step 4/12 : WORKDIR /workspace/rnn_translator ---> Running in 150b2d9df1cc Removing intermediate container 150b2d9df1cc ---> 17720ab57857 Step 5/12 : COPY requirements.txt . ---> fc25fbdf0006 Step 6/12 : RUN pip install --no-cache-dir https://github.com/mlperf/training/archive/6289993e1e9f0f5c4534336df83ff199bd0cdb75.zip#subdirectory=compliance && pip install --no-cache-dir -r requirements.txt ---> Running in 88b21caded36 Collecting https://github.com/mlperf/training/archive/6289993e1e9f0f5c4534336df83ff199bd0cdb75.zip#subdirectory=compliance Downloading https://github.com/mlperf/training/archive/6289993e1e9f0f5c4534336df83ff199bd0cdb75.zip Building wheels for collected packages: mlperf-compliance Building wheel for mlperf-compliance (setup.py): started Building wheel for mlperf-compliance (setup.py): finished with status 'done' Stored in directory: /tmp/pip-ephem-wheel-cache-c_6ttc8p/wheels/9e/73/0a/3c481ccbda248a195828b8ea5173e83b8394051d8c40e08660 Successfully built mlperf-compliance Installing collected packages: mlperf-compliance Found existing installation: mlperf-compliance 0.0.10 Uninstalling mlperf-compliance-0.0.10: Successfully uninstalled mlperf-compliance-0.0.10 Successfully installed mlperf-compliance-0.6.0 Requirement already satisfied: mlperf-compliance==0.6.0 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 1)) (0.6.0) Requirement already satisfied: sacrebleu==1.2.10 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 2)) (1.2.10) Requirement already satisfied: typing in /opt/conda/lib/python3.6/site-packages (from sacrebleu==1.2.10->-r requirements.txt (line 2)) (3.6.6) Removing intermediate container 88b21caded36 ---> 346646500f0f Step 7/12 : COPY seq2seq/csrc seq2seq/csrc ---> 936e5bc1a41e Step 8/12 : COPY setup.py . ---> 090cc90c4cb5 Step 9/12 : RUN pip install . ---> Running in 0547065d6492 Processing /workspace/rnn_translator Requirement already satisfied: mlperf-compliance==0.6.0 in /opt/conda/lib/python3.6/site-packages (from gnmt==0.6.0) (0.6.0) Requirement already satisfied: sacrebleu==1.2.10 in /opt/conda/lib/python3.6/site-packages (from gnmt==0.6.0) (1.2.10) Requirement already satisfied: typing in /opt/conda/lib/python3.6/site-packages (from sacrebleu==1.2.10->gnmt==0.6.0) (3.6.6) Building wheels for collected packages: gnmt Building wheel for gnmt (setup.py): started Building wheel for gnmt (setup.py): still running... Building wheel for gnmt (setup.py): finished with status 'done' Stored in directory: /tmp/pip-ephem-wheel-cache-_jrlxic9/wheels/84/b6/f1/20addc378b275e39e227da5ee58c19f8e2433a88fd6e5fbf7b Successfully built gnmt Installing collected packages: gnmt Successfully installed gnmt-0.6.0 Removing intermediate container 0547065d6492 ---> 7a7bb07a7855 Step 10/12 : COPY . . ---> dfa84645d44d Step 11/12 : ENV LANG C.UTF-8 ---> Running in 992046e4ef3e Removing intermediate container 992046e4ef3e ---> d1e6862fe916 Step 12/12 : ENV LC_ALL C.UTF-8 ---> Running in c67514666b6d Removing intermediate container c67514666b6d ---> 2d4231f91c86 Successfully built 2d4231f91c86 Successfully tagged mlperf-nvidia:rnn_translator
View Code
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:MLPerf 机器学习基准测试实战入门(一)NAVIDA-GNMT - Python技术站