TensorFlow Linux GPU + jupyterlab environment installation (Docker) (Ubuntu Deepin Manjaro)

    Copyright statement: This article is neucrack's original article and follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement for reprinting.
    Original link: https://neucrack.com/p/116

    Using docker here, the installation environment is simpler (you only need to install the NVIDIA driver, you don’t need to install cuda, and of course you don’t have to worry about the cuda version) and stable~
    And you can run multiple dockers at the same time, such as running multiple jupyterlabs at the same time for different people to use

    Install docker

    Install docker, version must be 19.03 and above (you can use docker --version to view), if the version is lower than this version, later use of nvidia-docker driver will fail and you will be prompted to find it --gpu all` parameter

    Installation

    • If it is Manjaro, directly yay -S docker
    • Other releases:

    See the official tutorial: https://docs.docker.com/install/linux/docker-ce/debian/

    deepin is based on debian 9.0
    If it is deepin, you need to modify the unstable in sudo vim /usr/share/python-apt/templates/Deepin.info to stable
    And use the command sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian stretch stable"

    Set up proxy

    If the download is slow, you may need to set up a proxy, or you can use a domestic mirror instead of an official mirror, such as daocloud mirror acceleration

    Docker proxy setting reference: https://neucrack.com/p/286

    When you pull the image, you can set the proxy to make the pull faster. It is recommended to remove the proxy when creating the container

    Set the current user can access docker (non-root)

    Reference here: https://docs.docker.com/install/linux/linux-postinstall/

    sudo groupadd docker
    sudo usermod -aG docker $USER
    newgrp docker # Or restart the terminal, if it does not take effect, you can restart
    

    Common commands

    docker images: list image list
    docker run [options] image_name [command]: create a new container from the image
    docker ps: running container
    docker ps -a: all containers, including those that are not running
    docker rm container_name: delete the container
    docker rmi image_name: delete image
    docker start container_name: start the container
    docker attatch container_name: attach to the container
    docker exec conrainer_name [comand]: execute commands in the container
    docker logs container_name: view container execution log

    docker build -t image_name .: build an image from Dockerfile

    docker run common parameters

    -it: Enable interactive terminal
    -rm: delete in time, do not save the container, that is, delete after exit
    --gpus all: enable all GPU support
    -p port1:port2: host and container port mapping, port1 is the port of the host
    -v volume1:volume2: the disk mapping between the host and the container, volume1 is the folder of the host, such as mapping /home/${USER}/notes to /tf/notes
    --name name: Give the container a name. Without this parameter, the name is randomly generated
    --device device:container_device: hang on the device, such as /dev/ttyUSB0:/dev/ttyUSB0
    --network=host: Use the host's network
    --restart: Automatically start, you can use this setting to start automatically, if you forget to run it, you can use docker update --restart=always container name to update

    no: Do ​​not restart the container automatically. (default value)
    on-failure: The container exits due to an error (the exit status of the container is not 0) restart the container
    unless-stopped: Restart the container when it has been stopped or Docker stopped/restarted
    always: restart the container when the container has been stopped or Docker stopped/restarted
    

    Install graphics card driver

    The graphics card installation part has written an independent article, refer to Linux Nvidia graphics card installation

    Install mirror

    Refer to the official document: https://www.tensorflow.org/install/docker

    For example, my Ubuntu here: (Be sure to read the documentation, it may be different, there are updates)

    Just follow the installation guide in the readme, for example, Ubuntu:

    # Add the package repositories
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add-
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    
    sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
    sudo systemctl restart docker
    

    If it is deepin, you need to change the system version

    distribution="ubuntu18.04"
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add-
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    
    sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
    sudo systemctl restart docker
    

    If it is Manjaro, just command yay -S nvidia-docker! (If you encounter slow downloads, you can use poipo to set up a global proxy, refer to Terminal proxy setting method)

    • Test whether nvidia-docker and cuda can be used

    Use the image of nvidia/cuda, this image is just for testing, you can delete it when you use it up, if you don’t have a proxy set up, and you don’t want to spend too much time pulling the image, you can use this image directly instead of tensorflow/tensorflow:latest -gpu-py3 this mirror or neucrack/tensorflow-gpu-py3-jupyterlab (or daocloud.io/neucrack/tensorflow-gpu-py3-jupyterlab) this mirror (recommended) (jupyterlab is installed on the basis of the former , And do better user rights management)

    lspci | grep -i nvidia
    docker run --gpus all --rm nvidia/cuda nvidia-smi
    

    such as:

    ➜ ~ sudo docker run --gpus all --rm nvidia/cuda nvidia-smi
    Tue Mar 10 15:57:12 2020
    +------------------------------------------------- ----------------------------+
    | NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 |
    |-------------------------------+----------------- -----+----------------------+
    | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    |===============================+================= =====+======================|
    | 0 GeForce GTX 106... Off | 00000000:01:00.0 On | N/A |
    | 33% 39C P0 27W / 120W | 310MiB / 6075MiB | 0% Default |
    +-------------------------------+----------------- -----+----------------------+
    
    +------------------------------------------------- ----------------------------+
    | Processes: GPU Memory |
    | GPU PID Type Process name Usage |
    |================================================ ============================|
    +------------------------------------------------- ----------------------------+
    
    
    Wed Mar 11 02:04:26 2020
    +------------------------------------------------- ----------------------------+
    | NVIDIA-SMI 430.40 Driver Version: 430.40 CUDA Version: 10.1 |
    |-------------------------------+----------------- -----+----------------------+
    | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    |===============================+================= =====+======================|
    | 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
    | 35% 41C P5 25W / 250W | 0MiB / 11178MiB | 0% Default |
    +-------------------------------+----------------- -----+----------------------+
    | 1 GeForce GTX 108... Off | 00000000:81:00.0 Off | N/A |
    | 39% 36C P5 19W / 250W | 0MiB / 11178MiB | 2% Default |
    +-------------------------------+----------------- -----+----------------------+
    
    +------------------------------------------------- ----------------------------+
    | Processes: GPU Memory |
    | GPU PID Type Process name Usage |
    |================================================ ============================|
    | No running processes found |
    +------------------------------------------------- ----------------------------+
    

    If the driver version is too low, there will be a prompt to update the driver

    At the same time, notice that the cuda version is 10.2, maybe tensorflow only supports 10.1. If tensorflow is installed directly on the host, it will report an error and not support. The benefits of using docker here are reflected. Don’t bother, just make sure that the driver is installed. Up

    Deepin has an error

    docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout:, stderr: nvidia-container-cli: ldcache error: open failed: /sbin/ldconfig.real: no such file or directory\\\\n\\\"\"": unknown .
    

    Refer to the solution here: https://github.com/NVIDIA/nvidia-docker/issues/614 to solve:

    ln -s /sbin/ldconfig /sbin/ldconfig.real
    

    docker error: nvidia-container-cli: initialization error: cuda error: unknown error
    Restart the system to be resolved

    Run tensorflow with GPU

    Pull the mirror, pull directly

    docker pull neucrack/tensorflow-gpu-py3-jupyterlab
    # docker pull tensorflow/tensorflow:latest-gpu-py3-jupyter
    # docker pull tensorflow/tensorflow
    # docker pull tensorflow/tensorflow:latest-gpu
    

    The image on daocloud can be used in China, and the speed will be faster:

    docker pull daocloud.io/neucrack/tensorflow-gpu-py3-jupyterlab
    

    Execute the test statement:

    docker run --gpus all -it --rm neucrack/tensorflow-gpu-py3-jupyterlab python -c "import tensorflow as tf; print('-----version:{}, gpu:{}, 1+2 ={}'.format(tf.__version__, tf.test.is_gpu_available(), tf.add(1, 2).numpy()) );"
    

    If daocloud is used, the image name needs to be changed to daocloud.io/neucrack/tensorflow-gpu-py3-jupyterlab

    If there is no problem, the following output will appear (it will be accompanied by a lot of debugging information and there may be warning messages, you can take a closer look):

    -----version:2.1.0, gpu:True, 1+2=3
    

    Jupyterlab

    docker run --gpus all --name jupyterlab-gpu -it -p 8889:8889 -e USER_NAME=$USER -e USER_ID=`id -u $USER` -e GROUP_NAME=`id -gn $USER` -e GROUP_ID =`id -g $USER` -v /home/${USER}:/tf neucrack/tensorflow-gpu-py3-jupyterlab
    

    If daocloud is used, the image name needs to be changed to daocloud.io/neucrack/tensorflow-gpu-py3-jupyterlab

    Then you can use the browser to use jupyterlab at the address of http://127.0.0.1:8889/, and the directory corresponds to the set /home/${USER} directory

    jupyterlab
    jupyterlab.png

    Exit directly with Ctrl+C
    This container will always exist on the computer after it is created, you can use docker ps -a to view it, and use it next time you start it

    docker start jupyterlab_gpu
    

    Can also be attached to the container:

    docker attatch jupyterlab_gpu
    

    Stop the container:

    docker stop jupyterlab_gpu
    

    Delete the container:

    docker rm jupyterlab_gpu
    

    Modify the user and root passwords so that you can use the sudo command

    docker exec -it jupyterlab_gpu /bin/bash
    passwd $USER
    passwd root
    

    If you need to create a new container every time and delete it when you use it up, you only need to add a -rm parameter after the run command

    other questions

    • Prompt when running the program: ResourceExhaustedError: OOM when allocating tensor with shape[784,128]

    Use nvidia-smi to view memory usage

    tensorflow will apply for (almost) all video memory at once:

    ➜ ~ nvidia-smi
    Fri Mar 20 09:18:48 2020
    +------------------------------------------------- ----------------------------+
    | NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 |
    |-------------------------------+----------------- -----+----------------------+
    | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    |===============================+================= =====+======================|
    | 0 GeForce GTX 108... Off | 00000000:04:00.0 On | N/A |
    | 0% 48C P2 60W / 250W | 10726MiB / 11178MiB | 0% Default |
    +-------------------------------+----------------- -----+----------------------+
    | 1 GeForce GTX 108... Off | 00000000:81:00.0 Off | N/A |
    | 0% 47C P2 58W / 250W | 197MiB / 11178MiB | 0% Default |
    +-------------------------------+----------------- -----+----------------------+
    
    +------------------------------------------------- ----------------------------+
    | Processes: GPU Memory |
    | GPU PID Type Process name Usage |
    |================================================ ============================|
    | 0 3099 G /usr/lib/xorg/Xorg 21MiB |
    | 0 40037 C /usr/bin/python3 10693MiB |
    | 1 40037 C /usr/bin/python3 185MiB |
    +------------------------------------------------- ----------------------------+
    
    

    There may be too many processes using video memory, and some processes can be properly exited;
    It is also possible that the memory application is repeated, you can try to restart the container to solve it

    • Has been running without results

    Restart the docker container to solve it. Anyway, if something is indecisive, restart to solve it. .

    • Prompt could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED

    Multi-process may be used. The new process directly copies the environment of the current process, resulting in an error. The solution is * the parent process needs to be imported*, which is imported separately when needed in the child process, instead of writing to the global, reference Here: https://abcdabcd987.com/python-multiprocessing/

    • ImportError: libGL.so.1: cannot open shared object file: No such file or directory
    apt install libgl1-mesa-glx
    
    • Failed to get convolution algorithm. This is probably because cuDNN failed to initialize

    The graphics card memory is insufficient. Check if it is occupied by other programs. If there are multiple graphics cards, you can set the environment variable CUDA_VISIBLE_DEVICES to set the graphics card to be used. For example, there are three graphics cards, the subscripts are 0, 1 , 2, select the third card and set it to 2

    import os
    
    os.environ["CUDA_VISIBLE_DEVICES"] = '2'
    
    

    Reference