У меня ошибка cuDNN, когда тензор потока Docker

Я хочу использовать в том числе и после tenorflow2.0 в Docker. Я хочу использовать ( https://github.com/tensorlayer/srgan ).

Мой Dockerfile

FROM tensorflow/tensorflow:latest-gpu-py3

ENV HOME=/home
ENV user=hogehoge



WORKDIR $HOME

RUN useradd -u 1000 -m -d /home/${user} ${user} 
&& chown -R ${user} /home/${user}

RUN pip install tensorlayer easydict

USER ${USER}

Я строю контейнер с:

docker build -t tensorflow .
sudo docker run --rm --gpus all -it -v /media/hikarukondo/Workspace/BLUE_TAG/workspace/:/home/ tensorflow

в контейнере,

python train.py

И тогда я получаю.

2020-01-14 05:39:56.390997: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-01-14 05:39:56.392064: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-01-14 05:40:00.523011: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-14 05:40:00.542402: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-14 05:40:00.542772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.62GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-01-14 05:40:00.542794: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-01-14 05:40:00.542831: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-14 05:40:00.543925: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-01-14 05:40:00.544139: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-01-14 05:40:00.545110: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-01-14 05:40:00.545615: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-01-14 05:40:00.545639: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-14 05:40:00.545738: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-14 05:40:00.546108: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-14 05:40:00.546413: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-01-14 05:40:00.546665: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-01-14 05:40:00.567683: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3696000000 Hz
2020-01-14 05:40:00.567909: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5795ae0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-01-14 05:40:00.567922: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-01-14 05:40:00.626426: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-14 05:40:00.626828: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5776b10 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-01-14 05:40:00.626856: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2020-01-14 05:40:00.627044: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-14 05:40:00.627339: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.62GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-01-14 05:40:00.627360: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-01-14 05:40:00.627368: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-14 05:40:00.627382: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-01-14 05:40:00.627392: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-01-14 05:40:00.627402: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-01-14 05:40:00.627412: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-01-14 05:40:00.627419: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-14 05:40:00.627460: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-14 05:40:00.627732: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-14 05:40:00.628005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-01-14 05:40:00.628040: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-01-14 05:40:00.801827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-01-14 05:40:00.801853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-01-14 05:40:00.801858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-01-14 05:40:00.802029: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-14 05:40:00.802406: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-14 05:40:00.802727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6664 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-01-14 05:40:01.135124: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-14 05:40:01.604467: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-01-14 05:40:01.609256: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "train.py", line 204, in <module>
    evaluate()
  File "train.py", line 171, in evaluate
    G = get_G([1, None, None, 3])
  File "/home/srgan/model.py", line 14, in get_G
    n = Conv2d(64, (3, 3), (1, 1), act=tf.nn.relu, padding='SAME', W_init=w_init)(nin)
  File "/usr/local/lib/python3.6/dist-packages/tensorlayer/layers/core.py", line 225, in __call__
    outputs = self.forward(input_tensors, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorlayer/layers/convolution/simplified_conv.py", line 271, in forward
    name=self.name,
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1914, in conv2d_v2
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 2011, in conv2d
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 937, in conv2d
    _ops.raise_from_not_ok_status(e, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D] name: conv2d_1

Версия докера 19.03.5, сборка У меня установлена ​​и доступна 1 GeForce RTX 2070 на моей машине. Моя текущая версия драйвера - 440.33.01.

Мне интересно, если я делаю что-то не так? Или есть проблема со сборкой Docker?

Всего 1 ответ


Можете ли вы попробовать настройки

config.gpu_options.allow_growth = True


Есть идеи?

10000