现象

项目使用 Flask + Keras + Tensorflow

同样的代码在机器A和B上都能正常运行,但在机器C上就会报如下异常。机器A和B的环境是先安装的,运行、调试成功后才尝试在C上跑。

  File "/Users/qhl/anaconda3/lib/python3.6/site-packages/keras/models.py", line 1025, in predict
    steps=steps)
  File "/Users/qhl/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1832, in predict
    self._make_predict_function()
  File "/Users/qhl/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1029, in _make_predict_function
    **kwargs)
  File "/Users/qhl/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2502, in function
    return Function(inputs, outputs, updates=updates, **kwargs)
  File "/Users/qhl/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2445, in __init__
    with tf.control_dependencies(self.outputs):
  File "/Users/qhl/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4863, in control_dependencies
    return get_default_graph().control_dependencies(control_inputs)
  File "/Users/qhl/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4481, in control_dependencies
    c = self.as_graph_element(c)
  File "/Users/qhl/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3478, in as_graph_element
    return self._as_graph_element_locked(obj, allow_tensor, allow_operation)
  File "/Users/qhl/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3557, in _as_graph_element_locked
    raise ValueError("Tensor %s is not an element of this graph." % obj)
ValueError: Tensor Tensor("Output/Softmax:0", shape=(?, 3062), dtype=float32) is not an element of this graph.

查找原因

由于Tensorflow有可能会用到显卡,因此首先怀疑是显卡或驱动的原因导致的。在C机器上做过如下尝试:

  • 重装过CPU和GPU版本的Tensorflow
  • 重装过显卡驱动以及CUBA
  • 操作系统也更换过Windows、Deepin、Ubuntu
  • 应用代码也同步过多次,确保A/B/C机器上的代码一模一样
  • 也查看过Tensorflow和Keras以及h5py的版本号

最终还是报一样的错。经过代码跟踪、上网搜索,发现是由于Tensorflow在多线程模式下的一个bug。而Flask最新版(1.0.2)默认改为多线程模式了,以前是默认单线程模式。而且巧的是,Flask 1.0版就是在我安装C机器环境前才发布的,装A/B机器时还是0.12。

解决办法

1. 据说改为Theano为backend可以解决。我没试过

2. 修改当前的default graph。这里有个大讨论可以参考:https://github.com/keras-team/keras/issues/2397 。具体做法:

  在加载或构建你的model后添加

graph = tf.get_default_graph()

  在执行model.predict()方法前

global graph
with graph.as_default():
    (... do inference here ...)

这样就可以支持多线程模式了

 3. 也可以强制将Flask改为单线程模式。

if __name__ == '__main__':
    app.run(host="0.0.0.0", port=8080, threaded=False)