tensorflow设置cpu核数_在TensorFlow中使用多个CPU内核

最新推荐文章于 2025-05-12 08:20:53 发布

weixin_39587407

最新推荐文章于 2025-05-12 08:20:53 发布

阅读量1.7k

点赞数

文章标签： tensorflow设置cpu核数

本文链接：https://blog.youkuaiyun.com/weixin_39587407/article/details/111922499

版权

本文探讨了在TensorFlow中尝试让程序利用多个CPU核心进行运算的问题。代码示例显示了即使设置了`device_count`, `inter_op_parallelism_threads`，程序仍然只使用一个核心。通过查看`htop`和运行跟踪命令，发现在某些情况下TensorFlow似乎使用了两个CPU设备，但并未实现真正意义上的并行执行。问题的根源在于常量折叠优化导致的顺序执行。解决办法是使用`tf.placeholder`创建非平凡输入，避免常量折叠，从而实现多核并行运算。" 42881669,4927375,Hadoop Python：分布式计算平均值教程,"['大数据开发', 'Hadoop', 'Python编程', 'MapReduce', '数据处理']

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

I have extensively studied other answers on TensorFlow and I just cannot seem to get it to use multiple cores on my CPU.

According to htop, the following program only uses a single CPU core:

import tensorflow as tf

n_cpus = 20

sess = tf.Session(config=tf.ConfigProto(

device_count={ "CPU": n_cpus },

inter_op_parallelism_threads=n_cpus,

intra_op_parallelism_threads=1,

))

size = 100000

A = tf.ones([size, size], name="A")

B = tf.ones([size, size], name="B")

C = tf.ones([size, size], name="C")

with tf.device("/cpu:0"):

x = tf.matmul(A, B)

with tf.device("/cpu:1"):

y = tf.matmul(A, C)

sess.run([x, y])

# run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)

# run_metadata = tf.RunMetadata()

# sess.run([x, y], options=run_options, run_metadata=run_metadata)

# for device in run_metadata.step_stats.dev_stats:

# device_name = device.device

# print(device.device)

# for node in device.node_stats:

# print(" ", node.node_name)

However, when I uncomment the lines at the bottom, and change size so that the computation actually finishes in a reasonable amount of time, I see that TensorFlow seems to think it's using at least 2 CPU devices:

/job:localhost/replica:0/task:0/device:CPU:0

_SOURCE

MatMul

_retval_MatMul_0_0

_retval_MatMul_1_0_1

/job:localhost/replica:0/task:0/device:CPU:1

_SOURCE

MatMul_1

Fundamentally, what I want to do here is execute different ops on different cores in parallel. I don't want to split a single op over multiple cores, though I know that happens to work in this contrived example. Both device_count and inter_op_parallelism_threads sound like what I want, but neither seems to actually result in using multiple cores. I've tried all combinations I can think of, including setting one or the other to 1 in case they conflict with each other, and nothing seems to work.

I can also confirm with taskset that I'm not doing anything strange with my CPU affinity:

$ taskset -p $$

pid 21395's current affinity mask: ffffffffff

What exactly do I have to do to this code to get it to use multiple CPU cores?

Note:

From this answer among others I'm setting the device_count and inter_op_parallelism_threads.

The tracing command comes from this answer.

I can remove the tf.device calls and it doesn't seem to make any difference to my CPU utilization.

I'm using TensorFlow 1.10.0 installed from conda.

解决方案

After some back and forth on the TensorFlow issue here we determined that the issue was that the program was being "optimized" by a constant folding pass, because the inputs were all trivial. It turns out this constant folding pass runs sequentially. Therefore, if you want to observe a parallel execution, the way to do this is to make the inputs non-trivial so that the constant folding won't apply to them. The method suggested in the issue was to use tf.placeholder, and I have written an example program that makes use of this here:

See the original issue for sample output from the program: https://github.com/tensorflow/tensorflow/issues/22619