performance_hardware(文档)

本文对比了NVIDIA K40、Titan、K20及GTX770等GPU在CaffeNet上的训练与测试性能,包括不同配置下的运行时间,并提供了K40的配置技巧。

---
title: Performance and Hardware Configuration
---

# Performance and Hardware Configuration

To measure performance on different NVIDIA GPUs we use CaffeNet, the Caffe reference ImageNet model.

For training, each time point is 20 iterations/minibatches of 256 images for 5,120 images total. For testing, a 50,000 image validation set is classified.

**Acknowledgements**: BVLC members are very grateful to NVIDIA for providing several GPUs to conduct this research.

## NVIDIA K40

Performance is best with ECC off and boost clock enabled. While ECC makes a negligible difference in speed, disabling it frees ~1 GB of GPU memory.

Best settings with ECC off and maximum clock speed in standard Caffe:

* Training is 26.5 secs / 20 iterations (5,120 images)
* Testing is 100 secs / validation set (50,000 images)

Best settings with Caffe + [cuDNN acceleration](http://nvidia.com/cudnn):

* Training is 19.2 secs / 20 iterations (5,120 images)
* Testing is 60.7 secs / validation set (50,000 images)

Other settings:

* ECC on, max speed: training 26.7 secs / 20 iterations, test 101 secs / validation set
* ECC on, default speed: training 31 secs / 20 iterations, test 117 secs / validation set
* ECC off, default speed: training 31 secs / 20 iterations, test 118 secs / validation set

### K40 configuration tips

For maximum K40 performance, turn off ECC and boost the clock speed (at your own risk).

To turn off ECC, do

    sudo nvidia-smi -i 0 --ecc-config=0    # repeat with -i x for each GPU ID

then reboot.

Set the "persistence" mode of the GPU settings by

    sudo nvidia-smi -pm 1

and then set the clock speed with

    sudo nvidia-smi -i 0 -ac 3004,875    # repeat with -i x for each GPU ID

but note that this configuration resets across driver reloading / rebooting. Include these commands in a boot script to initialize these settings. For a simple fix, add these commands to `/etc/rc.local` (on Ubuntu).

## NVIDIA Titan

Training: 26.26 secs / 20 iterations (5,120 images).
Testing: 100 secs / validation set (50,000 images).

cuDNN Training: 20.25 secs / 20 iterations (5,120 images).
cuDNN Testing: 66.3 secs / validation set (50,000 images).


## NVIDIA K20

Training: 36.0 secs / 20 iterations (5,120 images).
Testing: 133 secs / validation set (50,000 images).

## NVIDIA GTX 770

Training: 33.0 secs / 20 iterations (5,120 images).
Testing: 129 secs / validation set (50,000 images).

cuDNN Training: 24.3 secs / 20 iterations (5,120 images).
cuDNN Testing: 104 secs / validation set (50,000 images).

性能和硬件配置

为了衡量不同NVIDIA GPU上的性能,我们使用CaffeNet(Caffe参考ImageNet模型)。

为了进行训练,每个时间点是256张图像的20次迭代/minibatches处理,总共5120张图像。为了进行测试,分类了50,000个图像验证集。

NVIDIA K40

关闭ECC并启用加速时钟时,性能最佳。 ECC的速度差异可忽略不计,但将其禁用可释放约1 GB的GPU内存。

禁用ECC的最佳设置和标准Caffe中的最大时钟速度:

*训练时间为26.5秒/ 20次迭代(5,120张图像)
*测试时间为100秒/验证集(50,000张图片)

使用Caffe + [cuDNN加速](http://nvidia.com/cudnn)的最佳设置:

*训练时间为19.2秒/ 20次迭代(5,120张图像)
*测试时间为60.7秒/验证集(50,000张图片)

其他设定:

*开启ECC,最高速度:训练26.7秒/ 20次迭代,测试101秒/验证集
*开启ECC,默认速度:训练31秒/ 20次迭代,测试117秒/验证集

  • ECC关闭,默认速度:训练31秒/ 20次迭代,测试118秒/验证集

K40配置技巧

为了获得最佳的K40性能,请关闭ECC并提高时钟速度(后果自负)。

要关闭ECC,请执行

sudo nvidia-smi -i 0 --ecc-config = 0#对每个GPU ID用-i x重复

然后重新启动。

通过以下方式设置GPU设置的“持久性”模式

sudo nvidia-smi -pm 1

然后设置闹铃速度

sudo nvidia-smi -i 0 -ac 3004,875#对每个GPU ID用-i x重复

但请注意,此配置会在驱动程序重新加载/重新引导时重置。将这些命令包含在引导脚本中以初始化这些设置。为了简单的修复,请将这些命令添加到/ etc / rc.local(在Ubuntu上)。

NVIDIA Titan

训练:26.26秒/ 20次迭代(5,120张图像)。
测试:100秒/验证集(50,000张图像)。

cuDNN训练:20.25秒/ 20次迭代(5,120张图像)。
cuDNN测试:66.3秒/验证集(50,000张图像)。

NVIDIA K20

训练:36.0秒/ 20次迭代(5,120张图像)。
测试:133秒/验证集(50,000张图像)。

NVIDIA GTX 770

训练:33.0秒/ 20次迭代(5,120张图像)。
测试:129秒/验证集(50,000张图像)。

cuDNN训练:24.3秒/ 20次迭代(5,120张图像)。
cuDNN测试:104秒/验证集(50,000张图像)。

# E:\AI_System\environment\hardware_manager.py import platform import psutil import json import time import subprocess import logging from pathlib import Path from datetime import datetime from .db_manager import get_environment_db logger = logging.getLogger('HardwareManager') class HardwareManager: """ 硬件环境管理器 - 代表AI的家具和环境、集成数据库支持 负责管理硬件配置、请求新硬件、监控硬件状态 """ def __init__(self, config_path: Path = None): """ 初始化硬件管理器 :param config_path: 硬件配置文件的路径 """ # 获取数据库实例 self.db = get_environment_db() # 加载硬件目录和当前配置 self.available_hardware = self.load_hardware_catalog() self.current_setup = self.load_current_config(config_path) # 记录硬件启动事件 self.db.log_event( "system_start", f"硬件管理器启动: {platform.node()}", severity=1 ) # 保存当前配置到数据库 self.save_current_setup_to_db() # 设置最后更新时间 self.last_update = time.time() logger.info("✅ 硬件管理器初始化完成") def load_hardware_catalog(self) -> dict: """加载可用硬件目录(家具清单)""" return { "cpu": ["Intel i9-13900K", "AMD Ryzen 9 7950X", "Apple M2 Max", "Qualcomm Snapdragon X Elite"], "gpu": ["NVIDIA RTX 4090", "AMD Radeon RX 7900 XTX", "Apple M2 GPU", "NVIDIA RTX 6000 Ada"], "memory": [16, 32, 64, 128, 256], # GB "storage": ["1TB SSD", "2TB SSD", "4TB SSD", "8TB SSD", "16TB HDD", "32TB NAS"], "peripherals": [ "4K Camera", "3D Scanner", "High-Fidelity Microphone", "VR Headset", "Haptic Gloves", "Eye Tracking Device" ], "network": ["10Gb Ethernet", "WiFi 6E", "5G Modem", "Satellite Link"] } def load_current_config(self, config_path: Path = None) -> dict: """ 加载当前硬件配置 :param config_path: 配置文件路径,如果为None则自动检测 """ config_data = {} # 1. 尝试从配置文件加载 if config_path and config_path.exists(): try: with open(config_path, 'r', encoding='utf-8') as f: config_data = json.load(f) logger.info(f"从配置文件加载硬件配置: {config_path}") except Exception as e: logger.error(f"配置文件加载失败: {str(e)}") # 2. 自动检测缺失的硬件信息 auto_detected = { "cpu": platform.processor(), "gpu": self.detect_gpu(), "memory": round(psutil.virtual_memory().total / (1024 ** 3), 1), "storage": round(psutil.disk_usage('/').total / (1024 ** 3), 1), "os": f"{platform.system()} {platform.release()}", "architecture": platform.architecture()[0], "machine": platform.machine() } # 3. 尝试从数据库加载最新配置 try: # 获取最近安装的硬件配置 history = self.db.get_hardware_history(limit=10) db_config = {} # 只取已安装的硬件配置 for item in history: if item['status'] == 'installed': # 使用最新的安装记录 if item['hardware_type'] not in db_config: db_config[item['hardware_type']] = item['specification'] # 合并到配置数据 config_data = {**config_data, **db_config} logger.info("从数据库加载硬件配置") except Exception as e: logger.warning(f"数据库配置加载失败: {str(e)}") # 合并配置(优先级: 数据库 > 配置文件 > 自动检测) return {**auto_detected, **config_data} def detect_gpu(self) -> str: """自动检测GPU型号""" try: # Windows系统检测 if platform.system() == "Windows": try: import wmi w = wmi.WMI() for gpu in w.Win32_VideoController(): return gpu.Name except ImportError: # 回退方法 result = subprocess.run( ['powershell', '-Command', 'Get-WmiObject Win32_VideoController | Select-Object -ExpandProperty Name'], stdout=subprocess.PIPE, text=True, check=True ) return result.stdout.strip().split('\n')[0] # Linux系统检测 elif platform.system() == "Linux": try: result = subprocess.run( ['lspci', '-vnn'], stdout=subprocess.PIPE, text=True, check=True ) for line in result.stdout.split('\n'): if 'VGA' in line or '3D' in line: parts = line.split(': ') if len(parts) > 1: return parts[1].split(' (')[0] except FileNotFoundError: # 尝试替代方法 with open('/proc/driver/nvidia/gpus/0000:01:00.0/information', 'r') as f: for line in f: if 'Model:' in line: return line.split(': ')[1].strip() # macOS检测 elif platform.system() == "Darwin": result = subprocess.run( ['system_profiler', 'SPDisplaysDataType'], stdout=subprocess.PIPE, text=True, check=True ) for line in result.stdout.split('\n'): if 'Chipset Model' in line: return line.split(': ')[1].strip() elif 'Graphics/Displays' in line: return line.split(': ')[1].strip() except Exception as e: logger.warning(f"GPU检测失败: {str(e)}") return "Unknown GPU" def request_hardware(self, hardware_type: str, specification: str) -> tuple: """ 请求新硬件(家具) :param hardware_type: 硬件类型 (cpu, gpu, memory等) :param specification: 硬件规格 :return: (是否成功, 消息) """ if hardware_type not in self.available_hardware: return False, f"不支持硬件类型: {hardware_type}" if specification not in self.available_hardware[hardware_type]: return False, f"不支持的规格: {specification}" # 在实际系统中,这里会生成硬件购买请求 self.current_setup[hardware_type] = specification self.last_update = time.time() # 记录到数据库 self.db.log_hardware_request(hardware_type, specification, "AI System") # 记录事件 self.db.log_event( "hardware_request", f"请求新硬件: {hardware_type} - {specification}", severity=2 ) return True, f"已请求 {hardware_type}: {specification}。请管理员完成安装。" def confirm_hardware_installation(self, hardware_type: str, specification: str) -> tuple: """ 确认硬件安装完成 :param hardware_type: 硬件类型 :param specification: 硬件规格 :return: (是否成功, 消息) """ # 更新当前设置 self.current_setup[hardware_type] = specification self.last_update = time.time() # 更新数据库状态 self.db.log_hardware_installation(hardware_type, specification) # 记录事件 self.db.log_event( "hardware_install", f"硬件安装完成: {hardware_type} - {specification}", severity=3 ) return True, f"硬件 {hardware_type}: {specification} 安装完成" def get_current_setup(self) -> dict: """获取当前硬件配置""" return self.current_setup def get_performance_metrics(self) -> dict: """获取硬件性能指标""" # 获取性能数据 metrics = { "timestamp": datetime.now().isoformat(), "cpu_usage": psutil.cpu_percent(interval=1), "memory_usage": psutil.virtual_memory().percent, "disk_usage": psutil.disk_usage('/').percent, "cpu_temp": self.get_cpu_temperature(), "gpu_temp": self.get_gpu_temperature(), "network_io": self.get_network_io(), "last_updated": time.time() } # 保存到数据库 self.db.save_environment_state(metrics) return metrics def get_cpu_temperature(self) -> float: """获取CPU温度""" try: if hasattr(psutil, "sensors_temperatures"): temps = psutil.sensors_temperatures() if 'coretemp' in temps: return max([t.current for t in temps['coretemp'] if 'Core' in t.label]) return 40.0 # 默认值 except: return 40.0 def get_gpu_temperature(self) -> float: """获取GPU温度""" try: # Windows系统检测 if platform.system() == "Windows": import wmi w = wmi.WMI(namespace="root\\WMI") gpu_data = w.MSAcpi_ThermalZoneTemperature()[0] return gpu_data.CurrentTemperature / 10.0 - 273.15 # Linux系统检测 elif platform.system() == "Linux": with open('/sys/class/thermal/thermal_zone0/temp', 'r') as f: return int(f.read().strip()) / 1000.0 return 50.0 # 默认值 except: return 50.0 def get_network_io(self) -> dict: """获取网络IO数据""" net_io = psutil.net_io_counters() return { "bytes_sent": net_io.bytes_sent, "bytes_recv": net_io.bytes_recv, "packets_sent": net_io.packets_sent, "packets_recv": net_io.packets_recv } def save_config(self, config_path: Path) -> bool: """保存当前配置到文件""" try: with open(config_path, 'w', encoding='utf-8') as f: json.dump(self.current_setup, f, indent=2) return True except Exception as e: logger.error(f"配置保存失败: {str(e)}") return False def save_current_setup_to_db(self) -> bool: """保存当前硬件配置到数据库""" try: for hw_type, spec in self.current_setup.items(): # 只记录已知硬件类型 if hw_type in self.available_hardware: # 检查是否已存在相同配置 self.db.log_hardware_request(hw_type, str(spec), "System Initialization") self.db.log_hardware_installation(hw_type, str(spec)) return True except Exception as e: logger.error(f"保存配置到数据库失败: {str(e)}") return False def get_hardware_history(self, limit: int = 10) -> list: """从数据库获取硬件变更历史""" try: return self.db.get_hardware_history(limit) except Exception as e: logger.error(f"获取硬件历史失败: {str(e)}") return [] def simulate_hardware_upgrade(self, hardware_type: str, new_spec: str) -> tuple: """ 模拟硬件升级(用于测试) :param hardware_type: 硬件类型 :param new_spec: 新规格 :return: (是否成功, 消息) """ if hardware_type not in self.current_setup: return False, f"当前配置中没有 {hardware_type}" # 记录升级请求 self.db.log_hardware_request(hardware_type, new_spec, "Simulation") # 模拟安装过程 time.sleep(1) # 模拟安装延迟 # 确认安装 self.confirm_hardware_installation(hardware_type, new_spec) # 更新当前配置 self.current_setup[hardware_type] = new_spec return True, f"成功模拟 {hardware_type} 升级到 {new_spec}" # 工厂函数,便于在其他模块中创建实例 def create_hardware_manager(config_path: str = None) -> HardwareManager: """ 创建硬件管理器实例 :param config_path: 可选的自定义配置文件路径 """ path = Path(config_path) if config_path else None return HardwareManager(path) 辛苦啦
08-10
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值