Phoenix SDK开发指南：扩展自定义LLM评估指标与追踪逻辑-优快云博客

Phoenix SDK开发指南：扩展自定义LLM评估指标与追踪逻辑

【免费下载链接】phoenix AI Observability & Evaluation 项目地址: https://gitcode.com/gh_mirrors/phoenix13/phoenix

引言：突破LLM可观测性的边界

你是否还在为LLM应用的质量波动而困扰？当用户抱怨回答质量下降时，你是否需要花费数小时排查问题根源？Phoenix SDK为AI工程师提供了强大的可观测性工具链，但内置功能往往难以覆盖复杂业务场景。本文将系统讲解如何通过扩展评估指标与追踪逻辑，构建专属的LLM质量监控体系，让你在5分钟内定位问题，将调试效率提升10倍。

读完本文你将掌握：

自定义评估指标的设计模式与实现方法
全链路追踪逻辑的扩展技巧
性能与精度平衡的工程实践
生产环境部署的最佳实践

架构概览：Phoenix评估与追踪核心组件

Phoenix的可观测性体系基于两大支柱：评估系统与追踪系统。评估系统负责量化LLM输出质量，追踪系统则记录完整的执行链路。二者通过事件总线实现数据互通，形成闭环反馈机制。

mermaid

核心类关系

Phoenix SDK的评估系统围绕Evaluator抽象类构建，所有评估指标均需实现该接口。以下是关键类的继承关系：

mermaid

实战：构建自定义评估指标

设计原则与评估维度

自定义评估指标应遵循以下原则：

业务相关性：指标应直接反映业务目标
可解释性：分数变化应有明确的归因
计算效率：评估延迟应低于100ms（同步）或500ms（异步）
鲁棒性：对输入扰动不敏感

常用评估维度包括：

相关性：输出与查询的相关程度
事实一致性：输出内容的准确性
安全性：有害内容的风险等级
流畅性：语言表达的自然程度
结构化：格式符合要求的程度

实现自定义代码评估器

代码评估器适用于规则明确、计算迅速的场景。以下实现一个"情感极性评估器"，用于量化文本的积极程度：

from typing import Optional, Mapping

from phoenix.experiments.evaluators.base import CodeEvaluator
from phoenix.experiments.types import (
    EvaluationResult,
    TaskOutput,
    ExampleInput,
    ExampleMetadata,
)

class SentimentPolarityEvaluator(CodeEvaluator):
    """
    评估文本的积极情感极性，返回-1.0（完全消极）到1.0（完全积极）的分数。
    使用VADER情感分析库实现。
    """
    
    def __init__(self, threshold: float = 0.0):
        """
        参数:
            threshold: 分类阈值，高于此值判定为积极，默认0.0
        """
        self.threshold = threshold
        # 延迟加载情感分析模型以加快导入速度
        self._analyzer = None

    @property
    def name(self) -> str:
        return "SentimentPolarityEvaluator"
    
    def evaluate(
        self,
        output: Optional[TaskOutput] = None,
        expected: Optional[ExampleOutput] = None,
        metadata: ExampleMetadata = MappingProxyType({}),
        input: ExampleInput = MappingProxyType({}),
        **kwargs,
    ) -> EvaluationResult:
        # 确保输出存在且为字符串
        if not output or not isinstance(output, str):
            return EvaluationResult(
                score=0.0,
                explanation="无效输出格式",
                label="invalid"
            )
            
        # 延迟初始化情感分析器
        if self._analyzer is None:
            from nltk.sentiment import SentimentIntensityAnalyzer
            self._analyzer = SentimentIntensityAnalyzer()
            
        # 计算情感分数
        sentiment = self._analyzer.polarity_scores(output)
        compound_score = sentiment['compound']
        
        # 生成评估结果
        label = "positive" if compound_score >= self.threshold else "negative"
        
        return EvaluationResult(
            score=compound_score,
            explanation=f"情感分析结果: {sentiment}",
            label=label
        )

实现LLM辅助评估器

对于需要复杂语义理解的场景，可实现基于LLM的评估器。以下是一个评估回答"友善度"的示例：

from typing import Optional, Mapping, Any
from phoenix.experiments.evaluators.base import LLMEvaluator
from phoenix.experiments.types import (
    EvaluationResult,
    TaskOutput,
    ExampleInput,
    ExampleMetadata,
)
from phoenix.experiments.llm import LLMBaseModel

class FriendlinessEvaluator(LLMEvaluator):
    """
    使用LLM评估回答的友善程度，返回0-1之间的分数。
    """
    
    def __init__(self, model: LLMBaseModel, name: str = "Friendliness"):
        """
        参数:
            model: LLM模型实例
            name: 评估器名称
        """
        self.model = model
        self._name = name
        
    @property
    def name(self) -> str:
        return self._name
    
    async def async_evaluate(
        self,
        output: Optional[TaskOutput] = None,
        expected: Optional[ExampleOutput] = None,
        metadata: ExampleMetadata = MappingProxyType({}),
        input: ExampleInput = MappingProxyType({}),** kwargs,
    ) -> EvaluationResult:
        if not output or not isinstance(output, str):
            return EvaluationResult(
                score=0.0,
                explanation="无效输出格式",
                label="invalid"
            )
            
        # 构建评估提示
        prompt = f"""
        任务: 评估以下文本的友善程度，范围0-10分。
        友善度定义: 表达友好、礼貌、尊重的程度，不包含讽刺或消极情绪。
        
        文本: {output}
        
        请先分析文本中的友善表现，然后给出分数。
        输出格式: 分数(数字)和解释，用逗号分隔。
        例如: 8.5, 文本使用了"请"和"谢谢"等礼貌用语，表达尊重。
        """
        
        # 调用LLM评估
        response = await self.model.generate(prompt)
        
        # 解析结果
        try:
            score_str, explanation = response.split(',', 1)
            score = float(score_str.strip()) / 10.0  # 归一化到0-1范围
            label = "friendly" if score >= 0.6 else "unfriendly"
            
            return EvaluationResult(
                score=score,
                explanation=explanation.strip(),
                label=label
            )
        except (ValueError, IndexError):
            return EvaluationResult(
                score=0.0,
                explanation=f"LLM评估失败: {response}",
                label="error"
            )

评估器注册与使用

实现自定义评估器后，需注册到Phoenix系统才能在UI中显示：

from phoenix.session import Session

# 初始化Phoenix会话
session = Session()

# 注册评估器
session.register_evaluator(SentimentPolarityEvaluator(threshold=0.2))
session.register_evaluator(FriendlinessEvaluator(model=my_llm_model))

# 使用评估器
results = session.evaluate(
    dataset_name="customer_support_queries",
    evaluators=["SentimentPolarityEvaluator", "Friendliness"]
)

# 查看评估结果
for result in results:
    print(f"示例ID: {result.example_id}")
    print(f"情感分数: {result.scores['SentimentPolarityEvaluator']}")
    print(f"友善度分数: {result.scores['Friendliness']}")

追踪逻辑扩展：自定义Span与属性

OpenTelemetry集成基础

Phoenix基于OpenTelemetry(OTel)构建追踪系统，通过扩展OTel的SpanProcessor可以实现自定义追踪逻辑。核心概念包括：

Span：表示操作的单个单元，包含名称、时间戳、属性等
Trace：一组相关Span组成的调用链
Attribute：键值对形式的元数据
Event：Span生命周期中的重要事件

实现自定义属性注入

以下示例展示如何向所有LLM调用Span添加自定义属性：

from opentelemetry.sdk.trace import SpanProcessor, ReadableSpan
from opentelemetry.sdk.resources import Resource
from phoenix.trace import SpanModifier

class LLMCostProcessor(SpanProcessor):
    """
    计算LLM调用成本并注入Span属性
    """
    
    def __init__(self, pricing_model: dict[str, float]):
        """
        参数:
            pricing_model: 定价模型，如{"gpt-4": 0.03/1000, "gpt-3.5": 0.002/1000}
        """
        self.pricing_model = pricing_model
        
    def on_end(self, span: ReadableSpan) -> None:
        # 仅处理LLM调用Span
        if span.name != "llm_inference":
            return
            
        # 获取模型名称和token数量
        model_name = span.attributes.get("llm.model", "unknown")
        prompt_tokens = span.attributes.get("llm.prompt_tokens", 0)
        completion_tokens = span.attributes.get("llm.completion_tokens", 0)
        
        # 计算成本
        cost_per_token = self.pricing_model.get(model_name, 0.0)
        total_cost = (prompt_tokens + completion_tokens) * cost_per_token
        
        # 添加自定义属性
        span.set_attribute("llm.total_cost", total_cost)
        span.set_attribute("llm.cost_currency", "USD")
        
        # 记录成本事件
        span.add_event(
            "llm_cost_calculated",
            attributes={
                "prompt_cost": prompt_tokens * cost_per_token,
                "completion_cost": completion_tokens * cost_per_token,
                "total_cost": total_cost
            }
        )

# 在Phoenix中注册自定义Processor
from phoenix.trace import set_span_processors

set_span_processors([LLMCostProcessor({
    "gpt-4": 0.03 / 1000,
    "gpt-3.5-turbo": 0.002 / 1000,
    "claude-2": 0.03 / 1000
})])

实现业务逻辑Span

对于复杂业务流程，可创建自定义Span来跟踪关键步骤：

from opentelemetry import trace
from opentelemetry.trace import SpanKind

tracer = trace.get_tracer(__name__)

def process_customer_query(query: str, user_id: str) -> str:
    """处理客户查询的业务流程"""
    with tracer.start_as_current_span(
        "customer_query_processing",
        kind=SpanKind.SERVER,
        attributes={
            "user.id": user_id,
            "query.length": len(query),
            "query.category": classify_query(query)
        }
    ) as span:
        try:
            # 提取查询意图
            with tracer.start_as_current_span("intent_extraction") as intent_span:
                intent = extract_intent(query)
                intent_span.set_attribute("intent.label", intent)
                
            # 检索相关文档
            with tracer.start_as_current_span("document_retrieval") as retrieval_span:
                documents = retrieve_documents(intent)
                retrieval_span.set_attribute("retrieval.count", len(documents))
                retrieval_span.set_attribute("retrieval.strategy", "hybrid")
                
            # 生成回答
            with tracer.start_as_current_span("answer_generation") as generation_span:
                answer = generate_answer(query, documents)
                generation_span.set_attribute("answer.length", len(answer))
                
                # 添加自定义事件
                generation_span.add_event(
                    "moderation_check",
                    attributes={
                        "passed": True,
                        "category": "safe"
                    }
                )
                
            span.set_attribute("processing.success", True)
            return answer
            
        except Exception as e:
            span.set_attribute("processing.success", False)
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            return "抱歉，处理您的请求时出错"

追踪数据的存储与查询

自定义追踪数据会自动存储在Phoenix的数据库中，可通过SQL或SDK查询：

# 使用SQL查询自定义属性
results = session.query_spans(
    "llm.model = 'gpt-4' AND llm.total_cost > 0.01",
    start_time=datetime(2023, 10, 1),
    end_time=datetime(2023, 10, 2)
)

# 分析成本最高的查询
expensive_queries = sorted(
    results, 
    key=lambda x: x.attributes.get("llm.total_cost", 0), 
    reverse=True
)[:10]

for span in expensive_queries:
    print(f"成本: ${span.attributes['llm.total_cost']:.4f}")
    print(f"耗时: {span.end_time - span.start_time}")
    print(f"Token数: {span.attributes['llm.prompt_tokens'] + span.attributes['llm.completion_tokens']}")

性能优化：评估与追踪的工程实践

评估系统性能调优

自定义评估器可能成为性能瓶颈，以下是优化建议：

优化策略	适用场景	效果	实现复杂度
异步评估	网络调用或LLM评估	降低P99延迟50%+	中
批处理评估	批量数据处理	提高吞吐量10倍+	低
缓存评估结果	重复输入场景	减少计算量90%+	低
预计算特征	复杂特征工程	降低评估时间70%+	中

异步评估实现：

from concurrent.futures import ThreadPoolExecutor

class AsyncEvaluatorWrapper:
    """异步评估器包装器"""
    
    def __init__(self, evaluator, max_workers=4):
        self.evaluator = evaluator
        self.executor = ThreadPoolExecutor(max_workers=max_workers)
        
    def evaluate_async(self, output):
        """异步评估单个输出"""
        return self.executor.submit(
            self.evaluator.evaluate, 
            output=output
        )
        
    def evaluate_batch_async(self, outputs):
        """批量异步评估"""
        futures = [self.evaluate_async(output) for output in outputs]
        return [future.result() for future in futures]

采样策略：平衡精度与性能

在高流量场景下，全量追踪会导致性能问题，可实现采样策略：

from opentelemetry.sdk.trace import Sampler, SamplingResult
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from phoenix.trace import set_span_processors

class SmartSampler(Sampler):
    """
    智能采样器：
    - 对错误请求100%采样
    - 对高成本LLM调用100%采样
    - 对普通请求按比例采样
    """
    
    def __init__(self, default_rate=0.1):
        self.default_rate = default_rate
        
    def should_sample(self, context, trace_id, name, kind, attributes, links, root_span_id):
        # 错误请求强制采样
        if attributes.get("error", False):
            return SamplingResult(decision=SamplingResultDecision.RECORD_AND_SAMPLE)
            
        # 高成本LLM调用强制采样
        if name == "llm_inference":
            cost = attributes.get("llm.total_cost", 0.0)
            if cost > 0.01:  # 成本超过1美分
                return SamplingResult(decision=SamplingResultDecision.RECORD_AND_SAMPLE)
                
        # 随机采样
        import random
        if random.random() < self.default_rate:
            return SamplingResult(decision=SamplingResultDecision.RECORD_AND_SAMPLE)
            
        return SamplingResult(decision=SamplingResultDecision.DROP)

# 使用智能采样器
set_span_processors([
    BatchSpanProcessor(
        exporter=phoenix_exporter,
        max_queue_size=1024,
        schedule_delay_millis=5000
    )
])
trace.get_tracer_provider().update_sampler(SmartSampler(default_rate=0.05))

生产环境部署与监控

评估器性能基准测试

部署前需对自定义评估器进行基准测试，确保满足性能要求：

import timeit
import numpy as np

def benchmark_evaluator(evaluator, test_cases):
    """评估器性能基准测试"""
    times = []
    
    for output in test_cases:
        start = timeit.default_timer()
        evaluator.evaluate(output=output)
        end = timeit.default_timer()
        times.append(end - start)
        
    return {
        "p50": np.percentile(times, 50),
        "p90": np.percentile(times, 90),
        "p99": np.percentile(times, 99),
        "mean": np.mean(times),
        "throughput": len(test_cases) / sum(times)
    }

# 生成测试用例
test_outputs = [generate_test_output() for _ in range(1000)]

# 测试情感评估器
sentiment_evaluator = SentimentPolarityEvaluator()
results = benchmark_evaluator(sentiment_evaluator, test_outputs)
print(f"情感评估器性能: {results}")

# 测试LLM评估器（注意：异步评估需使用不同的基准测试方法）

监控自定义指标

通过Prometheus暴露自定义评估指标，实现持续监控：

from prometheus_client import Gauge, Counter

# 定义指标
EVALUATION_SCORE = Gauge(
    'phoenix_evaluation_score',
    'Custom evaluation scores',
    ['evaluator', 'model', 'environment']
)

EVALUATION_COUNT = Counter(
    'phoenix_evaluation_total',
    'Total number of evaluations performed',
    ['evaluator', 'result', 'model']
)

def monitor_evaluation(evaluator_name, model_name, score, success):
    """记录评估指标"""
    EVALUATION_SCORE.labels(
        evaluator=evaluator_name,
        model=model_name,
        environment="production"
    ).set(score)
    
    EVALUATION_COUNT.labels(
        evaluator=evaluator_name,
        result="success" if success else "failure",
        model=model_name
    ).inc()

# 在评估器中集成监控
class MonitoredEvaluator(SentimentPolarityEvaluator):
    def evaluate(self, output, **kwargs):
        result = super().evaluate(output,** kwargs)
        monitor_evaluation(
            evaluator_name=self.name,
            model_name=kwargs.get("model_name", "unknown"),
            score=result.score,
            success=True
        )
        return result

扩展的高可用部署

对于关键业务场景，建议部署评估服务的高可用集群：

# docker-compose.yml 示例
version: '3.8'
services:
  phoenix-evaluator:
    build: ./evaluator
    replicas: 3
    environment:
      - PHOENIX_HOST=phoenix-server:6006
      - EVALUATOR_THREADS=4
      - CACHE_SIZE=10000
    resources:
      limits:
        cpus: '2'
        memory: 2G
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      
  phoenix-server:
    image: arizephoenix/phoenix:latest
    ports:
      - "6006:6006"
    volumes:
      - phoenix-data:/data
    environment:
      - PHOENIX_PORT=6006
      - LOG_LEVEL=info

volumes:
  phoenix-data:

最佳实践与常见问题

评估指标设计模式

根据业务需求选择合适的评估模式：

阈值模式：设定明确阈值，将连续分数转换为分类标签

def threshold_based_evaluation(score: float, thresholds: list[float]) -> str:
    """将分数映射到分类标签"""
    if score >= thresholds[1]:
        return "excellent"
    elif score >= thresholds[0]:
        return "acceptable"
    else:
        return "poor"

对比模式：比较不同模型/版本的相对表现

def comparative_evaluation(baseline_scores: list[float], new_scores: list[float]) -> dict:
    """比较新模型与基线模型的表现"""
    from scipy import stats
    return {
        "mean_improvement": np.mean(new_scores) - np.mean(baseline_scores),
        "p_value": stats.ttest_ind(baseline_scores, new_scores).pvalue,
        "better_rate": sum(n > b for n, b in zip(new_scores, baseline_scores)) / len(new_scores)
    }

多维度模式：综合多个指标形成整体评估

def weighted_evaluation(scores: dict[str, float], weights: dict[str, float]) -> float:
    """计算加权综合得分"""
    total = 0.0
    weight_sum = 0.0

    for name, score in scores.items():
        if name in weights:
            total += score * weights[name]
            weight_sum += weights[name]

    return total / weight_sum if weight_sum > 0 else 0.0

常见问题与解决方案

问题	解决方案	复杂度
评估器性能不足	1. 实现缓存机制 2. 异步评估 3. 性能优化	中
评估结果不一致	1. 增加样本量 2. 优化提示词 3. 实现集成评估	高
存储占用过大	1. 实施数据保留策略 2. 压缩低频数据 3. 优化索引	中
追踪开销过高	1. 智能采样 2. 减少不必要的属性 3. 批量处理	低

调试技巧

当自定义评估或追踪逻辑出现问题时，可采用以下调试技巧：

本地重现：使用phoenix debug命令启动调试模式
Span检查：通过session.query_spans()检查Span属性
日志增强：在关键位置添加详细日志
单元测试：为评估逻辑编写专项测试

def test_sentiment_evaluator():
    """测试情感评估器"""
    evaluator = SentimentPolarityEvaluator()
    
    # 测试积极文本
    positive_result = evaluator.evaluate(output="非常感谢你的帮助，问题已经解决！")
    assert positive_result.score > 0.5
    assert positive_result.label == "positive"
    
    # 测试消极文本
    negative_result = evaluator.evaluate(output="这是我遇到的最糟糕的服务，完全无法接受！")
    assert negative_result.score < -0.5
    assert negative_result.label == "negative"
    
    # 测试中性文本
    neutral_result = evaluator.evaluate(output="订单编号 #12345")
    assert abs(neutral_result.score) < 0.2

未来展望与进阶方向

Phoenix SDK的扩展能力为LLM可观测性开辟了广阔空间。未来可探索的进阶方向包括：

自适应评估：基于模型性能动态调整评估策略
因果分析：通过追踪数据识别影响性能的关键因素
自动化优化：基于评估结果自动调整模型参数或提示词
多模态评估：扩展评估能力至图像、音频等多模态内容

随着LLM应用复杂度的增加，自定义可观测性将成为保障系统质量的关键能力。Phoenix SDK提供的扩展机制，让AI工程师能够构建真正贴合业务需求的可观测性体系。

总结

本文深入探讨了Phoenix SDK的扩展能力，通过自定义评估指标和追踪逻辑，开发者可以构建贴合业务需求的LLM可观测性系统。关键要点包括：

评估指标设计应遵循业务相关性、可解释性和效率原则
自定义评估器可通过继承Evaluator类实现，并支持代码和LLM两种评估方式
追踪逻辑扩展基于OpenTelemetry，可通过自定义Span和属性实现
性能与存储优化是生产环境部署的关键考量
多种评估模式和最佳实践可应对不同业务场景

通过本文介绍的方法，你可以充分利用Phoenix SDK的潜力，为LLM应用构建全方位的质量保障体系，在提升系统可靠性的同时，加速模型迭代优化。

收藏本文，关注Phoenix项目，获取更多LLM可观测性实践指南。下期预告：《构建LLM应用的持续评估流水线》

【免费下载链接】phoenix AI Observability & Evaluation 项目地址: https://gitcode.com/gh_mirrors/phoenix13/phoenix

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考