trajectory 평가로 에이전트를 평가하는 방법

많은 에이전트 동작은 실제 LLM을 사용할 때만 나타납니다. 예를 들어 에이전트가 어떤 도구를 호출할지, 응답을 어떻게 포맷할지, 또는 프롬프트 수정이 전체 실행 trajectory에 영향을 미치는지 등입니다. LangChain의 agentevals 패키지는 실제 모델로 에이전트 trajectory를 테스트하기 위해 특별히 설계된 evaluator를 제공합니다.

이 가이드는 오픈 소스 LangChain agentevals 패키지를 다루며, 이는 trajectory 평가를 위해 LangSmith와 통합됩니다.

AgentEvals를 사용하면 trajectory match 또는 _LLM judge_를 사용하여 에이전트의 trajectory(도구 호출을 포함한 정확한 메시지 시퀀스)를 평가할 수 있습니다:

Trajectory match

주어진 입력에 대한 참조 trajectory를 하드코딩하고 단계별 비교를 통해 실행을 검증합니다.예상되는 동작을 알고 있는 잘 정의된 워크플로우를 테스트하는 데 이상적입니다. 어떤 도구가 어떤 순서로 호출되어야 하는지에 대한 구체적인 기대가 있을 때 사용하세요. 이 접근 방식은 추가 LLM 호출이 필요하지 않으므로 결정론적이고 빠르며 비용 효율적입니다.

LLM-as-judge

LLM을 사용하여 에이전트의 실행 trajectory를 정성적으로 검증합니다. “judge” LLM은 프롬프트 루브릭(참조 trajectory를 포함할 수 있음)에 대해 에이전트의 결정을 검토합니다.더 유연하며 효율성과 적절성과 같은 미묘한 측면을 평가할 수 있지만, LLM 호출이 필요하고 덜 결정론적입니다. 엄격한 도구 호출이나 순서 요구 사항 없이 에이전트 trajectory의 전반적인 품질과 합리성을 평가하고자 할 때 사용하세요.

AgentEvals 설치하기

pip install agentevals

또는 AgentEvals repository를 직접 클론하세요.

Trajectory match evaluator

AgentEvals는 Python에서 create_trajectory_match_evaluator 함수를, TypeScript에서 createTrajectoryMatchEvaluator를 제공하여 에이전트의 trajectory를 참조 trajectory와 매칭합니다. 다음 모드를 사용할 수 있습니다:

모드	설명	사용 사례
`strict`	동일한 순서로 메시지와 도구 호출의 정확한 일치	특정 시퀀스 테스트 (예: 권한 부여 전 정책 조회)
`unordered`	동일한 도구 호출이 임의의 순서로 허용됨	순서가 중요하지 않을 때 정보 검색 검증
`subset`	에이전트가 참조의 도구만 호출 (추가 없음)	에이전트가 예상 범위를 초과하지 않도록 보장
`superset`	에이전트가 최소한 참조 도구를 호출 (추가 허용)	최소 필수 작업이 수행되는지 검증

Strict match

strict 모드는 trajectory가 동일한 순서로 동일한 도구 호출과 함께 동일한 메시지를 포함하도록 보장하지만, 메시지 내용의 차이는 허용합니다. 이는 작업 권한 부여 전에 정책 조회를 요구하는 것과 같이 특정 작업 시퀀스를 강제해야 할 때 유용합니다.

from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.match import create_trajectory_match_evaluator


@tool
def get_weather(city: str):
    """Get weather information for a city."""
    return f"It's 75 degrees and sunny in {city}."

agent = create_agent("openai:gpt-4o", tools=[get_weather])

evaluator = create_trajectory_match_evaluator(  
    trajectory_match_mode="strict",  
)  

def test_weather_tool_called_strict():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's the weather in San Francisco?")]
    })

    reference_trajectory = [
        HumanMessage(content="What's the weather in San Francisco?"),
        AIMessage(content="", tool_calls=[
            {"id": "call_1", "name": "get_weather", "args": {"city": "San Francisco"}}
        ]),
        ToolMessage(content="It's 75 degrees and sunny in San Francisco.", tool_call_id="call_1"),
        AIMessage(content="The weather in San Francisco is 75 degrees and sunny."),
    ]

    evaluation = evaluator(
        outputs=result["messages"],
        reference_outputs=reference_trajectory
    )
    # {
    #     'key': 'trajectory_strict_match',
    #     'score': True,
    #     'comment': None,
    # }
    assert evaluation["score"] is True

Unordered match

unordered 모드는 동일한 도구 호출을 임의의 순서로 허용하며, 올바른 도구 세트가 호출되는지 확인하고 싶지만 시퀀스는 중요하지 않을 때 유용합니다. 예를 들어, 에이전트가 도시의 날씨와 이벤트를 모두 확인해야 할 수 있지만 순서는 중요하지 않습니다.

from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.match import create_trajectory_match_evaluator


@tool
def get_weather(city: str):
    """Get weather information for a city."""
    return f"It's 75 degrees and sunny in {city}."

@tool
def get_events(city: str):
    """Get events happening in a city."""
    return f"Concert at the park in {city} tonight."

agent = create_agent("openai:gpt-4o", tools=[get_weather, get_events])

evaluator = create_trajectory_match_evaluator(  
    trajectory_match_mode="unordered",  
)  

def test_multiple_tools_any_order():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's happening in SF today?")]
    })

    # Reference shows tools called in different order than actual execution
    reference_trajectory = [
        HumanMessage(content="What's happening in SF today?"),
        AIMessage(content="", tool_calls=[
            {"id": "call_1", "name": "get_events", "args": {"city": "SF"}},
            {"id": "call_2", "name": "get_weather", "args": {"city": "SF"}},
        ]),
        ToolMessage(content="Concert at the park in SF tonight.", tool_call_id="call_1"),
        ToolMessage(content="It's 75 degrees and sunny in SF.", tool_call_id="call_2"),
        AIMessage(content="Today in SF: 75 degrees and sunny with a concert at the park tonight."),
    ]

    evaluation = evaluator(
        outputs=result["messages"],
        reference_outputs=reference_trajectory,
    )
    # {
    #     'key': 'trajectory_unordered_match',
    #     'score': True,
    # }
    assert evaluation["score"] is True

Subset과 superset match

superset과 subset 모드는 도구 호출의 순서보다는 어떤 도구가 호출되는지에 초점을 맞추며, 에이전트의 도구 호출이 참조와 얼마나 엄격하게 일치해야 하는지를 제어할 수 있습니다.

superset 모드는 실행에서 몇 가지 핵심 도구가 호출되는지 확인하고 싶지만 에이전트가 추가 도구를 호출하는 것은 괜찮을 때 사용하세요. 에이전트의 trajectory는 참조 trajectory의 모든 도구 호출을 최소한 포함해야 하며, 참조를 넘어서는 추가 도구 호출을 포함할 수 있습니다.
subset 모드는 에이전트가 참조를 넘어서는 관련 없거나 불필요한 도구를 호출하지 않았는지 확인하여 에이전트 효율성을 보장하는 데 사용하세요. 에이전트의 trajectory는 참조 trajectory에 나타나는 도구 호출만 포함해야 합니다.

다음 예제는 superset 모드를 보여주며, 참조 trajectory는 get_weather 도구만 필요하지만 에이전트는 추가 도구를 호출할 수 있습니다:

from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.match import create_trajectory_match_evaluator


@tool
def get_weather(city: str):
    """Get weather information for a city."""
    return f"It's 75 degrees and sunny in {city}."

@tool
def get_detailed_forecast(city: str):
    """Get detailed weather forecast for a city."""
    return f"Detailed forecast for {city}: sunny all week."

agent = create_agent("openai:gpt-4o", tools=[get_weather, get_detailed_forecast])

evaluator = create_trajectory_match_evaluator(  
    trajectory_match_mode="superset",  
)  

def test_agent_calls_required_tools_plus_extra():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's the weather in Boston?")]
    })

    # Reference only requires get_weather, but agent may call additional tools
    reference_trajectory = [
        HumanMessage(content="What's the weather in Boston?"),
        AIMessage(content="", tool_calls=[
            {"id": "call_1", "name": "get_weather", "args": {"city": "Boston"}},
        ]),
        ToolMessage(content="It's 75 degrees and sunny in Boston.", tool_call_id="call_1"),
        AIMessage(content="The weather in Boston is 75 degrees and sunny."),
    ]

    evaluation = evaluator(
        outputs=result["messages"],
        reference_outputs=reference_trajectory,
    )
    # {
    #     'key': 'trajectory_superset_match',
    #     'score': True,
    #     'comment': None,
    # }
    assert evaluation["score"] is True

tool_args_match_mode (Python) 또는 toolArgsMatchMode (TypeScript) 속성과 tool_args_match_overrides (Python) 또는 toolArgsMatchOverrides (TypeScript) 속성을 설정하여 실제 trajectory와 참조 간의 도구 호출 간 동등성을 evaluator가 고려하는 방식을 사용자 정의할 수도 있습니다. 기본적으로 동일한 도구에 대한 동일한 인수를 가진 도구 호출만 동등한 것으로 간주됩니다. 자세한 내용은 repository를 참조하세요.

LLM-as-judge evaluator

이 섹션은 agentevals 패키지의 trajectory 전용 LLM-as-a-judge evaluator를 다룹니다. LangSmith의 범용 LLM-as-a-judge evaluator에 대해서는 LLM-as-a-judge evaluator를 참조하세요.

LLM을 사용하여 에이전트의 실행 경로를 평가할 수도 있습니다. trajectory match evaluator와 달리 참조 trajectory가 필요하지 않지만, 사용 가능한 경우 제공할 수 있습니다.

참조 trajectory 없이

from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.llm import create_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT


@tool
def get_weather(city: str):
    """Get weather information for a city."""
    return f"It's 75 degrees and sunny in {city}."

agent = create_agent("openai:gpt-4o", tools=[get_weather])

evaluator = create_trajectory_llm_as_judge(  
    model="openai:o3-mini",  
    prompt=TRAJECTORY_ACCURACY_PROMPT,  
)  

def test_trajectory_quality():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's the weather in Seattle?")]
    })

    evaluation = evaluator(
        outputs=result["messages"],
    )
    # {
    #     'key': 'trajectory_accuracy',
    #     'score': True,
    #     'comment': 'The provided agent trajectory is reasonable...'
    # }
    assert evaluation["score"] is True

참조 trajectory와 함께

참조 trajectory가 있는 경우 프롬프트에 추가 변수를 추가하고 참조 trajectory를 전달할 수 있습니다. 아래에서는 사전 구축된 TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE 프롬프트를 사용하고 reference_outputs 변수를 구성합니다:

evaluator = create_trajectory_llm_as_judge(
    model="openai:o3-mini",
    prompt=TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
)
evaluation = judge_with_reference(
    outputs=result["messages"],
    reference_outputs=reference_trajectory,
)

LLM이 trajectory를 평가하는 방식에 대한 더 많은 구성 가능성을 원하시면 repository를 참조하세요.

Async 지원 (Python)

모든 agentevals evaluator는 Python asyncio를 지원합니다. factory 함수를 사용하는 evaluator의 경우, 함수 이름의 create_ 뒤에 async를 추가하여 async 버전을 사용할 수 있습니다. 다음은 async judge와 evaluator를 사용하는 예제입니다:

from agentevals.trajectory.llm import create_async_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT
from agentevals.trajectory.match import create_async_trajectory_match_evaluator

async_judge = create_async_trajectory_llm_as_judge(
    model="openai:o3-mini",
    prompt=TRAJECTORY_ACCURACY_PROMPT,
)

async_evaluator = create_async_trajectory_match_evaluator(
    trajectory_match_mode="strict",
)

async def test_async_evaluation():
    result = await agent.ainvoke({
        "messages": [HumanMessage(content="What's the weather?")]
    })

    evaluation = await async_judge(outputs=result["messages"])
    assert evaluation["score"] is True

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

trajectory 평가로 에이전트를 평가하는 방법

Trajectory match

LLM-as-judge

AgentEvals 설치하기

Trajectory match evaluator

Strict match

Unordered match

Subset과 superset match

LLM-as-judge evaluator

참조 trajectory 없이

참조 trajectory와 함께

Async 지원 (Python)

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

Trajectory match

LLM-as-judge

​AgentEvals 설치하기

​Trajectory match evaluator

​Strict match

​Unordered match

​Subset과 superset match

​LLM-as-judge evaluator

​참조 trajectory 없이

​참조 trajectory와 함께

​Async 지원 (Python)

AgentEvals 설치하기

Trajectory match evaluator

Strict match

Unordered match

Subset과 superset match

LLM-as-judge evaluator

참조 trajectory 없이

참조 trajectory와 함께

Async 지원 (Python)