langgraph는 LLM을 사용하여 상태 저장형 다중 액터 애플리케이션을 구축하기 위한 라이브러리로, agent 및 multi-agent 워크플로우를 생성하는 데 사용됩니다. langgraph graph를 평가하는 것은 단일 호출이 많은 LLM 호출을 포함할 수 있고, 어떤 LLM 호출이 이루어지는지가 이전 호출의 출력에 따라 달라질 수 있기 때문에 어려울 수 있습니다. 이 가이드에서는 evaluate() / aevaluate()에 graph와 graph node를 전달하는 방법의 메커니즘에 중점을 둘 것입니다. agent를 구축할 때의 평가 기법과 모범 사례는 langgraph 문서를 참조하세요.
End-to-end 평가
가장 일반적인 평가 유형은 end-to-end 평가로, 각 예제 입력에 대한 최종 graph 출력을 평가하려는 경우입니다.graph 정의하기
간단한 ReACT agent를 구성해 보겠습니다:Copy
from typing import Annotated, Literal, TypedDict
from langchain.chat_models import init_chat_model
from langchain.tools import tool
from langgraph.prebuilt import ToolNode
from langgraph.graph import END, START, StateGraph
from langgraph.graph.message import add_messages
class State(TypedDict):
# Messages have the type "list". The 'add_messages' function
# in the annotation defines how this state key should be updated
# (in this case, it appends messages to the list, rather than overwriting them)
messages: Annotated[list, add_messages]
# Define the tools for the agent to use
@tool
def search(query: str) -> str:
"""Call to surf the web."""
# This is a placeholder, but don't tell the LLM that...
if "sf" in query.lower() or "san francisco" in query.lower():
return "It's 60 degrees and foggy."
return "It's 90 degrees and sunny."
tools = [search]
tool_node = ToolNode(tools)
model = init_chat_model("claude-3-5-sonnet-latest").bind_tools(tools)
# Define the function that determines whether to continue or not
def should_continue(state: State) -> Literal["tools", END]:
messages = state['messages']
last_message = messages[-1]
# If the LLM makes a tool call, then we route to the "tools" node
if last_message.tool_calls:
return "tools"
# Otherwise, we stop (reply to the user)
return END
# Define the function that calls the model
def call_model(state: State):
messages = state['messages']
response = model.invoke(messages)
# We return a list, because this will get added to the existing list
return {"messages": [response]}
# Define a new graph
workflow = StateGraph(State)
# Define the two nodes we will cycle between
workflow.add_node("agent", call_model)
workflow.add_node("tools", tool_node)
# Set the entrypoint as 'agent'
# This means that this node is the first one called
workflow.add_edge(START, "agent")
# We now add a conditional edge
workflow.add_conditional_edges(
# First, we define the start node. We use 'agent'.
# This means these are the edges taken after the 'agent' node is called.
"agent",
# Next, we pass in the function that will determine which node is called next.
should_continue,
)
# We now add a normal edge from 'tools' to 'agent'.
# This means that after 'tools' is called, 'agent' node is called next.
workflow.add_edge("tools", 'agent')
# Finally, we compile it!
# This compiles it into a LangChain Runnable,
# meaning you can use it as you would any other runnable.
# Note that we're (optionally) passing the memory when compiling the graph
app = workflow.compile()
dataset 생성하기
질문과 예상 응답으로 구성된 간단한 dataset을 생성해 보겠습니다:Copy
from langsmith import Client
questions = [
"what's the weather in sf",
"whats the weather in san fran",
"whats the weather in tangier"
]
answers = [
"It's 60 degrees and foggy.",
"It's 60 degrees and foggy.",
"It's 90 degrees and sunny.",
]
ls_client = Client()
dataset = ls_client.create_dataset(
"weather agent",
inputs=[{"question": q} for q in questions],
outputs=[{"answers": a} for a in answers],
)
evaluator 생성하기
그리고 간단한 evaluator를 만들어 보겠습니다:langsmith>=0.2.0 필요
Copy
judge_llm = init_chat_model("gpt-4o")
async def correct(outputs: dict, reference_outputs: dict) -> bool:
instructions = (
"Given an actual answer and an expected answer, determine whether"
" the actual answer contains all of the information in the"
" expected answer. Respond with 'CORRECT' if the actual answer"
" does contain all of the expected information and 'INCORRECT'"
" otherwise. Do not include anything else in your response."
)
# Our graph outputs a State dictionary, which in this case means
# we'll have a 'messages' key and the final message should
# be our actual answer.
actual_answer = outputs["messages"][-1].content
expected_answer = reference_outputs["answer"]
user_msg = (
f"ACTUAL ANSWER: {actual_answer}"
f"\n\nEXPECTED ANSWER: {expected_answer}"
)
response = await judge_llm.ainvoke(
[
{"role": "system", "content": instructions},
{"role": "user", "content": user_msg}
]
)
return response.content.upper() == "CORRECT"
평가 실행하기
이제 평가를 실행하고 결과를 탐색할 수 있습니다. 예제에 저장된 형식으로 입력을 받을 수 있도록 graph function을 래핑하기만 하면 됩니다:모든 graph node가 동기 함수로 정의된 경우
evaluate 또는 aevaluate를 사용할 수 있습니다. node 중 하나라도 비동기로 정의된 경우 aevaluate를 사용해야 합니다langsmith>=0.2.0 필요
Copy
from langsmith import aevaluate
def example_to_state(inputs: dict) -> dict:
return {"messages": [{"role": "user", "content": inputs['question']}]}
# We use LCEL declarative syntax here.
# Remember that langgraph graphs are also langchain runnables.
target = example_to_state | app
experiment_results = await aevaluate(
target,
data="weather agent",
evaluators=[correct],
max_concurrency=4, # optional
experiment_prefix="claude-3.5-baseline", # optional
)
중간 단계 평가하기
agent의 최종 출력뿐만 아니라 중간 단계도 평가하는 것이 유용한 경우가 많습니다.langgraph의 장점은 graph의 출력이 이미 중간 단계에 대한 정보를 포함하고 있는 state 객체라는 것입니다. 일반적으로 state의 message를 살펴보는 것만으로도 관심 있는 모든 것을 평가할 수 있습니다. 예를 들어, message를 살펴보고 모델이 첫 번째 단계로 ‘search’ tool을 호출했는지 확인할 수 있습니다.
langsmith>=0.2.0 필요
Copy
def right_tool(outputs: dict) -> bool:
tool_calls = outputs["messages"][1].tool_calls
return bool(tool_calls and tool_calls[0]["name"] == "search")
experiment_results = await aevaluate(
target,
data="weather agent",
evaluators=[correct, right_tool],
max_concurrency=4, # optional
experiment_prefix="claude-3.5-baseline", # optional
)
custom evaluator에 전달할 수 있는 인수에 대한 자세한 내용은 이 how-to 가이드를 참조하세요.
Copy
from langsmith.schemas import Run, Example
def right_tool_from_run(run: Run, example: Example) -> dict:
# Get documents and answer
first_model_run = next(run for run in root_run.child_runs if run.name == "agent")
tool_calls = first_model_run.outputs["messages"][-1].tool_calls
right_tool = bool(tool_calls and tool_calls[0]["name"] == "search")
return {"key": "right_tool", "value": right_tool}
experiment_results = await aevaluate(
target,
data="weather agent",
evaluators=[correct, right_tool_from_run],
max_concurrency=4, # optional
experiment_prefix="claude-3.5-baseline", # optional
)
개별 node 실행 및 평가하기
때로는 시간과 비용을 절약하기 위해 단일 node를 직접 평가하고 싶을 수 있습니다.langgraph를 사용하면 이를 쉽게 수행할 수 있습니다. 이 경우 우리가 사용해 온 evaluator를 계속 사용할 수도 있습니다.
Copy
node_target = example_to_state | app.nodes["agent"]
node_experiment_results = await aevaluate(
node_target,
data="weather agent",
evaluators=[right_tool_from_run],
max_concurrency=4, # optional
experiment_prefix="claude-3.5-model-node", # optional
)
관련 자료
참조 코드
통합 코드 스니펫을 보려면 클릭하세요
통합 코드 스니펫을 보려면 클릭하세요
Copy
from typing import Annotated, Literal, TypedDict
from langchain.chat_models import init_chat_model
from langchain.tools import tool
from langgraph.prebuilt import ToolNode
from langgraph.graph import END, START, StateGraph
from langgraph.graph.message import add_messages
from langsmith import Client, aevaluate
# Define a graph
class State(TypedDict):
# Messages have the type "list". The 'add_messages' function
# in the annotation defines how this state key should be updated
# (in this case, it appends messages to the list, rather than overwriting them)
messages: Annotated[list, add_messages]
# Define the tools for the agent to use
@tool
def search(query: str) -> str:
"""Call to surf the web."""
# This is a placeholder, but don't tell the LLM that...
if "sf" in query.lower() or "san francisco" in query.lower():
return "It's 60 degrees and foggy."
return "It's 90 degrees and sunny."
tools = [search]
tool_node = ToolNode(tools)
model = init_chat_model("claude-3-5-sonnet-latest").bind_tools(tools)
# Define the function that determines whether to continue or not
def should_continue(state: State) -> Literal["tools", END]:
messages = state['messages']
last_message = messages[-1]
# If the LLM makes a tool call, then we route to the "tools" node
if last_message.tool_calls:
return "tools"
# Otherwise, we stop (reply to the user)
return END
# Define the function that calls the model
def call_model(state: State):
messages = state['messages']
response = model.invoke(messages)
# We return a list, because this will get added to the existing list
return {"messages": [response]}
# Define a new graph
workflow = StateGraph(State)
# Define the two nodes we will cycle between
workflow.add_node("agent", call_model)
workflow.add_node("tools", tool_node)
# Set the entrypoint as 'agent'
# This means that this node is the first one called
workflow.add_edge(START, "agent")
# We now add a conditional edge
workflow.add_conditional_edges(
# First, we define the start node. We use 'agent'.
# This means these are the edges taken after the 'agent' node is called.
"agent",
# Next, we pass in the function that will determine which node is called next.
should_continue,
)
# We now add a normal edge from 'tools' to 'agent'.
# This means that after 'tools' is called, 'agent' node is called next.
workflow.add_edge("tools", 'agent')
# Finally, we compile it!
# This compiles it into a LangChain Runnable,
# meaning you can use it as you would any other runnable.
# Note that we're (optionally) passing the memory when compiling the graph
app = workflow.compile()
questions = [
"what's the weather in sf",
"whats the weather in san fran",
"whats the weather in tangier"
]
answers = [
"It's 60 degrees and foggy.",
"It's 60 degrees and foggy.",
"It's 90 degrees and sunny.",
]
# Create a dataset
ls_client = Client()
dataset = ls_client.create_dataset(
"weather agent",
inputs=[{"question": q} for q in questions],
outputs=[{"answers": a} for a in answers],
)
# Define evaluators
async def correct(outputs: dict, reference_outputs: dict) -> bool:
instructions = (
"Given an actual answer and an expected answer, determine whether"
" the actual answer contains all of the information in the"
" expected answer. Respond with 'CORRECT' if the actual answer"
" does contain all of the expected information and 'INCORRECT'"
" otherwise. Do not include anything else in your response."
)
# Our graph outputs a State dictionary, which in this case means
# we'll have a 'messages' key and the final message should
# be our actual answer.
actual_answer = outputs["messages"][-1].content
expected_answer = reference_outputs["answer"]
user_msg = (
f"ACTUAL ANSWER: {actual_answer}"
f"\n\nEXPECTED ANSWER: {expected_answer}"
)
response = await judge_llm.ainvoke(
[
{"role": "system", "content": instructions},
{"role": "user", "content": user_msg}
]
)
return response.content.upper() == "CORRECT"
def right_tool(outputs: dict) -> bool:
tool_calls = outputs["messages"][1].tool_calls
return bool(tool_calls and tool_calls[0]["name"] == "search")
# Run evaluation
experiment_results = await aevaluate(
target,
data="weather agent",
evaluators=[correct, right_tool],
max_concurrency=4, # optional
experiment_prefix="claude-3.5-baseline", # optional
)
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.