이 튜토리얼에서는 LangSmith와 인기 있는 테스팅 도구(Pytest, Vitest, Jest)의 통합을 사용하여 LLM 애플리케이션을 평가하는 방법을 보여드립니다. 공개 거래 주식에 대한 질문에 답변하는 ReAct 에이전트를 만들고 이에 대한 포괄적인 테스트 스위트를 작성할 것입니다.

Setup

이 튜토리얼은 에이전트 오케스트레이션을 위해 LangGraph를, OpenAI’s GPT-4o, 검색을 위한 Tavily, E2B’s code interpreter, 그리고 주식 데이터 검색을 위한 Polygon을 사용하지만, 약간의 수정으로 다른 프레임워크, 모델 및 도구에도 적용할 수 있습니다. Tavily, E2B 및 Polygon은 무료로 가입할 수 있습니다.

Installation

먼저 에이전트를 만드는 데 필요한 패키지를 설치합니다:
pip install -U langgraph langchain[openai] langchain-community e2b-code-interpreter
다음으로 테스팅 프레임워크를 설치합니다:
# Make sure you have langsmith>=0.3.1
pip install -U "langsmith[pytest]"

Environment Variables

다음 환경 변수를 설정합니다:
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY=<YOUR_LANGSMITH_API_KEY>
export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
export TAVILY_API_KEY=<YOUR_TAVILY_API_KEY>
export E2B_API_KEY=<YOUR_E2B_API_KEY>
export POLYGON_API_KEY=<YOUR_POLYGON_API_KEY>

Create your app

React 에이전트를 정의하기 위해 오케스트레이션에는 LangGraph/LangGraph.js를, LLM과 도구에는 LangChain을 사용할 것입니다.

Define tools

먼저 에이전트에서 사용할 도구를 정의하겠습니다. 3가지 도구가 있습니다:
  • Tavily를 사용한 검색 도구
  • E2B를 사용한 code interpreter 도구
  • Polygon을 사용한 주식 정보 도구
from langchain_community.tools import TavilySearchResults
from e2b_code_interpreter import Sandbox
from langchain_community.tools.polygon.aggregates import PolygonAggregates
from langchain_community.utilities.polygon import PolygonAPIWrapper
from typing_extensions import Annotated, TypedDict, Optional, Literal

# Define search tool
search_tool = TavilySearchResults(
  max_results=5,
  include_raw_content=True,
)

# Define code tool
def code_tool(code: str) -> str:
  """Execute python code and return the result."""
  sbx = Sandbox()
  execution = sbx.run_code(code)

  if execution.error:
      return f"Error: {execution.error}"
  return f"Results: {execution.results}, Logs: {execution.logs}"

# Define input schema for stock ticker tool
class TickerToolInput(TypedDict):
  """Input format for the ticker tool.
    The tool will pull data in aggregate blocks (timespan_multiplier * timespan) from the from_date to the to_date
  """
  ticker: Annotated[str, ..., "The ticker symbol of the stock"]
  timespan: Annotated[Literal["minute", "hour", "day", "week", "month", "quarter", "year"], ..., "The size of the time window."]
  timespan_multiplier: Annotated[int, ..., "The multiplier for the time window"]
  from_date: Annotated[str, ..., "The date to start pulling data from, YYYY-MM-DD format - ONLY include the year month and day"]
  to_date: Annotated[str, ..., "The date to stop pulling data, YYYY-MM-DD format - ONLY include the year month and day"]

api_wrapper = PolygonAPIWrapper()
polygon_aggregate = PolygonAggregates(api_wrapper=api_wrapper)

# Define stock ticker tool
def ticker_tool(query: TickerToolInput) -> str:
  """Pull data for the ticker."""
  return polygon_aggregate.invoke(query)

Define agent

이제 모든 도구를 정의했으므로 create_agent를 사용하여 에이전트를 만들 수 있습니다.
from typing_extensions import Annotated, TypedDict
from langchain.agents import create_agent


class AgentOutputFormat(TypedDict):
    numeric_answer: Annotated[float | None, ..., "The numeric answer, if the user asked for one"]
    text_answer: Annotated[str | None, ..., "The text answer, if the user asked for one"]
    reasoning: Annotated[str, ..., "The reasoning behind the answer"]

agent = create_agent(
    model="openai:gpt-4o-mini",
    tools=[code_tool, search_tool, polygon_aggregates],
    response_format=AgentOutputFormat,
    system_prompt="You are a financial expert. Respond to the users query accurately",
)

Write tests

이제 에이전트를 정의했으니 기본 기능을 보장하기 위해 몇 가지 테스트를 작성해 보겠습니다. 이 튜토리얼에서는 에이전트의 tool calling 기능이 작동하는지, 에이전트가 관련 없는 질문을 무시할 줄 아는지, 그리고 모든 도구를 사용하는 복잡한 질문에 답변할 수 있는지 테스트할 것입니다. 먼저 테스트 파일을 설정하고 파일 상단에 필요한 import를 추가해야 합니다.
Create a `tests/test_agent.py` file.

from app import agent, polygon_aggregates, search_tool # import from wherever your agent is defined
import pytest
from langsmith import testing as t

Test 1: Handle off-topic questions

첫 번째 테스트는 에이전트가 관련 없는 쿼리에 대해 도구를 사용하지 않는지 확인하는 간단한 검사입니다.
@pytest.mark.langsmith
@pytest.mark.parametrize(
  # <-- Can still use all normal pytest markers
  "query",
  ["Hello!", "How are you doing?"],
)
def test_no_tools_on_offtopic_query(query: str) -> None:
  """Test that the agent does not use tools on offtopic queries."""
  # Log the test example
  t.log_inputs({"query": query})
  expected = []
  t.log_reference_outputs({"tool_calls": expected})
  # Call the agent's model node directly instead of running the ReACT loop.
  result = agent.nodes["agent"].invoke(
      {"messages": [{"role": "user", "content": query}]}
  )
  actual = result["messages"][0].tool_calls
  t.log_outputs({"tool_calls": actual})
  # Check that no tool calls were made.
  assert actual == expected

Test 2: Simple tool calling

tool calling의 경우, 에이전트가 올바른 매개변수로 올바른 도구를 호출하는지 확인할 것입니다.
@pytest.mark.langsmith
def test_searches_for_correct_ticker() -> None:
  """Test that the model looks up the correct ticker on simple query."""
  # Log the test example
  query = "What is the price of Apple?"
  t.log_inputs({"query": query})
  expected = "AAPL"
  t.log_reference_outputs({"ticker": expected})
  # Call the agent's model node directly instead of running the full ReACT loop.
  result = agent.nodes["agent"].invoke(
      {"messages": [{"role": "user", "content": query}]}
  )
  tool_calls = result["messages"][0].tool_calls
  if tool_calls[0]["name"] == polygon_aggregates.name:
      actual = tool_calls[0]["args"]["ticker"]
  else:
      actual = None
  t.log_outputs({"ticker": actual})
  # Check that the right ticker was queried
  assert actual == expected

Test 3: Complex tool calling

일부 tool call은 다른 것보다 테스트하기 쉽습니다. ticker 조회의 경우 올바른 ticker가 검색되었는지 단언할 수 있습니다. 코딩 도구의 경우 도구의 입력과 출력이 훨씬 덜 제한적이며 올바른 답을 얻는 방법이 많습니다. 이 경우 전체 에이전트를 실행하고 코딩 도구를 호출하는지와 올바른 답에 도달하는지를 단언하여 도구가 올바르게 사용되는지 테스트하는 것이 더 간단합니다.
@pytest.mark.langsmith
def test_executes_code_when_needed() -> None:
  query = (
      "In the past year Facebook stock went up by 66.76%, "
      "Apple by 25.24%, Google by 37.11%, Amazon by 47.52%, "
      "Netflix by 78.31%. Whats the avg return in the past "
      "year of the FAANG stocks, expressed as a percentage?"
  )
  t.log_inputs({"query": query})
  expected = 50.988
  t.log_reference_outputs({"response": expected})
  # Test that the agent executes code when needed
  result = agent.invoke({"messages": [{"role": "user", "content": query}]})
  t.log_outputs({"result": result["structured_response"].get("numeric_answer")})
  # Grab all the tool calls made by the LLM
  tool_calls = [
      tc["name"]
      for msg in result["messages"]
      for tc in getattr(msg, "tool_calls", [])
  ]
  # This will log the number of steps taken by the agent, which is useful for
  # determining how efficiently the agent gets to an answer.
  t.log_feedback(key="num_steps", score=len(result["messages"]) - 1)
  # Assert that the code tool was used
  assert "code_tool" in tool_calls
  # Assert that a numeric answer was provided:
  assert result["structured_response"].get("numeric_answer") is not None
  # Assert that the answer is correct
  assert abs(result["structured_response"]["numeric_answer"] - expected) <= 0.01

Test 4: LLM-as-a-judge

에이전트의 답변이 검색 결과에 근거하고 있는지 확인하기 위해 LLM-as-a-judge 평가를 실행할 것입니다. LLM as a judge 호출을 에이전트와 별도로 추적하기 위해 Python에서는 LangSmith가 제공하는 trace_feedback context manager를, JS/TS에서는 wrapEvaluator 함수를 사용할 것입니다.
from typing_extensions import Annotated, TypedDict
from langchain.chat_models import init_chat_model

class Grade(TypedDict):
  """Evaluate the groundedness of an answer in source documents."""
  score: Annotated[
      bool,
      ...,
      "Return True if the answer is fully grounded in the source documents, otherwise False.",
  ]

judge_llm = init_chat_model("gpt-4o").with_structured_output(Grade)

@pytest.mark.langsmith
def test_grounded_in_source_info() -> None:
  """Test that response is grounded in the tool outputs."""
  query = "How did Nvidia stock do in 2024 according to analysts?"
  t.log_inputs({"query": query})
  result = agent.invoke({"messages": [{"role": "user", "content": query}]})
  # Grab all the search calls made by the LLM
  search_results = "\n\n".join(
      msg.content
      for msg in result["messages"]
      if msg.type == "tool" and msg.name == search_tool.name
  )
  t.log_outputs(
      {
          "response": result["structured_response"].get("text_answer"),
          "search_results": search_results,
      }
  )
  # Trace the feedback LLM run separately from the agent run.
  with t.trace_feedback():
      # Instructions for the LLM judge
      instructions = (
          "Grade the following ANSWER. "
          "The ANSWER should be fully grounded in (i.e. supported by) the source DOCUMENTS. "
          "Return True if the ANSWER is fully grounded in the DOCUMENTS. "
          "Return False if the ANSWER is not grounded in the DOCUMENTS."
      )
      answer_and_docs = (
          f"ANSWER: {result['structured_response'].get('text_answer', '')}\n"
          f"DOCUMENTS:\n{search_results}"
      )
      # Run the judge LLM
      grade = judge_llm.invoke(
          [
              {"role": "system", "content": instructions},
              {"role": "user", "content": answer_and_docs},
          ]
      )
      t.log_feedback(key="groundedness", score=grade["score"])
  assert grade['score']

Run tests

config 파일을 설정한 후(Vitest 또는 Jest를 사용하는 경우), 다음 명령을 사용하여 테스트를 실행할 수 있습니다:
Create a `ls.vitest.config.ts` file:

import { defineConfig } from "vitest/config";

export default defineConfig({
  test: {
    include: ["**/*.eval.?(c|m)[jt]s"],
    reporters: ["langsmith/vitest/reporter"],
    setupFiles: ["dotenv/config"],
  },
});
pytest --langsmith-output tests

Reference code

VitestJest용 config 파일도 프로젝트에 추가하는 것을 잊지 마세요.

Agent

from e2b_code_interpreter import Sandbox
from langchain_community.tools import PolygonAggregates, TavilySearchResults
from langchain_community.utilities.polygon import PolygonAPIWrapper
from langchain.agents import create_agent
from typing_extensions import Annotated, TypedDict


search_tool = TavilySearchResults(
    max_results=5,
    include_raw_content=True,
)

def code_tool(code: str) -> str:
    """Execute python code and return the result."""
    sbx = Sandbox()
    execution = sbx.run_code(code)

    if execution.error:
        return f"Error: {execution.error}"
    return f"Results: {execution.results}, Logs: {execution.logs}"

polygon_aggregates = PolygonAggregates(api_wrapper=PolygonAPIWrapper())

class AgentOutputFormat(TypedDict):
    numeric_answer: Annotated[
        float | None, ..., "The numeric answer, if the user asked for one"
    ]
    text_answer: Annotated[
        str | None, ..., "The text answer, if the user asked for one"
    ]
    reasoning: Annotated[str, ..., "The reasoning behind the answer"]

agent = create_agent(
    model="openai:gpt-4o-mini",
    tools=[code_tool, search_tool, polygon_aggregates],
    response_format=AgentOutputFormat,
    system_prompt="You are a financial expert. Respond to the users query accurately",
)

Tests

# from app import agent, polygon_aggregates, search_tool # import from wherever your agent is defined
import pytest
from langchain.chat_models import init_chat_model
from langsmith import testing as t
from typing_extensions import Annotated, TypedDict

@pytest.mark.langsmith
@pytest.mark.parametrize(
  # <-- Can still use all normal pytest markers
  "query",
  ["Hello!", "How are you doing?"],
)
def test_no_tools_on_offtopic_query(query: str) -> None:
  """Test that the agent does not use tools on offtopic queries."""
  # Log the test example
  t.log_inputs({"query": query})
  expected = []
  t.log_reference_outputs({"tool_calls": expected})
  # Call the agent's model node directly instead of running the ReACT loop.
  result = agent.nodes["agent"].invoke(
      {"messages": [{"role": "user", "content": query}]}
  )
  actual = result["messages"][0].tool_calls
  t.log_outputs({"tool_calls": actual})
  # Check that no tool calls were made.
  assert actual == expected

@pytest.mark.langsmith
def test_searches_for_correct_ticker() -> None:
  """Test that the model looks up the correct ticker on simple query."""
  # Log the test example
  query = "What is the price of Apple?"
  t.log_inputs({"query": query})
  expected = "AAPL"
  t.log_reference_outputs({"ticker": expected})
  # Call the agent's model node directly instead of running the full ReACT loop.
  result = agent.nodes["agent"].invoke(
      {"messages": [{"role": "user", "content": query}]}
  )
  tool_calls = result["messages"][0].tool_calls
  if tool_calls[0]["name"] == polygon_aggregates.name:
      actual = tool_calls[0]["args"]["ticker"]
  else:
      actual = None
  t.log_outputs({"ticker": actual})
  # Check that the right ticker was queried
  assert actual == expected

@pytest.mark.langsmith
def test_executes_code_when_needed() -> None:
  query = (
      "In the past year Facebook stock went up by 66.76%, "
      "Apple by 25.24%, Google by 37.11%, Amazon by 47.52%, "
      "Netflix by 78.31%. Whats the avg return in the past "
      "year of the FAANG stocks, expressed as a percentage?"
  )
  t.log_inputs({"query": query})
  expected = 50.988
  t.log_reference_outputs({"response": expected})
  # Test that the agent executes code when needed
  result = agent.invoke({"messages": [{"role": "user", "content": query}]})
  t.log_outputs({"result": result["structured_response"].get("numeric_answer")})
  # Grab all the tool calls made by the LLM
  tool_calls = [
      tc["name"]
      for msg in result["messages"]
      for tc in getattr(msg, "tool_calls", [])
  ]
  # This will log the number of steps taken by the agent, which is useful for
  # determining how efficiently the agent gets to an answer.
  t.log_feedback(key="num_steps", score=len(result["messages"]) - 1)
  # Assert that the code tool was used
  assert "code_tool" in tool_calls
  # Assert that a numeric answer was provided:
  assert result["structured_response"].get("numeric_answer") is not None
  # Assert that the answer is correct
  assert abs(result["structured_response"]["numeric_answer"] - expected) <= 0.01

class Grade(TypedDict):
  """Evaluate the groundedness of an answer in source documents."""
  score: Annotated[
      bool,
      ...,
      "Return True if the answer is fully grounded in the source documents, otherwise False.",
  ]

judge_llm = init_chat_model("gpt-4o").with_structured_output(Grade)

@pytest.mark.langsmith
def test_grounded_in_source_info() -> None:
  """Test that response is grounded in the tool outputs."""
  query = "How did Nvidia stock do in 2024 according to analysts?"
  t.log_inputs({"query": query})
  result = agent.invoke({"messages": [{"role": "user", "content": query}]})
  # Grab all the search calls made by the LLM
  search_results = "\n\n".join(
      msg.content
      for msg in result["messages"]
      if msg.type == "tool" and msg.name == search_tool.name
  )
  t.log_outputs(
      {
          "response": result["structured_response"].get("text_answer"),
          "search_results": search_results,
      }
  )
  # Trace the feedback LLM run separately from the agent run.
  with t.trace_feedback():
      # Instructions for the LLM judge
      instructions = (
          "Grade the following ANSWER. "
          "The ANSWER should be fully grounded in (i.e. supported by) the source DOCUMENTS. "
          "Return True if the ANSWER is fully grounded in the DOCUMENTS. "
          "Return False if the ANSWER is not grounded in the DOCUMENTS."
      )
      answer_and_docs = (
          f"ANSWER: {result['structured_response'].get('text_answer', '')}\n"
          f"DOCUMENTS:\n{search_results}"
      )
      # Run the judge LLM
      grade = judge_llm.invoke(
          [
              {"role": "system", "content": instructions},
              {"role": "user", "content": answer_and_docs},
          ]
      )
      t.log_feedback(key="groundedness", score=grade["score"])
  assert grade["score"]

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.
I