Setup
이 튜토리얼은 에이전트 오케스트레이션을 위해 LangGraph를, OpenAI’s GPT-4o, 검색을 위한 Tavily, E2B’s code interpreter, 그리고 주식 데이터 검색을 위한 Polygon을 사용하지만, 약간의 수정으로 다른 프레임워크, 모델 및 도구에도 적용할 수 있습니다. Tavily, E2B 및 Polygon은 무료로 가입할 수 있습니다.Installation
먼저 에이전트를 만드는 데 필요한 패키지를 설치합니다:Copy
pip install -U langgraph langchain[openai] langchain-community e2b-code-interpreter
Copy
# Make sure you have langsmith>=0.3.1
pip install -U "langsmith[pytest]"
Environment Variables
다음 환경 변수를 설정합니다:Copy
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY=<YOUR_LANGSMITH_API_KEY>
export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
export TAVILY_API_KEY=<YOUR_TAVILY_API_KEY>
export E2B_API_KEY=<YOUR_E2B_API_KEY>
export POLYGON_API_KEY=<YOUR_POLYGON_API_KEY>
Create your app
React 에이전트를 정의하기 위해 오케스트레이션에는 LangGraph/LangGraph.js를, LLM과 도구에는 LangChain을 사용할 것입니다.Define tools
먼저 에이전트에서 사용할 도구를 정의하겠습니다. 3가지 도구가 있습니다:- Tavily를 사용한 검색 도구
- E2B를 사용한 code interpreter 도구
- Polygon을 사용한 주식 정보 도구
Copy
from langchain_community.tools import TavilySearchResults
from e2b_code_interpreter import Sandbox
from langchain_community.tools.polygon.aggregates import PolygonAggregates
from langchain_community.utilities.polygon import PolygonAPIWrapper
from typing_extensions import Annotated, TypedDict, Optional, Literal
# Define search tool
search_tool = TavilySearchResults(
max_results=5,
include_raw_content=True,
)
# Define code tool
def code_tool(code: str) -> str:
"""Execute python code and return the result."""
sbx = Sandbox()
execution = sbx.run_code(code)
if execution.error:
return f"Error: {execution.error}"
return f"Results: {execution.results}, Logs: {execution.logs}"
# Define input schema for stock ticker tool
class TickerToolInput(TypedDict):
"""Input format for the ticker tool.
The tool will pull data in aggregate blocks (timespan_multiplier * timespan) from the from_date to the to_date
"""
ticker: Annotated[str, ..., "The ticker symbol of the stock"]
timespan: Annotated[Literal["minute", "hour", "day", "week", "month", "quarter", "year"], ..., "The size of the time window."]
timespan_multiplier: Annotated[int, ..., "The multiplier for the time window"]
from_date: Annotated[str, ..., "The date to start pulling data from, YYYY-MM-DD format - ONLY include the year month and day"]
to_date: Annotated[str, ..., "The date to stop pulling data, YYYY-MM-DD format - ONLY include the year month and day"]
api_wrapper = PolygonAPIWrapper()
polygon_aggregate = PolygonAggregates(api_wrapper=api_wrapper)
# Define stock ticker tool
def ticker_tool(query: TickerToolInput) -> str:
"""Pull data for the ticker."""
return polygon_aggregate.invoke(query)
Define agent
이제 모든 도구를 정의했으므로create_agent를 사용하여 에이전트를 만들 수 있습니다.
Copy
from typing_extensions import Annotated, TypedDict
from langchain.agents import create_agent
class AgentOutputFormat(TypedDict):
numeric_answer: Annotated[float | None, ..., "The numeric answer, if the user asked for one"]
text_answer: Annotated[str | None, ..., "The text answer, if the user asked for one"]
reasoning: Annotated[str, ..., "The reasoning behind the answer"]
agent = create_agent(
model="openai:gpt-4o-mini",
tools=[code_tool, search_tool, polygon_aggregates],
response_format=AgentOutputFormat,
system_prompt="You are a financial expert. Respond to the users query accurately",
)
Write tests
이제 에이전트를 정의했으니 기본 기능을 보장하기 위해 몇 가지 테스트를 작성해 보겠습니다. 이 튜토리얼에서는 에이전트의 tool calling 기능이 작동하는지, 에이전트가 관련 없는 질문을 무시할 줄 아는지, 그리고 모든 도구를 사용하는 복잡한 질문에 답변할 수 있는지 테스트할 것입니다. 먼저 테스트 파일을 설정하고 파일 상단에 필요한 import를 추가해야 합니다.Copy
Create a `tests/test_agent.py` file.
from app import agent, polygon_aggregates, search_tool # import from wherever your agent is defined
import pytest
from langsmith import testing as t
Test 1: Handle off-topic questions
첫 번째 테스트는 에이전트가 관련 없는 쿼리에 대해 도구를 사용하지 않는지 확인하는 간단한 검사입니다.Copy
@pytest.mark.langsmith
@pytest.mark.parametrize(
# <-- Can still use all normal pytest markers
"query",
["Hello!", "How are you doing?"],
)
def test_no_tools_on_offtopic_query(query: str) -> None:
"""Test that the agent does not use tools on offtopic queries."""
# Log the test example
t.log_inputs({"query": query})
expected = []
t.log_reference_outputs({"tool_calls": expected})
# Call the agent's model node directly instead of running the ReACT loop.
result = agent.nodes["agent"].invoke(
{"messages": [{"role": "user", "content": query}]}
)
actual = result["messages"][0].tool_calls
t.log_outputs({"tool_calls": actual})
# Check that no tool calls were made.
assert actual == expected
Test 2: Simple tool calling
tool calling의 경우, 에이전트가 올바른 매개변수로 올바른 도구를 호출하는지 확인할 것입니다.Copy
@pytest.mark.langsmith
def test_searches_for_correct_ticker() -> None:
"""Test that the model looks up the correct ticker on simple query."""
# Log the test example
query = "What is the price of Apple?"
t.log_inputs({"query": query})
expected = "AAPL"
t.log_reference_outputs({"ticker": expected})
# Call the agent's model node directly instead of running the full ReACT loop.
result = agent.nodes["agent"].invoke(
{"messages": [{"role": "user", "content": query}]}
)
tool_calls = result["messages"][0].tool_calls
if tool_calls[0]["name"] == polygon_aggregates.name:
actual = tool_calls[0]["args"]["ticker"]
else:
actual = None
t.log_outputs({"ticker": actual})
# Check that the right ticker was queried
assert actual == expected
Test 3: Complex tool calling
일부 tool call은 다른 것보다 테스트하기 쉽습니다. ticker 조회의 경우 올바른 ticker가 검색되었는지 단언할 수 있습니다. 코딩 도구의 경우 도구의 입력과 출력이 훨씬 덜 제한적이며 올바른 답을 얻는 방법이 많습니다. 이 경우 전체 에이전트를 실행하고 코딩 도구를 호출하는지와 올바른 답에 도달하는지를 단언하여 도구가 올바르게 사용되는지 테스트하는 것이 더 간단합니다.Copy
@pytest.mark.langsmith
def test_executes_code_when_needed() -> None:
query = (
"In the past year Facebook stock went up by 66.76%, "
"Apple by 25.24%, Google by 37.11%, Amazon by 47.52%, "
"Netflix by 78.31%. Whats the avg return in the past "
"year of the FAANG stocks, expressed as a percentage?"
)
t.log_inputs({"query": query})
expected = 50.988
t.log_reference_outputs({"response": expected})
# Test that the agent executes code when needed
result = agent.invoke({"messages": [{"role": "user", "content": query}]})
t.log_outputs({"result": result["structured_response"].get("numeric_answer")})
# Grab all the tool calls made by the LLM
tool_calls = [
tc["name"]
for msg in result["messages"]
for tc in getattr(msg, "tool_calls", [])
]
# This will log the number of steps taken by the agent, which is useful for
# determining how efficiently the agent gets to an answer.
t.log_feedback(key="num_steps", score=len(result["messages"]) - 1)
# Assert that the code tool was used
assert "code_tool" in tool_calls
# Assert that a numeric answer was provided:
assert result["structured_response"].get("numeric_answer") is not None
# Assert that the answer is correct
assert abs(result["structured_response"]["numeric_answer"] - expected) <= 0.01
Test 4: LLM-as-a-judge
에이전트의 답변이 검색 결과에 근거하고 있는지 확인하기 위해 LLM-as-a-judge 평가를 실행할 것입니다. LLM as a judge 호출을 에이전트와 별도로 추적하기 위해 Python에서는 LangSmith가 제공하는trace_feedback context manager를, JS/TS에서는 wrapEvaluator 함수를 사용할 것입니다.
Copy
from typing_extensions import Annotated, TypedDict
from langchain.chat_models import init_chat_model
class Grade(TypedDict):
"""Evaluate the groundedness of an answer in source documents."""
score: Annotated[
bool,
...,
"Return True if the answer is fully grounded in the source documents, otherwise False.",
]
judge_llm = init_chat_model("gpt-4o").with_structured_output(Grade)
@pytest.mark.langsmith
def test_grounded_in_source_info() -> None:
"""Test that response is grounded in the tool outputs."""
query = "How did Nvidia stock do in 2024 according to analysts?"
t.log_inputs({"query": query})
result = agent.invoke({"messages": [{"role": "user", "content": query}]})
# Grab all the search calls made by the LLM
search_results = "\n\n".join(
msg.content
for msg in result["messages"]
if msg.type == "tool" and msg.name == search_tool.name
)
t.log_outputs(
{
"response": result["structured_response"].get("text_answer"),
"search_results": search_results,
}
)
# Trace the feedback LLM run separately from the agent run.
with t.trace_feedback():
# Instructions for the LLM judge
instructions = (
"Grade the following ANSWER. "
"The ANSWER should be fully grounded in (i.e. supported by) the source DOCUMENTS. "
"Return True if the ANSWER is fully grounded in the DOCUMENTS. "
"Return False if the ANSWER is not grounded in the DOCUMENTS."
)
answer_and_docs = (
f"ANSWER: {result['structured_response'].get('text_answer', '')}\n"
f"DOCUMENTS:\n{search_results}"
)
# Run the judge LLM
grade = judge_llm.invoke(
[
{"role": "system", "content": instructions},
{"role": "user", "content": answer_and_docs},
]
)
t.log_feedback(key="groundedness", score=grade["score"])
assert grade['score']
Run tests
config 파일을 설정한 후(Vitest 또는 Jest를 사용하는 경우), 다음 명령을 사용하여 테스트를 실행할 수 있습니다:Vitest/Jest용 Config 파일
Vitest/Jest용 Config 파일
Copy
Create a `ls.vitest.config.ts` file:
import { defineConfig } from "vitest/config";
export default defineConfig({
test: {
include: ["**/*.eval.?(c|m)[jt]s"],
reporters: ["langsmith/vitest/reporter"],
setupFiles: ["dotenv/config"],
},
});
Copy
pytest --langsmith-output tests
Reference code
Vitest와 Jest용 config 파일도 프로젝트에 추가하는 것을 잊지 마세요.Agent
Agent 코드
Agent 코드
Copy
from e2b_code_interpreter import Sandbox
from langchain_community.tools import PolygonAggregates, TavilySearchResults
from langchain_community.utilities.polygon import PolygonAPIWrapper
from langchain.agents import create_agent
from typing_extensions import Annotated, TypedDict
search_tool = TavilySearchResults(
max_results=5,
include_raw_content=True,
)
def code_tool(code: str) -> str:
"""Execute python code and return the result."""
sbx = Sandbox()
execution = sbx.run_code(code)
if execution.error:
return f"Error: {execution.error}"
return f"Results: {execution.results}, Logs: {execution.logs}"
polygon_aggregates = PolygonAggregates(api_wrapper=PolygonAPIWrapper())
class AgentOutputFormat(TypedDict):
numeric_answer: Annotated[
float | None, ..., "The numeric answer, if the user asked for one"
]
text_answer: Annotated[
str | None, ..., "The text answer, if the user asked for one"
]
reasoning: Annotated[str, ..., "The reasoning behind the answer"]
agent = create_agent(
model="openai:gpt-4o-mini",
tools=[code_tool, search_tool, polygon_aggregates],
response_format=AgentOutputFormat,
system_prompt="You are a financial expert. Respond to the users query accurately",
)
Tests
Test 코드
Test 코드
Copy
# from app import agent, polygon_aggregates, search_tool # import from wherever your agent is defined
import pytest
from langchain.chat_models import init_chat_model
from langsmith import testing as t
from typing_extensions import Annotated, TypedDict
@pytest.mark.langsmith
@pytest.mark.parametrize(
# <-- Can still use all normal pytest markers
"query",
["Hello!", "How are you doing?"],
)
def test_no_tools_on_offtopic_query(query: str) -> None:
"""Test that the agent does not use tools on offtopic queries."""
# Log the test example
t.log_inputs({"query": query})
expected = []
t.log_reference_outputs({"tool_calls": expected})
# Call the agent's model node directly instead of running the ReACT loop.
result = agent.nodes["agent"].invoke(
{"messages": [{"role": "user", "content": query}]}
)
actual = result["messages"][0].tool_calls
t.log_outputs({"tool_calls": actual})
# Check that no tool calls were made.
assert actual == expected
@pytest.mark.langsmith
def test_searches_for_correct_ticker() -> None:
"""Test that the model looks up the correct ticker on simple query."""
# Log the test example
query = "What is the price of Apple?"
t.log_inputs({"query": query})
expected = "AAPL"
t.log_reference_outputs({"ticker": expected})
# Call the agent's model node directly instead of running the full ReACT loop.
result = agent.nodes["agent"].invoke(
{"messages": [{"role": "user", "content": query}]}
)
tool_calls = result["messages"][0].tool_calls
if tool_calls[0]["name"] == polygon_aggregates.name:
actual = tool_calls[0]["args"]["ticker"]
else:
actual = None
t.log_outputs({"ticker": actual})
# Check that the right ticker was queried
assert actual == expected
@pytest.mark.langsmith
def test_executes_code_when_needed() -> None:
query = (
"In the past year Facebook stock went up by 66.76%, "
"Apple by 25.24%, Google by 37.11%, Amazon by 47.52%, "
"Netflix by 78.31%. Whats the avg return in the past "
"year of the FAANG stocks, expressed as a percentage?"
)
t.log_inputs({"query": query})
expected = 50.988
t.log_reference_outputs({"response": expected})
# Test that the agent executes code when needed
result = agent.invoke({"messages": [{"role": "user", "content": query}]})
t.log_outputs({"result": result["structured_response"].get("numeric_answer")})
# Grab all the tool calls made by the LLM
tool_calls = [
tc["name"]
for msg in result["messages"]
for tc in getattr(msg, "tool_calls", [])
]
# This will log the number of steps taken by the agent, which is useful for
# determining how efficiently the agent gets to an answer.
t.log_feedback(key="num_steps", score=len(result["messages"]) - 1)
# Assert that the code tool was used
assert "code_tool" in tool_calls
# Assert that a numeric answer was provided:
assert result["structured_response"].get("numeric_answer") is not None
# Assert that the answer is correct
assert abs(result["structured_response"]["numeric_answer"] - expected) <= 0.01
class Grade(TypedDict):
"""Evaluate the groundedness of an answer in source documents."""
score: Annotated[
bool,
...,
"Return True if the answer is fully grounded in the source documents, otherwise False.",
]
judge_llm = init_chat_model("gpt-4o").with_structured_output(Grade)
@pytest.mark.langsmith
def test_grounded_in_source_info() -> None:
"""Test that response is grounded in the tool outputs."""
query = "How did Nvidia stock do in 2024 according to analysts?"
t.log_inputs({"query": query})
result = agent.invoke({"messages": [{"role": "user", "content": query}]})
# Grab all the search calls made by the LLM
search_results = "\n\n".join(
msg.content
for msg in result["messages"]
if msg.type == "tool" and msg.name == search_tool.name
)
t.log_outputs(
{
"response": result["structured_response"].get("text_answer"),
"search_results": search_results,
}
)
# Trace the feedback LLM run separately from the agent run.
with t.trace_feedback():
# Instructions for the LLM judge
instructions = (
"Grade the following ANSWER. "
"The ANSWER should be fully grounded in (i.e. supported by) the source DOCUMENTS. "
"Return True if the ANSWER is fully grounded in the DOCUMENTS. "
"Return False if the ANSWER is not grounded in the DOCUMENTS."
)
answer_and_docs = (
f"ANSWER: {result['structured_response'].get('text_answer', '')}\n"
f"DOCUMENTS:\n{search_results}"
)
# Run the judge LLM
grade = judge_llm.invoke(
[
{"role": "system", "content": instructions},
{"role": "user", "content": answer_and_docs},
]
)
t.log_feedback(key="groundedness", score=grade["score"])
assert grade["score"]
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.