애플리케이션의 중간 단계를 평가하는 방법

많은 시나리오에서 작업의 최종 출력을 평가하는 것으로 충분하지만, 경우에 따라 파이프라인의 중간 단계를 평가하고 싶을 수 있습니다. 예를 들어, retrieval-augmented generation (RAG)의 경우 다음을 원할 수 있습니다:

retrieval 단계를 평가하여 입력 쿼리와 관련하여 올바른 문서가 검색되는지 확인합니다.
generation 단계를 평가하여 검색된 문서와 관련하여 올바른 답변이 생성되는지 확인합니다.

이 가이드에서는 두 시나리오를 모두 강조하기 위해 기준 1을 평가하기 위한 간단하고 완전히 커스텀한 evaluator와 기준 2를 평가하기 위한 LLM 기반 evaluator를 사용합니다. 파이프라인의 중간 단계를 평가하려면, evaluator 함수가 파이프라인의 중간 단계를 포함하는 Run 객체인 run/rootRun 인수를 탐색하고 처리해야 합니다.

1. LLM 파이프라인 정의하기

아래 RAG 파이프라인은 1) 입력 질문에 대한 Wikipedia 쿼리 생성, 2) Wikipedia에서 관련 문서 검색, 3) 검색된 문서를 기반으로 답변 생성으로 구성됩니다.

pip install -U langsmith langchain[openai] wikipedia

langsmith>=0.3.13 필요

import wikipedia as wp
from openai import OpenAI
from langsmith import traceable, wrappers

oai_client = wrappers.wrap_openai(OpenAI())

@traceable
def generate_wiki_search(question: str) -> str:
    """Generate the query to search in wikipedia."""
    instructions = (
        "Generate a search query to pass into wikipedia to answer the user's question. "
        "Return only the search query and nothing more. "
        "This will passed in directly to the wikipedia search engine."
    )
    messages = [
        {"role": "system", "content": instructions},
        {"role": "user", "content": question}
    ]
    result = oai_client.chat.completions.create(
        messages=messages,
        model="gpt-4o-mini",
        temperature=0,
    )
    return result.choices[0].message.content

@traceable(run_type="retriever")
def retrieve(query: str) -> list:
    """Get up to two search wikipedia results."""
    results = []
    for term in wp.search(query, results = 10):
        try:
            page = wp.page(term, auto_suggest=False)
            results.append({
                "page_content": page.summary,
                "type": "Document",
                "metadata": {"url": page.url}
            })
        except wp.DisambiguationError:
            pass
        if len(results) >= 2:
            return results

@traceable
def generate_answer(question: str, context: str) -> str:
    """Answer the question based on the retrieved information."""
    instructions = f"Answer the user's question based ONLY on the content below:\n\n{context}"
    messages = [
        {"role": "system", "content": instructions},
        {"role": "user", "content": question}
    ]
    result = oai_client.chat.completions.create(
        messages=messages,
        model="gpt-4o-mini",
        temperature=0
    )
    return result.choices[0].message.content

@traceable
def qa_pipeline(question: str) -> str:
    """The full pipeline."""
    query = generate_wiki_search(question)
    context = "\n\n".join([doc["page_content"] for doc in retrieve(query)])
    return generate_answer(question, context)

이 파이프라인은 다음과 같은 trace를 생성합니다:

2. 파이프라인을 평가할 dataset과 example 생성하기

파이프라인을 평가하기 위해 몇 가지 example이 포함된 매우 간단한 dataset을 구축합니다. langsmith>=0.3.13 필요

from langsmith import Client

ls_client = Client()
dataset_name = "Wikipedia RAG"

if not ls_client.has_dataset(dataset_name=dataset_name):
    dataset = ls_client.create_dataset(dataset_name=dataset_name)
    examples = [
      {"inputs": {"question": "What is LangChain?"}},
      {"inputs": {"question": "What is LangSmith?"}},
    ]
    ls_client.create_examples(
      dataset_id=dataset.id,
      examples=examples,
    )

3. 커스텀 evaluator 정의하기

위에서 언급한 것처럼, 두 개의 evaluator를 정의합니다: 하나는 입력 쿼리와 관련하여 검색된 문서의 관련성을 평가하고, 다른 하나는 검색된 문서와 관련하여 생성된 답변의 hallucination을 평가합니다. hallucination을 위한 evaluator를 정의하기 위해 with_structured_output과 함께 LangChain LLM wrapper를 사용합니다. 여기서 핵심은 evaluator 함수가 파이프라인의 중간 단계에 접근하기 위해 run / rootRun 인수를 탐색해야 한다는 것입니다. 그런 다음 evaluator는 중간 단계의 입력과 출력을 처리하여 원하는 기준에 따라 평가할 수 있습니다. 편의를 위해 langchain을 사용하는 예제이며, 필수는 아닙니다.

from langchain.chat_models import init_chat_model
from langsmith.schemas import Run
from pydantic import BaseModel, Field

def document_relevance(run: Run) -> bool:
    """Checks if retriever input exists in the retrieved docs."""
    qa_pipeline_run = next(
        r for run in run.child_runs if r.name == "qa_pipeline"
    )
    retrieve_run = next(
        r for run in qa_pipeline_run.child_runs if r.name == "retrieve"
    )
    page_contents = "\n\n".join(
        doc["page_content"] for doc in retrieve_run.outputs["output"]
    )
    return retrieve_run.inputs["query"] in page_contents

# Data model
class GradeHallucinations(BaseModel):
    """Binary score for hallucination present in generation answer."""
    is_grounded: bool = Field(..., description="True if the answer is grounded in the facts, False otherwise.")

# LLM with structured outputs for grading hallucinations
# For more see: https://python.langchain.com/docs/how_to/structured_output/
grader_llm= init_chat_model("gpt-4o-mini", temperature=0).with_structured_output(
    GradeHallucinations,
    method="json_schema",
    strict=True,
)

def no_hallucination(run: Run) -> bool:
    """Check if the answer is grounded in the documents.
    Return True if there is no hallucination, False otherwise.
    """
    # Get documents and answer
    qa_pipeline_run = next(
        r for r in run.child_runs if r.name == "qa_pipeline"
    )
    retrieve_run = next(
        r for r in qa_pipeline_run.child_runs if r.name == "retrieve"
    )
    retrieved_content = "\n\n".join(
        doc["page_content"] for doc in retrieve_run.outputs["output"]
    )

    # Construct prompt
    instructions = (
        "You are a grader assessing whether an LLM generation is grounded in / "
        "supported by a set of retrieved facts. Give a binary score 1 or 0, "
        "where 1 means that the answer is grounded in / supported by the set of facts."
    )
    messages = [
        {"role": "system", "content": instructions},
        {"role": "user", "content": f"Set of facts:\n{retrieved_content}\n\nLLM generation: {run.outputs['answer']}"},
    ]
    grade = grader_llm.invoke(messages)
    return grade.is_grounded

4. 파이프라인 평가하기

마지막으로, 위에서 정의한 커스텀 evaluator를 사용하여 evaluate를 실행합니다.

def qa_wrapper(inputs: dict) -> dict:
  """Wrap the qa_pipeline so it can accept the Example.inputs dict as input."""
  return {"answer": qa_pipeline(inputs["question"])}

experiment_results = ls_client.evaluate(
    qa_wrapper,
    data=dataset_name,
    evaluators=[document_relevance, no_hallucination],
    experiment_prefix="rag-wiki-oai"
)

experiment에는 evaluator의 점수와 코멘트를 포함한 평가 결과가 포함됩니다:

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

애플리케이션의 중간 단계를 평가하는 방법

1. LLM 파이프라인 정의하기

2. 파이프라인을 평가할 dataset과 example 생성하기

3. 커스텀 evaluator 정의하기

4. 파이프라인 평가하기

관련 항목

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

​1. LLM 파이프라인 정의하기

​2. 파이프라인을 평가할 dataset과 example 생성하기

​3. 커스텀 evaluator 정의하기

​4. 파이프라인 평가하기

​관련 항목

1. LLM 파이프라인 정의하기

2. 파이프라인을 평가할 dataset과 example 생성하기

3. 커스텀 evaluator 정의하기

4. 파이프라인 평가하기

관련 항목