Code evaluator 정의하는 방법

Evaluators

Code evaluator는 데이터셋 example과 결과 애플리케이션 output을 받아서 하나 이상의 metric을 반환하는 함수입니다. 이러한 함수들은 evaluate() / aevaluate()에 직접 전달될 수 있습니다.

기본 예제

from langsmith import evaluate

def correct(outputs: dict, reference_outputs: dict) -> bool:
    """Check if the answer exactly matches the expected answer."""
    return outputs["answer"] == reference_outputs["answer"]

def dummy_app(inputs: dict) -> dict:
    return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
    dummy_app,
    data="dataset_name",
    evaluators=[correct]
)

Evaluator 인자

code evaluator 함수는 특정 인자 이름을 가져야 합니다. 다음 인자들 중 일부를 사용할 수 있습니다:

run: Run: 주어진 example에 대해 애플리케이션이 생성한 전체 Run 객체입니다.
example: Example: example input, output(사용 가능한 경우), metadata(사용 가능한 경우)를 포함한 전체 데이터셋 Example입니다.
inputs: dict: 데이터셋의 단일 example에 해당하는 input의 dictionary입니다.
outputs: dict: 주어진 inputs에 대해 애플리케이션이 생성한 output의 dictionary입니다.
reference_outputs/referenceOutputs: dict: example과 연관된 reference output의 dictionary입니다(사용 가능한 경우).

대부분의 사용 사례에서는 inputs, outputs, reference_outputs만 필요합니다. run과 example은 애플리케이션의 실제 input과 output 외에 추가적인 trace나 example metadata가 필요한 경우에만 유용합니다. JS/TS를 사용할 때는 이들을 모두 단일 객체 인자의 일부로 전달해야 합니다.

Evaluator output

Code evaluator는 다음 타입 중 하나를 반환해야 합니다: Python과 JS/TS

dict: {"score" | "value": ..., "key": ...} 형태의 dict를 사용하면 metric 타입(“score”는 수치형, “value”는 범주형)과 metric 이름을 커스터마이즈할 수 있습니다. 예를 들어, 정수를 범주형 metric으로 기록하고 싶을 때 유용합니다.

Python만 해당

int | float | bool: 평균을 구하거나 정렬할 수 있는 연속형 metric으로 해석됩니다. 함수 이름이 metric의 이름으로 사용됩니다.
str: 범주형 metric으로 해석됩니다. 함수 이름이 metric의 이름으로 사용됩니다.
list[dict]: 단일 함수를 사용하여 여러 metric을 반환합니다.

추가 예제

langsmith>=0.2.0 필요

from langsmith import evaluate, wrappers
from langsmith.schemas import Run, Example
from openai import AsyncOpenAI
# Assumes you've installed pydantic.
from pydantic import BaseModel

# We can still pass in Run and Example objects if we'd like
def correct_old_signature(run: Run, example: Example) -> dict:
    """Check if the answer exactly matches the expected answer."""
    return {"key": "correct", "score": run.outputs["answer"] == example.outputs["answer"]}

# Just evaluate actual outputs
def concision(outputs: dict) -> int:
    """Score how concise the answer is. 1 is the most concise, 5 is the least concise."""
    return min(len(outputs["answer"]) // 1000, 4) + 1

# Use an LLM-as-a-judge
oai_client = wrappers.wrap_openai(AsyncOpenAI())

async def valid_reasoning(inputs: dict, outputs: dict) -> bool:
    """Use an LLM to judge if the reasoning and the answer are consistent."""
    instructions = """
Given the following question, answer, and reasoning, determine if the reasoning for the
answer is logically valid and consistent with question and the answer."""

    class Response(BaseModel):
        reasoning_is_valid: bool

    msg = f"Question: {inputs['question']}\nAnswer: {outputs['answer']}\nReasoning: {outputs['reasoning']}"
    response = await oai_client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
        response_format=Response
    )
    return response.choices[0].message.parsed.reasoning_is_valid

def dummy_app(inputs: dict) -> dict:
    return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
    dummy_app,
    data="dataset_name",
    evaluators=[correct_old_signature, concision, valid_reasoning]
)

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

Code evaluator 정의하는 방법

기본 예제

Evaluator 인자

Evaluator output

추가 예제

관련 항목

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

​기본 예제

​Evaluator 인자

​Evaluator output

​추가 예제

​관련 항목

기본 예제

Evaluator 인자

Evaluator output

추가 예제

관련 항목