Facebook Messenger

이 노트북은 Facebook에서 데이터를 로드하여 파인튜닝할 수 있는 형식으로 변환하는 방법을 보여줍니다. 전체 단계는 다음과 같습니다:

메신저 데이터를 디스크에 다운로드합니다.
Chat Loader를 생성하고 loader.load() (또는 loader.lazy_load())를 호출하여 변환을 수행합니다.
선택적으로 merge_chat_runs를 사용하여 동일한 발신자의 연속된 메시지를 결합하거나, map_ai_messages를 사용하여 지정된 발신자의 메시지를 “AIMessage” 클래스로 변환합니다. 이 작업을 완료한 후 convert_messages_for_finetuning을 호출하여 파인튜닝을 위한 데이터를 준비합니다.

이 작업이 완료되면 모델을 파인튜닝할 수 있습니다. 이를 위해 다음 단계를 완료합니다:

메시지를 OpenAI에 업로드하고 파인튜닝 작업을 실행합니다.
결과 모델을 LangChain 앱에서 사용합니다!

시작해봅시다.

1. 데이터 다운로드

자신의 메신저 데이터를 다운로드하려면 여기의 지침을 따르세요. 중요 - JSON 형식으로 다운로드해야 합니다(HTML이 아님). 이 워크스루에서 사용할 예제 덤프를 이 구글 드라이브 링크에서 호스팅하고 있습니다.

# This uses some example data
import zipfile

import requests


def download_and_unzip(url: str, output_path: str = "file.zip") -> None:
    file_id = url.split("/")[-2]
    download_url = f"https://drive.google.com/uc?export=download&id={file_id}"

    response = requests.get(download_url)
    if response.status_code != 200:
        print("Failed to download the file.")
        return

    with open(output_path, "wb") as file:
        file.write(response.content)
        print(f"File {output_path} downloaded.")

    with zipfile.ZipFile(output_path, "r") as zip_ref:
        zip_ref.extractall()
        print(f"File {output_path} has been unzipped.")


# URL of the file to download
url = (
    "https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing"
)

# Download and unzip
download_and_unzip(url)

File file.zip downloaded.
File file.zip has been unzipped.

2. Chat Loader 생성

전체 채팅 디렉토리용과 개별 파일 로드용, 두 가지 다른 FacebookMessengerChatLoader 클래스가 있습니다.

directory_path = "./hogwarts"

from langchain_community.chat_loaders.facebook_messenger import (
    FolderFacebookMessengerChatLoader,
    SingleFileFacebookMessengerChatLoader,
)

loader = SingleFileFacebookMessengerChatLoader(
    path="./hogwarts/inbox/HermioneGranger/messages_Hermione_Granger.json",
)

chat_session = loader.load()[0]
chat_session["messages"][:3]

[HumanMessage(content="Hi Hermione! How's your summer going so far?", additional_kwargs={'sender': 'Harry Potter'}),
 HumanMessage(content="Harry! Lovely to hear from you. My summer is going well, though I do miss everyone. I'm spending most of my time going through my books and researching fascinating new topics. How about you?", additional_kwargs={'sender': 'Hermione Granger'}),
 HumanMessage(content="I miss you all too. The Dursleys are being their usual unpleasant selves but I'm getting by. At least I can practice some spells in my room without them knowing. Let me know if you find anything good in your researching!", additional_kwargs={'sender': 'Harry Potter'})]

loader = FolderFacebookMessengerChatLoader(
    path="./hogwarts",
)

chat_sessions = loader.load()
len(chat_sessions)

3. 파인튜닝 준비

load()를 호출하면 추출할 수 있는 모든 채팅 메시지가 human message로 반환됩니다. 챗봇과 대화할 때 대화는 일반적으로 실제 대화에 비해 더 엄격한 교대 대화 패턴을 따릅니다. 메시지 “runs”(동일한 발신자의 연속된 메시지)를 병합하고 “AI”를 나타낼 발신자를 선택할 수 있습니다. 파인튜닝된 LLM은 이러한 AI 메시지를 생성하는 방법을 학습합니다.

from langchain_community.chat_loaders.utils import (
    map_ai_messages,
    merge_chat_runs,
)

merged_sessions = merge_chat_runs(chat_sessions)
alternating_sessions = list(map_ai_messages(merged_sessions, "Harry Potter"))

# Now all of Harry Potter's messages will take the AIMessage class
# which maps to the 'assistant' role in OpenAI's training format
alternating_sessions[0]["messages"][:3]

[AIMessage(content="Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.", additional_kwargs={'sender': 'Harry Potter'}),
 HumanMessage(content="What is it, Potter? I'm quite busy at the moment.", additional_kwargs={'sender': 'Severus Snape'}),
 AIMessage(content="I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.", additional_kwargs={'sender': 'Harry Potter'})]

이제 OpenAI 형식 dictionary로 변환할 수 있습니다

from langchain_community.adapters.openai import convert_messages_for_finetuning

training_data = convert_messages_for_finetuning(alternating_sessions)
print(f"Prepared {len(training_data)} dialogues for training")

Prepared 9 dialogues for training

training_data[0][:3]

[{'role': 'assistant',
  'content': "Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately."},
 {'role': 'user',
  'content': "What is it, Potter? I'm quite busy at the moment."},
 {'role': 'assistant',
  'content': "I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister."}]

OpenAI는 현재 파인튜닝 작업에 최소 10개의 학습 예제를 요구하지만, 대부분의 작업에는 50-100개를 권장합니다. 9개의 채팅 세션만 있으므로, 각 학습 예제가 전체 대화의 일부로 구성되도록 (선택적으로 일부 중복을 포함하여) 세분화할 수 있습니다. Facebook 채팅 세션(1인당 1개)은 종종 여러 날과 대화에 걸쳐 있으므로, 장거리 종속성은 어쨌든 모델링하는 것이 그다지 중요하지 않을 수 있습니다.

# Our chat is alternating, we will make each datapoint a group of 8 messages,
# with 2 messages overlapping
chunk_size = 8
overlap = 2

training_examples = [
    conversation_messages[i : i + chunk_size]
    for conversation_messages in training_data
    for i in range(0, len(conversation_messages) - chunk_size + 1, chunk_size - overlap)
]

len(training_examples)

4. 모델 파인튜닝

이제 모델을 파인튜닝할 시간입니다. openai가 설치되어 있고 OPENAI_API_KEY가 적절히 설정되어 있는지 확인하세요

pip install -qU  langchain-openai

import json
import time
from io import BytesIO

import openai

# We will write the jsonl file in memory
my_file = BytesIO()
for m in training_examples:
    my_file.write((json.dumps({"messages": m}) + "\n").encode("utf-8"))

my_file.seek(0)
training_file = openai.files.create(file=my_file, purpose="fine-tune")

# OpenAI audits each training file for compliance reasons.
# This make take a few minutes
status = openai.files.retrieve(training_file.id).status
start_time = time.time()
while status != "processed":
    print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
    time.sleep(5)
    status = openai.files.retrieve(training_file.id).status
print(f"File {training_file.id} ready after {time.time() - start_time:.2f} seconds.")

File file-ULumAXLEFw3vB6bb9uy6DNVC ready after 0.00 seconds.

파일이 준비되면 학습 작업을 시작할 시간입니다.

job = openai.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-3.5-turbo",
)

모델이 준비되는 동안 차 한 잔 하세요. 시간이 좀 걸릴 수 있습니다!

status = openai.fine_tuning.jobs.retrieve(job.id).status
start_time = time.time()
while status != "succeeded":
    print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
    time.sleep(5)
    job = openai.fine_tuning.jobs.retrieve(job.id)
    status = job.status

Status=[running]... 874.29s. 56.93s

print(job.fine_tuned_model)

ft:gpt-3.5-turbo-0613:personal::8QnAzWMr

5. LangChain에서 사용

결과 모델 ID를 ChatOpenAI 모델 클래스에서 직접 사용할 수 있습니다.

from langchain_openai import ChatOpenAI

model = ChatOpenAI(
    model=job.fine_tuned_model,
    temperature=1,
)

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [
        ("human", "{input}"),
    ]
)

chain = prompt | model | StrOutputParser()

for tok in chain.stream({"input": "What classes are you taking?"}):
    print(tok, end="", flush=True)

I'm taking Charms, Defense Against the Dark Arts, Herbology, Potions, Transfiguration, and Ancient Runes. How about you?

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

Integrations by component

1. 데이터 다운로드

2. Chat Loader 생성

3. 파인튜닝 준비

이제 OpenAI 형식 dictionary로 변환할 수 있습니다

4. 모델 파인튜닝

5. LangChain에서 사용

Popular Providers

Integrations by component

​1. 데이터 다운로드

​2. Chat Loader 생성

​3. 파인튜닝 준비

​이제 OpenAI 형식 dictionary로 변환할 수 있습니다

​4. 모델 파인튜닝

​5. LangChain에서 사용

1. 데이터 다운로드

2. Chat Loader 생성

3. 파인튜닝 준비

이제 OpenAI 형식 dictionary로 변환할 수 있습니다

4. 모델 파인튜닝

5. LangChain에서 사용