Hugging Face Local Pipelines

Hugging Face 모델은 HuggingFacePipeline 클래스를 통해 로컬에서 실행할 수 있습니다. Hugging Face Model Hub는 120,000개 이상의 모델, 20,000개의 데이터셋, 50,000개의 데모 앱(Spaces)을 호스팅하며, 모두 오픈 소스이고 공개적으로 사용 가능한 온라인 플랫폼으로 사람들이 쉽게 협업하고 함께 ML을 구축할 수 있습니다. 이들은 LangChain에서 이 로컬 pipeline wrapper를 통해 호출하거나 HuggingFaceHub 클래스를 통해 호스팅된 inference endpoint를 호출하여 사용할 수 있습니다. 사용하려면 transformers python 패키지가 설치되어 있어야 하며, pytorch도 필요합니다. 더 메모리 효율적인 attention 구현을 위해 xformer를 설치할 수도 있습니다.

pip install -qU transformers

Model Loading

모델은 from_model_id 메서드를 사용하여 모델 파라미터를 지정하여 로드할 수 있습니다.

from langchain_huggingface.llms import HuggingFacePipeline

hf = HuggingFacePipeline.from_model_id(
    model_id="gpt2",
    task="text-generation",
    pipeline_kwargs={"max_new_tokens": 10},
)

기존 transformers pipeline을 직접 전달하여 로드할 수도 있습니다.

from langchain_huggingface.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10)
hf = HuggingFacePipeline(pipeline=pipe)

Create Chain

모델이 메모리에 로드되면 prompt와 결합하여 chain을 구성할 수 있습니다.

from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

chain = prompt | hf

question = "What is electroencephalography?"

print(chain.invoke({"question": question}))

prompt 없이 응답을 받으려면 LLM에 skip_prompt=True를 바인딩할 수 있습니다.

chain = prompt | hf.bind(skip_prompt=True)

question = "What is electroencephalography?"

print(chain.invoke({"question": question}))

Streaming 응답.

for chunk in chain.stream(question):
    print(chunk, end="", flush=True)

GPU Inference

GPU가 있는 머신에서 실행할 때 device=n 파라미터를 지정하여 모델을 지정된 device에 배치할 수 있습니다. CPU inference의 경우 기본값은 -1입니다. 여러 개의 GPU가 있거나 모델이 단일 GPU에 비해 너무 큰 경우, device_map="auto"를 지정할 수 있으며, 이는 Accelerate 라이브러리를 필요로 하고 사용하여 모델 가중치를 로드하는 방법을 자동으로 결정합니다. 참고: device와 device_map은 함께 지정하면 안 되며 예기치 않은 동작을 유발할 수 있습니다.

gpu_llm = HuggingFacePipeline.from_model_id(
    model_id="gpt2",
    task="text-generation",
    device=0,  # replace with device_map="auto" to use the accelerate library.
    pipeline_kwargs={"max_new_tokens": 10},
)

gpu_chain = prompt | gpu_llm

question = "What is electroencephalography?"

print(gpu_chain.invoke({"question": question}))

Batch GPU Inference

GPU가 있는 device에서 실행하는 경우 GPU에서 batch 모드로 inference를 실행할 수도 있습니다.

gpu_llm = HuggingFacePipeline.from_model_id(
    model_id="bigscience/bloom-1b7",
    task="text-generation",
    device=0,  # -1 for CPU
    batch_size=2,  # adjust as needed based on GPU map and model size.
    model_kwargs={"temperature": 0, "max_length": 64},
)

gpu_chain = prompt | gpu_llm.bind(stop=["\n\n"])

questions = []
for i in range(4):
    questions.append({"question": f"What is the number {i} in french?"})

answers = gpu_chain.batch(questions)
for answer in answers:
    print(answer)

Inference with OpenVINO backend

OpenVINO로 모델을 배포하려면 backend="openvino" 파라미터를 지정하여 OpenVINO를 backend inference framework로 사용할 수 있습니다. Intel GPU가 있는 경우 model_kwargs={"device": "GPU"}를 지정하여 해당 GPU에서 inference를 실행할 수 있습니다.

pip install -U-strategy eager "optimum[openvino,nncf]" --quiet

ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}

ov_llm = HuggingFacePipeline.from_model_id(
    model_id="gpt2",
    task="text-generation",
    backend="openvino",
    model_kwargs={"device": "CPU", "ov_config": ov_config},
    pipeline_kwargs={"max_new_tokens": 10},
)

ov_chain = prompt | ov_llm

question = "What is electroencephalography?"

print(ov_chain.invoke({"question": question}))

Inference with local OpenVINO model

CLI를 사용하여 모델을 export하여 OpenVINO IR 형식으로 변환하고 로컬 폴더에서 모델을 로드할 수 있습니다.

!optimum-cli export openvino --model gpt2 ov_model_dir

inference 지연 시간과 모델 footprint를 줄이기 위해 --weight-format을 사용하여 8비트 또는 4비트 weight quantization을 적용하는 것이 권장됩니다:

!optimum-cli export openvino --model gpt2  --weight-format int8 ov_model_dir # for 8-bit quantization

!optimum-cli export openvino --model gpt2  --weight-format int4 ov_model_dir # for 4-bit quantization

ov_llm = HuggingFacePipeline.from_model_id(
    model_id="ov_model_dir",
    task="text-generation",
    backend="openvino",
    model_kwargs={"device": "CPU", "ov_config": ov_config},
    pipeline_kwargs={"max_new_tokens": 10},
)

ov_chain = prompt | ov_llm

question = "What is electroencephalography?"

print(ov_chain.invoke({"question": question}))

activation의 Dynamic Quantization과 KV-cache quantization을 통해 추가적인 inference 속도 향상을 얻을 수 있습니다. 이러한 옵션은 다음과 같이 ov_config로 활성화할 수 있습니다:

ov_config = {
    "KV_CACHE_PRECISION": "u8",
    "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32",
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "",
}

자세한 내용은 OpenVINO LLM guide와 OpenVINO Local Pipelines notebook을 참조하세요.

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

Integrations by component

Model Loading

Create Chain

GPU Inference

Batch GPU Inference

Inference with OpenVINO backend

Inference with local OpenVINO model

Popular Providers

Integrations by component

​Model Loading

​Create Chain

​GPU Inference

​Batch GPU Inference

​Inference with OpenVINO backend

​Inference with local OpenVINO model

Model Loading

Create Chain

GPU Inference

Batch GPU Inference

Inference with OpenVINO backend

Inference with local OpenVINO model