Huggingface Endpoints

Hugging Face Hub는 120,000개 이상의 모델, 20,000개의 데이터셋, 50,000개의 데모 앱(Spaces)을 제공하는 플랫폼으로, 모두 오픈 소스이며 공개적으로 사용 가능하고, 사람들이 쉽게 협업하고 함께 ML을 구축할 수 있는 온라인 플랫폼입니다.

Hugging Face Hub는 또한 ML 애플리케이션을 구축하기 위한 다양한 endpoint를 제공합니다. 이 예제는 다양한 Endpoint 유형에 연결하는 방법을 보여줍니다. 특히, text generation inference는 Text Generation Inference로 구동됩니다: 초고속 text generation inference를 위해 맞춤 제작된 Rust, Python 및 gRPC 서버입니다.

from langchain_huggingface import HuggingFaceEndpoint

Installation and Setup

사용하려면 huggingface_hub python 패키지가 설치되어 있어야 합니다.

pip install -qU huggingface_hub

# get a token: https://huggingface.co/docs/api-inference/quicktour#get-your-api-token

from getpass import getpass

HUGGINGFACEHUB_API_TOKEN = getpass()

import os

os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

Prepare Examples

from langchain_huggingface import HuggingFaceEndpoint

from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate

question = "Who won the FIFA World Cup in the year 1994? "

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

Examples

다음은 serverless Inference Providers API의 HuggingFaceEndpoint integration에 액세스하는 방법의 예제입니다.

repo_id = "deepseek-ai/DeepSeek-R1-0528"

llm = HuggingFaceEndpoint(
    repo_id=repo_id,
    max_length=128,
    temperature=0.5,
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN,
    provider="auto",  # set your provider here hf.co/settings/inference-providers
    # provider="hyperbolic",
    # provider="nebius",
    # provider="together",
)
llm_chain = prompt | llm
print(llm_chain.invoke({"question": question}))

Dedicated Endpoint

무료 serverless API를 사용하면 솔루션을 구현하고 즉시 반복할 수 있지만, 로드가 다른 요청과 공유되기 때문에 대량 사용 사례의 경우 속도 제한이 있을 수 있습니다. 엔터프라이즈 워크로드의 경우 Inference Endpoints - Dedicated를 사용하는 것이 가장 좋습니다. 이는 더 많은 유연성과 속도를 제공하는 완전 관리형 인프라에 대한 액세스를 제공합니다. 이러한 리소스는 지속적인 지원 및 가동 시간 보장과 함께 AutoScaling과 같은 옵션을 제공합니다.

# Set the url to your Inference Endpoint below
your_endpoint_url = "https://fayjubiy2xqn36z0.us-east-1.aws.endpoints.huggingface.cloud"

llm = HuggingFaceEndpoint(
    endpoint_url=f"{your_endpoint_url}",
    max_new_tokens=512,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
)
llm("What did foo say about bar?")

Streaming

from langchain_core.callbacks import StreamingStdOutCallbackHandler
from langchain_huggingface import HuggingFaceEndpoint

llm = HuggingFaceEndpoint(
    endpoint_url=f"{your_endpoint_url}",
    max_new_tokens=512,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
    streaming=True,
)
llm("What did foo say about bar?", callbacks=[StreamingStdOutCallbackHandler()])

이 동일한 HuggingFaceEndpoint class는 LLM을 제공하는 로컬 HuggingFace TGI instance와 함께 사용할 수 있습니다. 다양한 하드웨어(GPU, TPU, Gaudi…) 지원에 대한 자세한 내용은 TGI repository를 확인하세요.

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

Integrations by component

Installation and Setup

Prepare Examples

Examples

Dedicated Endpoint

Streaming

Popular Providers

Integrations by component

​Installation and Setup

​Prepare Examples

​Examples

​Dedicated Endpoint

​Streaming

Installation and Setup

Prepare Examples

Examples

Dedicated Endpoint

Streaming