๋ฐ˜์‘ํ˜•

๋Œ€๊ทœ๋ชจ ์ œ์กฐ ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋ฐ ๋ณด๊ณ ์„œ ์ƒ์„ฑ์„ ์œ„ํ•œ Agentic AI ์‹œ์Šคํ…œ์—์„œ
LLM ์ปดํ“จํŒ… ๋น„์šฉ๊ณผ ์‘๋‹ต ์ง€์—ฐ(latency) ์€ ์ฃผ์š” ๋ฌธ์ œ์ด๋‹ค.

๋‹ค์Œ ์‚ฌํ•ญ์„ ํฌํ•จํ•˜์—ฌ ํšจ์œจํ™” ์ „๋žต์„ ์ œ์‹œํ•˜์‹œ์˜ค.

 

1. ๋ชจ๋ธ ์„œ๋น™ ๋ฐ ์บ์‹ฑ ์ „๋žต (vLLM, Triton, TensorRT ๋“ฑ)

2. ํ† ํฐ ๋‹จ์œ„ ์ตœ์ ํ™” (Prompt/Response Caching, Prefix Tuning ๋“ฑ)

3. ๋ชจ๋ธ ์••์ถ• ๋ฐ ๋ถ„์‚ฐ ์„œ๋น™ ์ „๋žต (Quantization, Sharding, Mixture-of-Experts ๋“ฑ)

 

 

โ‘  ๋ฌธ์ œ ์ธ์‹

  • ์ œ์กฐ ํ˜„์žฅ์€ ๋Œ€๋Ÿ‰ ๋ณด๊ณ ์„œ(์ˆ˜์ฒœ๊ฑด/์ผ) ์ƒ์„ฑ ์š”๊ตฌ → LLM ํ˜ธ์ถœ ๋น„์šฉ·์ง€์—ฐ์ด ๊ธ‰์ฆ.
  • ๋”ฐ๋ผ์„œ LLM์˜ ์ปดํ“จํŒ… ํšจ์œจํ™”(Serving + Token + Storage) ๊ฐ€ ํ•ต์‹ฌ์ด๋‹ค.

โ‘ก ๋ชจ๋ธ ์„œ๋น™ ์ตœ์ ํ™”

์ „๋žต๊ธฐ์ˆ ์„ค๋ช…
vLLM Continuous batching + PagedAttention ์—ฌ๋Ÿฌ ์š”์ฒญ์„ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜์—ฌ GPU ํ™œ์šฉ๋ฅ  ๊ทน๋Œ€ํ™”
Triton Server Multi-model serving LLM + ML๋ชจ๋ธ + RAG ์ธํผ๋Ÿฐ์Šค ํ†ตํ•ฉ ์„œ๋น™
TensorRT-LLM FP8 quant + graph fusion GPU inference latency 30~40% ๋‹จ์ถ•
Async Queue Redis + asyncio ๋™์‹œ ์š”์ฒญ์„ ๋น„๋™๊ธฐ๋กœ ํ์ž‰

์˜ˆ์‹œ ๊ตฌ์กฐ:

 
Client → API Gateway → vLLM → Cache → Report Agent

โ‘ข ํ† ํฐ ํšจ์œจํ™” ์ „๋žต

๋ฐฉ๋ฒ•์„ค๋ช…๊ธฐ๋Œ€ํšจ๊ณผ
Prompt Caching ๋™์ผ ์งˆ์˜ ํ”„๋กฌํ”„ํŠธ ํ•ด์‹œ ์ €์žฅ ๋ฐ˜๋ณต ๋ณด๊ณ ์„œ ์žฌ์‚ฌ์šฉ
Prefix Tuning ๊ณต์ •๋ณ„ ํŠนํ™” prefix๋งŒ ๋ฏธ์„ธ์กฐ์ • ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ๊ฐ์†Œ(0.3~1%)
Response Caching “query+context hash” ์บ์‹œ ํ‚ค๋กœ ์ €์žฅ RAG ๋ฐ˜๋ณต ํ˜ธ์ถœ ๊ฐ์†Œ
Streaming Output ์ฆ‰์‹œ ์‘๋‹ต ์ŠคํŠธ๋ฆผ ์ „๋‹ฌ UX ๊ฐœ์„ , ์ง€์—ฐ ์ฒด๊ฐ ๊ฐ์†Œ

โ‘ฃ ๋ชจ๋ธ ์••์ถ• ๋ฐ ๋ถ„์‚ฐ ์„œ๋น™

๊ธฐ์ˆ ๋‚ด์šฉ์žฅ์ 
Quantization (4bit/8bit) ์ •๋ฐ€๋„ ๋‚ฎ์ถฐ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ 70% GPU VRAM ์ ˆ๊ฐ
Sharding / ZeRO ๋Œ€ํ˜• ๋ชจ๋ธ์„ GPU๊ฐ„ ๋ถ„ํ•  ๋Œ€๊ทœ๋ชจ LLM ์„œ๋น™ ๊ฐ€๋Šฅ
MoE (Mixture of Experts) ์š”์ฒญ๋ณ„๋กœ ์ผ๋ถ€ ์ „๋ฌธ๊ฐ€ ๋ ˆ์ด์–ด๋งŒ ํ™œ์„ฑ ํ‰๊ท  ์—ฐ์‚ฐ๋Ÿ‰ 20~40% ๊ฐ์†Œ

โ‘ค ์‹ค๋ฌด ์‹œ๋‚˜๋ฆฌ์˜ค

  • 13B ๋ชจ๋ธ(vLLM) 3๊ฐœ → GPU 4์žฅ(48GB)
  • PromptCache ํ™œ์„ฑํ™” → ๋ฐ˜๋ณต ์งˆ์˜ ์‘๋‹ต ์†๋„ 3๋ฐฐ ๊ฐœ์„ 
  • FP8 TensorRT ๋ณ€ํ™˜ → ๋‹จ์ผ ๋ณด๊ณ ์„œ ์‘๋‹ต์‹œ๊ฐ„ 9.8s → 4.3s
  • ๋น„์šฉ ์ ˆ๊ฐ: GPU ์‚ฌ์šฉ๋ฅ  35% ↓, ์›” $3,000 ์ ˆ์•ฝ

โ‘ฅ ํ‰๊ฐ€ ํฌ์ธํŠธ

  • ๋ชจ๋ธ ์„œ๋น™ ๊ตฌ์กฐ(vLLM/Triton)์™€ ํ† ํฐ ์ตœ์ ํ™”๋ฅผ ๊ตฌ์ฒด์ ์œผ๋กœ ์–ธ๊ธ‰ํ–ˆ๋Š”๊ฐ€
  • Quantization/MoE ๊ฐ™์€ ์ปดํ“จํŒ… ์ ˆ๊ฐ ๊ธฐ์ˆ ์˜ ์›๋ฆฌ๋ฅผ ์„ค๋ช…ํ–ˆ๋Š”๊ฐ€
  • ์‹ค์ œ ์šด์˜ ํšจ๊ณผ(์†๋„·๋น„์šฉ ๊ฐœ์„ )๋ฅผ ์ˆ˜์น˜๋กœ ์ œ์‹œํ–ˆ๋Š”๊ฐ€
๋ฐ˜์‘ํ˜•

+ Recent posts