Rapid-MLX：Apple Silicon 本地 AI Engine 開始接近 Agent 工作流需求

這篇 Threads 介紹 Rapid-MLX，一個針對 Apple Silicon 的本地 AI engine。原文主打「比 Ollama 快 4.2 倍」，但真正值得留下的不是單一速度數字，而是它把 Mac 本地 LLM 從「能跑」推向「可接 agent workflow」：OpenAI-compatible API、工具呼叫、prompt cache、reasoning separation、MCP / agent harness 相容、依 Mac RAM 選模型。

官方 GitHub README 的定位很直接：Run AI on your Mac. Faster than anything else. 它用 Apple MLX framework，吃 Metal compute kernels 與 unified memory，目標是讓 Cursor、Claude Code、Hermes Agent、PydanticAI、LangChain、Aider 等 OpenAI-compatible app 可以直接指到本機 http://localhost:8000/v1。

快速啟動

官方建議 Homebrew 安裝：

brew install raullenchai/rapid-mlx/rapid-mlx

也可以用 pip：

pip install rapid-mlx

或一行安裝：

curl -fsSL https://raullenchai.github.io/Rapid-MLX/install.sh | bash

啟動模型：

rapid-mlx serve gemma-4-26b

然後用 OpenAI-compatible API 呼叫：

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Say hello"}]}'

如果遇到 No matching distribution，通常是 macOS 內建 Python 太舊。README 特別提醒 macOS ships 3.9，pip 安裝需要 Python 3.10+，可先用 Homebrew 裝 Python 3.12。

依 Mac RAM 選模型

官方 README 給的推薦很實用：

16GB MacBook Air / Pro：Qwen3.5-4B 4bit，約 2.4GB RAM，160 tok/s。
24GB MacBook Pro：Qwen3.5-9B 4bit，約 5.1GB RAM，108 tok/s。
32GB Mac Mini / Studio：Qwen3.5-27B 4bit，約 15.3GB RAM，39 tok/s。
32GB Mac Mini / Studio：Nemotron-Nano 30B 4bit，約 18GB RAM，141 tok/s，主打最快 30B 與 100% tool calling。
48GB / 64GB Mac Mini / Studio：Qwen3.5-35B-A3B 8bit，約 37GB RAM，83 tok/s，README 稱為 smart + fast sweet spot。
96GB Mac Studio / Pro：Qwen3.5-122B mxfp4，約 65GB RAM，57 tok/s。
128GB+ Mac Studio / Pro：DeepSeek V4 Flash 2-bit DQ，約 91GB RAM，56 tok/s。
192GB / 256GB：可上更高品質或更大模型，例如 Qwen3.5-122B 8bit、DeepSeek V4 Flash 8-bit。

這比單純喊「Mac 能跑大模型」更有用，因為本地模型真正的限制通常不是能不能啟動，而是 RAM、KV cache、長上下文時的 memory pressure、首字延遲與可持續速度。

對 agent 最關鍵的功能

1. OpenAI-compatible server

Rapid-MLX 跑起來後就是本機 OpenAI-compatible endpoint。這代表 Cursor、Continue.dev、Open WebUI、LibreChat、PydanticAI、LangChain、Aider、Goose、Claw Code、Hermes Agent 等工具理論上都可以接。

Hermes config 範例：

model:
  provider: "custom"
  default: "default"
  base_url: "http://localhost:8000/v1"
  context_length: 32768

這對 BigIntTech 的意義是：本地 Mac mini / Mac Studio 可以成為低成本 agent inference 節點，用來跑不需要最高智慧、但需要高頻工具呼叫或隱私較敏感的任務。

2. Tool calling 與 parser recovery

README 說 Rapid-MLX 支援 OpenAI-compatible tool calling，並有 17 種 parser formats 與 automatic recovery。4-bit 量化模型在多輪 tool calls 後常見問題是格式崩壞：模型輸出看起來像工具呼叫，但 JSON 或 schema 不穩。Rapid-MLX 的設計是自動偵測 broken output，轉回 structured tool_calls。

這對 agent 很重要，因為本地模型如果只能聊天，價值有限；能穩定 tool calling，才有機會接 workflow。

3. Prompt cache / hybrid model snapshot

原文特別提到 Qwen3.5 的快取加速。README 也說 Rapid-MLX 有 persistent prompt cache：標準 Transformer 走 KV cache trimming；混合架構模型如 Qwen3.5 DeltaNet，則用 RNN state snapshots 還原 non-trimmable layers，不必重算固定 system prompt。官方描述是 2–5x faster TTFT。

這類優化非常適合 agent，因為 agent 通常每次請求都有長 system prompt、工具 schema、context 殘留。如果固定前綴能被快取，首字延遲與互動體感會差很多。

4. Reasoning separation

Rapid-MLX 對 Qwen3、DeepSeek-R1、MiniMax、GPT-OSS 等 reasoning formats 會把 reasoning_content 和 content 分開。這對 UI、log、agent planner 都有價值：內部推理可以保留給分析或 debug，對外內容則乾淨輸出。

Benchmark 與注意事項

README 宣稱在 Mac Studio M3 Ultra 256GB 測試，多數模型可勝過 Ollama / mlx-lm。例子包括：

Phi-4 Mini 14B：180 tok/s，對 Ollama 56 tok/s 約 3.2x。
Qwen3.5-9B：108 tok/s，對 Ollama 41 tok/s 約 2.6x。
Qwen3.5-35B-A3B：83 tok/s，100% tools。
Qwen3-Coder 80B：74 tok/s，100% tools。
Qwen3.5-122B：44 tok/s，100% tools。

但要注意：速度數字通常高度依賴硬體、模型格式、量化方式、prompt 長度、batch、上下文長度與測試方法。這篇可以收為工具情報，但不應把「4.2x」當成無條件保證。真正要採用，應該在自己的 Mac 與自己的 workload 上跑 benchmark。

我的判斷

Rapid-MLX 的重點不是取代所有雲端模型，而是補上 Apple Silicon 本地推論的一塊：

開發時離線跑小模型。
跑高頻、低成本、隱私敏感的 agent 子任務。
作為 Cursor / Claude Code / Hermes 的本地 OpenAI-compatible endpoint。
測試本地 tool calling、structured output、reasoning separation。

對 Allen 來說，我會把它列入 Mac mini / Mac Studio 本地 AI stack 的候選，特別是我們已經有 Hermes Agent 與多個自動化工作流。如果要實測，第一步不是上 122B，而是先用 16–64GB 可穩定跑的 Qwen3.5 / Nemotron-Nano，測 tool calling 成功率、TTFT、長 system prompt 快取效果，以及 Hermes 接上後的實際穩定性。

原始來源： https://www.threads.com/@largitdata/post/DX67_u9D9tj

GitHub： https://github.com/raullenchai/Rapid-MLX