vLLM MLX

v1.0.0Data Science & MLstable

OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.

anthropicapple-siliconaudio-processingclaude-codecomputer-vision
Share:
1,232
Stars
0
Downloads
0
Weekly
0/5

What is vLLM MLX?

vLLM MLX is a Model Context Protocol (MCP) server that allows AI assistants like Claude, Cursor, and VS Code to openai and anthropic compatible server for apple silicon. run llms and vision-language models (llama, qwen-vl, llava) with continuous batching, mcp tool calling, and multimodal support. native mlx bac...

OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.

This server falls under the Data Science & ML category on MCPgee, the world's largest MCP server directory with 33,000+ servers.

Features

  • OpenAI and Anthropic compatible server for Apple Silicon. Ru

Use Cases

Run LLMs on Apple Silicon natively
Vision-language models with continuous batching
waybarrios

Maintainer

LicenseApache-2.0
Languagepython
Versionv1.0.0
UpdatedMay 22, 2026
Statushealthy
Maintenanceactive

Works with

ClaudeOpenAIwindowsmacoslinux

Installation

Manual Installation

npx vllm-mlx

Configuration

Configuration Details

Config File

claude_desktop_config.json

Performance

Response Metrics

Response Time< 200ms
ThroughputMedium

Resource Usage

Memory UsageLow
CPU UsageLow

How to Set Up and Use vLLM MLX

vllm-mlx is an OpenAI and Anthropic API-compatible inference server built on Apple's MLX framework for Apple Silicon Macs, capable of running large language models and vision-language models (including Llama, Qwen-VL, and LLaVA) at over 400 tokens per second with continuous batching and native Metal GPU acceleration. It exposes standard /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/messages endpoints and includes 12 MCP tool-calling parsers, structured JSON output, multimodal support for text, images, video, and audio, plus built-in TTS and Whisper-based STT — making it a complete local AI inference backend for Claude Code and other MCP clients.

Prerequisites

  • A Mac with Apple Silicon (M1, M2, M3, or M4 series chip)
  • macOS 13 Ventura or later
  • Python 3.10 or higher with pip or the uv package manager
  • At least 8 GB unified memory (16 GB or more recommended for 7B+ models)
  • An MLX-compatible model from Hugging Face (e.g. mlx-community/Llama-3.2-3B-Instruct-4bit)
1

Install vllm-mlx

Install via uv (recommended for isolation) or pip. For audio/TTS support, install with the optional audio extra and the espeak-ng system dependency.

uv tool install vllm-mlx
# Or with audio support:
pip install vllm-mlx[audio]
brew install espeak-ng
2

Start the inference server

Launch the server with a model from mlx-community on Hugging Face. The model is downloaded automatically on first use. --continuous-batching improves throughput significantly.

vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching
3

Test the OpenAI-compatible endpoint

Verify the server is running by sending a test request with the OpenAI Python SDK or curl. No real API key is needed — any string works.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "default", "messages": [{"role": "user", "content": "Hello!"}]}'
4

Point Claude Code at the local server

Set the ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY environment variables to redirect Claude Code to your local vllm-mlx server instead of Anthropic's cloud API.

export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude
5

Configure as an MCP server in Claude Desktop

To expose vllm-mlx as an MCP server that provides local model inference as a tool, add it to your claude_desktop_config.json.

{
  "mcpServers": {
    "vllm-mlx": {
      "command": "vllm-mlx",
      "args": ["serve", "mlx-community/Llama-3.2-3B-Instruct-4bit", "--port", "8000"],
      "env": {
        "ANTHROPIC_BASE_URL": "http://localhost:8000",
        "ANTHROPIC_API_KEY": "not-needed"
      }
    }
  }
}
6

Enable advanced features (optional)

For long-context models, add --ssd-cache-dir to spill the KV cache to disk. For reasoning models, add --reasoning-parser qwen3 or --reasoning-parser deepseek. Enable Prometheus metrics with --metrics.

vllm-mlx serve mlx-community/Qwen2.5-7B-Instruct-4bit \
  --port 8000 \
  --continuous-batching \
  --ssd-cache-dir /tmp/kvcache \
  --reasoning-parser qwen3 \
  --metrics

vLLM MLX Examples

Client configuration

Claude Desktop configuration using vllm-mlx as a local inference backend via the Anthropic-compatible endpoint.

{
  "mcpServers": {
    "vllm-mlx": {
      "command": "vllm-mlx",
      "args": [
        "serve",
        "mlx-community/Llama-3.2-3B-Instruct-4bit",
        "--port",
        "8000",
        "--continuous-batching"
      ],
      "env": {
        "ANTHROPIC_BASE_URL": "http://localhost:8000",
        "ANTHROPIC_API_KEY": "not-needed"
      }
    }
  }
}

Prompts to try

Example uses once vllm-mlx is running locally and Claude Code is pointed at it.

- "Summarize this document locally without sending it to the cloud" (paste large text)
- "Generate embeddings for these code snippets and find the most similar ones"
- "Transcribe this audio file using the local Whisper model"
- "Describe what's in this screenshot" (using a vision-language model like LLaVA)
- "Run a benchmark with 5 concurrent requests and report the tokens-per-second throughput"

Troubleshooting vLLM MLX

Model download fails or is very slow on first run

Models are downloaded from Hugging Face on first use. Ensure you have enough free disk space (4-bit quantized 7B models need about 4 GB). Set HF_HOME to a directory on a fast SSD to cache models for reuse.

Server crashes with 'Out of memory' error

Switch to a smaller or more aggressively quantized model (e.g. 4bit instead of 8bit). Add --ssd-cache-dir to spill the KV cache to disk. Reduce concurrent requests with lower --max-tokens limits.

Claude Code still connects to Anthropic cloud instead of local server

Verify ANTHROPIC_BASE_URL is exported in the same shell session where you run 'claude'. Run 'echo $ANTHROPIC_BASE_URL' to confirm it shows 'http://localhost:8000'. Restart claude after setting the variable.

Frequently Asked Questions about vLLM MLX

What is vLLM MLX?

vLLM MLX is a Model Context Protocol (MCP) server that openai and anthropic compatible server for apple silicon. run llms and vision-language models (llama, qwen-vl, llava) with continuous batching, mcp tool calling, and multimodal support. native mlx backend, 400+ tok/s. works with claude code. It connects AI assistants to external tools and data sources through a standardized interface.

How do I install vLLM MLX?

Follow the installation instructions on the vLLM MLX GitHub repository. Clone the repo, install dependencies, and add the server config to your AI client.

Which AI clients work with vLLM MLX?

vLLM MLX works with all major MCP-compatible AI clients including Claude Desktop, Claude Code, Cursor, VS Code (GitHub Copilot), Windsurf, and Cline.

Is vLLM MLX free to use?

Yes, vLLM MLX is open source and available under the Apache-2.0 license. You can use it freely in both personal and commercial projects.

Browse More Data Science & ML MCP Servers

Explore all data science & ml servers available in the MCPgee directory. Each server includes setup guides for Claude, Cursor, and VS Code.

Quick Config Preview

{ "mcpServers": { "vllm-mlx": { "command": "npx", "args": ["-y", "vllm-mlx"] } } }

Add this to your claude_desktop_config.json or .cursor/mcp.json

Read the full setup guide →

Ready to use vLLM MLX?

Browse our complete directory of 33,000+ MCP servers, read setup guides for your editor, and start building with the Model Context Protocol.

33,000+ ServersFree & Open SourceStep-by-Step Guides