How do I install vLLM MLX MCP Server?

Follow the setup instructions on the vLLM MLX GitHub repository, then add the server configuration to your AI client.

What category is vLLM MLX MCP Server?

vLLM MLX is categorized under Data Science & ML. Browse more servers in these categories on MCPgee.

vLLM MLX

Name: Vllm Mlx MCP Server
Author: waybarrios

v1.0.0•Data Science & ML•stable

OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.

anthropicapple-siliconaudio-processingclaude-codecomputer-vision

1,232

Stars

Downloads

Weekly

0/5

View on GitHub

What is vLLM MLX?

vLLM MLX is a Model Context Protocol (MCP) server that allows AI assistants like Claude, Cursor, and VS Code to openai and anthropic compatible server for apple silicon. run llms and vision-language models (llama, qwen-vl, llava) with continuous batching, mcp tool calling, and multimodal support. native mlx bac...

This server falls under the Data Science & ML category on MCPgee, the world's largest MCP server directory with 33,000+ servers.

Features

OpenAI and Anthropic compatible server for Apple Silicon. Ru

Use Cases

Run LLMs on Apple Silicon natively

Vision-language models with continuous batching

waybarrios

Maintainer

LicenseApache-2.0

Languagepython

Versionv1.0.0

UpdatedMay 22, 2026

Statushealthy

Maintenanceactive

Works with

ClaudeOpenAIwindowsmacoslinux

View Source Browse All Servers

Installation

Manual Installation

npx vllm-mlx

Configuration

Configuration Details

Config File

claude_desktop_config.json

Performance

Response Metrics

Response Time< 200ms

ThroughputMedium

Resource Usage

Memory UsageLow

CPU UsageLow

How to Set Up and Use vLLM MLX

vllm-mlx is an OpenAI and Anthropic API-compatible inference server built on Apple's MLX framework for Apple Silicon Macs, capable of running large language models and vision-language models (including Llama, Qwen-VL, and LLaVA) at over 400 tokens per second with continuous batching and native Metal GPU acceleration. It exposes standard /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/messages endpoints and includes 12 MCP tool-calling parsers, structured JSON output, multimodal support for text, images, video, and audio, plus built-in TTS and Whisper-based STT — making it a complete local AI inference backend for Claude Code and other MCP clients.

Prerequisites

A Mac with Apple Silicon (M1, M2, M3, or M4 series chip)
macOS 13 Ventura or later
Python 3.10 or higher with pip or the uv package manager
At least 8 GB unified memory (16 GB or more recommended for 7B+ models)
An MLX-compatible model from Hugging Face (e.g. mlx-community/Llama-3.2-3B-Instruct-4bit)

Install vllm-mlx

Install via uv (recommended for isolation) or pip. For audio/TTS support, install with the optional audio extra and the espeak-ng system dependency.

uv tool install vllm-mlx
# Or with audio support:
pip install vllm-mlx[audio]
brew install espeak-ng

Start the inference server

Launch the server with a model from mlx-community on Hugging Face. The model is downloaded automatically on first use. --continuous-batching improves throughput significantly.

vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

Test the OpenAI-compatible endpoint

Verify the server is running by sending a test request with the OpenAI Python SDK or curl. No real API key is needed — any string works.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "default", "messages": [{"role": "user", "content": "Hello!"}]}'

Point Claude Code at the local server

Set the ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY environment variables to redirect Claude Code to your local vllm-mlx server instead of Anthropic's cloud API.

export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude

Configure as an MCP server in Claude Desktop

To expose vllm-mlx as an MCP server that provides local model inference as a tool, add it to your claude_desktop_config.json.

{
  "mcpServers": {
    "vllm-mlx": {
      "command": "vllm-mlx",
      "args": ["serve", "mlx-community/Llama-3.2-3B-Instruct-4bit", "--port", "8000"],
      "env": {
        "ANTHROPIC_BASE_URL": "http://localhost:8000",
        "ANTHROPIC_API_KEY": "not-needed"
      }
    }
  }
}

Enable advanced features (optional)

For long-context models, add --ssd-cache-dir to spill the KV cache to disk. For reasoning models, add --reasoning-parser qwen3 or --reasoning-parser deepseek. Enable Prometheus metrics with --metrics.

vllm-mlx serve mlx-community/Qwen2.5-7B-Instruct-4bit \
  --port 8000 \
  --continuous-batching \
  --ssd-cache-dir /tmp/kvcache \
  --reasoning-parser qwen3 \
  --metrics

vLLM MLX Examples

Client configuration

Claude Desktop configuration using vllm-mlx as a local inference backend via the Anthropic-compatible endpoint.

{
  "mcpServers": {
    "vllm-mlx": {
      "command": "vllm-mlx",
      "args": [
        "serve",
        "mlx-community/Llama-3.2-3B-Instruct-4bit",
        "--port",
        "8000",
        "--continuous-batching"
      ],
      "env": {
        "ANTHROPIC_BASE_URL": "http://localhost:8000",
        "ANTHROPIC_API_KEY": "not-needed"
      }
    }
  }
}

Prompts to try

Example uses once vllm-mlx is running locally and Claude Code is pointed at it.

- "Summarize this document locally without sending it to the cloud" (paste large text)
- "Generate embeddings for these code snippets and find the most similar ones"
- "Transcribe this audio file using the local Whisper model"
- "Describe what's in this screenshot" (using a vision-language model like LLaVA)
- "Run a benchmark with 5 concurrent requests and report the tokens-per-second throughput"

Troubleshooting vLLM MLX

Model download fails or is very slow on first run

Models are downloaded from Hugging Face on first use. Ensure you have enough free disk space (4-bit quantized 7B models need about 4 GB). Set HF_HOME to a directory on a fast SSD to cache models for reuse.

Server crashes with 'Out of memory' error

Switch to a smaller or more aggressively quantized model (e.g. 4bit instead of 8bit). Add --ssd-cache-dir to spill the KV cache to disk. Reduce concurrent requests with lower --max-tokens limits.

Claude Code still connects to Anthropic cloud instead of local server

Verify ANTHROPIC_BASE_URL is exported in the same shell session where you run 'claude'. Run 'echo $ANTHROPIC_BASE_URL' to confirm it shows 'http://localhost:8000'. Restart claude after setting the variable.

Frequently Asked Questions about vLLM MLX

What is vLLM MLX?

vLLM MLX is a Model Context Protocol (MCP) server that openai and anthropic compatible server for apple silicon. run llms and vision-language models (llama, qwen-vl, llava) with continuous batching, mcp tool calling, and multimodal support. native mlx backend, 400+ tok/s. works with claude code. It connects AI assistants to external tools and data sources through a standardized interface.

How do I install vLLM MLX?

Follow the installation instructions on the vLLM MLX GitHub repository. Clone the repo, install dependencies, and add the server config to your AI client.

Which AI clients work with vLLM MLX?

vLLM MLX works with all major MCP-compatible AI clients including Claude Desktop, Claude Code, Cursor, VS Code (GitHub Copilot), Windsurf, and Cline.

Is vLLM MLX free to use?

Yes, vLLM MLX is open source and available under the Apache-2.0 license. You can use it freely in both personal and commercial projects.

Learn More About MCP Servers

Getting Started with MCP

Set up your first MCP server in minutes

MCP Setup Guide

Configure MCP in Claude, Cursor & VS Code

All MCP Tutorials

18+ hands-on guides for developers

MCP FAQ

40+ answers about Model Context Protocol

vLLM MLX Alternatives — Similar Data Science & ML Servers

Looking for alternatives to vLLM MLX? Here are other popular data science & ml servers you can use with Claude, Cursor, and VS Code.

Ultrarag

★ 5.6k

A Low-Code MCP Framework for Building Complex and Innovative RAG Pipelines

RocketRide

★ 3.1k

📇 🏠 - MCP server that exposes RocketRide AI pipelines as t

Aix Db

★ 2.1k

Aix-DB 基于 LangChain/LangGraph 框架，结合 MCP Skills 多智能体协作架构，实现自然语言到数据洞察的端到端转换。

NeMo Data Designer

★ 1.9k

🎨 NeMo Data Designer: Generate high-quality synthetic data from scratch or from seed data.

PaperBanana

★ 1.7k

Open source implementation and extension of Google Research’s PaperBanana for automated academic figures, diagrams, and research visuals, expanded to new domains like slide generation.

MiniMax

★ 1.5k

Bridges MiniMax AI capabilities to the Model Context Protocol, enabling AI agents to perform image understanding, text-to-image generation, and speech synthesis. It provides a standardized interface for accessing MiniMax's core tools via JSON-RPC.

Browse More Data Science & ML MCP Servers

Explore all data science & ml servers available in the MCPgee directory. Each server includes setup guides for Claude, Cursor, and VS Code.

Data Science & ML Browse All Servers

Set Up vLLM MLX in Your Editor

Choose your AI client for step-by-step setup instructions.

🖥️

Claude Desktop

macOS & Windows app

⌨️

Claude Code

CLI & terminal

📝

Cursor

AI-first code editor

💻

VS Code

GitHub Copilot MCP

🏄

Windsurf

Codeium AI editor

🔌

Cline

VS Code extension

Quick Config Preview

{
  "mcpServers": {
    "vllm-mlx": {
      "command": "npx",
      "args": ["-y", "vllm-mlx"]
    }
  }
}

Add this to your claude_desktop_config.json or .cursor/mcp.json

Read the full setup guide →

Ready to use vLLM MLX?

Browse our complete directory of 33,000+ MCP servers, read setup guides for your editor, and start building with the Model Context Protocol.

33,000+ ServersFree & Open SourceStep-by-Step Guides

Explore All Servers Read Our Guides