How do I install vMLX MCP Server?

Follow the setup instructions on the vMLX GitHub repository, then add the server configuration to your AI client.

What category is vMLX MCP Server?

vMLX is categorized under Data Science & ML. Browse more servers in these categories on MCPgee.

vMLX

Name: Vmlx MCP Server
Author: jjang-ai

v1.0.0•Data Science & ML•stable

vMLX - JANGTQ Uber Compressed MLX Models - L2 Disk Cache (survives restart) + L1 Paged (super fast ttft) + Hybrid SSM Scheduler + Cont Batching + etc!

anthropic-apikvcache-compressionkvcache-optimizationkvcache-reusellm

518

Stars

Downloads

Weekly

0/5

View on GitHub

What is vMLX?

vMLX is a Model Context Protocol (MCP) server that allows AI assistants like Claude, Cursor, and VS Code to vmlx - jangtq uber compressed mlx models - l2 disk cache (survives restart) + l1 paged (super fast ttft) + hybrid ssm scheduler + cont batching + etc!

vMLX - JANGTQ Uber Compressed MLX Models - L2 Disk Cache (survives restart) + L1 Paged (super fast ttft) + Hybrid SSM Scheduler + Cont Batching + etc!

This server falls under the Data Science & ML category on MCPgee, the world's largest MCP server directory with 33,000+ servers.

Features

vMLX - JANGTQ Uber Compressed MLX Models - L2 Disk Cache (su

Use Cases

MLX model optimization

KV-cache compression and reuse

Fast inference on macOS

jjang-ai

Maintainer

LicenseApache-2.0

Languagepython

Versionv1.0.0

UpdatedMay 22, 2026

Statushealthy

Maintenanceactive

Works with

ClaudeOpenAIwindowsmacoslinux

View Source Browse All Servers

Installation

Manual Installation

npx vmlx

Configuration

Configuration Details

Config File

claude_desktop_config.json

Performance

Response Metrics

Response Time< 200ms

ThroughputMedium

Resource Usage

Memory UsageLow

CPU UsageLow

How to Set Up and Use vMLX

vMLX is a self-hosted MLX inference server for Apple Silicon Macs that runs LLMs, vision models, image generation, embedding, and reranking models locally with no third-party API keys required. It exposes an OpenAI- and Anthropic-compatible HTTP API, so any tool that calls OpenAI or Anthropic endpoints can point at vMLX instead. Advanced features include a paged KV cache with SSD-persisted L2 disk cache that survives restarts, KV cache quantization for 2-4x memory savings, speculative decoding, continuous batching, and distributed inference across multiple Apple Silicon Macs. You would use it to run models from Hugging Face's mlx-community locally and connect them to Claude Desktop or other MCP clients as a local LLM backend.

Prerequisites

Apple Silicon Mac (M1, M2, M3, or M4 — Intel Macs are not supported)
macOS 14+ (Sonoma or later) recommended
Python 3.10+ with uv, pipx, or a virtual environment
Sufficient unified memory for the model you want to run (e.g. 8 GB for 8B parameter 4-bit models)
No external API keys required for inference — keys only needed if using vMLX as a gateway to cloud providers

Install vMLX

Install via uv tool install for the cleanest setup — uv handles the isolated environment automatically. Alternatively use pipx or a plain venv.

# Recommended (uv):
brew install uv
uv tool install vmlx

# Or with pipx:
brew install pipx
pipx install vmlx

# Or in a venv:
python3 -m venv ~/.vmlx-env && source ~/.vmlx-env/bin/activate
pip install vmlx

Start the inference server with a model

Point vmlx serve at any model repo from mlx-community on Hugging Face. The model is downloaded automatically on first run and cached locally. The server starts at http://0.0.0.0:8000.

vmlx serve mlx-community/Qwen3-8B-4bit

Verify the server is running

Test the OpenAI-compatible endpoint with curl to confirm the server is accepting requests before connecting an MCP client.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "local", "messages": [{"role": "user", "content": "Hello!"}], "stream": false}'

Enable advanced caching for faster follow-ups

Start the server with prefix cache, paged KV cache, and SSD disk cache enabled to reuse KV states across requests and survive restarts. This is especially useful for long system prompts.

vmlx serve mlx-community/Qwen3-8B-4bit \
  --enable-prefix-cache \
  --enable-paged-cache \
  --enable-disk-cache

Connect to Claude Desktop or another MCP client

Configure your MCP client to use vMLX as a local OpenAI-compatible provider. Since vMLX speaks the OpenAI protocol, point any compatible client at http://localhost:8000/v1 with a dummy API key.

{
  "mcpServers": {
    "vmlx": {
      "command": "npx",
      "args": ["vmlx"],
      "env": {
        "VMLX_MODEL": "mlx-community/Qwen3-8B-4bit"
      }
    }
  }
}

vMLX Examples

Client configuration

Claude Desktop configuration for using vMLX as a local LLM backend, pointing the client at the OpenAI-compatible endpoint running on localhost.

{
  "mcpServers": {
    "vmlx": {
      "command": "npx",
      "args": ["vmlx"]
    }
  }
}

Prompts to try

Once vMLX is running and connected, use it exactly like a cloud LLM — these prompts work with any mlx-community model.

- "Explain the difference between MoE and dense transformer architectures."
- "Write a Python function to parse a CSV file with error handling."
- "Summarize this document: [paste text here]"
- "Translate the following paragraph from English to Spanish: [text]"
- "What are the pros and cons of using KV cache quantization for inference?"

Troubleshooting vMLX

pip install fails with 'externally-managed-environment' on macOS

Use uv tool install vmlx or pipx install vmlx instead of bare pip. Alternatively create a virtual environment: python3 -m venv ~/.vmlx-env && source ~/.vmlx-env/bin/activate && pip install vmlx.

Model download is slow or fails on first run

Models are downloaded from Hugging Face's mlx-community. Ensure you have a stable internet connection and enough disk space (4-bit 8B models are roughly 5-8 GB). If behind a proxy, set HF_ENDPOINT or HUGGINGFACE_HUB_CACHE accordingly.

Out-of-memory crash when loading a large model

Choose a smaller or more aggressively quantized model. For example, use a 4-bit quantized model (mlx-community/Qwen3-8B-4bit) on 16 GB RAM Macs, or a 2-bit JANG-quantized model for very large models. Close other RAM-intensive apps before starting vmlx serve.

Frequently Asked Questions about vMLX

What is vMLX?

vMLX is a Model Context Protocol (MCP) server that vmlx - jangtq uber compressed mlx models - l2 disk cache (survives restart) + l1 paged (super fast ttft) + hybrid ssm scheduler + cont batching + etc! It connects AI assistants to external tools and data sources through a standardized interface.

How do I install vMLX?

Follow the installation instructions on the vMLX GitHub repository. Clone the repo, install dependencies, and add the server config to your AI client.

Which AI clients work with vMLX?

vMLX works with all major MCP-compatible AI clients including Claude Desktop, Claude Code, Cursor, VS Code (GitHub Copilot), Windsurf, and Cline.

Is vMLX free to use?

Yes, vMLX is open source and available under the Apache-2.0 license. You can use it freely in both personal and commercial projects.

Learn More About MCP Servers

Getting Started with MCP

Set up your first MCP server in minutes

MCP Setup Guide

Configure MCP in Claude, Cursor & VS Code

All MCP Tutorials

18+ hands-on guides for developers

MCP FAQ

40+ answers about Model Context Protocol

vMLX Alternatives — Similar Data Science & ML Servers

Looking for alternatives to vMLX? Here are other popular data science & ml servers you can use with Claude, Cursor, and VS Code.

Ultrarag

★ 5.6k

A Low-Code MCP Framework for Building Complex and Innovative RAG Pipelines

RocketRide

★ 3.1k

📇 🏠 - MCP server that exposes RocketRide AI pipelines as t

Aix Db

★ 2.1k

Aix-DB 基于 LangChain/LangGraph 框架，结合 MCP Skills 多智能体协作架构，实现自然语言到数据洞察的端到端转换。

NeMo Data Designer

★ 1.9k

🎨 NeMo Data Designer: Generate high-quality synthetic data from scratch or from seed data.

PaperBanana

★ 1.7k

Open source implementation and extension of Google Research’s PaperBanana for automated academic figures, diagrams, and research visuals, expanded to new domains like slide generation.

MiniMax

★ 1.5k

Bridges MiniMax AI capabilities to the Model Context Protocol, enabling AI agents to perform image understanding, text-to-image generation, and speech synthesis. It provides a standardized interface for accessing MiniMax's core tools via JSON-RPC.

Browse More Data Science & ML MCP Servers

Explore all data science & ml servers available in the MCPgee directory. Each server includes setup guides for Claude, Cursor, and VS Code.

Data Science & ML Browse All Servers

Set Up vMLX in Your Editor

Choose your AI client for step-by-step setup instructions.

🖥️

Claude Desktop

macOS & Windows app

⌨️

Claude Code

CLI & terminal

📝

Cursor

AI-first code editor

💻

VS Code

GitHub Copilot MCP

🏄

Windsurf

Codeium AI editor

🔌

Cline

VS Code extension

Quick Config Preview

{
  "mcpServers": {
    "vmlx": {
      "command": "npx",
      "args": ["-y", "vmlx"]
    }
  }
}

Add this to your claude_desktop_config.json or .cursor/mcp.json

Read the full setup guide →

Ready to use vMLX?

Browse our complete directory of 33,000+ MCP servers, read setup guides for your editor, and start building with the Model Context Protocol.

33,000+ ServersFree & Open SourceStep-by-Step Guides

Explore All Servers Read Our Guides