API Documentation

Base URL: http://localhost:8000/v1 (internal, vLLM)

Model: Qwen3-30B-A3B-Thinking-AWQ (MoE architecture)

Compatibility: OpenAI SDK compatible

Quick Start

Install the OpenAI SDK and set your API key:

pip install openai

export MUNIN_API_KEY="sk-munin-your-key"

Python Example

from openai import OpenAI

client = OpenAI(
    api_key="sk-munin-your-key",
    base_url="http://localhost:8000/v1"
)

response = client.chat.completions.create(
    model="qwen3-30b-a3b-thinking",
    messages=[
        {"role": "user", "content": "What is a transformer model?"}
    ]
)

print(response.choices[0].message.content)

Endpoints

POST /v1/chat/completions

Generate a chat completion. This is the main endpoint for conversations.

Request Body

Parameter Type Description
model required string Model ID: qwen3-30b-a3b-thinking
messages required array List of messages with role and content
temperature float Sampling temperature (0-2). Default: 0.7
max_tokens integer Maximum tokens to generate. Default: 4096
stream boolean Enable streaming responses. Default: false
tools array List of tools for function calling

Example

{
  "model": "qwen3-30b-a3b-thinking",
  "messages": [
    {"role": "system", "content": "You are a helpful research assistant."},
    {"role": "user", "content": "Explain attention mechanisms in transformers."}
  ],
  "temperature": 0.6,
  "top_p": 0.95,
  "max_tokens": 4096
}
GET /v1/models

List all available models.

Response

{
  "object": "list",
  "data": [
    {"id": "qwen3-30b-a3b-thinking", "object": "model", "owned_by": "qwen"}
  ]
}

Note: Personas (Chatgeti, Codegeti, Writegeti) are configured in Open WebUI, not the vLLM API.

Tool Calling (Function Calling)

The API supports OpenAI-compatible tool calling for RAG workflows.

Available Tools

Tool Description
search_papers Semantic search over scientific papers
find_citing_papers Find papers that cite a given DOI
find_author_papers Find papers by author name
get_paper_details Get full metadata for a paper
web_search Search the web for current information

Example with Tools

from openai import OpenAI
import json

client = OpenAI(api_key="not-needed", base_url="http://localhost:8000/v1")

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_papers",
            "description": "Search the scientific paper database",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                    "top_k": {"type": "integer", "default": 5}
                },
                "required": ["query"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="qwen3-30b-a3b-thinking",
    messages=[{"role": "user", "content": "Find papers about attention mechanisms"}],
    tools=tools,
    tool_choice="auto"
)

# Check if model wants to call a tool
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Tool: {tool_call.function.name}")
    print(f"Args: {tool_call.function.arguments}")

Streaming

Enable streaming for real-time response generation:

stream = client.chat.completions.create(
    model="qwen3-30b-a3b-thinking",
    messages=[{"role": "user", "content": "Write a haiku about AI"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Thinking Mode

Always-Thinking Model

This model always outputs reasoning in <think> tags before the response. For best results:

Rate Limits

Concurrent requests are limited to 3 by vLLM configuration. Contact an administrator for details.

Error Codes

Code Meaning
400 Bad request - check your parameters
401 Unauthorized - invalid API key
429 Rate limited - slow down requests
500 Server error - try again later
503 Service unavailable - LLM may be offline