API Documentation

Base URL: http://localhost:8000/v1 (internal, vLLM)

Model: Qwen3-30B-A3B-Thinking-AWQ (MoE architecture)

Compatibility: OpenAI SDK compatible

Quick Start

Install the OpenAI SDK and set your API key:

pip install openai

export MUNIN_API_KEY="sk-munin-your-key"

Python Example

from openai import OpenAI

client = OpenAI(
    api_key="sk-munin-your-key",
    base_url="http://localhost:8000/v1"
)

response = client.chat.completions.create(
    model="qwen3-30b-a3b-thinking",
    messages=[
        {"role": "user", "content": "What is a transformer model?"}
    ]
)

print(response.choices[0].message.content)

Endpoints

POST /v1/chat/completions

Generate a chat completion. This is the main endpoint for conversations.

Request Body

Parameter	Type	Description
`model` required	string	Model ID: `qwen3-30b-a3b-thinking`
`messages` required	array	List of messages with `role` and `content`
`temperature`	float	Sampling temperature (0-2). Default: 0.7
`max_tokens`	integer	Maximum tokens to generate. Default: 4096
`stream`	boolean	Enable streaming responses. Default: false
`tools`	array	List of tools for function calling

Example

{
  "model": "qwen3-30b-a3b-thinking",
  "messages": [
    {"role": "system", "content": "You are a helpful research assistant."},
    {"role": "user", "content": "Explain attention mechanisms in transformers."}
  ],
  "temperature": 0.6,
  "top_p": 0.95,
  "max_tokens": 4096
}

GET /v1/models

List all available models.

Response

{
  "object": "list",
  "data": [
    {"id": "qwen3-30b-a3b-thinking", "object": "model", "owned_by": "qwen"}
  ]
}

Note: Personas (Chatgeti, Codegeti, Writegeti) are configured in Open WebUI, not the vLLM API.

Tool Calling (Function Calling)

The API supports OpenAI-compatible tool calling for RAG workflows.

Available Tools

Tool	Description
`search_papers`	Semantic search over scientific papers
`find_citing_papers`	Find papers that cite a given DOI
`find_author_papers`	Find papers by author name
`get_paper_details`	Get full metadata for a paper
`web_search`	Search the web for current information

Example with Tools

from openai import OpenAI
import json

client = OpenAI(api_key="not-needed", base_url="http://localhost:8000/v1")

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_papers",
            "description": "Search the scientific paper database",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                    "top_k": {"type": "integer", "default": 5}
                },
                "required": ["query"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="qwen3-30b-a3b-thinking",
    messages=[{"role": "user", "content": "Find papers about attention mechanisms"}],
    tools=tools,
    tool_choice="auto"
)

# Check if model wants to call a tool
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Tool: {tool_call.function.name}")
    print(f"Args: {tool_call.function.arguments}")

Streaming

Enable streaming for real-time response generation:

stream = client.chat.completions.create(
    model="qwen3-30b-a3b-thinking",
    messages=[{"role": "user", "content": "Write a haiku about AI"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Thinking Mode

Always-Thinking Model

This model always outputs reasoning in <think> tags before the response. For best results:

Temperature: 0.6, Top P: 0.95 (recommended for thinking mode)
Presence penalty: 1.5 (helps prevent repetition in quantized models)
In multi-turn conversations, exclude thinking content from history

Rate Limits

Concurrent requests are limited to 3 by vLLM configuration. Contact an administrator for details.

Error Codes

Code	Meaning
400	Bad request - check your parameters
401	Unauthorized - invalid API key
429	Rate limited - slow down requests
500	Server error - try again later
503	Service unavailable - LLM may be offline