API Documentation
Base URL: http://localhost:8000/v1 (internal, vLLM)
Model: Qwen3-30B-A3B-Thinking-AWQ (MoE architecture)
Compatibility: OpenAI SDK compatible
Quick Start
Install the OpenAI SDK and set your API key:
pip install openai export MUNIN_API_KEY="sk-munin-your-key"
Python Example
from openai import OpenAI
client = OpenAI(
api_key="sk-munin-your-key",
base_url="http://localhost:8000/v1"
)
response = client.chat.completions.create(
model="qwen3-30b-a3b-thinking",
messages=[
{"role": "user", "content": "What is a transformer model?"}
]
)
print(response.choices[0].message.content)
Endpoints
POST
/v1/chat/completions
Generate a chat completion. This is the main endpoint for conversations.
Request Body
| Parameter | Type | Description |
|---|---|---|
model required |
string | Model ID: qwen3-30b-a3b-thinking |
messages required |
array | List of messages with role and content |
temperature |
float | Sampling temperature (0-2). Default: 0.7 |
max_tokens |
integer | Maximum tokens to generate. Default: 4096 |
stream |
boolean | Enable streaming responses. Default: false |
tools |
array | List of tools for function calling |
Example
{
"model": "qwen3-30b-a3b-thinking",
"messages": [
{"role": "system", "content": "You are a helpful research assistant."},
{"role": "user", "content": "Explain attention mechanisms in transformers."}
],
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 4096
}
GET
/v1/models
List all available models.
Response
{
"object": "list",
"data": [
{"id": "qwen3-30b-a3b-thinking", "object": "model", "owned_by": "qwen"}
]
}
Note: Personas (Chatgeti, Codegeti, Writegeti) are configured in Open WebUI, not the vLLM API.
Tool Calling (Function Calling)
The API supports OpenAI-compatible tool calling for RAG workflows.
Available Tools
| Tool | Description |
|---|---|
search_papers |
Semantic search over scientific papers |
find_citing_papers |
Find papers that cite a given DOI |
find_author_papers |
Find papers by author name |
get_paper_details |
Get full metadata for a paper |
web_search |
Search the web for current information |
Example with Tools
from openai import OpenAI
import json
client = OpenAI(api_key="not-needed", base_url="http://localhost:8000/v1")
tools = [
{
"type": "function",
"function": {
"name": "search_papers",
"description": "Search the scientific paper database",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"top_k": {"type": "integer", "default": 5}
},
"required": ["query"]
}
}
}
]
response = client.chat.completions.create(
model="qwen3-30b-a3b-thinking",
messages=[{"role": "user", "content": "Find papers about attention mechanisms"}],
tools=tools,
tool_choice="auto"
)
# Check if model wants to call a tool
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
print(f"Tool: {tool_call.function.name}")
print(f"Args: {tool_call.function.arguments}")
Streaming
Enable streaming for real-time response generation:
stream = client.chat.completions.create(
model="qwen3-30b-a3b-thinking",
messages=[{"role": "user", "content": "Write a haiku about AI"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Thinking Mode
Always-Thinking Model
This model always outputs reasoning in <think> tags before the response. For best results:
- Temperature: 0.6, Top P: 0.95 (recommended for thinking mode)
- Presence penalty: 1.5 (helps prevent repetition in quantized models)
- In multi-turn conversations, exclude thinking content from history
Rate Limits
Concurrent requests are limited to 3 by vLLM configuration. Contact an administrator for details.
Error Codes
| Code | Meaning |
|---|---|
| 400 | Bad request - check your parameters |
| 401 | Unauthorized - invalid API key |
| 429 | Rate limited - slow down requests |
| 500 | Server error - try again later |
| 503 | Service unavailable - LLM may be offline |