System Architecture

Technical overview of the Munin research assistant system.

High-Level Overview

                                    +---------------------+
                                    |       Users         |
                                    +----------+----------+
                                               |
                                               v
                                    +---------------------+
                                    |  Cloudflare Tunnel  |
                                    |   (Zero Trust)      |
                                    |   Email OTP Auth    |
                                    +----------+----------+
                                               |
                                               v
+-----------------------------------------------------------------------------+
|                              MUNIN CLUSTER                                   |
|                                                                              |
|  +-------------------+     +-------------------+     +-------------------+   |
|  |    Open WebUI     |---->|  Retrieval API    |---->|      Qdrant       |   |
|  |    (Chat UI)      |     |   (RAG Layer)     |     |   (Vector DB)     |   |
|  |    Port 3000      |     |   Port 8080       |     |   Port 6333       |   |
|  +---------+---------+     +---------+---------+     +-------------------+   |
|            |                         |                                       |
|            |                         v                                       |
|            |               +-------------------+     +-------------------+   |
|            |               |     SearXNG       |     |      Neo4j        |   |
|            |               |  (Web Search)     |     |   (Citations)     |   |
|            |               |   Port 8888       |     |   Port 7474       |   |
|            |               +-------------------+     +-------------------+   |
|            v                                                                 |
|  +-------------------+                                                       |
|  |       vLLM        |     Personas:                                        |
|  |     Port 8000     |       - Chatgeti (research)                          |
|  |                   |       - Codegeti (code)                              |
|  | Qwen3-30B-A3B-AWQ |       - Writegeti (writing)                          |
|  | (MoE 30B/3B)      |                                                       |
|  +-------------------+                                                       |
|                                                                              |
|                         GPU 1 (RTX 5090) - 90% VRAM                         |
+-----------------------------------------------------------------------------+

Persona Comparison

Persona Purpose Thinking Mode Tools Available
Chatgeti Research assistant, paper analysis, literature discovery Enabled Paper Search, Citations, Academic Search (Semantic Scholar, PubMed), Web Search
Codegeti Code assistant, scientific Python, SLURM scripts Enabled Web Search, Date/Time
Writegeti Academic writing, scientific communication, editing Enabled Web Search, Date/Time

All personas use the same underlying model: Qwen3-30B-A3B-Thinking-AWQ (MoE architecture, always-thinking).

Workflow: Use Chatgeti to discover and analyze papers -> Codegeti to implement methods -> Writegeti to write about results.

RAG (Retrieval-Augmented Generation) Flow

When you ask a question with RAG enabled, here's what happens:

1 Query Received

Open WebUI receives your question and checks which knowledge bases are enabled.

2 Parallel Retrieval

The retrieval service queries all selected sources simultaneously (papers, web).

3 Semantic Search

Your query is embedded using SPECTER (papers) and compared against stored vectors in Qdrant.

4 Context Assembly

Top matching documents are ranked and assembled into context for the LLM, with source attribution.

5 LLM Generation

The LLM receives your question + retrieved context and generates a grounded response with citations.

Knowledge Bases

Source Content Embeddings Update Frequency
Papers Scientific PDFs (title, abstract, metadata) SPECTER (768 dim) Manual upload
Web Live web search results N/A (real-time) Real-time
User Docs Your uploaded files Open WebUI default On upload

Model Configuration

Parameter Value Notes
Model Qwen3-30B-A3B-Thinking-AWQ-4bit MoE architecture, always-thinking
Total Parameters 30B With 3B active per forward pass
Quantization AWQ 4-bit ~17GB model size
Context Window 32,768 tokens Constrained by RTX 5090 VRAM
Concurrent Requests 3 Balance between throughput and latency
Thinking Mode Always enabled Outputs reasoning in collapsible section

Service Ports

Port Service Access
3000 Open WebUI Via Cloudflare Tunnel
6333 Qdrant Vector DB Internal only
7474 / 7687 Neo4j Graph DB Internal only
8000 vLLM (all personas) Internal only
8070 GROBID (PDF parsing) Internal only
8080 Retrieval Service Internal only
8888 SearXNG Internal only

Technology Stack

Layer Technology Purpose
Edge Cloudflare Tunnel + Access Secure access, authentication, SSL
UI Open WebUI Chat interface, document upload
RAG Custom FastAPI Service Multi-source retrieval
Search SearXNG Privacy-respecting web search
Inference vLLM Fast LLM serving (OpenAI-compatible)
Model Qwen3-30B-A3B-Thinking-AWQ MoE architecture (30B/3B active)
Vectors Qdrant Semantic search
Graph Neo4j Citation relationships
Parsing GROBID Scientific PDF extraction
Embeddings SPECTER, BGE-base Text to vectors
Compute SLURM Job scheduling, GPU allocation
Containers Docker Compose Service orchestration

User Types

WebUI Users (Remote Researchers) Cluster Users (SLURM/SSH Access)
Access: Cloudflare -> WebUI Access: VPN -> SSH -> SLURM
Auth: Email OTP (30-day sessions) Auth: Linux accounts + SSH keys
GPU: Via vLLM service GPU: GPU 0 via SLURM jobs
Can: Chat with personas, use RAG, upload documents Can: Submit SLURM jobs, run training/inference, SSH access