System Architecture

Technical overview of the Munin research assistant system.

High-Level Overview

                                    +---------------------+
                                    |       Users         |
                                    +----------+----------+
                                               |
                                               v
                                    +---------------------+
                                    |  Cloudflare Tunnel  |
                                    |   (Zero Trust)      |
                                    |   Email OTP Auth    |
                                    +----------+----------+
                                               |
                                               v
+-----------------------------------------------------------------------------+
|                              MUNIN CLUSTER                                   |
|                                                                              |
|  +-------------------+     +-------------------+     +-------------------+   |
|  |    Open WebUI     |---->|  Retrieval API    |---->|      Qdrant       |   |
|  |    (Chat UI)      |     |   (RAG Layer)     |     |   (Vector DB)     |   |
|  |    Port 3000      |     |   Port 8080       |     |   Port 6333       |   |
|  +---------+---------+     +---------+---------+     +-------------------+   |
|            |                         |                                       |
|            |                         v                                       |
|            |               +-------------------+     +-------------------+   |
|            |               |     SearXNG       |     |      Neo4j        |   |
|            |               |  (Web Search)     |     |   (Citations)     |   |
|            |               |   Port 8888       |     |   Port 7474       |   |
|            |               +-------------------+     +-------------------+   |
|            v                                                                 |
|  +-------------------+                                                       |
|  |       vLLM        |     Personas:                                        |
|  |     Port 8000     |       - Chatgeti (research)                          |
|  |                   |       - Codegeti (code)                              |
|  | Qwen3-30B-A3B-AWQ |       - Writegeti (writing)                          |
|  | (MoE 30B/3B)      |                                                       |
|  +-------------------+                                                       |
|                                                                              |
|                         GPU 1 (RTX 5090) - 90% VRAM                         |
+-----------------------------------------------------------------------------+

Persona Comparison

Persona	Purpose	Thinking Mode	Tools Available
Chatgeti	Research assistant, paper analysis, literature discovery	Enabled	Paper Search, Citations, Academic Search (Semantic Scholar, PubMed), Web Search
Codegeti	Code assistant, scientific Python, SLURM scripts	Enabled	Web Search, Date/Time
Writegeti	Academic writing, scientific communication, editing	Enabled	Web Search, Date/Time

All personas use the same underlying model: Qwen3-30B-A3B-Thinking-AWQ (MoE architecture, always-thinking).

Workflow: Use Chatgeti to discover and analyze papers -> Codegeti to implement methods -> Writegeti to write about results.

RAG (Retrieval-Augmented Generation) Flow

When you ask a question with RAG enabled, here's what happens:

1 Query Received

Open WebUI receives your question and checks which knowledge bases are enabled.

2 Parallel Retrieval

The retrieval service queries all selected sources simultaneously (papers, web).

3 Semantic Search

Your query is embedded using SPECTER (papers) and compared against stored vectors in Qdrant.

4 Context Assembly

Top matching documents are ranked and assembled into context for the LLM, with source attribution.

5 LLM Generation

The LLM receives your question + retrieved context and generates a grounded response with citations.

Knowledge Bases

Source	Content	Embeddings	Update Frequency
Papers	Scientific PDFs (title, abstract, metadata)	SPECTER (768 dim)	Manual upload
Web	Live web search results	N/A (real-time)	Real-time
User Docs	Your uploaded files	Open WebUI default	On upload

Model Configuration

Parameter	Value	Notes
Model	Qwen3-30B-A3B-Thinking-AWQ-4bit	MoE architecture, always-thinking
Total Parameters	30B	With 3B active per forward pass
Quantization	AWQ 4-bit	~17GB model size
Context Window	32,768 tokens	Constrained by RTX 5090 VRAM
Concurrent Requests	3	Balance between throughput and latency
Thinking Mode	Always enabled	Outputs reasoning in collapsible section

Service Ports

Port	Service	Access
3000	Open WebUI	Via Cloudflare Tunnel
6333	Qdrant Vector DB	Internal only
7474 / 7687	Neo4j Graph DB	Internal only
8000	vLLM (all personas)	Internal only
8070	GROBID (PDF parsing)	Internal only
8080	Retrieval Service	Internal only
8888	SearXNG	Internal only

Technology Stack

Layer	Technology	Purpose
Edge	Cloudflare Tunnel + Access	Secure access, authentication, SSL
UI	Open WebUI	Chat interface, document upload
RAG	Custom FastAPI Service	Multi-source retrieval
Search	SearXNG	Privacy-respecting web search
Inference	vLLM	Fast LLM serving (OpenAI-compatible)
Model	Qwen3-30B-A3B-Thinking-AWQ	MoE architecture (30B/3B active)
Vectors	Qdrant	Semantic search
Graph	Neo4j	Citation relationships
Parsing	GROBID	Scientific PDF extraction
Embeddings	SPECTER, BGE-base	Text to vectors
Compute	SLURM	Job scheduling, GPU allocation
Containers	Docker Compose	Service orchestration

User Types

WebUI Users (Remote Researchers)	Cluster Users (SLURM/SSH Access)
Access: Cloudflare -> WebUI	Access: VPN -> SSH -> SLURM
Auth: Email OTP (30-day sessions)	Auth: Linux accounts + SSH keys
GPU: Via vLLM service	GPU: GPU 0 via SLURM jobs
Can: Chat with personas, use RAG, upload documents	Can: Submit SLURM jobs, run training/inference, SSH access