DarkNebula — Production-Grade Homelab Infrastructure

The Challenge
Architecture & Solution
Tech Stack
Key Engineering Decisions
Service Inventory
Automation & Scripts
Bot Systems
Self-Healing Agent
Security Hardening
Deployment
Roadmap

The Challenge

Managing a homelab infrastructure with over 40 Docker containers, multiple network access methods, hybrid storage (SSD and HDD), and diverse service categories (media, development, monitoring, security) presents significant operational challenges that traditional manual management approaches cannot address effectively.

Before this system, service management required executing individual docker compose commands for each container, with no unified state tracking or mode-based resource optimization. Monitoring required manually checking multiple dashboards, with no consolidated view of system health. Backup operations were executed manually with no automated verification. Security monitoring relied on separate tools with no unified threat response. The lack of remote access capabilities limited administration to local network connections or complex SSH tunnel setups.

The fundamental challenge was building an infrastructure that could operate autonomously with minimal manual intervention while providing comprehensive remote control through multiple interfaces (CLI, Telegram, Discord). The system needed to be intelligent enough to detect failures and recover automatically, resource-efficient enough to operate on consumer hardware, and secure enough to expose via Cloudflare Tunnel without compromising the host system.

The requirement: Build a production-grade, self-healing homelab infrastructure with centralized configuration management, multi-interface remote control (CLI + Telegram + Discord), automated resource optimization through mode switching (Work/Media/Full), hardware-level power monitoring via Intel RAPL, and enterprise-grade security hardening with CrowdSec integration.

Architecture & Solution

The architecture follows a modular, metadata-driven design where every service contains a .homelab-meta file defining its tier classification (core, work, media, extra). This metadata drives all automation logic including mode switching, self-healing scope, and status reporting.

Parsing system architecture diagram...

The configuration architecture implements a single-source-of-truth model where all service configurations reference a centralized global.env file through symlinks. The update-env.sh script recursively verifies and recreates these symlinks, ensuring configuration changes propagate consistently across all services without manual intervention.

Tech Stack

Layer	Technology	Role
Host OS	Ubuntu 24.04 LTS	Primary operating system
Container Runtime	Docker 27.x + Compose V2	Container orchestration
Reverse Proxy	Caddy 2.x	Automatic HTTPS, routing
Virtual Private Network	Tailscale	Mesh VPN for remote access
Public Access	Cloudflare Tunnel	Zero-trust public exposure
Time-Series Database	Prometheus	Metrics collection (15s scrape, 365-day retention)
Visualization	Grafana	Dashboards with INR costing
Security Monitoring	CrowdSec	Community-driven threat detection
Bot Framework	Discord.js v14	Discord bot with slash commands
Telegram API	Long Polling	Pure bash Telegram command processor
Power Monitoring	Scaphandre	Intel RAPL per-process attribution
Automation	Bash scripts (64)	Service control, backups, healing
File Sharing	Samba	7 authenticated network shares
Remote Desktop	RustDesk	P2P remote access

Key Engineering Decisions

1. Metadata-Driven Service Classification

Every service directory contains a .homelab-meta file with tier classification. This metadata drives all automation logic without hardcoding service lists.

bash

# ~/docker/applications/media/jellyfin/.homelab-meta
NAME=Jellyfin
TIER=media
CATEGORY=media

The homelab-ctl.sh script reads these metadata files to determine which services should run in each mode. This approach eliminates maintenance overhead when adding or removing services.

2. Centralized Configuration with Symlink Propagation

Rather than maintaining individual .env files for each service, all configurations reference a single global.env through symlinks maintained by update-env.sh.

bash

# update-env.sh excerpt
SERVICES=(
    "base/caddy"
    "base/portainer"
    "base/cloudflared"
    "applications/media/jellyfin"
    "applications/monitoring/telemetry"
    ...
)

for service in "${SERVICES[@]}"; do
    ln -sf "$GLOBAL_ENV" "$DOCKER_DIR/$service/.env"
done

This design ensures that updating a password or API key in one place automatically propagates to all services, eliminating configuration drift between containers.

3. Mode-Based Resource Management

The system implements three operational modes that control which service tiers are active, enabling resource optimization based on usage patterns:

Mode	Active Tiers	RAM Usage	Use Case
`work`	core + work	~800MB	Development, file sync
`media`	core + work + media	~2.5GB	Photo library, media streaming
`full`	all tiers	~5GB+	Full automation, *arr stack

The homelab-ctl.sh script implements mode switching by stopping and archiving services that should not run in the target mode, moving their compose files to ~/docker/archived/ to free memory and CPU resources.

bash

# homelab-ctl.sh mode switching
case "$MODE" in
    work)
        archive_tier "media"
        archive_tier "extra"
        ;;
    media)
        restore_tier "media"
        archive_tier "extra"
        ;;
    full)
        restore_all
        ;;
esac

4. Intel RAPL Power Monitoring

The system uses Scaphandre to read CPU power data directly from Intel RAPL (Running Average Power Limit) hardware counters, providing per-second power attribution at the process level.

bash

# power-monitor.sh calculation
CPU_POWER=$(cat /sys/class/powercap/intel-rapl:0/energy_uj)
GPU_POWER=$(cat /sys/class/hwmon/hwmon*/power1_average 2>/dev/null || echo 0)
OVERHEAD=15  # Motherboard, RAM, SSD, NIC
PSU_LOSS=0.85  # 85% efficiency

TOTAL_POWER=$(( (CPU_POWER + GPU_POWER + OVERHEAD) / PSU_LOSS ))

This data is scraped by Prometheus every 15 seconds and visualized in Grafana with electricity cost calculations based on ₹8/kWh (Assam APDCL rates).

5. Exponential Backoff with Chronic Failure Suppression

The self-healing agent tracks consecutive failures per container and implements progressive backoff to prevent restart loops while still allowing recovery attempts.

bash

# Failure tracking
FAILURE_COUNT=$(cat ~/docker/status/heal-tracker/${CONTAINER}.count 2>/dev/null || echo 0)
((FAILURE_COUNT++))

if [ $FAILURE_COUNT -ge 7 ]; then
    echo "⛔ CHRONIC: Skipping $CONTAINER (suppressed)"
    exit 0
fi

# Backoff intervals
if [ $FAILURE_COUNT -le 2 ]; then
    RESTART_NOW
elif [ $FAILURE_COUNT -le 4 ]; then
    sleep 900  # 15 minutes
elif [ $FAILURE_COUNT -le 6 ]; then
    sleep 3600  # 1 hour
fi

Service Inventory

Infrastructure (7 containers)

Service	Port	Description
Caddy	80/443/8085	Reverse proxy with automatic HTTPS
Cloudflared	—	Cloudflare Tunnel daemon
Portainer	9000	Container management GUI
CrowdSec	—	Threat detection and IP banning
Watchtower	—	Automatic image updates
Sablier	—	Scale-to-zero for on-demand services
Authentik	9900	SSO identity provider (4 containers)

Media (16 containers)

Service	Port	Description
Jellyfin	8096	4K HDR media streaming
Immich	2283	Google Photos alternative (5 containers)
qBittorrent	8081	Torrent client with VueTorrent theme
Syncthing	8384	P2P file synchronization
Sonarr	8989	TV show automation
Radarr	7878	Movie automation
Prowlarr	9696	Indexer management
Jellyseerr	5055	Media request platform
autobrr	7474	Torrent filter automation
Youtarr	3087	YouTube channel downloader

Development (3 containers)

Service	Port	Description
Gitea	3000	Self-hosted Git server
Gitea Actions	—	CI/CD runner

Telemetry (9 containers)

Service	Port	Description
Grafana	3001	Analytics dashboard
Prometheus	9091	Time-series database
Alertmanager	9093	Alert routing
Node Exporter	—	Host metrics
cAdvisor	—	Container metrics
Scaphandre	8086	Power monitoring
Smartctl Exporter	9633	Disk health
Netdata	19999	Real-time metrics
Scraparr	7100	*arr metrics exporter

Automation & Scripts

The system includes 64 automation scripts organized by function:

Category	Scripts	Purpose
Core Control	`dark-nebula.sh`, `homelab-ctl.sh`	CLI and mode management
Bot Handlers	`dark-bot.sh`, `dark-discord.js`	Telegram and Discord bots
Healing	`dark-heal.sh`, `fix-system.sh`	Self-healing and repair
Backup	`backup-minimal.sh`, `backup-full.sh`	Configuration and data backup
Storage	`mount-hdd.sh`, `unmount-hdd.sh`	Safe HDD lifecycle
Audit	`audit/01-system.sh`...`20-overall-score.sh`	20 modular audit checks
Telemetry	`power-monitor.sh`, `healthcheck.sh`	Hardware monitoring

The dark CLI provides unified access to all automation:

bash

dark status         # Full system health check
dark mode work      # Switch to work mode
dark power          # Live power and thermal data
dark audit          # Run modular system audit
dark heal           # Trigger self-healing
dark backup         # Configuration backup
dark hdd mount      # Mount media HDD
dark update         # Update all containers

Shell aliases provide quick access to common operations:

Alias	Expands To
`home`, `dev`, `ws`	Directory navigation
`dcup`, `dcdown`	Docker compose up/down
`dps`, `dstats`	Container status
`hdd-mount`, `hdd-unmount`	HDD lifecycle
`health`	Full health report

Bot Systems

Telegram Bot (44 Commands)

The Telegram bot provides remote control through pure bash processing of long-polling updates. All commands are processed through dark-bot.sh with responses sent via the Telegram Bot API.

Command	Function
`/status`	System health check
`/audit`	Deep modular audit
`/heal`	Trigger self-healing
`/power`	CPU/GPU/thermal data
`/storage`	Disk usage (SSD, HDD, Docker)
`/mode`	Current mode display
`/backup`	Trigger config backup
`/hdd_mount`	Mount media HDD
`/hdd_unmount`	Safely unmount HDD
`/on <service>`	Start any service
`/off <service>`	Stop any service
`/tunnel`	Cloudflare Tunnel status
`/security`	CrowdSec decisions

Discord Bot (49 Commands)

The Discord bot runs as a Docker container using Discord.js v14, providing richer interactive features including modals for config editing, trend charts via QuickChart API, and a live Pulse embed that updates every 30 seconds.

javascript

// Discord bot power command
const { execSync } = require('child_process');
const powerData = execSync('docker exec scaphandre wget -qO- http://localhost:8086/metrics')
    .toString()
    .split('\n')
    .filter(line => line.startsWith('scaphandre_power_socket_watts'));

const embed = new EmbedBuilder()
    .setTitle('⚡ Power Draw')
    .addFields(
        { name: 'CPU', value: `${cpuWatts}W`, inline: true },
        { name: 'GPU', value: `${gpuWatts}W`, inline: true },
        { name: 'Cost/hr', value: `₹${costPerHour}`, inline: true }
    );

Discord-exclusive features include:

The Pulse: Auto-updating pinned embed with live metrics
Trend Charts: 24-hour historical data via QuickChart
Remote Studio: Interactive .env editor via Discord Modals

Self-Healing Agent

The dark heal command implements an 11-phase autonomous recovery system that respects the current operational mode:

Parsing system architecture diagram...

Phase	Detection	Recovery
1	Lock file staleness	Clear stale lock
2	Unhealthy containers	Docker restart
3	Crashed containers	Docker start (mode-aware)
4	Failed systemd	systemctl restart
5	Internet unreachable	Restart NetworkManager
5	Tailscale offline	Restart tailscaled
6	HTTP probe failures	Restart failed services
7	Tunnel disconnection	Restart cloudflared
8	Disk >90%	Emergency prune
9	HDD unmounted	Auto-mount if in media/full
10	Permission drift	fix-system.sh
11	Recovery complete	Send notifications

The agent is mode-aware, meaning it only attempts to heal services whose tier matches the current operational mode. It will not try to start media services when in work mode, avoiding unnecessary resource consumption and error logs.

Security Hardening

Network Perimeter

The system implements a defense-in-depth approach with multiple security layers:

Layer	Technology	Configuration
Public Access	Cloudflare Tunnel	Zero exposed ports, tunnel-to-container
Reverse Proxy	Caddy	Automatic HTTPS, access logging
IDS	CrowdSec	Community threats, auto-bans
Firewall	UFW	Default DENY, only 80/443/22
VPN	Tailscale	Encrypted mesh, 100.x.x.x
Intrusion	Fail2ban	SSH, Gitea, Jellyfin jails

Host Hardening

Setting	Value	File
Root login	Disabled	sshd_config.d/hardening.conf
Password auth	Disabled	sshd_config.d/hardening.conf
SSH key type	Ed25519 only	authorized_keys
Docker socket	User in docker group	/var/run/docker.sock
Samba shares	Authenticated only	smb.conf

Audit Schedule

The automated audit runs 20 modular checks via cron:

Check	Scope
01-03	System, Docker, Network
04-06	Security, Telegram, Vulnerabilities
07-10	Backup, Certificates, Docker Security
11-16	Resource leaks, Mode consistency, HDD health
17-20	Git config, Cron integrity, Firewall, Overall score

Deployment

Quick Start

bash

# Clone and configure
git clone https://github.com/bhargav-pratim-sarma/Darknebula-Homelab.git
cd Darknebula-Homelab/homelab-installer

# Run installer
chmod +x install.sh
sudo ./install.sh

# Access services
dark status              # System health
dark power              # Live power data

Automated Installer

The install.sh script automates the complete setup in ~10 minutes:

Network configuration (static IP)
Docker engine installation
Modern CLI tools (Starship, Lazygit, Lazydocker)
Tailscale VPN with zero-touch auth
40+ Docker containers deployment
Samba file shares configuration
Systemd services (power monitor, bot, boot notification)
ZSH with Oh-My-Zsh and plugins

Configuration

The installer supports non-interactive mode via install.conf:

bash

# install.conf
INSTALL_ANTIGRAVITY=true
INSTALL_DESKTOP=false
INITIAL_MODE=full
TAILSCALE_AUTH_KEY=your_key_here
GITHUB_PAT=your_pat_here

Roadmap

Phase 1 — Core Infrastructure (Complete) - Docker, Caddy, Portainer, basic monitoring
Phase 2 — Service Catalog (Complete) - 40+ containers across all categories
Phase 3 — Automation Scripts (Complete) - 64 scripts for all operations
Phase 4 — Telegram Bot (Complete) - 44 commands for remote control
Phase 5 — Discord Bot (Complete) - 49 slash commands with Pulse, Trends
Phase 6 — Power Monitoring (Complete) - Intel RAPL via Scaphandre, INR costing
Phase 7 — Self-Healing (Complete) - 11-phase autonomous recovery
Phase 8 — Mode Switching (Complete) - Work/Media/Full resource optimization
Phase 9 — IPv6 Support (In Progress) - Full IPv6 dual-stack
Phase 10 — Kubernetes Migration (Planned) - K3s for container orchestration

Conclusion

DarkNebula represents a production-grade homelab infrastructure that achieves near-autonomous operation through comprehensive automation. The 40+ services are managed through a metadata-driven system that eliminates maintenance overhead while providing multiple control interfaces (CLI, Telegram, Discord) for remote administration.

The Intel RAPL power monitoring provides hardware-level accuracy at 15-second resolution, enabling precise electricity cost tracking in Grafana. The self-healing agent automatically detects and recovers from 11 different failure modes, with exponential backoff preventing restart loops on chronically failing services.

The mode-based resource management enables running the infrastructure on consumer hardware by selectively activating service tiers based on current needs. The work mode consumes ~800MB RAM while full mode enables the complete automation stack at ~5GB.

Security hardening through CrowdSec, Fail2ban, UFW, and Cloudflare Tunnel provides enterprise-grade protection while maintaining accessibility through the public domain. The centralized configuration through global.env with symlink propagation ensures consistency across all services without manual synchronization.

Architecture Feedback

Spotted a potential optimization or antipattern? Let me know.

DarkNebula — Production-Grade Homelab Infrastructure

Table of Contents

The Challenge

Architecture & Solution

Tech Stack

Key Engineering Decisions

1. Metadata-Driven Service Classification

2. Centralized Configuration with Symlink Propagation

3. Mode-Based Resource Management

4. Intel RAPL Power Monitoring

5. Exponential Backoff with Chronic Failure Suppression

Service Inventory

Infrastructure (7 containers)

Media (16 containers)

Development (3 containers)

Telemetry (9 containers)

Automation & Scripts

Bot Systems

Telegram Bot (44 Commands)

Discord Bot (49 Commands)

Self-Healing Agent

Security Hardening

Network Perimeter

Host Hardening

Audit Schedule

Deployment

Quick Start

Automated Installer

Configuration

Roadmap

Conclusion

Architecture Feedback

Submit a Technical Suggestion

Let's architect your next system.