Table of Contents
- The Challenge
- Architecture & Solution
- Tech Stack
- Key Engineering Decisions
- Service Inventory
- Automation & Scripts
- Bot Systems
- Self-Healing Agent
- Security Hardening
- Deployment
- Roadmap
The Challenge
Managing a homelab infrastructure with over 40 Docker containers, multiple network access methods, hybrid storage (SSD and HDD), and diverse service categories (media, development, monitoring, security) presents significant operational challenges that traditional manual management approaches cannot address effectively.
Before this system, service management required executing individual docker compose commands for each container, with no unified state tracking or mode-based resource optimization. Monitoring required manually checking multiple dashboards, with no consolidated view of system health. Backup operations were executed manually with no automated verification. Security monitoring relied on separate tools with no unified threat response. The lack of remote access capabilities limited administration to local network connections or complex SSH tunnel setups.
The fundamental challenge was building an infrastructure that could operate autonomously with minimal manual intervention while providing comprehensive remote control through multiple interfaces (CLI, Telegram, Discord). The system needed to be intelligent enough to detect failures and recover automatically, resource-efficient enough to operate on consumer hardware, and secure enough to expose via Cloudflare Tunnel without compromising the host system.
The requirement: Build a production-grade, self-healing homelab infrastructure with centralized configuration management, multi-interface remote control (CLI + Telegram + Discord), automated resource optimization through mode switching (Work/Media/Full), hardware-level power monitoring via Intel RAPL, and enterprise-grade security hardening with CrowdSec integration.
Architecture & Solution
The architecture follows a modular, metadata-driven design where every service contains a .homelab-meta file defining its tier classification (core, work, media, extra). This metadata drives all automation logic including mode switching, self-healing scope, and status reporting.
The configuration architecture implements a single-source-of-truth model where all service configurations reference a centralized global.env file through symlinks. The update-env.sh script recursively verifies and recreates these symlinks, ensuring configuration changes propagate consistently across all services without manual intervention.
Tech Stack
| Layer | Technology | Role |
|---|---|---|
| Host OS | Ubuntu 24.04 LTS | Primary operating system |
| Container Runtime | Docker 27.x + Compose V2 | Container orchestration |
| Reverse Proxy | Caddy 2.x | Automatic HTTPS, routing |
| Virtual Private Network | Tailscale | Mesh VPN for remote access |
| Public Access | Cloudflare Tunnel | Zero-trust public exposure |
| Time-Series Database | Prometheus | Metrics collection (15s scrape, 365-day retention) |
| Visualization | Grafana | Dashboards with INR costing |
| Security Monitoring | CrowdSec | Community-driven threat detection |
| Bot Framework | Discord.js v14 | Discord bot with slash commands |
| Telegram API | Long Polling | Pure bash Telegram command processor |
| Power Monitoring | Scaphandre | Intel RAPL per-process attribution |
| Automation | Bash scripts (64) | Service control, backups, healing |
| File Sharing | Samba | 7 authenticated network shares |
| Remote Desktop | RustDesk | P2P remote access |
Key Engineering Decisions
1. Metadata-Driven Service Classification
Every service directory contains a .homelab-meta file with tier classification. This metadata drives all automation logic without hardcoding service lists.
# ~/docker/applications/media/jellyfin/.homelab-meta
NAME=Jellyfin
TIER=media
CATEGORY=mediaThe homelab-ctl.sh script reads these metadata files to determine which services should run in each mode. This approach eliminates maintenance overhead when adding or removing services.
2. Centralized Configuration with Symlink Propagation
Rather than maintaining individual .env files for each service, all configurations reference a single global.env through symlinks maintained by update-env.sh.
# update-env.sh excerpt
SERVICES=(
"base/caddy"
"base/portainer"
"base/cloudflared"
"applications/media/jellyfin"
"applications/monitoring/telemetry"
...
)
for service in "${SERVICES[@]}"; do
ln -sf "$GLOBAL_ENV" "$DOCKER_DIR/$service/.env"
doneThis design ensures that updating a password or API key in one place automatically propagates to all services, eliminating configuration drift between containers.
3. Mode-Based Resource Management
The system implements three operational modes that control which service tiers are active, enabling resource optimization based on usage patterns:
| Mode | Active Tiers | RAM Usage | Use Case |
|---|---|---|---|
work | core + work | ~800MB | Development, file sync |
media | core + work + media | ~2.5GB | Photo library, media streaming |
full | all tiers | ~5GB+ | Full automation, *arr stack |
The homelab-ctl.sh script implements mode switching by stopping and archiving services that should not run in the target mode, moving their compose files to ~/docker/archived/ to free memory and CPU resources.
# homelab-ctl.sh mode switching
case "$MODE" in
work)
archive_tier "media"
archive_tier "extra"
;;
media)
restore_tier "media"
archive_tier "extra"
;;
full)
restore_all
;;
esac4. Intel RAPL Power Monitoring
The system uses Scaphandre to read CPU power data directly from Intel RAPL (Running Average Power Limit) hardware counters, providing per-second power attribution at the process level.
# power-monitor.sh calculation
CPU_POWER=$(cat /sys/class/powercap/intel-rapl:0/energy_uj)
GPU_POWER=$(cat /sys/class/hwmon/hwmon*/power1_average 2>/dev/null || echo 0)
OVERHEAD=15 # Motherboard, RAM, SSD, NIC
PSU_LOSS=0.85 # 85% efficiency
TOTAL_POWER=$(( (CPU_POWER + GPU_POWER + OVERHEAD) / PSU_LOSS ))This data is scraped by Prometheus every 15 seconds and visualized in Grafana with electricity cost calculations based on ₹8/kWh (Assam APDCL rates).
5. Exponential Backoff with Chronic Failure Suppression
The self-healing agent tracks consecutive failures per container and implements progressive backoff to prevent restart loops while still allowing recovery attempts.
# Failure tracking
FAILURE_COUNT=$(cat ~/docker/status/heal-tracker/${CONTAINER}.count 2>/dev/null || echo 0)
((FAILURE_COUNT++))
if [ $FAILURE_COUNT -ge 7 ]; then
echo "⛔ CHRONIC: Skipping $CONTAINER (suppressed)"
exit 0
fi
# Backoff intervals
if [ $FAILURE_COUNT -le 2 ]; then
RESTART_NOW
elif [ $FAILURE_COUNT -le 4 ]; then
sleep 900 # 15 minutes
elif [ $FAILURE_COUNT -le 6 ]; then
sleep 3600 # 1 hour
fiService Inventory
Infrastructure (7 containers)
| Service | Port | Description |
|---|---|---|
| Caddy | 80/443/8085 | Reverse proxy with automatic HTTPS |
| Cloudflared | — | Cloudflare Tunnel daemon |
| Portainer | 9000 | Container management GUI |
| CrowdSec | — | Threat detection and IP banning |
| Watchtower | — | Automatic image updates |
| Sablier | — | Scale-to-zero for on-demand services |
| Authentik | 9900 | SSO identity provider (4 containers) |
Media (16 containers)
| Service | Port | Description |
|---|---|---|
| Jellyfin | 8096 | 4K HDR media streaming |
| Immich | 2283 | Google Photos alternative (5 containers) |
| qBittorrent | 8081 | Torrent client with VueTorrent theme |
| Syncthing | 8384 | P2P file synchronization |
| Sonarr | 8989 | TV show automation |
| Radarr | 7878 | Movie automation |
| Prowlarr | 9696 | Indexer management |
| Jellyseerr | 5055 | Media request platform |
| autobrr | 7474 | Torrent filter automation |
| Youtarr | 3087 | YouTube channel downloader |
Development (3 containers)
| Service | Port | Description |
|---|---|---|
| Gitea | 3000 | Self-hosted Git server |
| Gitea Actions | — | CI/CD runner |
Telemetry (9 containers)
| Service | Port | Description |
|---|---|---|
| Grafana | 3001 | Analytics dashboard |
| Prometheus | 9091 | Time-series database |
| Alertmanager | 9093 | Alert routing |
| Node Exporter | — | Host metrics |
| cAdvisor | — | Container metrics |
| Scaphandre | 8086 | Power monitoring |
| Smartctl Exporter | 9633 | Disk health |
| Netdata | 19999 | Real-time metrics |
| Scraparr | 7100 | *arr metrics exporter |
Automation & Scripts
The system includes 64 automation scripts organized by function:
| Category | Scripts | Purpose |
|---|---|---|
| Core Control | dark-nebula.sh, homelab-ctl.sh | CLI and mode management |
| Bot Handlers | dark-bot.sh, dark-discord.js | Telegram and Discord bots |
| Healing | dark-heal.sh, fix-system.sh | Self-healing and repair |
| Backup | backup-minimal.sh, backup-full.sh | Configuration and data backup |
| Storage | mount-hdd.sh, unmount-hdd.sh | Safe HDD lifecycle |
| Audit | audit/01-system.sh...20-overall-score.sh | 20 modular audit checks |
| Telemetry | power-monitor.sh, healthcheck.sh | Hardware monitoring |
The dark CLI provides unified access to all automation:
dark status # Full system health check
dark mode work # Switch to work mode
dark power # Live power and thermal data
dark audit # Run modular system audit
dark heal # Trigger self-healing
dark backup # Configuration backup
dark hdd mount # Mount media HDD
dark update # Update all containersShell aliases provide quick access to common operations:
| Alias | Expands To |
|---|---|
home, dev, ws | Directory navigation |
dcup, dcdown | Docker compose up/down |
dps, dstats | Container status |
hdd-mount, hdd-unmount | HDD lifecycle |
health | Full health report |
Bot Systems
Telegram Bot (44 Commands)
The Telegram bot provides remote control through pure bash processing of long-polling updates. All commands are processed through dark-bot.sh with responses sent via the Telegram Bot API.
| Command | Function |
|---|---|
/status | System health check |
/audit | Deep modular audit |
/heal | Trigger self-healing |
/power | CPU/GPU/thermal data |
/storage | Disk usage (SSD, HDD, Docker) |
/mode | Current mode display |
/backup | Trigger config backup |
/hdd_mount | Mount media HDD |
/hdd_unmount | Safely unmount HDD |
/on <service> | Start any service |
/off <service> | Stop any service |
/tunnel | Cloudflare Tunnel status |
/security | CrowdSec decisions |
Discord Bot (49 Commands)
The Discord bot runs as a Docker container using Discord.js v14, providing richer interactive features including modals for config editing, trend charts via QuickChart API, and a live Pulse embed that updates every 30 seconds.
// Discord bot power command
const { execSync } = require('child_process');
const powerData = execSync('docker exec scaphandre wget -qO- http://localhost:8086/metrics')
.toString()
.split('\n')
.filter(line => line.startsWith('scaphandre_power_socket_watts'));
const embed = new EmbedBuilder()
.setTitle('⚡ Power Draw')
.addFields(
{ name: 'CPU', value: `${cpuWatts}W`, inline: true },
{ name: 'GPU', value: `${gpuWatts}W`, inline: true },
{ name: 'Cost/hr', value: `₹${costPerHour}`, inline: true }
);Discord-exclusive features include:
- The Pulse: Auto-updating pinned embed with live metrics
- Trend Charts: 24-hour historical data via QuickChart
- Remote Studio: Interactive
.enveditor via Discord Modals
Self-Healing Agent
The dark heal command implements an 11-phase autonomous recovery system that respects the current operational mode:
| Phase | Detection | Recovery |
|---|---|---|
| 1 | Lock file staleness | Clear stale lock |
| 2 | Unhealthy containers | Docker restart |
| 3 | Crashed containers | Docker start (mode-aware) |
| 4 | Failed systemd | systemctl restart |
| 5 | Internet unreachable | Restart NetworkManager |
| 5 | Tailscale offline | Restart tailscaled |
| 6 | HTTP probe failures | Restart failed services |
| 7 | Tunnel disconnection | Restart cloudflared |
| 8 | Disk >90% | Emergency prune |
| 9 | HDD unmounted | Auto-mount if in media/full |
| 10 | Permission drift | fix-system.sh |
| 11 | Recovery complete | Send notifications |
The agent is mode-aware, meaning it only attempts to heal services whose tier matches the current operational mode. It will not try to start media services when in work mode, avoiding unnecessary resource consumption and error logs.
Security Hardening
Network Perimeter
The system implements a defense-in-depth approach with multiple security layers:
| Layer | Technology | Configuration |
|---|---|---|
| Public Access | Cloudflare Tunnel | Zero exposed ports, tunnel-to-container |
| Reverse Proxy | Caddy | Automatic HTTPS, access logging |
| IDS | CrowdSec | Community threats, auto-bans |
| Firewall | UFW | Default DENY, only 80/443/22 |
| VPN | Tailscale | Encrypted mesh, 100.x.x.x |
| Intrusion | Fail2ban | SSH, Gitea, Jellyfin jails |
Host Hardening
| Setting | Value | File |
|---|---|---|
| Root login | Disabled | sshd_config.d/hardening.conf |
| Password auth | Disabled | sshd_config.d/hardening.conf |
| SSH key type | Ed25519 only | authorized_keys |
| Docker socket | User in docker group | /var/run/docker.sock |
| Samba shares | Authenticated only | smb.conf |
Audit Schedule
The automated audit runs 20 modular checks via cron:
| Check | Scope |
|---|---|
| 01-03 | System, Docker, Network |
| 04-06 | Security, Telegram, Vulnerabilities |
| 07-10 | Backup, Certificates, Docker Security |
| 11-16 | Resource leaks, Mode consistency, HDD health |
| 17-20 | Git config, Cron integrity, Firewall, Overall score |
Deployment
Quick Start
# Clone and configure
git clone https://github.com/bhargav-pratim-sarma/Darknebula-Homelab.git
cd Darknebula-Homelab/homelab-installer
# Run installer
chmod +x install.sh
sudo ./install.sh
# Access services
dark status # System health
dark power # Live power dataAutomated Installer
The install.sh script automates the complete setup in ~10 minutes:
- Network configuration (static IP)
- Docker engine installation
- Modern CLI tools (Starship, Lazygit, Lazydocker)
- Tailscale VPN with zero-touch auth
- 40+ Docker containers deployment
- Samba file shares configuration
- Systemd services (power monitor, bot, boot notification)
- ZSH with Oh-My-Zsh and plugins
Configuration
The installer supports non-interactive mode via install.conf:
# install.conf
INSTALL_ANTIGRAVITY=true
INSTALL_DESKTOP=false
INITIAL_MODE=full
TAILSCALE_AUTH_KEY=your_key_here
GITHUB_PAT=your_pat_hereRoadmap
- Phase 1 — Core Infrastructure (Complete) - Docker, Caddy, Portainer, basic monitoring
- Phase 2 — Service Catalog (Complete) - 40+ containers across all categories
- Phase 3 — Automation Scripts (Complete) - 64 scripts for all operations
- Phase 4 — Telegram Bot (Complete) - 44 commands for remote control
- Phase 5 — Discord Bot (Complete) - 49 slash commands with Pulse, Trends
- Phase 6 — Power Monitoring (Complete) - Intel RAPL via Scaphandre, INR costing
- Phase 7 — Self-Healing (Complete) - 11-phase autonomous recovery
- Phase 8 — Mode Switching (Complete) - Work/Media/Full resource optimization
- Phase 9 — IPv6 Support (In Progress) - Full IPv6 dual-stack
- Phase 10 — Kubernetes Migration (Planned) - K3s for container orchestration
Conclusion
DarkNebula represents a production-grade homelab infrastructure that achieves near-autonomous operation through comprehensive automation. The 40+ services are managed through a metadata-driven system that eliminates maintenance overhead while providing multiple control interfaces (CLI, Telegram, Discord) for remote administration.
The Intel RAPL power monitoring provides hardware-level accuracy at 15-second resolution, enabling precise electricity cost tracking in Grafana. The self-healing agent automatically detects and recovers from 11 different failure modes, with exponential backoff preventing restart loops on chronically failing services.
The mode-based resource management enables running the infrastructure on consumer hardware by selectively activating service tiers based on current needs. The work mode consumes ~800MB RAM while full mode enables the complete automation stack at ~5GB.
Security hardening through CrowdSec, Fail2ban, UFW, and Cloudflare Tunnel provides enterprise-grade protection while maintaining accessibility through the public domain. The centralized configuration through global.env with symlink propagation ensures consistency across all services without manual synchronization.
Architecture Feedback
Spotted a potential optimization or antipattern? Let me know.