Initializing
Back to Projects
Year2024
DomainDevOps
AccessOpen Source
Complexity8.7 / 10
Live Link
DockerDocker ComposeBashUbuntuCloudflare TunnelTailscalePrometheusGrafanaCrowdSecNode.jsDiscord.jsTelegram BotSelf-HealingInfrastructure as CodePower Monitoring
DevOpsProduction

DarkNebula — Production-Grade Homelab Infrastructure

Fully automated homelab with 40+ containers, 64 automation scripts, Telegram/Discord bots, Intel RAPL power monitoring, self-healing agents, and mode-based resource management.

Container Count0+
Automation Scripts0
Telegram Commands0
Discord Commands0
Power Monitoring Resolution0s
Self-Healing Check Interval0min

Table of Contents


The Challenge

Managing a homelab infrastructure with over 40 Docker containers, multiple network access methods, hybrid storage (SSD and HDD), and diverse service categories (media, development, monitoring, security) presents significant operational challenges that traditional manual management approaches cannot address effectively.

Before this system, service management required executing individual docker compose commands for each container, with no unified state tracking or mode-based resource optimization. Monitoring required manually checking multiple dashboards, with no consolidated view of system health. Backup operations were executed manually with no automated verification. Security monitoring relied on separate tools with no unified threat response. The lack of remote access capabilities limited administration to local network connections or complex SSH tunnel setups.

The fundamental challenge was building an infrastructure that could operate autonomously with minimal manual intervention while providing comprehensive remote control through multiple interfaces (CLI, Telegram, Discord). The system needed to be intelligent enough to detect failures and recover automatically, resource-efficient enough to operate on consumer hardware, and secure enough to expose via Cloudflare Tunnel without compromising the host system.

The requirement: Build a production-grade, self-healing homelab infrastructure with centralized configuration management, multi-interface remote control (CLI + Telegram + Discord), automated resource optimization through mode switching (Work/Media/Full), hardware-level power monitoring via Intel RAPL, and enterprise-grade security hardening with CrowdSec integration.


Architecture & Solution

The architecture follows a modular, metadata-driven design where every service contains a .homelab-meta file defining its tier classification (core, work, media, extra). This metadata drives all automation logic including mode switching, self-healing scope, and status reporting.

Parsing system architecture diagram...

The configuration architecture implements a single-source-of-truth model where all service configurations reference a centralized global.env file through symlinks. The update-env.sh script recursively verifies and recreates these symlinks, ensuring configuration changes propagate consistently across all services without manual intervention.


Tech Stack

LayerTechnologyRole
Host OSUbuntu 24.04 LTSPrimary operating system
Container RuntimeDocker 27.x + Compose V2Container orchestration
Reverse ProxyCaddy 2.xAutomatic HTTPS, routing
Virtual Private NetworkTailscaleMesh VPN for remote access
Public AccessCloudflare TunnelZero-trust public exposure
Time-Series DatabasePrometheusMetrics collection (15s scrape, 365-day retention)
VisualizationGrafanaDashboards with INR costing
Security MonitoringCrowdSecCommunity-driven threat detection
Bot FrameworkDiscord.js v14Discord bot with slash commands
Telegram APILong PollingPure bash Telegram command processor
Power MonitoringScaphandreIntel RAPL per-process attribution
AutomationBash scripts (64)Service control, backups, healing
File SharingSamba7 authenticated network shares
Remote DesktopRustDeskP2P remote access

Key Engineering Decisions

1. Metadata-Driven Service Classification

Every service directory contains a .homelab-meta file with tier classification. This metadata drives all automation logic without hardcoding service lists.

bash
# ~/docker/applications/media/jellyfin/.homelab-meta
NAME=Jellyfin
TIER=media
CATEGORY=media

The homelab-ctl.sh script reads these metadata files to determine which services should run in each mode. This approach eliminates maintenance overhead when adding or removing services.

Rather than maintaining individual .env files for each service, all configurations reference a single global.env through symlinks maintained by update-env.sh.

bash
# update-env.sh excerpt
SERVICES=(
    "base/caddy"
    "base/portainer"
    "base/cloudflared"
    "applications/media/jellyfin"
    "applications/monitoring/telemetry"
    ...
)

for service in "${SERVICES[@]}"; do
    ln -sf "$GLOBAL_ENV" "$DOCKER_DIR/$service/.env"
done

This design ensures that updating a password or API key in one place automatically propagates to all services, eliminating configuration drift between containers.

3. Mode-Based Resource Management

The system implements three operational modes that control which service tiers are active, enabling resource optimization based on usage patterns:

ModeActive TiersRAM UsageUse Case
workcore + work~800MBDevelopment, file sync
mediacore + work + media~2.5GBPhoto library, media streaming
fullall tiers~5GB+Full automation, *arr stack

The homelab-ctl.sh script implements mode switching by stopping and archiving services that should not run in the target mode, moving their compose files to ~/docker/archived/ to free memory and CPU resources.

bash
# homelab-ctl.sh mode switching
case "$MODE" in
    work)
        archive_tier "media"
        archive_tier "extra"
        ;;
    media)
        restore_tier "media"
        archive_tier "extra"
        ;;
    full)
        restore_all
        ;;
esac

4. Intel RAPL Power Monitoring

The system uses Scaphandre to read CPU power data directly from Intel RAPL (Running Average Power Limit) hardware counters, providing per-second power attribution at the process level.

bash
# power-monitor.sh calculation
CPU_POWER=$(cat /sys/class/powercap/intel-rapl:0/energy_uj)
GPU_POWER=$(cat /sys/class/hwmon/hwmon*/power1_average 2>/dev/null || echo 0)
OVERHEAD=15  # Motherboard, RAM, SSD, NIC
PSU_LOSS=0.85  # 85% efficiency

TOTAL_POWER=$(( (CPU_POWER + GPU_POWER + OVERHEAD) / PSU_LOSS ))

This data is scraped by Prometheus every 15 seconds and visualized in Grafana with electricity cost calculations based on ₹8/kWh (Assam APDCL rates).

5. Exponential Backoff with Chronic Failure Suppression

The self-healing agent tracks consecutive failures per container and implements progressive backoff to prevent restart loops while still allowing recovery attempts.

bash
# Failure tracking
FAILURE_COUNT=$(cat ~/docker/status/heal-tracker/${CONTAINER}.count 2>/dev/null || echo 0)
((FAILURE_COUNT++))

if [ $FAILURE_COUNT -ge 7 ]; then
    echo "⛔ CHRONIC: Skipping $CONTAINER (suppressed)"
    exit 0
fi

# Backoff intervals
if [ $FAILURE_COUNT -le 2 ]; then
    RESTART_NOW
elif [ $FAILURE_COUNT -le 4 ]; then
    sleep 900  # 15 minutes
elif [ $FAILURE_COUNT -le 6 ]; then
    sleep 3600  # 1 hour
fi

Service Inventory

Infrastructure (7 containers)

ServicePortDescription
Caddy80/443/8085Reverse proxy with automatic HTTPS
CloudflaredCloudflare Tunnel daemon
Portainer9000Container management GUI
CrowdSecThreat detection and IP banning
WatchtowerAutomatic image updates
SablierScale-to-zero for on-demand services
Authentik9900SSO identity provider (4 containers)

Media (16 containers)

ServicePortDescription
Jellyfin80964K HDR media streaming
Immich2283Google Photos alternative (5 containers)
qBittorrent8081Torrent client with VueTorrent theme
Syncthing8384P2P file synchronization
Sonarr8989TV show automation
Radarr7878Movie automation
Prowlarr9696Indexer management
Jellyseerr5055Media request platform
autobrr7474Torrent filter automation
Youtarr3087YouTube channel downloader

Development (3 containers)

ServicePortDescription
Gitea3000Self-hosted Git server
Gitea ActionsCI/CD runner

Telemetry (9 containers)

ServicePortDescription
Grafana3001Analytics dashboard
Prometheus9091Time-series database
Alertmanager9093Alert routing
Node ExporterHost metrics
cAdvisorContainer metrics
Scaphandre8086Power monitoring
Smartctl Exporter9633Disk health
Netdata19999Real-time metrics
Scraparr7100*arr metrics exporter

Automation & Scripts

The system includes 64 automation scripts organized by function:

CategoryScriptsPurpose
Core Controldark-nebula.sh, homelab-ctl.shCLI and mode management
Bot Handlersdark-bot.sh, dark-discord.jsTelegram and Discord bots
Healingdark-heal.sh, fix-system.shSelf-healing and repair
Backupbackup-minimal.sh, backup-full.shConfiguration and data backup
Storagemount-hdd.sh, unmount-hdd.shSafe HDD lifecycle
Auditaudit/01-system.sh...20-overall-score.sh20 modular audit checks
Telemetrypower-monitor.sh, healthcheck.shHardware monitoring

The dark CLI provides unified access to all automation:

bash
dark status         # Full system health check
dark mode work      # Switch to work mode
dark power          # Live power and thermal data
dark audit          # Run modular system audit
dark heal           # Trigger self-healing
dark backup         # Configuration backup
dark hdd mount      # Mount media HDD
dark update         # Update all containers

Shell aliases provide quick access to common operations:

AliasExpands To
home, dev, wsDirectory navigation
dcup, dcdownDocker compose up/down
dps, dstatsContainer status
hdd-mount, hdd-unmountHDD lifecycle
healthFull health report

Bot Systems

Telegram Bot (44 Commands)

The Telegram bot provides remote control through pure bash processing of long-polling updates. All commands are processed through dark-bot.sh with responses sent via the Telegram Bot API.

CommandFunction
/statusSystem health check
/auditDeep modular audit
/healTrigger self-healing
/powerCPU/GPU/thermal data
/storageDisk usage (SSD, HDD, Docker)
/modeCurrent mode display
/backupTrigger config backup
/hdd_mountMount media HDD
/hdd_unmountSafely unmount HDD
/on <service>Start any service
/off <service>Stop any service
/tunnelCloudflare Tunnel status
/securityCrowdSec decisions

Discord Bot (49 Commands)

The Discord bot runs as a Docker container using Discord.js v14, providing richer interactive features including modals for config editing, trend charts via QuickChart API, and a live Pulse embed that updates every 30 seconds.

javascript
// Discord bot power command
const { execSync } = require('child_process');
const powerData = execSync('docker exec scaphandre wget -qO- http://localhost:8086/metrics')
    .toString()
    .split('\n')
    .filter(line => line.startsWith('scaphandre_power_socket_watts'));

const embed = new EmbedBuilder()
    .setTitle('⚡ Power Draw')
    .addFields(
        { name: 'CPU', value: `${cpuWatts}W`, inline: true },
        { name: 'GPU', value: `${gpuWatts}W`, inline: true },
        { name: 'Cost/hr', value: `${costPerHour}`, inline: true }
    );

Discord-exclusive features include:

  • The Pulse: Auto-updating pinned embed with live metrics
  • Trend Charts: 24-hour historical data via QuickChart
  • Remote Studio: Interactive .env editor via Discord Modals

Self-Healing Agent

The dark heal command implements an 11-phase autonomous recovery system that respects the current operational mode:

Parsing system architecture diagram...
PhaseDetectionRecovery
1Lock file stalenessClear stale lock
2Unhealthy containersDocker restart
3Crashed containersDocker start (mode-aware)
4Failed systemdsystemctl restart
5Internet unreachableRestart NetworkManager
5Tailscale offlineRestart tailscaled
6HTTP probe failuresRestart failed services
7Tunnel disconnectionRestart cloudflared
8Disk >90%Emergency prune
9HDD unmountedAuto-mount if in media/full
10Permission driftfix-system.sh
11Recovery completeSend notifications

The agent is mode-aware, meaning it only attempts to heal services whose tier matches the current operational mode. It will not try to start media services when in work mode, avoiding unnecessary resource consumption and error logs.


Security Hardening

Network Perimeter

The system implements a defense-in-depth approach with multiple security layers:

LayerTechnologyConfiguration
Public AccessCloudflare TunnelZero exposed ports, tunnel-to-container
Reverse ProxyCaddyAutomatic HTTPS, access logging
IDSCrowdSecCommunity threats, auto-bans
FirewallUFWDefault DENY, only 80/443/22
VPNTailscaleEncrypted mesh, 100.x.x.x
IntrusionFail2banSSH, Gitea, Jellyfin jails

Host Hardening

SettingValueFile
Root loginDisabledsshd_config.d/hardening.conf
Password authDisabledsshd_config.d/hardening.conf
SSH key typeEd25519 onlyauthorized_keys
Docker socketUser in docker group/var/run/docker.sock
Samba sharesAuthenticated onlysmb.conf

Audit Schedule

The automated audit runs 20 modular checks via cron:

CheckScope
01-03System, Docker, Network
04-06Security, Telegram, Vulnerabilities
07-10Backup, Certificates, Docker Security
11-16Resource leaks, Mode consistency, HDD health
17-20Git config, Cron integrity, Firewall, Overall score

Deployment

Quick Start

bash
# Clone and configure
git clone https://github.com/bhargav-pratim-sarma/Darknebula-Homelab.git
cd Darknebula-Homelab/homelab-installer

# Run installer
chmod +x install.sh
sudo ./install.sh

# Access services
dark status              # System health
dark power              # Live power data

Automated Installer

The install.sh script automates the complete setup in ~10 minutes:

  1. Network configuration (static IP)
  2. Docker engine installation
  3. Modern CLI tools (Starship, Lazygit, Lazydocker)
  4. Tailscale VPN with zero-touch auth
  5. 40+ Docker containers deployment
  6. Samba file shares configuration
  7. Systemd services (power monitor, bot, boot notification)
  8. ZSH with Oh-My-Zsh and plugins

Configuration

The installer supports non-interactive mode via install.conf:

bash
# install.conf
INSTALL_ANTIGRAVITY=true
INSTALL_DESKTOP=false
INITIAL_MODE=full
TAILSCALE_AUTH_KEY=your_key_here
GITHUB_PAT=your_pat_here

Roadmap

  • Phase 1 — Core Infrastructure (Complete) - Docker, Caddy, Portainer, basic monitoring
  • Phase 2 — Service Catalog (Complete) - 40+ containers across all categories
  • Phase 3 — Automation Scripts (Complete) - 64 scripts for all operations
  • Phase 4 — Telegram Bot (Complete) - 44 commands for remote control
  • Phase 5 — Discord Bot (Complete) - 49 slash commands with Pulse, Trends
  • Phase 6 — Power Monitoring (Complete) - Intel RAPL via Scaphandre, INR costing
  • Phase 7 — Self-Healing (Complete) - 11-phase autonomous recovery
  • Phase 8 — Mode Switching (Complete) - Work/Media/Full resource optimization
  • Phase 9 — IPv6 Support (In Progress) - Full IPv6 dual-stack
  • Phase 10 — Kubernetes Migration (Planned) - K3s for container orchestration

Conclusion

DarkNebula represents a production-grade homelab infrastructure that achieves near-autonomous operation through comprehensive automation. The 40+ services are managed through a metadata-driven system that eliminates maintenance overhead while providing multiple control interfaces (CLI, Telegram, Discord) for remote administration.

The Intel RAPL power monitoring provides hardware-level accuracy at 15-second resolution, enabling precise electricity cost tracking in Grafana. The self-healing agent automatically detects and recovers from 11 different failure modes, with exponential backoff preventing restart loops on chronically failing services.

The mode-based resource management enables running the infrastructure on consumer hardware by selectively activating service tiers based on current needs. The work mode consumes ~800MB RAM while full mode enables the complete automation stack at ~5GB.

Security hardening through CrowdSec, Fail2ban, UFW, and Cloudflare Tunnel provides enterprise-grade protection while maintaining accessibility through the public domain. The centralized configuration through global.env with symlink propagation ensures consistency across all services without manual synchronization.

Architecture Feedback

Spotted a potential optimization or antipattern? Let me know.

Submit a Technical Suggestion