feat(voice): add persona-driven tts and qq local media upload

2026-03-23 11:10:27 +08:00
parent b1a08f3bb9
commit d838a12b56
9 changed files with 882 additions and 96 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -33,10 +33,13 @@ Do not commit real API keys, tokens, chat logs, or workspace data. Keep local se
 - `/skill` currently supports `search`, `install`, `uninstall`, `list`, and `update`. Keep subcommand dispatch in `nanobot/agent/loop.py`.
 - `/mcp` supports the default `list` behavior (and explicit `/mcp list`) to show configured MCP servers and registered MCP tools.
 - `/status` should return plain-text runtime info for the active session and stay wired into `/help` plus Telegram's command menu/localization coverage.
- Agent runtime config should be hot-reloaded from the active `config.json` for safe in-process fields such as `tools.mcpServers`, `tools.web.*`, `tools.exec.*`, `tools.restrictToWorkspace`, `agents.defaults.model`, `agents.defaults.maxToolIterations`, `agents.defaults.contextWindowTokens`, `agents.defaults.maxTokens`, `agents.defaults.temperature`, `agents.defaults.reasoningEffort`, `channels.sendProgress`, and `channels.sendToolHints`. Channel connection settings and provider credentials still require a restart.
+- Agent runtime config should be hot-reloaded from the active `config.json` for safe in-process fields such as `tools.mcpServers`, `tools.web.*`, `tools.exec.*`, `tools.restrictToWorkspace`, `agents.defaults.model`, `agents.defaults.maxToolIterations`, `agents.defaults.contextWindowTokens`, `agents.defaults.maxTokens`, `agents.defaults.temperature`, `agents.defaults.reasoningEffort`, `channels.sendProgress`, `channels.sendToolHints`, and `channels.voiceReply.*`. Channel connection settings and provider credentials still require a restart.
 - nanobot does not expose local files over HTTP. If a feature needs a public URL for local files, provide your own static file server and point config such as `mediaBaseUrl` at it.
 - Generated screenshots, downloads, and other temporary user-delivery artifacts should be written under `workspace/out`, not the workspace root. Treat that as the generic delivery-artifact root for tools, MCP servers, and skills.
- QQ outbound media sends remote `http(s)` image URLs directly. For local QQ images, try `file_data` upload first. If `mediaBaseUrl` is configured, keep the URL-based path available as a fallback for SDK/runtime compatibility; without it, there is no URL fallback.
+- QQ outbound media can send remote rich-media URLs directly. For local QQ media under `workspace/out`, use direct `file_data` upload only; do not rely on URL fallback for local files. Supported local QQ rich media are images, `.mp4` video, and `.silk` voice.
 - `channels.voiceReply` currently adds TTS attachments on supported outbound channels such as Telegram, and QQ when the configured TTS endpoint returns `silk`. Preserve plain-text fallback when QQ voice requirements are not met.
 - Voice replies should follow the active session persona. Build TTS style instructions from the resolved persona's prompt files, and allow optional persona-local overrides from `VOICE.json` under the persona workspace (`<workspace>/VOICE.json` for default, `<workspace>/personas/<name>/VOICE.json` for custom personas).
 - `channels.voiceReply.url` may override the TTS endpoint independently of the chat model provider. When omitted, fall back to the active conversation provider URL. Keep `apiBase` accepted as a compatibility alias.
 - `/skill` shells out to `npx clawhub@latest`; it requires Node.js/`npx` at runtime.
 - `/skill uninstall` runs in a non-interactive context, so keep passing `--yes` when shelling out to ClawHub.
 - Treat empty `/skill search` output as a user-visible "no results" case rather than a silent success. Surface npm/registry failures directly to the user.
--- a/README.md
+++ b/README.md
@@ -264,6 +264,57 @@ That's it! You have a working AI assistant in 2 minutes.
 `baseUrl` can point either to the SearXNG root (for example `http://localhost:8080`) or directly to `/search`.
 ### Optional: Voice Replies
 Enable `channels.voiceReply` when you want nanobot to attach a synthesized voice reply on
 supported outbound channels such as Telegram. QQ voice replies are also supported when your TTS
 endpoint can return `silk`.
 ```json
 {
  "channels": {
    "voiceReply": {
      "enabled": true,
      "channels": ["telegram"],
      "url": "https://your-tts-endpoint.example.com/v1",
      "model": "gpt-4o-mini-tts",
      "voice": "alloy",
      "instructions": "keep the delivery calm and clear",
      "speed": 1.0,
      "responseFormat": "opus"
    }
  }
 }
 ```
 `voiceReply` currently adds a voice attachment while keeping the normal text reply. For QQ voice
 delivery, use `responseFormat: "silk"` because QQ local voice upload expects `.silk`. If `apiKey`
 and `apiBase` are omitted, nanobot falls back to the active provider credentials; use an
 OpenAI-compatible TTS endpoint for this.
 `voiceReply.url` is optional and can point either to a provider base URL such as
 `https://api.openai.com/v1` or directly to an `/audio/speech` endpoint. If omitted, nanobot uses
 the current conversation provider URL. `apiBase` remains supported as a legacy alias.
 Voice replies automatically follow the active session persona. nanobot builds TTS style
 instructions from that persona's `SOUL.md` and `USER.md`, so switching `/persona` changes both the
 text response style and the generated speech style together.
 If a specific persona needs a fixed voice or speaking pattern, add `VOICE.json` under the persona
 workspace:
 - Default persona: `<workspace>/VOICE.json`
 - Custom persona: `<workspace>/personas/<name>/VOICE.json`
 Example:
 ```json
 {
  "voice": "nova",
  "instructions": "sound crisp, confident, and slightly faster than normal",
  "speed": 1.15
 }
 ```
 ## 💬 Chat Apps
 Connect nanobot to your favorite chat platform. Want to build your own? See the [Channel Plugin Guide](./docs/CHANNEL_PLUGIN_GUIDE.md).
@@ -708,10 +759,10 @@ Uses **botpy SDK** with WebSocket — no public IP required. Currently supports
 }
 ```
-`mediaBaseUrl` is optional. For local QQ images, nanobot will first try direct `file_data` upload
+For local QQ media, nanobot uploads files directly with `file_data` from generated delivery
-from generated delivery artifacts under `workspace/out`. Configuring `mediaBaseUrl` is still
+artifacts under `workspace/out`. Local uploads do not require `mediaBaseUrl`, and nanobot does not
-recommended, because nanobot can then map those files onto your own static file server and fall
+fall back to URL-based upload for local files anymore. Supported local QQ rich media are images,
-back to the URL-based rich-media flow when needed.
+`.mp4` video, and `.silk` voice.
 Multi-bot example:
@@ -1245,7 +1296,7 @@ Use `toolTimeout` to override the default 30s per-call timeout for slow servers:
 ```
 MCP tools are automatically discovered and registered on startup. The LLM can use them alongside built-in tools — no extra configuration needed.
-nanobot hot-reloads agent runtime config from the active `config.json` on the next message, including `tools.mcpServers`, `tools.web.*`, `tools.exec.*`, `tools.restrictToWorkspace`, `agents.defaults.model`, `agents.defaults.maxToolIterations`, `agents.defaults.contextWindowTokens`, `agents.defaults.maxTokens`, `agents.defaults.temperature`, `agents.defaults.reasoningEffort`, `channels.sendProgress`, and `channels.sendToolHints`. Channel connection settings and provider credentials still require a restart.
+nanobot hot-reloads agent runtime config from the active `config.json` on the next message, including `tools.mcpServers`, `tools.web.*`, `tools.exec.*`, `tools.restrictToWorkspace`, `agents.defaults.model`, `agents.defaults.maxToolIterations`, `agents.defaults.contextWindowTokens`, `agents.defaults.maxTokens`, `agents.defaults.temperature`, `agents.defaults.reasoningEffort`, `channels.sendProgress`, `channels.sendToolHints`, and `channels.voiceReply.*`. Channel connection settings and provider credentials still require a restart.
--- a/nanobot/agent/loop.py
+++ b/nanobot/agent/loop.py
@@ -28,6 +28,7 @@ from nanobot.agent.i18n import (
    text,
 )
 from nanobot.agent.memory import MemoryConsolidator
 from nanobot.agent.personas import build_persona_voice_instructions, load_persona_voice_settings
 from nanobot.agent.skills import BUILTIN_SKILLS_DIR
 from nanobot.agent.subagent import SubagentManager
 from nanobot.agent.tools.cron import CronTool
@@ -40,8 +41,9 @@ from nanobot.agent.tools.web import WebFetchTool, WebSearchTool
 from nanobot.bus.events import InboundMessage, OutboundMessage
 from nanobot.bus.queue import MessageBus
 from nanobot.providers.base import LLMProvider
 from nanobot.providers.speech import OpenAISpeechProvider
 from nanobot.session.manager import Session, SessionManager
-from nanobot.utils.helpers import build_status_content
+from nanobot.utils.helpers import build_status_content, ensure_dir, safe_filename
 if TYPE_CHECKING:
    from nanobot.config.schema import ChannelsConfig, ExecToolConfig
@@ -675,6 +677,137 @@ class AgentLoop:
            metadata={"render_as": "text"},
        )
    @staticmethod
    def _voice_reply_extension(response_format: str) -> str:
        """Map TTS response formats to delivery file extensions."""
        return {
            "opus": ".ogg",
            "mp3": ".mp3",
            "aac": ".aac",
            "flac": ".flac",
            "wav": ".wav",
            "pcm": ".pcm",
            "silk": ".silk",
        }.get(response_format, f".{response_format}")
    @staticmethod
    def _channel_base_name(channel: str) -> str:
        """Normalize multi-instance channel routes such as telegram/main."""
        return channel.split("/", 1)[0].lower()
    def _voice_reply_enabled_for_channel(self, channel: str) -> bool:
        """Return True when voice replies are enabled for the given channel."""
        cfg = getattr(self.channels_config, "voice_reply", None)
        if not cfg or not getattr(cfg, "enabled", False):
            return False
        route_name = channel.lower()
        base_name = self._channel_base_name(channel)
        enabled_channels = {
            name.lower() for name in getattr(cfg, "channels", []) if isinstance(name, str)
        }
        if route_name not in enabled_channels and base_name not in enabled_channels:
            return False
        if base_name == "qq":
            return getattr(cfg, "response_format", "opus") == "silk"
        return base_name in {"telegram", "qq"}
    def _voice_reply_profile(
        self,
        persona: str | None,
    ) -> tuple[str, str | None, float | None]:
        """Resolve voice, instructions, and speed for the active persona."""
        cfg = getattr(self.channels_config, "voice_reply", None)
        persona_voice = load_persona_voice_settings(self.workspace, persona)
        extra_instructions = [
            value.strip()
            for value in (
                getattr(cfg, "instructions", "") if cfg is not None else "",
                persona_voice.instructions or "",
            )
            if isinstance(value, str) and value.strip()
        ]
        instructions = build_persona_voice_instructions(
            self.workspace,
            persona,
            extra_instructions=" ".join(extra_instructions) if extra_instructions else None,
        )
        voice = persona_voice.voice or getattr(cfg, "voice", "alloy")
        speed = (
            persona_voice.speed
            if persona_voice.speed is not None
            else getattr(cfg, "speed", None) if cfg is not None else None
        )
        return voice, instructions, speed
    async def _maybe_attach_voice_reply(
        self,
        outbound: OutboundMessage | None,
        *,
        persona: str | None = None,
    ) -> OutboundMessage | None:
        """Optionally synthesize the final text reply into a voice attachment."""
        if (
            outbound is None
            or not outbound.content
            or not self._voice_reply_enabled_for_channel(outbound.channel)
        ):
            return outbound
        cfg = getattr(self.channels_config, "voice_reply", None)
        if cfg is None:
            return outbound
        api_key = (getattr(cfg, "api_key", "") or getattr(self.provider, "api_key", "") or "").strip()
        if not api_key:
            logger.warning(
                "Voice reply enabled for {}, but no TTS api_key is configured",
                outbound.channel,
            )
            return outbound
        api_base = (
            getattr(cfg, "api_base", "")
            or getattr(self.provider, "api_base", "")
            or "https://api.openai.com/v1"
        ).strip()
        response_format = getattr(cfg, "response_format", "opus")
        model = getattr(cfg, "model", "gpt-4o-mini-tts")
        voice, instructions, speed = self._voice_reply_profile(persona)
        media_dir = ensure_dir(self.workspace / "out" / "voice")
        filename = safe_filename(
            f"{outbound.channel}_{outbound.chat_id}_{int(time.time() * 1000)}"
        ) + self._voice_reply_extension(response_format)
        output_path = media_dir / filename
        try:
            provider = OpenAISpeechProvider(api_key=api_key, api_base=api_base)
            await provider.synthesize_to_file(
                outbound.content,
                model=model,
                voice=voice,
                instructions=instructions,
                speed=speed,
                response_format=response_format,
                output_path=output_path,
            )
        except Exception:
            logger.exception(
                "Failed to synthesize voice reply for {}:{}",
                outbound.channel,
                outbound.chat_id,
            )
            return outbound
        return OutboundMessage(
            channel=outbound.channel,
            chat_id=outbound.chat_id,
            content=outbound.content,
            reply_to=outbound.reply_to,
            media=[*(outbound.media or []), str(output_path)],
            metadata=dict(outbound.metadata or {}),
        )
    async def _run_agent_loop(
        self,
        initial_messages: list[dict],
@@ -1072,8 +1205,14 @@ class AgentLoop:
            self._save_turn(session, all_msgs, 1 + len(history))
            self.sessions.save(session)
            self._ensure_background_token_consolidation(session)
-            return OutboundMessage(channel=channel, chat_id=chat_id,
+            return await self._maybe_attach_voice_reply(
-                                  content=final_content or "Background task completed.")
+                OutboundMessage(
                    channel=channel,
                    chat_id=chat_id,
                    content=final_content or "Background task completed.",
                ),
                persona=persona,
            )
        preview = msg.content[:80] + "..." if len(msg.content) > 80 else msg.content
        logger.info("Processing message from {}:{}: {}", msg.channel, msg.sender_id, preview)
@@ -1156,9 +1295,14 @@ class AgentLoop:
        preview = final_content[:120] + "..." if len(final_content) > 120 else final_content
        logger.info("Response to {}:{}: {}", msg.channel, msg.sender_id, preview)
-        return OutboundMessage(
+        return await self._maybe_attach_voice_reply(
-            channel=msg.channel, chat_id=msg.chat_id, content=final_content,
+            OutboundMessage(
-            metadata=msg.metadata or {},
+                channel=msg.channel,
                chat_id=msg.chat_id,
                content=final_content,
                metadata=msg.metadata or {},
            ),
            persona=persona,
        )
    @staticmethod
--- a/nanobot/agent/personas.py
+++ b/nanobot/agent/personas.py
@@ -2,12 +2,29 @@
 from __future__ import annotations
 import json
 import re
 from dataclasses import dataclass
 from pathlib import Path
 from loguru import logger
 DEFAULT_PERSONA = "default"
 PERSONAS_DIRNAME = "personas"
 PERSONA_VOICE_FILENAME = "VOICE.json"
 _VALID_PERSONA_RE = re.compile(r"^[A-Za-z0-9][A-Za-z0-9_-]{0,63}$")
 _VOICE_MARKDOWN_RE = re.compile(r"(```[\s\S]*?```|`[^`]*`|!\[[^\]]*\]\([^)]+\)|[#>*_~-]+)")
 _VOICE_WHITESPACE_RE = re.compile(r"\s+")
 _VOICE_MAX_GUIDANCE_CHARS = 1200
@dataclass(frozen=True)
 class PersonaVoiceSettings:
    """Optional persona-level voice synthesis overrides."""
    voice: str | None = None
    instructions: str | None = None
    speed: float | None = None
 def normalize_persona_name(name: str | None) -> str | None:
@@ -64,3 +81,88 @@ def persona_workspace(workspace: Path, persona: str | None) -> Path:
    if resolved in (None, DEFAULT_PERSONA):
        return workspace
    return personas_root(workspace) / resolved
 def load_persona_voice_settings(workspace: Path, persona: str | None) -> PersonaVoiceSettings:
    """Load optional persona voice overrides from VOICE.json."""
    path = persona_workspace(workspace, persona) / PERSONA_VOICE_FILENAME
    if not path.exists():
        return PersonaVoiceSettings()
    try:
        data = json.loads(path.read_text(encoding="utf-8"))
    except (OSError, ValueError) as exc:
        logger.warning("Failed to load persona voice config {}: {}", path, exc)
        return PersonaVoiceSettings()
    if not isinstance(data, dict):
        logger.warning("Ignoring persona voice config {} because it is not a JSON object", path)
        return PersonaVoiceSettings()
    voice = data.get("voice")
    if isinstance(voice, str):
        voice = voice.strip() or None
    else:
        voice = None
    instructions = data.get("instructions")
    if isinstance(instructions, str):
        instructions = instructions.strip() or None
    else:
        instructions = None
    speed = data.get("speed")
    if isinstance(speed, (int, float)):
        speed = float(speed)
        if not 0.25 <= speed <= 4.0:
            logger.warning(
                "Ignoring persona voice speed from {} because it is outside 0.25-4.0",
                path,
            )
            speed = None
    else:
        speed = None
    return PersonaVoiceSettings(voice=voice, instructions=instructions, speed=speed)
 def build_persona_voice_instructions(
    workspace: Path,
    persona: str | None,
    *,
    extra_instructions: str | None = None,
 ) -> str:
    """Build voice-style instructions from the active persona prompt files."""
    resolved = resolve_persona_name(workspace, persona) or DEFAULT_PERSONA
    persona_dir = None if resolved == DEFAULT_PERSONA else personas_root(workspace) / resolved
    guidance_parts: list[str] = []
    for filename in ("SOUL.md", "USER.md"):
        file_path = workspace / filename
        if persona_dir:
            persona_file = persona_dir / filename
            if persona_file.exists():
                file_path = persona_file
        if not file_path.exists():
            continue
        try:
            raw = file_path.read_text(encoding="utf-8")
        except OSError as exc:
            logger.warning("Failed to read persona voice source {}: {}", file_path, exc)
            continue
        clean = _VOICE_WHITESPACE_RE.sub(" ", _VOICE_MARKDOWN_RE.sub(" ", raw)).strip()
        if clean:
            guidance_parts.append(clean)
    guidance = " ".join(guidance_parts).strip()
    if len(guidance) > _VOICE_MAX_GUIDANCE_CHARS:
        guidance = guidance[:_VOICE_MAX_GUIDANCE_CHARS].rstrip()
    segments = [
        f"Speak as the active persona '{resolved}'. Match that persona's tone, attitude, pacing, and emotional style while keeping the reply natural and conversational.",
    ]
    if extra_instructions:
        segments.append(extra_instructions.strip())
    if guidance:
        segments.append(f"Persona guidance: {guidance}")
    return " ".join(segment for segment in segments if segment)
--- a/nanobot/channels/qq.py
+++ b/nanobot/channels/qq.py
@@ -5,6 +5,7 @@ import base64
 from collections import deque
 from pathlib import Path
 from typing import TYPE_CHECKING
 from urllib.parse import urlparse
 from loguru import logger
@@ -13,7 +14,7 @@ from nanobot.bus.queue import MessageBus
 from nanobot.channels.base import BaseChannel
 from nanobot.config.schema import QQConfig, QQInstanceConfig
 from nanobot.security.network import validate_url_target
-from nanobot.utils.delivery import resolve_delivery_media
+from nanobot.utils.delivery import delivery_artifacts_root, is_image_file
 try:
    import botpy
@@ -97,17 +98,50 @@ class QQChannel(BaseChannel):
        """Return the active workspace root used by QQ publishing."""
        return (self._workspace or Path.cwd()).resolve(strict=False)
-    async def _publish_local_media(
+    def _resolve_local_media(
        self,
        media_path: str,
-    ) -> tuple[Path | None, str | None, str | None]:
+    ) -> tuple[Path | None, int | None, str | None]:
-        """Resolve a local delivery artifact and optionally map it to its served URL."""
+        """Resolve a local delivery artifact and infer the QQ rich-media file type."""
-        local_path, media_url, error = resolve_delivery_media(
+        source = Path(media_path).expanduser()
-            media_path,
+        try:
-            self._workspace_root(),
+            resolved = source.resolve(strict=True)
-            self.config.media_base_url,
+        except FileNotFoundError:
-        )
+            return None, None, "local file not found"
-        return local_path, media_url, error
+        except OSError as e:
            logger.warning("Failed to resolve local QQ media path {}: {}", media_path, e)
            return None, None, "local file unavailable"
        if not resolved.is_file():
            return None, None, "local file not found"
        artifacts_root = delivery_artifacts_root(self._workspace_root())
        try:
            resolved.relative_to(artifacts_root)
        except ValueError:
            return None, None, f"local delivery media must stay under {artifacts_root}"
        suffix = resolved.suffix.lower()
        if is_image_file(resolved):
            return resolved, 1, None
        if suffix == ".mp4":
            return resolved, 2, None
        if suffix == ".silk":
            return resolved, 3, None
        return None, None, "local delivery media must be an image, .mp4 video, or .silk voice"
    @staticmethod
    def _remote_media_file_type(media_url: str) -> int | None:
        """Infer a QQ rich-media file type from a remote URL."""
        path = urlparse(media_url).path.lower()
        if path.endswith(".mp4"):
            return 2
        if path.endswith(".silk"):
            return 3
        image_exts = (".jpg", ".jpeg", ".png", ".gif", ".webp")
        if path.endswith(image_exts):
            return 1
        return None
    def _next_msg_seq(self) -> int:
        """Return the next QQ message sequence number."""
@@ -136,15 +170,16 @@ class QQChannel(BaseChannel):
        self,
        chat_id: str,
        msg_type: str,
        file_type: int,
        media_url: str,
        content: str | None,
        msg_id: str | None,
    ) -> None:
-        """Send one QQ remote image URL as a rich-media message."""
+        """Send one QQ remote rich-media URL as a rich-media message."""
        if msg_type == "group":
            media = await self._client.api.post_group_file(
                group_openid=chat_id,
-                file_type=1,
+                file_type=file_type,
                url=media_url,
                srv_send_msg=False,
            )
@@ -159,7 +194,7 @@ class QQChannel(BaseChannel):
        else:
            media = await self._client.api.post_c2c_file(
                openid=chat_id,
-                file_type=1,
+                file_type=file_type,
                url=media_url,
                srv_send_msg=False,
            )
@@ -176,22 +211,20 @@ class QQChannel(BaseChannel):
        self,
        chat_id: str,
        msg_type: str,
-        media_url: str | None,
+        file_type: int,
        local_path: Path,
        content: str | None,
        msg_id: str | None,
    ) -> None:
-        """Upload a local QQ image using file_data and, when available, a public URL."""
+        """Upload a local QQ rich-media file using file_data."""
        if not self._client or Route is None:
            raise RuntimeError("QQ client not initialized")
        payload = {
-            "file_type": 1,
+            "file_type": file_type,
            "file_data": self._encode_file_data(local_path),
            "srv_send_msg": False,
        }
        if media_url:
            payload["url"] = media_url
        if msg_type == "group":
            route = Route("POST", "/v2/groups/{group_openid}/files", group_openid=chat_id)
            media = await self._client.api._http.request(route, json=payload)
@@ -265,15 +298,13 @@ class QQChannel(BaseChannel):
            fallback_lines: list[str] = []
            for media_path in msg.media:
                resolved_media = media_path
                local_media_path: Path | None = None
                local_file_type: int | None = None
                if not self._is_remote_media(media_path):
-                    local_media_path, resolved_media, publish_error = await self._publish_local_media(
+                    local_media_path, local_file_type, publish_error = self._resolve_local_media(media_path)
                        media_path
                    )
                    if local_media_path is None:
                        logger.warning(
-                            "QQ outbound local media could not be published: {} ({})",
+                            "QQ outbound local media could not be uploaded directly: {} ({})",
                            media_path,
                            publish_error,
                        )
@@ -281,65 +312,51 @@ class QQChannel(BaseChannel):
                            self._failed_media_notice(media_path, publish_error)
                        )
                        continue
-
+                else:
-                if resolved_media:
+                    ok, error = validate_url_target(media_path)
                    ok, error = validate_url_target(resolved_media)
                    if not ok:
                        logger.warning("QQ outbound media blocked by URL validation: {}", error)
                        fallback_lines.append(self._failed_media_notice(media_path, error))
                        continue
                    remote_file_type = self._remote_media_file_type(media_path)
                    if remote_file_type is None:
                        fallback_lines.append(
                            self._failed_media_notice(
                                media_path,
                                "remote QQ media must be an image URL, .mp4 video, or .silk voice",
                            )
                        )
                        continue
                try:
                    if local_media_path is not None:
-                        try:
+                        await self._post_local_media_message(
-                            await self._post_local_media_message(
+                            msg.chat_id,
-                                msg.chat_id,
+                            msg_type,
-                                msg_type,
+                            local_file_type or 1,
-                                resolved_media,
+                            local_media_path.resolve(strict=True),
-                                local_media_path.resolve(strict=True),
+                            msg.content if msg.content and not content_sent else None,
-                                msg.content if msg.content and not content_sent else None,
+                            msg_id,
-                                msg_id,
+                        )
                            )
                        except Exception as local_upload_error:
                            if resolved_media:
                                logger.warning(
                                    "QQ local file_data upload failed for {}: {}, falling back to URL-only upload",
                                    local_media_path,
                                    local_upload_error,
                                )
                                await self._post_remote_media_message(
                                    msg.chat_id,
                                    msg_type,
                                    resolved_media,
                                    msg.content if msg.content and not content_sent else None,
                                    msg_id,
                                )
                            else:
                                logger.warning(
                                    "QQ local file_data upload failed for {} without mediaBaseUrl fallback: {}",
                                    local_media_path,
                                    local_upload_error,
                                )
                                fallback_lines.append(
                                    self._failed_media_notice(
                                        media_path,
                                        "QQ local file_data upload failed",
                                    )
                                )
                                continue
                    else:
                        await self._post_remote_media_message(
                            msg.chat_id,
                            msg_type,
-                            resolved_media,
+                            remote_file_type,
                            media_path,
                            msg.content if msg.content and not content_sent else None,
                            msg_id,
                        )
                    if msg.content and not content_sent:
                        content_sent = True
                except Exception as media_error:
-                    logger.error("Error sending QQ media {}: {}", resolved_media, media_error)
+                    logger.error("Error sending QQ media {}: {}", media_path, media_error)
-                    fallback_lines.append(self._failed_media_notice(media_path))
+                    if local_media_path is not None:
                        fallback_lines.append(
                            self._failed_media_notice(media_path, "QQ local file_data upload failed")
                        )
                    else:
                        fallback_lines.append(self._failed_media_notice(media_path))
            text_parts: list[str] = []
            if msg.content and not content_sent:
--- a/nanobot/config/schema.py
+++ b/nanobot/config/schema.py
@@ -3,7 +3,7 @@
 from pathlib import Path
 from typing import Any, Literal
-from pydantic import BaseModel, ConfigDict, Field, ValidationInfo, field_validator
+from pydantic import AliasChoices, BaseModel, ConfigDict, Field, ValidationInfo, field_validator
 from pydantic.alias_generators import to_camel
 from pydantic_settings import BaseSettings
@@ -13,6 +13,7 @@ class Base(BaseModel):
    model_config = ConfigDict(alias_generator=to_camel, populate_by_name=True)
 class WhatsAppConfig(Base):
    """WhatsApp channel configuration."""
@@ -356,6 +357,20 @@ class WecomMultiConfig(Base):
    instances: list[WecomInstanceConfig] = Field(default_factory=list)
 class VoiceReplyConfig(Base):
    """Optional text-to-speech replies for supported outbound channels."""
    enabled: bool = False
    channels: list[str] = Field(default_factory=lambda: ["telegram"])
    model: str = "gpt-4o-mini-tts"
    voice: str = "alloy"
    instructions: str = ""
    speed: float | None = None
    response_format: Literal["mp3", "opus", "aac", "flac", "wav", "pcm", "silk"] = "opus"
    api_key: str = ""
    api_base: str = Field(default="", validation_alias=AliasChoices("apiBase", "url"))
 def _coerce_multi_channel_config(
    value: Any,
    single_cls: type[BaseModel],
@@ -369,11 +384,14 @@ def _coerce_multi_channel_config(
    if isinstance(value, dict) and "instances" in value:
        return multi_cls.model_validate(value)
    return single_cls.model_validate(value)
 class ChannelsConfig(Base):
    """Configuration for chat channels."""
    send_progress: bool = True  # stream agent's text progress to the channel
    send_tool_hints: bool = False  # stream tool-call hints (e.g. read_file("…"))
    voice_reply: VoiceReplyConfig = Field(default_factory=VoiceReplyConfig)
    whatsapp: WhatsAppConfig | WhatsAppMultiConfig = Field(default_factory=WhatsAppConfig)
    telegram: TelegramConfig | TelegramMultiConfig = Field(default_factory=TelegramConfig)
    discord: DiscordConfig | DiscordMultiConfig = Field(default_factory=DiscordConfig)
@@ -515,6 +533,7 @@ class ExecToolConfig(Base):
    timeout: int = 60
    path_append: str = ""
 class MCPServerConfig(Base):
    """MCP server connection configuration (stdio or HTTP)."""
--- a/nanobot/providers/speech.py
+++ b/nanobot/providers/speech.py
@@ -0,0 +1,88 @@
 """OpenAI-compatible text-to-speech provider."""
 from __future__ import annotations
 from pathlib import Path
 import httpx
 class OpenAISpeechProvider:
    """Minimal OpenAI-compatible TTS client."""
    _NO_INSTRUCTIONS_MODELS = {"tts-1", "tts-1-hd"}
    def __init__(self, api_key: str, api_base: str = "https://api.openai.com/v1"):
        self.api_key = api_key
        self.api_base = api_base.rstrip("/")
    def _speech_url(self) -> str:
        """Return the final speech endpoint URL from a base URL or direct endpoint URL."""
        if self.api_base.endswith("/audio/speech"):
            return self.api_base
        return f"{self.api_base}/audio/speech"
    @classmethod
    def _supports_instructions(cls, model: str) -> bool:
        """Return True when the target TTS model accepts style instructions."""
        return model not in cls._NO_INSTRUCTIONS_MODELS
    async def synthesize(
        self,
        text: str,
        *,
        model: str,
        voice: str,
        instructions: str | None = None,
        speed: float | None = None,
        response_format: str,
    ) -> bytes:
        """Synthesize text into audio bytes."""
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
        }
        payload = {
            "model": model,
            "voice": voice,
            "input": text,
            "response_format": response_format,
        }
        if instructions and self._supports_instructions(model):
            payload["instructions"] = instructions
        if speed is not None:
            payload["speed"] = speed
        async with httpx.AsyncClient(timeout=60.0) as client:
            response = await client.post(
                self._speech_url(),
                headers=headers,
                json=payload,
            )
            response.raise_for_status()
            return response.content
    async def synthesize_to_file(
        self,
        text: str,
        *,
        model: str,
        voice: str,
        instructions: str | None = None,
        speed: float | None = None,
        response_format: str,
        output_path: str | Path,
    ) -> Path:
        """Synthesize text and write the audio payload to disk."""
        path = Path(output_path)
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_bytes(
            await self.synthesize(
                text,
                model=model,
                voice=voice,
                instructions=instructions,
                speed=speed,
                response_format=response_format,
            )
        )
        return path
--- a/tests/test_qq_channel.py
+++ b/tests/test_qq_channel.py
@@ -276,7 +276,6 @@ async def test_send_local_media_under_out_dir_uses_c2c_file_api(
            "params": {"openid": "user123"},
            "json": {
                "file_type": 1,
                "url": "https://files.example.com/out/demo.png",
                "file_data": b64encode(b"\x89PNG\r\n\x1a\nfake-png").decode("ascii"),
                "srv_send_msg": False,
            },
@@ -338,7 +337,6 @@ async def test_send_local_media_in_nested_out_path_uses_relative_url(
            "params": {"openid": "user123"},
            "json": {
                "file_type": 1,
                "url": "https://files.example.com/qq-media/shots/github.png",
                "file_data": b64encode(b"\x89PNG\r\n\x1a\nfake-png").decode("ascii"),
                "srv_send_msg": False,
            },
@@ -408,8 +406,7 @@ async def test_send_local_media_outside_out_falls_back_to_text_notice(
@pytest.mark.asyncio
-async def test_send_local_media_falls_back_to_url_only_upload_when_file_data_upload_fails(
+async def test_send_local_media_with_media_base_url_still_falls_back_to_text_notice_when_file_data_upload_fails(
    monkeypatch,
    tmp_path,
 ) -> None:
    workspace = tmp_path / "workspace"
@@ -431,7 +428,6 @@ async def test_send_local_media_falls_back_to_url_only_upload_when_file_data_upl
    )
    channel._client = _FakeClient()
    channel._client.api.raise_on_raw_file_upload = True
    monkeypatch.setattr("nanobot.channels.qq.validate_url_target", lambda url: (True, ""))
    await channel.send(
        OutboundMessage(
@@ -443,20 +439,12 @@ async def test_send_local_media_falls_back_to_url_only_upload_when_file_data_upl
        )
    )
-    assert channel._client.api.c2c_file_calls == [
+    assert channel._client.api.c2c_file_calls == []
        {
            "openid": "user123",
            "file_type": 1,
            "url": "https://files.example.com/out/demo.png",
            "srv_send_msg": False,
        }
    ]
    assert channel._client.api.c2c_calls == [
        {
            "openid": "user123",
-            "msg_type": 7,
+            "msg_type": 0,
-            "content": "hello",
+            "content": "hello\n[Failed to send: demo.png - QQ local file_data upload failed]",
            "media": {"file_info": "c2c-file-info", "file_uuid": "c2c-file", "ttl": 60},
            "msg_id": "msg1",
            "msg_seq": 2,
        }
@@ -596,7 +584,60 @@ async def test_send_non_image_media_from_out_falls_back_to_text_notice(
        {
            "openid": "user123",
            "msg_type": 0,
-            "content": "hello\n[Failed to send: note.txt - local delivery media must be an image]",
+            "content": (
                "hello\n[Failed to send: note.txt - local delivery media must be an image, .mp4 video, "
                "or .silk voice]"
            ),
            "msg_id": "msg1",
            "msg_seq": 2,
        }
    ]
@pytest.mark.asyncio
 async def test_send_local_silk_voice_uses_file_type_three_direct_upload(tmp_path) -> None:
    workspace = tmp_path / "workspace"
    workspace.mkdir()
    out_dir = workspace / "out"
    out_dir.mkdir()
    source = out_dir / "reply.silk"
    source.write_bytes(b"fake-silk")
    channel = QQChannel(
        QQConfig(app_id="app", secret="secret", allow_from=["*"]),
        MessageBus(),
        workspace=workspace,
    )
    channel._client = _FakeClient()
    await channel.send(
        OutboundMessage(
            channel="qq",
            chat_id="user123",
            content="hello",
            media=[str(source)],
            metadata={"message_id": "msg1"},
        )
    )
    assert channel._client.api.raw_file_upload_calls == [
        {
            "method": "POST",
            "path": "/v2/users/{openid}/files",
            "params": {"openid": "user123"},
            "json": {
                "file_type": 3,
                "file_data": b64encode(b"fake-silk").decode("ascii"),
                "srv_send_msg": False,
            },
        }
    ]
    assert channel._client.api.c2c_calls == [
        {
            "openid": "user123",
            "msg_type": 7,
            "content": "hello",
            "media": {"file_info": "c2c-file-info", "file_uuid": "c2c-file", "ttl": 60},
            "msg_id": "msg1",
            "msg_seq": 2,
        }
--- a/tests/test_voice_reply.py
+++ b/tests/test_voice_reply.py
@@ -0,0 +1,321 @@
 """Tests for optional outbound voice replies."""
 from __future__ import annotations
 from pathlib import Path
 from unittest.mock import AsyncMock, MagicMock, patch
 import pytest
 from nanobot.bus.events import InboundMessage
 from nanobot.config.schema import Config
 from nanobot.providers.base import LLMResponse
 from nanobot.providers.speech import OpenAISpeechProvider
 def _make_loop(workspace: Path, *, channels_payload: dict | None = None):
    """Create an AgentLoop with lightweight mocks and configurable channels."""
    from nanobot.agent.loop import AgentLoop
    from nanobot.bus.queue import MessageBus
    bus = MessageBus()
    provider = MagicMock()
    provider.get_default_model.return_value = "test-model"
    provider.chat_with_retry = AsyncMock(return_value=LLMResponse(content="hello", tool_calls=[]))
    provider.api_key = ""
    provider.api_base = None
    config = Config.model_validate({"channels": channels_payload or {}})
    with patch("nanobot.agent.loop.SubagentManager"):
        loop = AgentLoop(
            bus=bus,
            provider=provider,
            workspace=workspace,
            channels_config=config.channels,
        )
    return loop, provider
 def test_voice_reply_config_parses_camel_case() -> None:
    config = Config.model_validate(
        {
            "channels": {
                "voiceReply": {
                    "enabled": True,
                    "channels": ["telegram/main"],
                    "model": "gpt-4o-mini-tts",
                    "voice": "alloy",
                    "instructions": "sound calm",
                    "speed": 1.1,
                    "responseFormat": "mp3",
                    "apiKey": "tts-key",
                    "url": "https://tts.example.com/v1",
                }
            }
        }
    )
    voice_reply = config.channels.voice_reply
    assert voice_reply.enabled is True
    assert voice_reply.channels == ["telegram/main"]
    assert voice_reply.instructions == "sound calm"
    assert voice_reply.speed == 1.1
    assert voice_reply.response_format == "mp3"
    assert voice_reply.api_key == "tts-key"
    assert voice_reply.api_base == "https://tts.example.com/v1"
 def test_openai_speech_provider_accepts_direct_endpoint_url() -> None:
    provider = OpenAISpeechProvider(
        api_key="tts-key",
        api_base="https://tts.example.com/v1/audio/speech",
    )
    assert provider._speech_url() == "https://tts.example.com/v1/audio/speech"
@pytest.mark.asyncio
 async def test_telegram_voice_reply_attaches_audio_for_multi_instance_route(
    tmp_path: Path,
    monkeypatch: pytest.MonkeyPatch,
 ) -> None:
    (tmp_path / "SOUL.md").write_text("default soul voice", encoding="utf-8")
    loop, provider = _make_loop(
        tmp_path,
        channels_payload={
            "voiceReply": {
                "enabled": True,
                "channels": ["telegram"],
                "instructions": "keep the delivery warm",
                "speed": 1.05,
                "responseFormat": "opus",
            }
        },
    )
    provider.api_key = "provider-tts-key"
    provider.api_base = "https://provider.example.com/v1"
    captured: dict[str, str | float | None] = {}
    async def fake_synthesize_to_file(
        self,
        text: str,
        *,
        model: str,
        voice: str,
        instructions: str | None,
        speed: float | None,
        response_format: str,
        output_path: str | Path,
    ) -> Path:
        path = Path(output_path)
        path.write_bytes(b"voice-bytes")
        captured["api_key"] = self.api_key
        captured["api_base"] = self.api_base
        captured["text"] = text
        captured["model"] = model
        captured["voice"] = voice
        captured["instructions"] = instructions
        captured["speed"] = speed
        captured["response_format"] = response_format
        return path
    monkeypatch.setattr(OpenAISpeechProvider, "synthesize_to_file", fake_synthesize_to_file)
    response = await loop._process_message(
        InboundMessage(
            channel="telegram/main",
            sender_id="user-1",
            chat_id="chat-1",
            content="hello",
        )
    )
    assert response is not None
    assert response.content == "hello"
    assert len(response.media) == 1
    media_path = Path(response.media[0])
    assert media_path.parent == tmp_path / "out" / "voice"
    assert media_path.suffix == ".ogg"
    assert media_path.read_bytes() == b"voice-bytes"
    assert captured == {
        "api_key": "provider-tts-key",
        "api_base": "https://provider.example.com/v1",
        "text": "hello",
        "model": "gpt-4o-mini-tts",
        "voice": "alloy",
        "instructions": (
            "Speak as the active persona 'default'. Match that persona's tone, attitude, pacing, "
            "and emotional style while keeping the reply natural and conversational. keep the "
            "delivery warm Persona guidance: default soul voice"
        ),
        "speed": 1.05,
        "response_format": "opus",
    }
@pytest.mark.asyncio
 async def test_persona_voice_settings_override_global_voice_profile(
    tmp_path: Path,
    monkeypatch: pytest.MonkeyPatch,
 ) -> None:
    (tmp_path / "SOUL.md").write_text("default soul", encoding="utf-8")
    persona_dir = tmp_path / "personas" / "coder"
    persona_dir.mkdir(parents=True)
    (persona_dir / "SOUL.md").write_text("speak like a sharp engineer", encoding="utf-8")
    (persona_dir / "USER.md").write_text("be concise and technical", encoding="utf-8")
    (persona_dir / "VOICE.json").write_text(
        '{"voice":"nova","instructions":"use a crisp and confident delivery","speed":1.2}',
        encoding="utf-8",
    )
    loop, provider = _make_loop(
        tmp_path,
        channels_payload={
            "voiceReply": {
                "enabled": True,
                "channels": ["telegram"],
                "voice": "alloy",
                "instructions": "keep the pacing steady",
            }
        },
    )
    provider.api_key = "provider-tts-key"
    session = loop.sessions.get_or_create("telegram:chat-1")
    session.metadata["persona"] = "coder"
    loop.sessions.save(session)
    captured: dict[str, str | float | None] = {}
    async def fake_synthesize_to_file(
        self,
        text: str,
        *,
        model: str,
        voice: str,
        instructions: str | None,
        speed: float | None,
        response_format: str,
        output_path: str | Path,
    ) -> Path:
        path = Path(output_path)
        path.write_bytes(b"voice-bytes")
        captured["voice"] = voice
        captured["instructions"] = instructions
        captured["speed"] = speed
        return path
    monkeypatch.setattr(OpenAISpeechProvider, "synthesize_to_file", fake_synthesize_to_file)
    response = await loop._process_message(
        InboundMessage(
            channel="telegram",
            sender_id="user-1",
            chat_id="chat-1",
            content="hello",
        )
    )
    assert response is not None
    assert len(response.media) == 1
    assert captured["voice"] == "nova"
    assert captured["speed"] == 1.2
    assert isinstance(captured["instructions"], str)
    assert "active persona 'coder'" in captured["instructions"]
    assert "keep the pacing steady" in captured["instructions"]
    assert "use a crisp and confident delivery" in captured["instructions"]
    assert "speak like a sharp engineer" in captured["instructions"]
    assert "be concise and technical" in captured["instructions"]
@pytest.mark.asyncio
 async def test_qq_voice_reply_config_keeps_text_only(
    tmp_path: Path,
    monkeypatch: pytest.MonkeyPatch,
 ) -> None:
    loop, provider = _make_loop(
        tmp_path,
        channels_payload={
            "voiceReply": {
                "enabled": True,
                "channels": ["qq"],
                "apiKey": "tts-key",
            }
        },
    )
    provider.api_key = "provider-tts-key"
    synthesize = AsyncMock()
    monkeypatch.setattr(OpenAISpeechProvider, "synthesize_to_file", synthesize)
    response = await loop._process_message(
        InboundMessage(
            channel="qq",
            sender_id="user-1",
            chat_id="chat-1",
            content="hello",
        )
    )
    assert response is not None
    assert response.content == "hello"
    assert response.media == []
    synthesize.assert_not_awaited()
@pytest.mark.asyncio
 async def test_qq_voice_reply_uses_silk_when_configured(
    tmp_path: Path,
    monkeypatch: pytest.MonkeyPatch,
 ) -> None:
    loop, provider = _make_loop(
        tmp_path,
        channels_payload={
            "voiceReply": {
                "enabled": True,
                "channels": ["qq"],
                "apiKey": "tts-key",
                "responseFormat": "silk",
            }
        },
    )
    provider.api_key = "provider-tts-key"
    captured: dict[str, str | None] = {}
    async def fake_synthesize_to_file(
        self,
        text: str,
        *,
        model: str,
        voice: str,
        instructions: str | None,
        speed: float | None,
        response_format: str,
        output_path: str | Path,
    ) -> Path:
        path = Path(output_path)
        path.write_bytes(b"fake-silk")
        captured["response_format"] = response_format
        return path
    monkeypatch.setattr(OpenAISpeechProvider, "synthesize_to_file", fake_synthesize_to_file)
    response = await loop._process_message(
        InboundMessage(
            channel="qq",
            sender_id="user-1",
            chat_id="chat-1",
            content="hello",
        )
    )
    assert response is not None
    assert response.content == "hello"
    assert len(response.media) == 1
    assert Path(response.media[0]).suffix == ".silk"
    assert captured["response_format"] == "silk"