Speech & Audio - ReadRealm

ReadRealm converts book text into audio using Azure OpenAI TTS (the tts deployment). Two HTTP endpoints handle on-demand and streaming audio generation. A separate real-time speech feature — backed by Azure Cognitive Services Realtime — is WebSocket-based.

Configuration

The TTS and real-time speech services are configured through environment variables:

Variable	Used by	Description
`AZURE_API_TTS_KEY`	TTS service	API key for the Azure OpenAI TTS deployment
`AZURE_API_TTS_ENDPOINT`	TTS service	Endpoint URL for the Azure OpenAI TTS deployment
`AZURE_API_TTS_MODEL`	TTS service	Deployment name (e.g. `tts`)
`AZURE_API_REALTIME_KEY`	Real-time speech	API key for the Azure Realtime speech deployment
`AZURE_API_REALTIME_ENDPOINT`	Real-time speech	Endpoint URL for the Azure Realtime speech gateway
`AZURE_API_REALTIME_MODEL`	Real-time speech	Deployment name for real-time voice

These variables are read from the NestJS ConfigService. Map them in your environment or .env file before starting the API.

GET /book/tts/stream/:title

Looks up a book by title on Gutendex, fetches its plain-text content, and streams the TTS audio as a chunked audio/mpeg response. The connection stays open until the full audio is piped through.

Only the first 4,096 characters of the book text are sent to Azure OpenAI for conversion. Very long books are truncated.

Path parameters

title

string

required

URL-encoded book title. The server decodes it before querying Gutendex (e.g. alice%20in%20wonderland).

Response headers

Header	Value
`Content-Type`	`audio/mpeg`
`Transfer-Encoding`	`chunked`
`Cache-Control`	`no-cache`
`Content-Disposition`	`inline`

Response body

A binary MP3 audio stream piped directly from the Azure OpenAI response. Write it to a file or pipe it to a media player.

# Save to a file
curl -o alice.mp3 \
  'http://localhost:3000/book/tts/stream/alice%20in%20wonderland'

# Play inline with mpv (or any media player that reads stdin)
curl -s 'http://localhost:3000/book/tts/stream/alice%20in%20wonderland' | mpv -

Error responses

Status	Condition
`404`	No book with the given title found on Gutendex, or book has no plain-text content
`500`	Azure TTS call failed or stream error occurred

POST /book/ebook

Generates TTS audio from a book body supplied directly in the request. Use this when you already have the book text and do not need the server to fetch it from Gutendex. If the textData field is an empty string, the server substitutes a default CreateBookDto instance before invoking Azure TTS.

Request body

number

required

Numeric book ID.

author

string

required

Author name.

title

string

required

Book title.

publicationDate

number

required

Publication year.

numOfPages

number

required

Page count.

coverImage

string

required

Cover image URL.

genre

string

required

Genre.

textData

string

Plain-text content to synthesise. Only the first 4,096 characters are sent to Azure. If empty, a default book object is used.

curl -X POST http://localhost:3000/book/ebook \
  -H 'Content-Type: application/json' \
  -d '{
    "id": 28520,
    "author": "Lewis Carroll",
    "title": "Alice in Wonderland",
    "publicationDate": 1865,
    "numOfPages": 96,
    "coverImage": "https://covers.openlibrary.org/b/id/8739161-M.jpg",
    "genre": "Fantasy",
    "textData": "Alice was beginning to get very tired of sitting by her sister on the bank..."
  }'

Response

A binary audio response body (audio/mpeg) returned by the Azure OpenAI TTS API. The voice used is alloy and the output format is mp3.

Error responses

Status	Condition
`404`	`textData` is absent and the fallback `CreateBookDto` also has no text
`500`	Azure TTS call failed

Real-time speech (WebSocket)

ReadRealm also supports two-way real-time voice powered by the Azure Cognitive Services Realtime API. This uses a WebSocket connection managed by the SpeechRealtimeService:

The client streams raw PCM audio buffers to the server.
The server forwards them to Azure in fixed-size chunks (4,800 bytes at 24 kHz, mono).
Azure performs server-side VAD (voice activity detection) using whisper-1 for transcription.
Audio responses (response.audio.delta) and transcript deltas are forwarded back to the client in real time via the socket.
On stop, the session audio is saved to MP3 using FFmpeg (libmp3lame, 128 kbps).

For connection setup, event names, and payload formats, see WebSockets.

WebSocket events overview

Direction	Event	Description
Server → Client	`audio`	Base64-encoded audio delta or `"Session start"` / `"clear"` control strings
Server → Client	`transcript`	Incremental transcript text or status markers
Server → Client	`state`	Input state change: `0` (Working), `1` (ReadyToStart), `2` (ReadyToStop)
Server → Client	`error`	Error message string

WebSocket reference

Full connection flow, client events, and payload schemas for both chat and real-time speech.

Documentation Index

​Configuration

​GET /book/tts/stream/:title

​Path parameters

​Response headers

​Response body

​Error responses

​POST /book/ebook

​Request body

​Response

​Error responses

​Real-time speech (WebSocket)

​WebSocket events overview

WebSocket reference

Configuration

GET /book/tts/stream/:title

Path parameters

Response headers

Response body

Error responses

POST /book/ebook

Request body

Response

Error responses

Real-time speech (WebSocket)

WebSocket events overview