Live, Topic-Threaded Conversations

Muskan Sahetai
Developer
Ramblytics turns any two-speaker conversation into structured, actionable notes in real time. It transcribes speech, diarizes who's talking (Speaker A/B), segments the dialogue into topics, and builds "threads" with short summaries, key phrases, and auto-detected action items.
Participants can interact with each thread during or after the call—confirm decisions, assign tasks, answer open questions—and export everything to Markdown/Notion/CSV.
Ideal for meetings, interviews, lectures, and sales calls where "who said what about which topic" matters—without sending audio to the cloud.
Balancing transcription accuracy with latency is critical. Whisper models need to process audio chunks fast enough for "live" feel (~2s lag max), which means optimizing model size, GPU utilization, and chunking strategy without sacrificing quality. We're exploring quantized models and streaming inference to hit our latency targets.
Distinguishing between two speakers in real-world conditions (background noise, overlapping speech, varying distances from mic) is non-trivial. PyAnnote's pretrained models work well in ideal conditions, but we're building custom fine-tuning pipelines and voice embedding strategies to handle edge cases like similar-sounding voices or cross-talk.
Conversations don't have clear chapter breaks. Deciding when to split dialogue into a new "thread" requires semantic understanding of context shifts. We're experimenting with sentence embeddings, sliding-window topic coherence scores, and lightweight LLMs to detect topic boundaries without sending full transcripts to the cloud.
Running everything on-device means no cloud APIs, which is great for privacy but challenging for resource-constrained machines. We need efficient model serving, smart caching, and graceful degradation on lower-end hardware. WebAssembly and ONNX Runtime are being evaluated for browser-based processing.
Automatically detecting tasks, decisions, and questions from natural dialogue requires understanding intent and context. We're building NER (Named Entity Recognition) pipelines combined with rule-based patterns to catch phrases like "let's follow up on..." or "can you send me..." and attribute them to the right speaker.