From Static Finance Content to a Searchable AI Learning Platform
The client had high-quality courses, Synthesia video lessons, and a 500+ page financial modeling book. Members still had to search manually. Here is how we built a retrieval and voice AI layer across all three content sources.
Investment Analyst
2,800+
500+
~6 Weeks

Before
After
Members browsed courses manually with no semantic understanding
Hybrid keyword and semantic search across platform content
Video knowledge stayed locked inside Synthesia transcripts
Transcripts pulled, cleaned, and indexed automatically
A 500+ page book existed only as a static PDF
Book content extracted, chunked, embedded, and stored in Qdrant
No AI assistant existed for member Q&A
ElevenLabs voice agent answers from proprietary training material
Course and video updates required manual operational work
Automated LearnWorlds and Synthesia sync pipelines keep content current
No financial advice controls were needed because there was no AI layer
Explicit guardrails keep the agent educational, not advisory
Services Delivered
Across this project, we delivered:
Discovery:
Understanding the Content Before Designing the AI Layer
We mapped how members were expected to learn today, where the content lived, and what LearnWorlds, Synthesia, Algolia, ElevenLabs, and Qdrant could realistically support.
What we learned:
Three useful content sources existed, but none of them worked together.LearnWorlds held course structure, Synthesia held video scripts, and the 500-page PDF held the deepest reference material.
Course content was structured, but not semantically searchable.Members could browse pages and pathways, but they could not ask questions in natural language and get directed to the right lesson.
Video transcripts were valuable, but isolated. Synthesia contained spoken instructional content, but LearnWorlds had no native transcript layer or reliable mapping between courses and videos.
The PDF needed its own retrieval pipeline.The Financial Modelling Mastery book was too important to treat as a static file. It needed extraction, cleaning, chunking, embeddings, and vector storage before an AI agent could use it.

Two decisions shaped the entire system:
Build a decoupled content architecture
Pull content from each source independently
Avoid manual course-to-video mapping
Use automated sync pipelines instead of an admin curation layer
Reduce maintenance overhead for future course and video updates
Treat the PDF as the foundation for book intelligence
Build CR-1 as a reusable PDF ingestion pipeline
Clean noisy textbook content before embedding
Store semantic chunks in Qdrant
Validate retrieval through real finance-related test queries
This approach let us move fast without forcing LearnWorlds, Synthesia, and the PDF book into one brittle content model. Each source kept its structure. The intelligence layer made them searchable, retrievable, and usable by the AI agent.
The three knowledge layers
LearnWorlds Course Layer
Course titles
Descriptions
Learning pathways
LearnWorlds API
Synced into Algolia for semantic search and recommendations.
Synthesia Video Layer
Video metadata
Scripts
Clean transcripts
Synthesia API
Converts video scripts into searchable learning content, so spoken lessons can be discovered through semantic search and surfaced by the AI agent.
Financial Modelling Book Layer
500-page PDF
Cleaned text chunks
Embedded book knowledge
Financial Modelling Mastery PDF
Turns the 500+ page PDF into a retrievable knowledge base, so the AI agent can answer detailed finance questions from the book instead of relying on generic model knowledge.
WHAT WE BUILT
A unified AI intelligence layer across three content sources.
We built four connected components that turned The Investment Analyst's static learning content into a searchable, AI-assisted member experience.
PDF Ingestion Pipeline
Built a reusable Python pipeline to extract, clean, chunk, embed, & store the 500+ page Financial Modelling Mastery book in Qdrant.
Page-level text extraction
Noise filtering for headers, TOC entries, captions, blank pages, and copyright text
Semantic chunking for retrieval quality
OpenAI embedding generation
Qdrant storage
Test-query validation through the agent
The book became retrievable by the AI assistant instead of sitting as a static PDF.
ElevenLabs AI Voice Agent
Built a member-facing AI tutor using ElevenLabs Agents v2.0 and the client's professional voice clone.
Voice-enabled member Q&A
Answers grounded in proprietary content
References and links back to courses, videos, and book material
Guardrails against personalised financial advice
LearnWorlds widget / iframe embedding
Paywall-compatible access
Members could ask questions naturally and get guided to the right learning material.
Algolia Semantic Search & Retrieval
Configured Algolia as the discovery layer for LearnWorlds courses and Synthesia video content.
Hybrid keyword + semantic search
Course title, description, and pathway indexing
Video transcript indexing
Investment-domain index structure
Algolia Recommendation API for course suggestions
Smart search inside LearnWorlds
Courses and video knowledge became searchable from one place.
Automated Content Sync Pipelines
Built automated pipelines to keep LearnWorlds and Synthesia content current inside the search and AI layer.
LearnWorlds course sync
Synthesia video metadata sync
Transcript extraction and timestamp cleanup
Scheduled and event-driven indexing
Update and deletion handling in Algolia
New, updated, or removed content flowed into the system without manual operational work.
The Details That Made It Production-Ready
Chunking Strategy for the PDF
Raw PDF extraction created too much noise for reliable retrieval. The book included tables, formulae, repeated headers, captions, footnotes, and copyright text.
We solved this with:
Multi-stage text cleaning
Removal of non-informational content
Semantic chunks instead of fixed character windows
Finance-specific retrieval tests
Tech stackDecoupled Architecture Over Manual Linking
LearnWorlds and Synthesia had no native connection.
Instead of building a manual linking layer with a custom DB and admin UI, we used independent sync pipelines feeding a shared Algolia index.
This meant:
No manual course-to-video mapping
No ongoing admin curation
Faster delivery
Easier content updates
Tech stackEmbedding Model Choice
Investment content is dense and vocabulary-heavy. Queries like EBITDA bridge, terminal value growth rate, or DCF assumptions need financial context, not surface similarity.
We used OpenAI embeddings for the PDF pipeline to improve retrieval quality on finance-specific content.
Better embeddings meant better answers from day one.
Tech stackKeeping the Agent Grounded
The agent had to teach financial concepts, not give investment advice.
We configured it with:
Grounding against TIA's proprietary content
Guardrails against personalised financial advice
Clear fallback behavior when content is not found
References and links back to courses, videos, or book material
Tech stackPlatform Embedding Constraints
LearnWorlds limited how deeply the AI layer could be embedded.
We avoided unsupported platform customization and delivered the experience through widget and frame embedding.
This kept the integration:
Stable across LearnWorlds updates
Compatible with member-only access
Easy to place across course pages
Independent from LearnWorlds core code
Tech stackReal World Challenges
PDF noise would have polluted retrieval
Without filtering repeated headers, captions, blank-page noise, and TOC fragments, the agent would retrieve junk.
Reusability changed the pipeline design
The ingestion module was built configuration-first, so future books would not require a rewrite.
No native LearnWorlds-Synthesia mapping
There was no reliable way to say this course unit equals this Synthesia video. That forced the decoupled architecture decision early.
Paywall embedding had to be validated
The ElevenLabs widget had to work inside authenticated LearnWorlds pages without unsupported platform changes.
What changed for members and platform operations
Area
Before
After
Content search
Manual page browsing, no semantic understanding
Hybrid keyword and semantic search via Algolia NeuralSearch
Cross-source discovery
No way to search LearnWorlds and Synthesia together
Unified index across courses, transcripts, and book content
Book knowledge
500-page PDF unused by any system
Fully embedded in Qdrant - queryable by the AI agent
Member Q&A
No AI assistant on the platform
ElevenLabs voice agent grounded in proprietary content
Video transcripts
Locked on Synthesia, not connected to search
Auto-synced and indexed into Algolia via pipeline
Content freshness
Manual updates required for every content change
Automated pipelines handle sync, updates, and deletions
Course recommendations
Static navigation; no intelligent recommendation
Algolia Recommendation API surfaces courses contextually
Financial advice risk
No AI on platform
Explicit guardrails in place; agent teaches, not advises
The Team and Timeline
A single engineer delivered the core CR-1 work over approximately six weeks, while the broader platform engagement ran in parallel.
1–2 weeks
Discovery Phase
Architectural analysis
Platform Constraint Mapping
approach evaluation
decision and sign-off
~6 weeks
CR-1: PDF Pipeline
datapipeline_pdf module
PDF extraction and cleaning
semantic chunking
OpenAI embedding
Qdrant ingestion
test-query validation
README
Parallel
AI Agent
ElevenLabs v2.0 widget integration
voice clone configuration
prompt grounding
guardrails
CTA placement
paywall embedding
Algolia Layer
NeuralSearch configuration
index design
Recommendation API
LearnWorlds and Synthesia pipeline builds
automated sync
ONGOING
Go Live
End-to-end validation
production deployment
monitoring
AMC support
Tech Stack

Brew. Build. Breakthrough.
A twice-a-month newsletter from
Karan Shah, CEO & Co-Founder
10K+ Users Already Subscribed
SoluteLabs © 2014-2026
Privacy & Terms



