Share your text data anonymously and free
Voxept.
Bondsynth AI LLC.
TODO: Insert some nifty audio/TTS etc graphic here.
What is it?.
Long-form text-to-speech.
Audiobooks from ebooks.
Websites.
PDFs.
Anything longer than a few sentences.
User Journey (once on site).
Listener uploads an ebook or submits a URL.
Listener is redirected to a progress bar page; in the background text-to-speech synthesis begins.
Progress bar updates every 5 percentage points, user informed when error occurs.
Synthesis completes and the listener either listens in the browser or download a file to listen to.
Long-Form Text-to-Speech.
Long-form text-to-speech has a higher quality bar.
The voices on digital assistants is fine for about 1 minute of strictly information audio.
Any longer and it gets too annoying.
Long-form text-to-speech benefits from many speakers.
High-level content structure starts to matter.
If a character in a book is describing a harrowing situation over the course of several sentences a consistent tone across those sentences is important.
Individual sentences will not always convey enough information to produce the appropriate tone.
Why? (Who cares?).
Audiobook enthusiasts have strong opinions about their audiobooks.
Many audiobooks are of poor production quality.
Some listeners strongly dislike one reader or another.
Whereas another listener might love the same reader.
Some listeners like sound-effects, some despise them.
Audiobooks are expensive.
Full cast audiobooks are very expensive.
Public domain books are popular, but just as costly.
Audiobook listeners . TO DO: SEARCH FOR STATS ABOUT HOW OFTEN AUDIOBOOK LISTENERS LISTEN AND/OR HOW MANY BOOKS THEY LISTEN TO A YEAR, COMPARE TO TEXT BOOK READERS.
Why Now?.
Audiobooks are popular . TO DO: INSERT MENTION OF AUDIOBOOK SALES PER YEAR WITH YoY GROWTH.
Web content is being consumed in audio form increasingly more . TO DO: INSERT SOME STATS ABOUT THAT.
TO DO: INSERT SCREENSHOTS OF NYTIMES, WSJ, ETC.
There are too many websites for humans to read them all.
TO DO: INSERT STATS ABOUT LONG-TAIL OF WEBSITE CONSUMPTION? IS MOST LONG-FORM CONTENT CONSUMED ON THE WEB FROM A FEW BIG SITES OR MANY, MANY SMALL ONES?.
Why Now?.
Podcast popularity.
TO DO: INSERT STATS ABOUT PODCASTS.
TO DO: INSERT BLURB ABOUT PODCASTS vs WEB AUDIO.
Headphone use is becoming ubiquitous.
TO DO: INSERT STATS ABOUT HEADPHONE USE.
Headphones are really the first augmented reality device.
I expect listening time to continue to increase as more and more people listen to things at all times.
Why? (Who cares?).
Even with the audiobook industry booming there are very popular books that have audiobook problems.
Dawnshard by Brandon Sanderson - 2+ years for an audiobook . TO DO: DOUBLE CHECK THE 2+ YEARS NUMBER.
Excession by Iain M. Banks (part of the culture series) - audiobook can ONLY be purchased if you live in the UK.
Dune - no unabridged audiobook.
Sorry if you listened to it, I did too. I thought the story was lacking cohesion. Apparently it actually was..
www.duneaudiobook.com if you’d like to listen to the unabridged version . TO DO: MAKE THIS WORK!.
How?.
Working on long-form text-to-speech as a vertical.
Data.
Model.
Product.
Iterate across the stack.
Investing in tools.
Tools for high quality dataset generation.
Tools for manual dataset quality evaluation.
Tools for automatic dataset quality evaluation.
Tools for data augmentation.
Where’s Voxept now?.
Natural, consistent, text-to-speech on web pages and ebooks.
Higher quality than any other long-form text-to-speech I’ve seen.
Single speaker.
Unscaled website up and running.
No long-range context.
Physical-book to audiobook . TO DO: TODO: DROP?.
Mostly a marketing gimmick .
Rough Roadmap.
December 2022.
More emotive speech.
Emotive speech using context across several sentences at a time.
Scalable website.
Begin acquiring users.
Audiobook enthusiasts.
Tech enthusiasts (particularly for web page text-to-speech).
More target audience research.
What are the pain points?.
Solicit enthusiastic users to give feedback.
March 2023.
Multiple voices available, user configurable, only one voice per piece of content.
Iterate on user-feedback.
June 2023.
Pricing model worked out.
Ads vs tiered access.
Important to keep some sort of free tier.
Vision.
(long-long range roadmap).
Multiple voices with no legal questions.
Automatic voice-to-character mapping for ebooks.
Full-cast audiobook generation.
Using higher-level emotional context for determining speech emotiveness.
E.g. a character giving an impassioned speech should sound impassioned through the whole thing even if an individual sentence does not have any cues..
This Entire Presentation Read by Voxept.