Generating artificial podcasts from articles with OpenAI's TTS model
Introduction
Who doesn’t like podcasts? Well, certainly not me! That’s why in today’s article I’ll explore how you can use OpenAI’s TTS API to generate podcasts from your articles. Before we start, let’s briefly break down the problem.
-
I write my articles in MDX, so I can’t just pass them to the TTS API. They need to be transformed into “tts-compatible” input, which doesn’t have frontmatter, JSX elements like code snippets, JavaScript imports, etc.
-
OpenAI’s TTS API input accepts strings at most 4096 characters long, so in case of longer articles, we need to break them into sections, generate an audio file for each, and join them together to get the final podcast.
-
The process of generating podcasts should be quick and simple. We want a CLI tool that lists modified or new articles and lets us pick the ones for which we want to generate a podcast.
Note: while this post contains code snippets it’s not a tutorial. If you want to implement a similar feature on your site and find something confusing please look at the full source code of the podcast package in my site’s repository!
Transforming raw MDX files into TTS input
Node.js has a really good toolkit for working with a content called unified, which “compiles content to syntax trees and syntax trees to content”. We ll use it to parse our MDX articles and process their ASTs. We can get started by defining the necessary types and a compileArticle
function.
This function takes a raw MDX string and initializes a unified processor that later uses a few plugins to extract frontmatter and an array of sections.
We need to write two unified plugins. A transformer, which will modify the AST produced by remark-parse
, remark-mdx
, remark-frontmatter
, and remarkUnlink
, and a compiler, which will take the AST and extract an array of sections. Let’s start by writing the transformer.
Removing unreadable nodes
ttsTransformer
is our custom plugin that traverses the AST to extract the article’s frontmatter, wrap headings in quotation marks, render lists with toMarkdown
utility and remove unreadable nodes.
The pre-processing of headings and lists hopefully will help us to make the TTS input less vague for the model.
At the end, we remove all nodes which aren’t considered “tts-compatible”. How do we know if a node is or isn’t “tts-compatible”? Well, I’ve made a list of nodes that represent text content and created isTTS
assertion we can use when calling the remove
utility. This assures us that the AST will contain only nodes which can be read.
Compiling AST into an array of sections
Now, we can look at the ttsCompiler
plugin. We’ll attach our own custom compiler to the unified processor which will visit all of the top-level nodes, render them to string, and group them into sections.
As I store my article titles in frontmatter I need to prepend them to the first section. Then, we just loop through the children nodes of the root node and append them to the current section, if its total length is smaller than 4096 characters. Otherwise, we start a new section.
If a single node’s text content exceeds 4096 characters I throw an error. As I plan to write short blog posts, it’s almost impossible this will ever happen.
Example
To show how our compiler works, let’s say we have the following article for which we want to generate a podcast.
After calling compileArticle
with contents of this file as rawMDX
argument we receive the following object.
Heading and the first paragraph get merged into a single section as they don’t exceed the length limit. If we would have a really long article, like this one, we probably would have another entry in the sections array.
Using OpenAI’s TTS model to generate podcasts
We’re ready to compile an article and turn it into a podcast now. We’ll use an official OpenAI client for Node.js to make things a bit simpler. Let’s abstract this process into a generatePodcast
function.
This function loops through the compiled article’s sections and runs the model for each one storing the output in a temporary file called {articleSlug}-section{i}.mp3
. After generating audio files for every section we concat them into the final podcast with ffmpeg.
We store the temporary files in our os temporary directory. To keep things clean, after we’re done we remove them.
What’s smart in this snippet is that as generating an audio file for a section is independent of other sections we can map them into a Promise
. Running the model can take a while and making this process asynchronous makes our code significantly faster.
With this function, we can generate a podcast for an article.
Caching
Last but not least, I think it’s worth mentioning how I figure out if an article’s changed. So, my CLI tool generates .podcasts
directory in which I create cache.json
file. In that cache file, I store md5 hashes of articles for which I’ve generated podcasts. When I list files from my articles directory I check if the md5 hash of their current tts content matches the one stored in the cache file. If it does, I just ignore the file. Otherwise, I store the file together with any new articles. When I run the CLI, I prompt myself to choose articles from which I want to generate a podcast. It looks like this.
At the time of writing, I update the cached md5 hash for every changed article, even if finally I don’t generate a podcast from it. Maybe I’ll change this behavior in the future but for now it’s fine.
Final thoughts
Does it even make any sense to do that? I don’t know! I’ve spent a long time on this article as I had to figure out how unified works but I am really happy with the outcome. I think the podcast feature makes a bit more sense for non-technical posts as they don’t have a lot of important code snippets.
If you are still reading, or listening, I am really impressed! If you would like to support me in some way feel free to leave some bucks for a cup of coffee ️☕️.