How to do Topic Modeling on Podcasts and Videos

How to do Topic Modeling on Podcasts and Videos

Yash Vardhan
NLP and LLMs

What is Topic Modeling?

Topic Modeling is like finding the main themes or subjects in a large collection of written documents without knowing what those themes are ahead of time. Imagine you have a big pile of books or articles, and you want to understand what they're generally about without reading each one in detail. Topic Modeling uses smart computer algorithms to do this.

It is a type of statistical modeling that leverages unsupervised machine learning to analyze and identify clusters or groups of similar words within a body of text, thereby discovering hidden patterns and automatically identifying topics that exist within a text corpus.

Some of the common algorithms to identify and extract topics from a collection of documents include Latent Dirichlet Allocation(LDA), Latent Semantic Analysis (LSA), Non-Negative Matrix Factorization(NMF), Correlated Topic Model (CTM), Top2Vec, BERTopic, etc.

Each topic is represented as a distribution over words, and each document is represented as a distribution over topics. This allows for the extraction of the main themes from large sets of text data, making it easier to organize, search, and summarize the information.

Topic modeling has numerous applications, including document clustering, organizing large document collections, improving information retrieval, and enhancing recommendations in various domains like academia, business, and social media. It provides insights that are not apparent through simple keyword searches, enabling a more nuanced understanding of the content.

Topic Modeling vs Topic Classification

Topic Modeling utilizes unsupervised machine learning techniques. It does not require a pre-labeled dataset and can autonomously identify patterns within your text data. By analyzing word co-occurrences and distributions, Topic Modeling can reveal hidden themes and topics without any prior knowledge of the content. This process is particularly effective with high-quality, large datasets; the more data you feed into the model, the better it becomes at detecting and delineating the underlying topics. Larger datasets provide more context and variation, allowing the model to capture more subtle patterns and nuances in the text.

Conversely, Topic Classification employs supervised machine learning. It relies on datasets that have been manually labeled to train the model. This method can work effectively with smaller datasets, provided that they are well-labeled and representative of the topics of interest. The manual labeling process creates a structured dataset that the algorithm can use to learn to classify new texts into predefined categories accurately.

In terms of long-term effectiveness, teaching a machine to identify high-value words through text analysis—such as in Topic Classification—can be more strategic compared to the unsupervised approach of Topic Modeling. If you have a predefined list of topics and wish to label sets of texts like reviews or surveys quickly, a Topic Classification algorithm is more applicable. This approach allows for the automatic extraction of valuable insights from texts based on predefined categories, making it a practical solution for tasks that require quick and accurate topic identification.

Why should we do Topic Modeling?

Topic Modeling allows you to examine multiple topics and organize, understand, and summarize them on a large scale. It enables you to swiftly uncover hidden patterns within the data, providing insights that can inform data-driven decisions.

Document Classification

Topic modeling helps classify documents into predefined categories based on the topics they contain. For instance, a collection of news articles can be categorized into topics like politics, sports, technology, and health. The algorithm identifies the predominant themes in each document and assigns it to the most relevant category. This process automates and accelerates the classification of large volumes of text, making it easier to organize and retrieve information.

Additionally, topic modeling enhances the accuracy and efficiency of document classification by reducing human error and bias. It can handle multilingual text and adapt to different domains without needing extensive manual adjustments. By continuously learning from new data, topic modeling algorithms can evolve and improve over time, ensuring that the classification remains relevant and up-to-date.

Effortlessly Tag Customer Support Requests

Customer support teams receive numerous queries daily. Topic modeling can analyze these requests and automatically tag them with relevant topics such as billing issues, technical support, product inquiries, or service feedback. By categorizing requests, support teams can prioritize and route them to the appropriate departments or specialists, improving response times and customer satisfaction.

Topic modeling enhances the accuracy of tagging by minimizing manual errors and ensuring consistency in categorization. This automated tagging system can also identify emerging trends or recurring issues, enabling support teams to proactively address common problems.

Scaling Customer Feedback Analysis

Companies often collect vast amounts of feedback from customers through surveys, reviews, social media, and other channels. Topic modeling can process this feedback to identify recurring themes and sentiments, such as common complaints, product suggestions, or praise. This analysis helps businesses understand customer needs and preferences at scale, allowing them to make data-driven decisions to enhance products and services.

Topic modeling allows businesses to detect shifts in customer sentiment over time, providing early warnings about potential issues or emerging trends. It can segment feedback by demographic or geographic factors, offering more nuanced insights into customer behavior. By automating the analysis of large-scale feedback, topic modeling frees up resources, enabling teams to focus on strategic initiatives rather than manual data processing.

Crafting Content That Resonates

Content creators, marketers, and writers aim to produce content that engages their audience. Topic modeling can analyze existing content and audience interactions to identify trending topics and themes. By understanding what resonates with their audience, creators can tailor their content to match these interests, increasing engagement, and relevance. This approach ensures that the content is aligned with the audience’s preferences and needs.

Topic modeling allows content creators to discover gaps in current content offerings, revealing opportunities for new and unique topics that have not yet been explored. It can also track changes in audience interests over time, helping creators to adapt their strategies accordingly. By analyzing feedback and interactions, such as comments, likes, and shares, topic modeling provides insights into the types of content that generate the most engagement.

Understanding Employee Sentiments

Organizations often conduct employee surveys and collect feedback through various channels to gauge employee sentiment and workplace satisfaction. Topic modeling can analyze this data to uncover underlying themes and sentiments, such as concerns about work-life balance, management practices, or workplace culture. By understanding these sentiments, organizations can address issues, improve employee morale, and create a more positive work environment.

Topic modeling enables organizations to track changes in employee sentiment over time, allowing them to measure the impact of implemented policies and initiatives. This analysis can segment feedback by department, tenure, or other relevant factors, providing a detailed understanding of specific groups' experiences within the organization.

How Does Topic Modeling Work

Topic modeling is a method used in natural language processing (NLP) and text mining to uncover hidden patterns within a large collection of texts. It is an unsupervised machine-learning technique, meaning it does not require labeled data. Here’s a detailed explanation of how topic modeling works, particularly focusing on the most commonly used algorithm, Latent Dirichlet Allocation (LDA). The steps involved in Topic Modeling are:

Text Preprocessing

Text Preprocessing is a vital step in preparing text data for modeling, particularly for algorithms like Latent Dirichlet Allocation (LDA), which is used for topic modeling. This process involves several steps aimed at cleaning and standardizing the text data to enhance the performance of the model. 

Tokenization is the process of breaking down text into individual units, usually words or phrases, called tokens. This step simplifies the text and makes it easier to analyze. For the sentence "The cat sat on the mat," tokenization would result in ["The", "cat", "sat", "on", "the", "mat"]. Stop words are common words that carry little semantic meaning and are often removed to focus on the more significant words. Removing these words helps reduce the noise in the data. Words like "the," "is," and "and" are typically removed from the token list. For the tokenized example above, removing stop words might result in ["cat", "sat", "mat"].

Lemmatization and stemming are both techniques used to reduce words to their base forms. Lemmatization reduces words to their base or dictionary form by considering the word's meaning and context, resulting in more accurate reductions such as "better" becoming "good" and "ran" becoming "run." On the other hand, stemming removes suffixes to reduce words to their root form, without considering context, which can lead to less accurate reductions. For example, "running," "runner," and "ran" might all be reduced to "run." While lemmatization ensures the base form is a proper word, stemming focuses on simplifying text for analysis.

Model Initialization

The model initialization step in Latent Dirichlet Allocation (LDA) is crucial for setting the stage for the next (iterative) process. Initially, each word in each document is randomly assigned to one of the predefined numbers of topics. This random assignment provides a starting point for the iterative refinement process. Since LDA is a probabilistic model, starting with a random distribution allows the algorithm to explore the topic space effectively. Consider a document with the words ["cat", "sat", "mat"]. If we have three topics, the initial assignment might randomly assign "cat" to Topic 1, "sat" to Topic 2, and "mat" to Topic 3.

This random assignment is purely an initial guess. The LDA algorithm will iteratively refine these assignments to discover the true underlying topic structure in the data. The random initialization helps ensure that the algorithm does not start with any biases and can explore a wide range of possible topic distributions.

LDA not only adjusts which topic a word belongs to, but it also updates its understanding of what each topic is about. This dual updating process—adjusting word assignments and refining topic descriptions—helps LDA to accurately capture the main themes or topics within a collection of documents.

Word-Topic Assignment

The word-topic assignment step in Latent Dirichlet Allocation (LDA) is a key component of the iterative process that refines the model to discover the underlying topics in a corpus (collection of documents). For each word in each document, the LDA algorithm reassigns the word to a topic based on the probability distribution that considers two factors:

Topic-Word Distribution

The probability of the word given the topic. It indicates how strongly a word is associated with a particular topic. This is computed as the number of times a word is assigned to a topic across all documents, divided by the total number of words assigned to that topic. It is denoted by a symbol Φ.

Document-Topic Distribution

The probability of the topic given the document. It reflects how strongly a topic is associated with a particular document. This is computed as the number of words in a document assigned to a topic, divided by the total number of words in a document. It is denoted by a symbol θ.

After several iterations, the algorithm converges, and the final topic and document distributions are calculated

Topic Modeling for YouTube Videos using BERTopic

What is BERTopic?

BERTopic is a sophisticated topic modeling technique that leverages the power of transformers and a variant of Term Frequency-Inverse Document Frequency (TF-IDF) called Class-based TF-IDF (c-TF-IDF). This combination allows BERTopic to create dense clusters of topics that are easily interpretable while maintaining the relevance of important words in the topic descriptions. Here is a detailed explanation of BERTopic and its key components:

A Transformer Embedding Model

A transformer embedding model is a type of neural network architecture designed to generate high-quality, contextual representations of text. BERTopic supports several libraries for encoding our text to dense vector embeddings to capture contextual relationships between words in a document. We can use a suitable embedding model from one of the supported libraries, which includes Sentence Transformers, Flair, Spacy, Gensim, etc.

Dimensionality Reduction

Dimensionality reduction is a process in data analysis and machine learning that reduces the number of random variables under consideration. It involves transforming data from a high-dimensional space into a lower-dimensional space while preserving as much relevant information as possible. This technique is crucial for simplifying models, improving computational efficiency, and overcoming issues associated with high-dimensional data, such as the "curse of dimensionality."

After building our embeddings, BERTopic compresses them into a lower-dimensional space to perform the clustering step effectively and visualize our data. BERTopic employs UMAP to perform the dimensionality reduction step.

UMAP(Uniform Manifold Approximation and Projection) is a non-linear dimensionality reduction technique that preserves the local and global structure of the data, making it suitable for clustering.

Clustering

Clustering is a technique used in data analysis and machine learning to group a set of objects in such a way that objects in the same group (or cluster) are more similar to each other than to those in other groups. It is an unsupervised learning method, meaning that it does not require labeled data to perform the grouping.

In BERTopic, transformer models like BERT generate high-quality, contextual embeddings of text. These embeddings capture the semantic meaning of words and sentences. HDBSCAN is then applied to these embeddings to cluster documents into meaningful topics. By leveraging HDBSCAN’s strengths in handling varying densities, identifying noise, and not requiring a predefined number of clusters, BERTopic can produce robust and interpretable topics.

For instance, when analyzing a large corpus of text such as customer reviews, BERTopic with HDBSCAN can effectively group similar reviews into topics like "product quality," "customer service," and "delivery experience," while filtering out irrelevant or noisy data. This results in a more nuanced and actionable understanding of the data.

Topic Representation

Once the clusters (topics) are formed, BERTopic uses TF-IDF to extract the most representative words for each topic. TF-IDF scores words based on their frequency in a document relative to their frequency in the entire corpus, highlighting unique terms for each topic.

Why should we use BERTopic over the traditional Topic Modeling algorithm?

BERTopic offers several advantages over traditional topic modeling algorithms like Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF).

Contextual Understanding

BERTopic captures the context of words using transformer-based models, unlike traditional methods such as LDA and NMF, which do not account for context. Transformers, like BERT, provide deep contextual embeddings by considering the position and relationship of words within a sentence. This leads to more accurate and meaningful topic extraction, as the model understands the nuanced meanings and relationships between words, resulting in higher-quality topics.

High-Quality Embeddings

BERTopic uses dense, high-quality embeddings from BERT, offering better semantic representation than traditional sparse representations like TF-IDF. BERT embeddings capture the nuanced meaning and relationships between words, resulting in more coherent and semantically rich topics. This enhances the clustering of topics, making them more meaningful and easier to interpret.

Flexibility and Customization

BERTopic allows integration with different embedding models and customization of dimensionality reduction and clustering techniques. This flexibility enables users to fine-tune the topic modeling process based on specific datasets and use cases. For example, one can choose between various transformers and dimensionality reduction methods, adapting the model to best fit the data's characteristics.

Dynamic Topic Modeling

BERTopic can dynamically update topics as new data arrives, eliminating the need to retrain the entire model from scratch. This capability is especially beneficial for applications with continuously evolving data, such as real-time social media analysis, where topics need to remain current and reflect the latest trends without extensive computational overhead.

Handling Short Texts

BERTopic excels in handling short texts, effectively capturing their meaning, while traditional methods like LDA and NMF often struggle with limited text. BERT’s contextual embeddings allow BERTopic to understand and cluster short texts, such as tweets or reviews, more accurately, ensuring that even brief documents are meaningfully categorized.

Visualization and Insights

BERTopic offers integrated and powerful visualization tools, such as intertopic distance maps and bar charts, which enhance the interpretability and analysis of generated topics. These visualizations help users explore and understand the relationships between topics, making it easier to derive actionable insights and communicate findings effectively. Traditional methods often lack these advanced visualization capabilities.

Using Whisper AI and BERTopic to model Youtube Videos

Installation

Before diving into the actual code, it is important to install a few essential packages, namely Whisper, BERTopic, and Pytube. These packages provide crucial functionalities for our project and ensure the smooth implementation of various tasks.

Whisper AI is a state-of-the-art speech recognition system developed by OpenAI. This advanced tool is designed to accurately convert spoken language into written text. It excels in providing high accuracy and robustness across different audio qualities, accents, and languages. Whisper AI’s capabilities make it an invaluable tool for tasks requiring precise transcription of audio content into text, accommodating diverse speech patterns and environments.

The Pytube library is a Python package designed to facilitate the downloading of videos from YouTube. It provides a simple and intuitive interface for accessing YouTube content, extracting metadata from videos, and downloading video or audio streams in various formats. Pytube makes it easy to handle YouTube videos programmatically, offering functionalities like video search, resolution selection, and format conversion, which are essential for managing and utilizing online video content efficiently.


!pip install bertopic
!pip install pytube
!pip install --upgrade git+https://github.com/openai/whisper.git

Data Ingestion

We are performing topic modeling on YouTube videos from the 'Take U Forward' channel, specifically focusing on the Binary Search Playlist. To proceed with this task, we need to gather the video URLs from this playlist. This can be achieved using the YouTube Data API with an API key, which allows programmatic access to YouTube content, or by manually inputting some video URLs.


video_urls = [ 
  'https://www.youtube.com/watch?v=MHf6awe89xw', 
  'https://www.youtube.com/watch?v=6zhGS79oQ4k', 
  'https://www.youtube.com/watch?v=hjR1IYVx9lY', 
  'https://www.youtube.com/watch?v=5qGrJbHhqFs', 
  'https://www.youtube.com/watch?v=w2G2W8l__pc'
  ]

When we have our URLs, we can start downloading the videos and extracting the transcripts. To create those transcripts, we make use of the recently released Whisper.

Below, we will import our Whisper model


import whisper
whisper_model = whisper.load_model("tiny")‍

Then, we iterate over our YouTube URLs, download the audio, and finally pass them through our Whisper model in order to generate the transcriptions.

The YouTube(url) creates an instance of the YouTube class from the pytube library with the specified URL, granting access to the video's streams and metadata. The streams.filter(only_audio=True) method filters to only audio streams, and [0] selects the first audio stream from the filtered list. The download(filename="audio.mp4") method downloads this selected audio stream as "audio.mp4", with the path to the downloaded file stored in the variable path.


from pytube import YouTube
docs = []

# Loop through the video URLs, transcribe them, and store in docs list
for url in video_urls:
	path = YouTube(url).streams.filter(only_audio=True)[0].download(filename="audio.mp4")  
	transcription = whisper_model.transcribe(path)  
  docs.append(transcription["text"])


Text Preprocessing

We have split our text based on full stops and question marks to increase our dataset, thereby improving the clustering for topic modeling


def split_text_into_sentences(text):
    # Split text into sentences based on full stop and question mark
    import re
    chunks = re.split(r'(?<=[.?\n])\s+', text.strip())
    return chunks


texts = []


for doc in docs:
  chunks = split_text_into_sentences(doc)
  texts += chunks


print(len(texts))
# Output - 2162



BERTopic Pipeline

These components form a comprehensive topic modeling pipeline. First, text documents are converted into dense embeddings and then reduced to a lower-dimensional space for easier clustering. Next, the reduced embeddings are clustered into topics, and the text is tokenized to create a matrix of token counts. Finally, the token counts are transformed into a class-based TF-IDF matrix to clearly represent topics.



from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer


First, we initialize a sentence embedding model using the SentenceTransformer library with the all-MiniLM-L6-v2 model. This model is used to convert sentences into numerical vectors. These embeddings capture the semantic meanings of the sentences, which can be used for comparison and clustering in later steps.

Here, a UMAP (Uniform Manifold Approximation and Projection) model is set up to reduce the dimensionality of the high-dimensional embeddings from the previous step. This reduction makes the data easier to handle and visualize. The parameters define how the UMAP model behaves:

  • n_neighbors=5: The number of neighboring points used in manifold approximation.
  • n_components=3: The number of dimensions to reduce the data to.
  • min_dist=0.0: The minimum distance between points in low-dimensional space.
  • metric='cosine': The metric used to measure distance in high-dimensional space, focusing on angles instead of Euclidean distance.

The clustering step involves clustering the dimensionally reduced embeddings using HDBSCAN, a density-based clustering algorithm. It helps to group the data into clusters based on density, with the following parameters:

  • min_cluster_size=5: The smallest size a cluster can be.
  • metric='euclidean': The distance metric for clustering (in this case, standard Euclidean distance).
  • cluster_selection_method='eom': The method for selecting clusters from the cluster hierarchy.
  • prediction_data=True: Allows the model to predict which cluster new data points belong to.

A CountVectorizer is initialized to convert text data into a token count matrix, effectively tokenizing the text. It removes common English stop words to focus on more meaningful words. ClassTfidfTransformer is a modification of the TF-IDF approach to be more suited for topic modeling, emphasizing words that are more unique to each topic.



# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=5, n_components=3, min_dist=0.0, metric='cosine')

# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=5, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english")

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()

# Step 6 - (Optional) Fine-tune topic representations with
# a `bertopic.representation` model
representation_model = KeyBERTInspired()


Initializes a BERTopic model with specified embedding, dimensionality reduction, clustering, tokenization, and representation models. It then fits this model to a set of texts, producing topics and their associated probabilities.



topic_model = BERTopic(
  embedding_model=embedding_model,          
  umap_model=umap_model,                    
  hdbscan_model=hdbscan_model,              
  vectorizer_model=vectorizer_model,        
  ctfidf_model=ctfidf_model,                
  representation_model=representation_model
)


topics, probs = topic_model.fit_transform(texts)

topic_model.get_topic_info()


topic_model.get_topic(0)



import pandas as pd
df = pd.DataFrame({"topic": topics, "document": texts})
df



Visualization

There are many visualization techniques, but some of the important are:

Intertopic Distance Map:

An Intertopic Distance Map is a visualization tool used in topic modeling to represent the relationships and distances between different topics in a two-dimensional space. This map helps to understand how similar or different the topics are to each other, providing valuable insights into the structure of the data.



topic_model.visualize_topics()


Bar Chart:

The bar chart tool in BERTopic is used to visualize the most frequent words within each topic, providing a clear representation of the key terms that define each topic. This visualization helps in understanding the essence of each topic by highlighting the top words associated with it.



topic_model.visualize_barchart()


By analyzing the topic modeling results and identifying the most frequent words for Topic 0, we can confidently predict that the topic is centered around binary search algorithms in coding.

Do you want to add Topic Modeling to your Application?

If you are looking to integrate Topic Modeling to your application, Mercity.ai can help. We specialize in developing highly tailored NLP solutions for various industries and business domains. Contact us today and let us create a Topic Modeling solution for your application.

Subscribe to stay informed

Subscribe to our newsletter to stay updated on all things AI!
Subscribe
Awesome, you subscribed!
Error! Please try again.