Tutorial on Analyzing Group Chats

Reading all the messages from a group chat can be a tedious task, especially when you receive over 100 text messages while you are away. messageanalyzer can alleviate your heavy reading load by providing an accessible gateway to Natural Language Processing (NLP) and supporting you with fast, efficient analyzing tools. This is a short tutorial on performing several Natural Language Processing (NLP) tasks on group chats using messageanalyzer.

For this tutorial, we will introduce and guide you through four popular NLP tasks leveraging the functions in messageanalyzer:

  • Extract keywords

  • Extract topics

  • Sentiment analysis

  • Detect language patterns

Import

First, let’s import messageanalyzer as below. We can check the version of messageanalyzer by using the attribute .__version__.

import messageanalyzer

print(messageanalyzer.__version__)
2.0.3

Below is a list of sample texts that cover a diverse range of conversations among peers. We will work on this sample texts for the majority of the tutorial.

sample_text = [
    "Hey, has anyone watched the latest episode of that new series?",
    "Not yet! Is it worth watching?",
    "Totally! The twists this season are insane.",
    "Okay, now I’m intrigued. Adding it to my watchlist.",
    "I’m thinking of baking cookies today. Any flavor suggestions?",
    "Chocolate chip, always a classic!",
    "Good choice! I’ll try adding some sea salt on top for extra flavor.",
    "Ooh, sea salt on cookies? That sounds amazing. Let us know how it turns out."
    "What’s the best vacation spot you’ve been to?",
    "Bali, hands down. The beaches are incredible.",
    "Oh, I’ve always wanted to visit Bali. Did you go to any of the waterfalls there?",
    "Yes! Tegenungan Waterfall was breathtaking.",
    "Does anyone else feel like AI is moving so fast?",
    "For sure. It’s exciting but also a little scary sometimes.",
    "Agreed. I mean, just look at how tools like ChatGPT are changing how we work.",
    "True, but it’s also helping with so many tasks I used to find tedious.",
    "I just got a new gaming headset, and it’s a game-changer!",
    "Nice! Which one did you get?",
    "The HyperX Cloud II. The sound quality is amazing, and it’s super comfortable.",
    "I’ve heard good things about that one. Great pick!"
]

Keyword Extraction

Keyword extraction is a technique to identify and extract the most relevant words or phrases from a given set of text messages. This function supports keyword extraction by using the Term Frequency-Inverse Document Frequency (TF-IDF). It is helpful in summarizing text, identifying key terms, or preprocessing data for further text analysis tasks.

To use the keyword extraction function, import extract_keywords from messageanalyzer.extract_keywords.

from messageanalyzer.extract_keywords import extract_keywords

Below extracts top keywords from the list of messages using TF-IDF. Each message’s keywords are determined based on their importance relative to the entire group chat. This is made possible by specifying the parameter method=”tfidf” and num_keywords = 3.

keywords = extract_keywords(sample_text, num_keywords=3)

print(keywords[:5])
[['watched', 'series', 'latest'], ['worth', 'watching', 'yes'], ['twists', 'totally', 'season'], ['watchlist', 'okay', 'intrigued'], ['today', 'thinking', 'suggestions']]

From the above results, we can see that the group talks a lot about watching TV series.

Topic Modeling

Another way to summarize texts is topic modeling. Topic modeling is a tool to identify and extract different topics mentioned in a text and represent these topics with a group of words or phrases originated from the text. Our topic modeling function leverages the algorithm of Non-negative Matrix Factorization to reduce the text corpus to multiple topics. This application is helpful in summarizing and identifying common themes in long texts.

To use the topic modeling function in our package, import topic_modeling from messageanalyzer.topic_modeling as below.

from messageanalyzer.topic_modeling import topic_modeling

Now we can apply topic modeling to our sample texts.

Below returns 10 topics via topic modeling, where each topic is represented by 3 words selected from the sample texts. This is made possible by specifying the parameter n_topics = 10 and n_words = 3.

Note: A runtime warning might be returned but it is expected when the number of topics requested exceeds the maximum number of topics that Non-negative Matrix Factorization will extracts. It will still return as many topics as requested while throwing a warning.

topic_modeling(sample_text, n_topics = 10, n_words = 3)
{'Topic 1': ['sound', 'super', 'comfortable'],
 'Topic 2': ['yes', 'waterfall', 'tegenungan'],
 'Topic 3': ['beaches', 'hands', 'incredible'],
 'Topic 4': ['like', 'does', 'fast'],
 'Topic 5': ['twists', 'totally', 'season'],
 'Topic 6': ['flavor', 'adding', 'sea'],
 'Topic 7': ['worth', 'watching', 'thinking'],
 'Topic 8': ['sure', 'little', 'scary'],
 'Topic 9': ['did', 'nice', 'waterfalls'],
 'Topic 10': ['new', 'series', 'watched']}

It seems like using 3 representative words for each topic is not too insightful. We have topics like Topic 5, i.e. a topic on insane twists in a TV series season, that are easy to comprehend, but we also have topics like Topic 9 that do not tell us anything useful. Let’s try adding more words for each topic.

Thus, we extract 5 words from the sample texts to represent 10 topics. The random_state parameter is also specified to ensure reproducibility when rerunning the function. The default of random_state is set to 123.

topic_modeling(sample_text, n_topics = 10, n_words = 5, random_state = 456)
/home/docs/checkouts/readthedocs.org/user_builds/dsci524-text-analyzer-19/envs/latest/lib/python3.9/site-packages/sklearn/decomposition/_nmf.py:1742: ConvergenceWarning: Maximum number of iterations 200 reached. Increase it to improve convergence.
  warnings.warn(
{'Topic 1': ['beaches', 'hands', 'incredible', 'bali', 'sure'],
 'Topic 2': ['new', 'series', 'watched', 'latest', 'hey'],
 'Topic 3': ['twists', 'totally', 'season', 'insane', 'chip'],
 'Topic 4': ['watchlist', 'intrigued', 'okay', 'adding', 'll'],
 'Topic 5': ['yes', 'waterfall', 'tegenungan', 'breathtaking', 'tedious'],
 'Topic 6': ['like', 'does', 'fast', 'feel', 'ai'],
 'Topic 7': ['cookies', 'flavor', 'sea', 'salt', 'suggestions'],
 'Topic 8': ['things', 'heard', 'pick', 'great', 'good'],
 'Topic 9': ['did', 'nice', 'waterfalls', 'visit', 'oh'],
 'Topic 10': ['sound', 'super', 'comfortable', 'cloud', 'ii']}

This is more informative and detailed than the previous result. For instance, we can now deduce that the group talked about the incredible beaches in Bali from Topic 1.

Sentiment Analysis

The sentiment analysis also evaluates the sentiment of one or more input messages using TextBlob. Our function calculates a polarity score and labels the sentiment as positive, negative, or neutral. If the sentiment is highly negative (below -0.2), it triggers an alert. The function returns a list of dictionaries with the message, sentiment score, label, and alert flag if applicable.

To use the keyword extraction function, import analyze_sentiment from messageanalyzer.sentiment_analysis.

from messageanalyzer.sentiment_analysis import analyze_sentiment

Below shows how to use analyze_sentiment function to perform sentiment analysis with our sample_text. The “Default” model here utilizes the pretrained model from TextBlob. The sentiment of each text string will be evaluated through the model’s polarity, and will be categorized as postive, negative and neutral by the value. In addition, we would pay more attention to those text messages that convey highly negative content, and will trigger alert if the value is smaller than the threshold -0.2.

analyze_sentiment(sample_text, "Default")
ALERT: Message is highly negative - Totally! The twists this season are insane.
ALERT: Message is highly negative - Agreed. I mean, just look at how tools like ChatGPT are changing how we work.
[{'message': 'Hey, has anyone watched the latest episode of that new series?',
  'score': 0.3181818181818182,
  'label': 'positive'},
 {'message': 'Not yet! Is it worth watching?',
  'score': 0.3,
  'label': 'positive'},
 {'message': 'Totally! The twists this season are insane.',
  'score': -0.5,
  'alert': True,
  'label': 'negative'},
 {'message': 'Okay, now I’m intrigued. Adding it to my watchlist.',
  'score': 0.5,
  'label': 'positive'},
 {'message': 'I’m thinking of baking cookies today. Any flavor suggestions?',
  'score': 0.0,
  'label': 'neutral'},
 {'message': 'Chocolate chip, always a classic!',
  'score': 0.20833333333333331,
  'label': 'positive'},
 {'message': 'Good choice! I’ll try adding some sea salt on top for extra flavor.',
  'score': 0.4583333333333333,
  'label': 'positive'},
 {'message': 'Ooh, sea salt on cookies? That sounds amazing. Let us know how it turns out.What’s the best vacation spot you’ve been to?',
  'score': 0.8,
  'label': 'positive'},
 {'message': 'Bali, hands down. The beaches are incredible.',
  'score': 0.37222222222222223,
  'label': 'positive'},
 {'message': 'Oh, I’ve always wanted to visit Bali. Did you go to any of the waterfalls there?',
  'score': 0.0,
  'label': 'neutral'},
 {'message': 'Yes! Tegenungan Waterfall was breathtaking.',
  'score': 1.0,
  'label': 'positive'},
 {'message': 'Does anyone else feel like AI is moving so fast?',
  'score': 0.2,
  'label': 'positive'},
 {'message': 'For sure. It’s exciting but also a little scary sometimes.',
  'score': 0.02812500000000001,
  'label': 'positive'},
 {'message': 'Agreed. I mean, just look at how tools like ChatGPT are changing how we work.',
  'score': -0.3125,
  'alert': True,
  'label': 'negative'},
 {'message': 'True, but it’s also helping with so many tasks I used to find tedious.',
  'score': 0.11666666666666665,
  'label': 'positive'},
 {'message': 'I just got a new gaming headset, and it’s a game-changer!',
  'score': 0.17045454545454544,
  'label': 'positive'},
 {'message': 'Nice! Which one did you get?',
  'score': 0.75,
  'label': 'positive'},
 {'message': 'The HyperX Cloud II. The sound quality is amazing, and it’s super comfortable.',
  'score': 0.43333333333333335,
  'label': 'positive'},
 {'message': 'I’ve heard good things about that one. Great pick!',
  'score': 0.85,
  'label': 'positive'}]

The above results shows that in our sample text, some text strings contains very negative sentiments, while some others convey very positive sentiments. Extra alerts are printed too because those message have very high negative polarity scores that are less than -0.2.

Detect Language Patterns

Detecting language patterns is another great way to get a better understanding of text messages, especially when you have an international group of peers that speak in different languages.

The detect_language_patternsfunction spots patterns like common n-grams (word combinations), frequently used characters, or the mix of languages in a dataset. These patterns can help you see key trends and details in the text, like often-mentioned terms, writing styles, or the overall language makeup.

To use the language pattern detection function, import detect_language_patterns from messageanalyzer.detect_language_patterns as shown below.

from messageanalyzer.detect_language_patterns import detect_language_patterns

We are using a different sample text below to test the function’s ability to read a mix of languages. The sample covers a mix of themes, including artificial intelligence and meditation, and a mix of languages, including English, French, and Chinese.

mix_text = [
    "Artificial intelligence and machine learning are transforming industries around the globe.",
    "The basketball team secured a thrilling victory in the final seconds of the game.",
    "Yoga and meditation are excellent for reducing stress and improving mental health.",
    "Exploring the hidden beaches of Bali is an unforgettable experience for any traveler.",
    "Quantum computing is expected to revolutionize data processing and cryptography.",
    "L'intelligence artificielle et l'apprentissage automatique transforment les industries du monde entier.",
    "L'équipe de basket-ball a remporté une victoire passionnante dans les dernières secondes du match.",
    "Le yoga et la méditation sont excellents pour réduire le stress et améliorer la santé mentale.",
    "L'exploration des plages cachées de Bali est une expérience inoubliable pour tout voyageur.",
    "L'informatique quantique devrait révolutionner le traitement des données et la cryptographie.",
    "人工智能和机器学习正在改变全球各行各业。",
    "篮球队在比赛的最后几秒钟取得了激动人心的胜利。",
    "瑜伽和冥想是减压和改善心理健康的绝佳方式。",
    "探索巴厘岛隐秘的海滩对任何旅行者来说都是一次难忘的经历。",
    "量子计算有望彻底改变数据处理和密码学。"
]

The example below demonstrates how to use the detect_language_patterns function to analyze the sample text.

  1. Language Detection
    The first part detects the language of each message in the sample text by setting the parameter method="language". The result is a list of detected languages, where each entry corresponds to the language of a sentence in the sample text.

# Detect the language of each message in the sample text
result = detect_language_patterns(mix_text, method="language")
print(result)
['en', 'en', 'en', 'en', 'en', 'fr', 'fr', 'fr', 'fr', 'fr', 'zh-cn', 'zh-cn', 'zh-tw', 'zh-cn', 'zh-cn']

Each detected language (en for English, fr for French, and zh-cn for Chinese) corresponds to a sentence in the sample_text.

  1. Bigram Extraction
    The second part identifies the top 5 most common bigrams (two-word combinations) in the sample text by setting the parameters method="ngrams", n=2, and top_n=5. The output shows the bigrams along with their frequencies.

# Extract the top 5 most common bigrams (two-word combinations)
result = detect_language_patterns(mix_text, method="ngrams", n=2, top_n=5)
print(result)
[('et la', np.int64(2)), ('artificial intelligence', np.int64(1)), ('intelligence and', np.int64(1)), ('and machine', np.int64(1)), ('machine learning', np.int64(1))]

The bigram et la appears twice in the French sentences, while other bigrams occur once in the English sentences.