неділя, 5 липня 2020 р.

Telegram Chats History Analysis with R

In this article we describe an enhanced version of the analysis first presented in http://logic-explained.blogspot.com/2020/03/telegram-chats-history-analysis.html

People live digital lives and generate a lot of data. Group chats is one place where such traces of activity are preserved and can reveal interesting facts about the society structure and the individuals in it. We use Telegram as a messenger for some company-wide chats. In this article we demonstrate how to easily pull a Telegram chat message history and perform some basic analysis on it.

Disclaimer: all the personal identifiable information s.a. individual names and chat names are intentionally pseudonymized to not be revealed in the final report. This is configurable and done on the report forming level, so that whoever owns the data could attribute the stats to the real individuals. More on pseudonymization below.

Result: here is the anonymized report you can checkout - this is a result of executing the script which fetches and analyses several company wide chats.

Code (but of course not data) is available here: https://github.com/justadreamer/tghistory


Tools

Unlike the previous article where we used only SQL queries - the main tool for analysis used here is R. R is an excellent choice of a language and environment for data manipulation, analysis and graphing. Especially relatively simple exploratory data analysis which mainly consists in grouping by various criteria and obtaining summary statistics. R is highly extensible and comes with packages which enable literate programming and nice report generation - all done from a single RMarkdown document.
To fetch Telegram message history (ETL step = extract, transform, load) we used a Python script with a dependency on python-telegram - a wrapper over TDLib ("Telegram Database") - a cross-platform library for building Telegram clients. TDLib is an asynchronous library which handles networking, caching and data structures to provide an application with everything necessary to be a functional Telegram client.

PostgreSQL DB schema is below. Just 3 simple tables which contain just a tiny subset of the data fields which TDLib makes available for each entity.




Solution Architecture

The flow is 2-step.

First we extract chat message history using a python ETL script and load it into the Postgres DB, TDLib manages its own local message cache and we interact with it thru python-telegram wrapper.

Second we launch the knitr tool and feed it an R Markdown document which is a text report in Markdown interspersed with R snippets which are actually executed and perform the data manipulation, analysis and graphing.

R loads the data from the same Postgres DB we saved the messages into and is using the same config.yml file the ETL script uses - thus there is a guaranteed single source of truth for both parts of the system.

The dependency scheme is shown below:




Anonymization

It deserves a couple words. We anonymize or rather pseudonymize the data at an ingestion step - when it is loaded from a DB into R environment:


We borrow a list of animals used in Google Drive and make a reversible map of original usernames / chat titles to the anonymized ones made up of an animal name and a suffix. For a chat the suffix is "Chat", for the user the suffix is "Anonymous" - just to distinguish the entities. Once the data is ingested - it contains only anonymized entities and all further analysis is done on the anonymized entities. This means that you can not accidentally mention any real name in the produced report unless only by hardcoding it, because the ingested data you work with contains only anonymized names.

I mentioned that the map is bi-directional - that is we can deanonymize any name. With a small caveat that we have around 80 animals in the animals.txt file while there might be more users than animals - this means we need to reuse some animals if we have a list of anonymized entities longer than the list of animals - probably adding an index to a anonymized name if it is already present in a list. As a result of anonymization we get stats of this kind:


Stats

In the report we compute and graph the following stats:

1. The total number of messages in each chat of interest. We choose the chat with the highest number of messages as the one to analyze.

2. The distribution of messages by user.

3. The distribution of messages by day of week.

4. The distribution of messages by user and day of week.

5. The distribution of messages by day of week and hour of day.

6. The distribution of messages by person and by hour of day in the heaviest day of week.

7. The distribution of messages by content type.

8. The message length - min, max, IQR, median, mean.

9. Conversation analysis - who are the conversation starters and supporters.

Here are the most interesting distributions:
Day of week and hour of day:
Hour of day and user on Friday:
Text message lengths:

Checkout the full report here.

четвер, 12 березня 2020 р.

Telegram Chats History Analysis

People generate data. Living a digital life (at least to some extent) usually means leaving a lot of traces. One of the places to obtain such data is chat message history. Especially group chats. This is where traces of activity can tell us some facts about the society structure and the individuals who form it. 

As a weekend project for amusement and also to remove some rust from my coding muscles - I decided to pull message history from a couple of Telegram group chats I am a member of and then run some quick aggregations on this history in order to maybe learn some facts about my peers, chat buddies and myself.

I describe the methodology and give query examples and some depersonalized results. All work is open source https://github.com/justadreamer/tghistory.

Disclaimer: in this (and future) articles I try not to leak any personally identifiable information (at least intentionally), all names should be anonymized, so the data is depersonalized and does not lead to particular people.

Approach

We will pull the history of several selected Telegram chats into a PostgreSQL DB and then run aggregation queries on it to find out interesting facts. The ETL (Extract Transform Load) script is written in Python and is using python-telegram - a wrapper over TDLib ("Telegram Database") - a cross-platform library for building Telegram clients. Here is the diagram illustrating the setup:

Telegram Database Library

TDLib is an asynchronous library which handles networking, caching and data structures to provide an application with everything necessary to be a functional Telegram client. Asynchronous operation means that you issue a request and each method usually returns an AsynchronousResult object which is essentially a future - you then can block and wait for a response on this AsynchronousResult object. 
 

Postgres DB Schema


3 simple tables which contain just a tiny subset of the data fields which TDLib makes available for each entity

 

Queries

Chats 

We have exported messages for several chats, let's explore what chats and how many messages each contains and the proportion of total messages:

WITH
   SELECTION(chat_title, mid) AS (
       SELECT
           c.title,
           m.id
       FROM
           messages m
       INNER JOIN
           chats c
           ON c.id = m.chat_id
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   chat_title,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (chat_title, total_count)
ORDER BY
   message_count DESC;



User activity within a single chat

The cross activity of users between chats can be explored, but it is more interesting to explore individual chats and user behaviors within them. In all subsequent requests we limit the initial selection scope to messages of a single chat, thus we construct the queries in such a way that there is just one parameter: the title of the chat, and it is stated in the beginning of the query - so replace 'Telegram' title with a chat of interest title.

How many messages and what proportion of total do people send in a given chat sorted in a descending order by the number of messages? We create a view - we will need to reuse it for later queries. Also if you need just to look up f.e. the top 10 most active users - add a LIMIT 10 clause.

DROP VIEW IF EXISTS users_by_message_count;
CREATE VIEW users_by_message_count AS
WITH
   CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
   SELECTION(uid, username, first_name, last_name, mid) AS (
       SELECT
           u.id,
           u.username,
           u.first_name,
           u.last_name,
           m.id
       FROM
           CHAT_TITLE,
           users u
       INNER JOIN
           messages m
           ON u.id = m.sender_user_id
       INNER JOIN
           chats c
           ON c.id = m.chat_id
       WHERE
           c.title = chat_title
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   uid,
   username,
   first_name,
   last_name,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (uid, username, first_name, last_name, total_count)
ORDER BY
   message_count DESC;
SELECT * FROM users_by_message_count;

In the example result set below I replaced the Telegram chat with one of the group chats I am in:



Activity by day of week 

What is the most active day of the week in a given chat?
WITH
   CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
   SELECTION(day_of_week, mid) AS (
       SELECT
           to_char(m.send_date, 'Day'),
           m.id
       FROM
           CHAT_TITLE,
           messages m
       INNER JOIN
           chats c
           ON c.id = m.chat_id
       WHERE
           c.title = chat_title
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   day_of_week,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (day_of_week, total_count)
ORDER BY
   message_count DESC;



Not surprisingly Friday is the most active day of the week.

User activity by day of week

So now let's group by user and day_of_week and see the most active users and corresponding days of week. The trick here is the custom ordering. We first order by the user but we take the user from a view we previously created and order them by the number of messages they send in descending order - thus the most active person is first. The within that order for each person - we order days by the message_count in descending order.

WITH
   CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
   SELECTION(umessage_count, username, first_name, last_name, day_of_week, mid) AS (
       SELECT
           u.message_count,
           u.username,
           u.first_name,
           u.last_name,
           to_char(m.send_date, 'Day'),
           m.id
       FROM
           CHAT_TITLE,
           messages m
       INNER JOIN
           users_by_message_count u
           ON u.uid = m.sender_user_id
       INNER JOIN
           chats c
           ON c.id = m.chat_id
       WHERE
           c.title = chat_title
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   username,
   first_name,
   last_name,
   day_of_week,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (umessage_count, username, first_name, last_name, day_of_week, total_count)
ORDER BY
   umessage_count DESC, message_count DESC;




Activity by month of year

Let's research the activity by months of the year: what is the most active month of the year in a given chat?

WITH
   CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
   SELECTION(month, mid) AS (
       SELECT
           to_char(m.send_date, 'Month'),
           m.id
       FROM
           CHAT_TITLE,
           messages m
       INNER JOIN
           chats c
           ON c.id = m.chat_id
       WHERE
           c.title = chat_title
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   month,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (month, total_count)
ORDER BY
   message_count DESC;



However the result can be severely skewed due to the fact that some people have been added just recently.

User activity by month of year

Let's group by people similar to how we did for day of week:

WITH
   CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
   SELECTION(umessage_count, username, first_name, last_name, month, mid) AS (
       SELECT
           u.message_count,
           u.username,
           u.first_name,
           u.last_name,
           to_char(m.send_date, 'Month'),
           m.id
       FROM
           CHAT_TITLE,
           messages m
       INNER JOIN
           users_by_message_count u
           ON u.uid = m.sender_user_id
       INNER JOIN
           chats c
           ON c.id = m.chat_id
       WHERE
           c.title = chat_title
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   username,
   first_name,
   last_name,
   month,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (umessage_count, username, first_name, last_name, month, total_count)
ORDER BY
   umessage_count DESC, message_count DESC;


The learnings are not that valuable actually.

Activity by type of message

Let's research the activity by type of message in a given chat:

WITH
   CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
   SELECTION(content_type, mid) AS (
       SELECT
           m.content_type,
           m.id
       FROM
           CHAT_TITLE,
           messages m
       INNER JOIN
           chats c
           ON c.id = m.chat_id
       WHERE
           c.title = chat_title
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   content_type,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (content_type, total_count)
ORDER BY
   message_count DESC;



Not surprisingly most messages are text, and surprisingly the second biggest type is photo, not sticker.

User activity by type of message

Now let's order by user - similar to day of week and month of year:

WITH
   CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
   SELECTION(umessage_count, username, first_name, last_name, content_type, mid) AS (
       SELECT
           u.message_count,
           u.username,
           u.first_name,
           u.last_name,
           m.content_type,
           m.id
       FROM
           CHAT_TITLE,
           messages m
       INNER JOIN
           users_by_message_count u
           ON u.uid = m.sender_user_id
       INNER JOIN
           chats c
           ON c.id = m.chat_id
       WHERE
           c.title = chat_title
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   username,
   first_name,
   last_name,
   content_type,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (umessage_count, username, first_name, last_name, content_type, total_count)
ORDER BY
   umessage_count DESC, message_count DESC;


Useful Links