четвер, 12 березня 2020 р.

Telegram Chats History Analysis

People generate data. Living a digital life (at least to some extent) usually means leaving a lot of traces. One of the places to obtain such data is chat message history. Especially group chats. This is where traces of activity can tell us some facts about the society structure and the individuals who form it. 

As a weekend project for amusement and also to remove some rust from my coding muscles - I decided to pull message history from a couple of Telegram group chats I am a member of and then run some quick aggregations on this history in order to maybe learn some facts about my peers, chat buddies and myself.

I describe the methodology and give query examples and some depersonalized results. All work is open source https://github.com/justadreamer/tghistory.

Disclaimer: in this (and future) articles I try not to leak any personally identifiable information (at least intentionally), all names should be anonymized, so the data is depersonalized and does not lead to particular people.

Approach

We will pull the history of several selected Telegram chats into a PostgreSQL DB and then run aggregation queries on it to find out interesting facts. The ETL (Extract Transform Load) script is written in Python and is using python-telegram - a wrapper over TDLib ("Telegram Database") - a cross-platform library for building Telegram clients. Here is the diagram illustrating the setup:

Telegram Database Library

TDLib is an asynchronous library which handles networking, caching and data structures to provide an application with everything necessary to be a functional Telegram client. Asynchronous operation means that you issue a request and each method usually returns an AsynchronousResult object which is essentially a future - you then can block and wait for a response on this AsynchronousResult object. 
 

Postgres DB Schema


3 simple tables which contain just a tiny subset of the data fields which TDLib makes available for each entity

 

Queries

Chats 

We have exported messages for several chats, let's explore what chats and how many messages each contains and the proportion of total messages:

WITH
   SELECTION(chat_title, mid) AS (
       SELECT
           c.title,
           m.id
       FROM
           messages m
       INNER JOIN
           chats c
           ON c.id = m.chat_id
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   chat_title,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (chat_title, total_count)
ORDER BY
   message_count DESC;



User activity within a single chat

The cross activity of users between chats can be explored, but it is more interesting to explore individual chats and user behaviors within them. In all subsequent requests we limit the initial selection scope to messages of a single chat, thus we construct the queries in such a way that there is just one parameter: the title of the chat, and it is stated in the beginning of the query - so replace 'Telegram' title with a chat of interest title.

How many messages and what proportion of total do people send in a given chat sorted in a descending order by the number of messages? We create a view - we will need to reuse it for later queries. Also if you need just to look up f.e. the top 10 most active users - add a LIMIT 10 clause.

DROP VIEW IF EXISTS users_by_message_count;
CREATE VIEW users_by_message_count AS
WITH
   CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
   SELECTION(uid, username, first_name, last_name, mid) AS (
       SELECT
           u.id,
           u.username,
           u.first_name,
           u.last_name,
           m.id
       FROM
           CHAT_TITLE,
           users u
       INNER JOIN
           messages m
           ON u.id = m.sender_user_id
       INNER JOIN
           chats c
           ON c.id = m.chat_id
       WHERE
           c.title = chat_title
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   uid,
   username,
   first_name,
   last_name,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (uid, username, first_name, last_name, total_count)
ORDER BY
   message_count DESC;
SELECT * FROM users_by_message_count;

In the example result set below I replaced the Telegram chat with one of the group chats I am in:



Activity by day of week 

What is the most active day of the week in a given chat?
WITH
   CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
   SELECTION(day_of_week, mid) AS (
       SELECT
           to_char(m.send_date, 'Day'),
           m.id
       FROM
           CHAT_TITLE,
           messages m
       INNER JOIN
           chats c
           ON c.id = m.chat_id
       WHERE
           c.title = chat_title
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   day_of_week,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (day_of_week, total_count)
ORDER BY
   message_count DESC;



Not surprisingly Friday is the most active day of the week.

User activity by day of week

So now let's group by user and day_of_week and see the most active users and corresponding days of week. The trick here is the custom ordering. We first order by the user but we take the user from a view we previously created and order them by the number of messages they send in descending order - thus the most active person is first. The within that order for each person - we order days by the message_count in descending order.

WITH
   CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
   SELECTION(umessage_count, username, first_name, last_name, day_of_week, mid) AS (
       SELECT
           u.message_count,
           u.username,
           u.first_name,
           u.last_name,
           to_char(m.send_date, 'Day'),
           m.id
       FROM
           CHAT_TITLE,
           messages m
       INNER JOIN
           users_by_message_count u
           ON u.uid = m.sender_user_id
       INNER JOIN
           chats c
           ON c.id = m.chat_id
       WHERE
           c.title = chat_title
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   username,
   first_name,
   last_name,
   day_of_week,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (umessage_count, username, first_name, last_name, day_of_week, total_count)
ORDER BY
   umessage_count DESC, message_count DESC;




Activity by month of year

Let's research the activity by months of the year: what is the most active month of the year in a given chat?

WITH
   CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
   SELECTION(month, mid) AS (
       SELECT
           to_char(m.send_date, 'Month'),
           m.id
       FROM
           CHAT_TITLE,
           messages m
       INNER JOIN
           chats c
           ON c.id = m.chat_id
       WHERE
           c.title = chat_title
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   month,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (month, total_count)
ORDER BY
   message_count DESC;



However the result can be severely skewed due to the fact that some people have been added just recently.

User activity by month of year

Let's group by people similar to how we did for day of week:

WITH
   CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
   SELECTION(umessage_count, username, first_name, last_name, month, mid) AS (
       SELECT
           u.message_count,
           u.username,
           u.first_name,
           u.last_name,
           to_char(m.send_date, 'Month'),
           m.id
       FROM
           CHAT_TITLE,
           messages m
       INNER JOIN
           users_by_message_count u
           ON u.uid = m.sender_user_id
       INNER JOIN
           chats c
           ON c.id = m.chat_id
       WHERE
           c.title = chat_title
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   username,
   first_name,
   last_name,
   month,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (umessage_count, username, first_name, last_name, month, total_count)
ORDER BY
   umessage_count DESC, message_count DESC;


The learnings are not that valuable actually.

Activity by type of message

Let's research the activity by type of message in a given chat:

WITH
   CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
   SELECTION(content_type, mid) AS (
       SELECT
           m.content_type,
           m.id
       FROM
           CHAT_TITLE,
           messages m
       INNER JOIN
           chats c
           ON c.id = m.chat_id
       WHERE
           c.title = chat_title
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   content_type,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (content_type, total_count)
ORDER BY
   message_count DESC;



Not surprisingly most messages are text, and surprisingly the second biggest type is photo, not sticker.

User activity by type of message

Now let's order by user - similar to day of week and month of year:

WITH
   CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
   SELECTION(umessage_count, username, first_name, last_name, content_type, mid) AS (
       SELECT
           u.message_count,
           u.username,
           u.first_name,
           u.last_name,
           m.content_type,
           m.id
       FROM
           CHAT_TITLE,
           messages m
       INNER JOIN
           users_by_message_count u
           ON u.uid = m.sender_user_id
       INNER JOIN
           chats c
           ON c.id = m.chat_id
       WHERE
           c.title = chat_title
   ),
   TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
   username,
   first_name,
   last_name,
   content_type,
   COUNT(mid) message_count,
   round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
  TOTAL, SELECTION
GROUP BY
   (umessage_count, username, first_name, last_name, content_type, total_count)
ORDER BY
   umessage_count DESC, message_count DESC;


Useful Links