People generate data. Living a digital life (at least to some extent) usually means leaving a lot of traces. One of the places to obtain such data is chat message history. Especially group chats. This is where traces of activity can tell us some facts about the society structure and the individuals who form it.
As a weekend project for amusement and also to remove some rust from my coding muscles - I decided to pull message history from a couple of Telegram group chats I am a member of and then run some quick aggregations on this history in order to maybe learn some facts about my peers, chat buddies and myself.
I describe the methodology and give query examples and some depersonalized results. All work is open source https://github.com/justadreamer/tghistory.
Disclaimer: in this (and future) articles I try not to leak any personally identifiable information (at least intentionally), all names should be anonymized, so the data is depersonalized and does not lead to particular people.
Approach
We will pull the history of several selected Telegram chats into a PostgreSQL DB and then run aggregation queries on it to find out interesting facts. The ETL (Extract Transform Load) script is written in Python and is using python-telegram - a wrapper over TDLib ("Telegram Database") - a cross-platform library for building Telegram clients. Here is the diagram illustrating the setup:Telegram Database Library
TDLib is an asynchronous library which handles networking, caching and data structures to provide an application with everything necessary to be a functional Telegram client. Asynchronous operation means that you issue a request and each method usually returns an AsynchronousResult object which is essentially a future - you then can block and wait for a response on this AsynchronousResult object.
Postgres DB Schema
Queries
Chats
We have exported messages for several chats, let's explore what chats and how many messages each contains and the proportion of total messages:
WITH
SELECTION(chat_title, mid) AS (
SELECT
c.title,
m.id
FROM
messages m
INNER JOIN
chats c
ON c.id = m.chat_id
),
TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
chat_title,
COUNT(mid) message_count,
round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
TOTAL, SELECTION
GROUP BY
(chat_title, total_count)
ORDER BY
message_count DESC;
User activity within a single chat
The cross activity of users between chats can be explored, but it is more interesting to explore individual chats and user behaviors within them. In all subsequent requests we limit the initial selection scope to messages of a single chat, thus we construct the queries in such a way that there is just one parameter: the title of the chat, and it is stated in the beginning of the query - so replace 'Telegram' title with a chat of interest title.How many messages and what proportion of total do people send in a given chat sorted in a descending order by the number of messages? We create a view - we will need to reuse it for later queries. Also if you need just to look up f.e. the top 10 most active users - add a LIMIT 10 clause.
DROP VIEW IF EXISTS users_by_message_count;
CREATE VIEW users_by_message_count AS
WITH
CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
SELECTION(uid, username, first_name, last_name, mid) AS (
SELECT
u.id,
u.username,
u.first_name,
u.last_name,
m.id
FROM
CHAT_TITLE,
users u
INNER JOIN
messages m
ON u.id = m.sender_user_id
INNER JOIN
chats c
ON c.id = m.chat_id
WHERE
c.title = chat_title
),
TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
uid,
username,
first_name,
last_name,
COUNT(mid) message_count,
round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
TOTAL, SELECTION
GROUP BY
(uid, username, first_name, last_name, total_count)
ORDER BY
message_count DESC;
SELECT * FROM users_by_message_count;
In the example result set below I replaced the Telegram chat with one of the group chats I am in:
Activity by day of week
What is the most active day of the week in a given chat?
WITH
CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
SELECTION(day_of_week, mid) AS (
SELECT
to_char(m.send_date, 'Day'),
m.id
FROM
CHAT_TITLE,
messages m
INNER JOIN
chats c
ON c.id = m.chat_id
WHERE
c.title = chat_title
),
TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
day_of_week,
COUNT(mid) message_count,
round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
TOTAL, SELECTION
GROUP BY
(day_of_week, total_count)
ORDER BY
message_count DESC;
Not surprisingly Friday is the most active day of the week.
User activity by day of week
So now let's group by user and day_of_week and see the most active users and corresponding days of week. The trick here is the custom ordering. We first order by the user but we take the user from a view we previously created and order them by the number of messages they send in descending order - thus the most active person is first. The within that order for each person - we order days by the message_count in descending order.
WITH
WITH
CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
SELECTION(umessage_count, username, first_name, last_name, day_of_week, mid) AS (
SELECT
u.message_count,
u.username,
u.first_name,
u.last_name,
to_char(m.send_date, 'Day'),
m.id
FROM
CHAT_TITLE,
messages m
INNER JOIN
users_by_message_count u
ON u.uid = m.sender_user_id
INNER JOIN
chats c
ON c.id = m.chat_id
WHERE
c.title = chat_title
),
TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
username,
first_name,
last_name,
day_of_week,
COUNT(mid) message_count,
round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
TOTAL, SELECTION
GROUP BY
(umessage_count, username, first_name, last_name, day_of_week, total_count)
ORDER BY
umessage_count DESC, message_count DESC;
Activity by month of year
Let's research the activity by months of the year: what is the most active month of the year in a given chat?
WITH
CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
SELECTION(month, mid) AS (
SELECT
to_char(m.send_date, 'Month'),
m.id
FROM
CHAT_TITLE,
messages m
INNER JOIN
chats c
ON c.id = m.chat_id
WHERE
c.title = chat_title
),
TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
month,
COUNT(mid) message_count,
round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
TOTAL, SELECTION
GROUP BY
(month, total_count)
ORDER BY
message_count DESC;
However the result can be severely skewed due to the fact that some people have been added just recently.
User activity by month of year
Let's group by people similar to how we did for day of week:
WITH
CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
SELECTION(umessage_count, username, first_name, last_name, month, mid) AS (
SELECT
u.message_count,
u.username,
u.first_name,
u.last_name,
to_char(m.send_date, 'Month'),
m.id
FROM
CHAT_TITLE,
messages m
INNER JOIN
users_by_message_count u
ON u.uid = m.sender_user_id
INNER JOIN
chats c
ON c.id = m.chat_id
WHERE
c.title = chat_title
),
TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
username,
first_name,
last_name,
month,
COUNT(mid) message_count,
round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
TOTAL, SELECTION
GROUP BY
(umessage_count, username, first_name, last_name, month, total_count)
ORDER BY
umessage_count DESC, message_count DESC;
The learnings are not that valuable actually.
Activity by type of message
Let's research the activity by type of message in a given chat:
WITH
CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
SELECTION(content_type, mid) AS (
SELECT
m.content_type,
m.id
FROM
CHAT_TITLE,
messages m
INNER JOIN
chats c
ON c.id = m.chat_id
WHERE
c.title = chat_title
),
TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
content_type,
COUNT(mid) message_count,
round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
TOTAL, SELECTION
GROUP BY
(content_type, total_count)
ORDER BY
message_count DESC;
Not surprisingly most messages are text, and surprisingly the second biggest type is photo, not sticker.
User activity by type of message
Now let's order by user - similar to day of week and month of year:
WITH
CHAT_TITLE(chat_title) AS (SELECT 'Telegram'),
SELECTION(umessage_count, username, first_name, last_name, content_type, mid) AS (
SELECT
u.message_count,
u.username,
u.first_name,
u.last_name,
m.content_type,
m.id
FROM
CHAT_TITLE,
messages m
INNER JOIN
users_by_message_count u
ON u.uid = m.sender_user_id
INNER JOIN
chats c
ON c.id = m.chat_id
WHERE
c.title = chat_title
),
TOTAL(total_count) AS (SELECT COUNT(mid) FROM SELECTION)
SELECT
username,
first_name,
last_name,
content_type,
COUNT(mid) message_count,
round(CAST(COUNT(mid) AS numeric) / total_count * 100, 2) prop
FROM
TOTAL, SELECTION
GROUP BY
(umessage_count, username, first_name, last_name, content_type, total_count)
ORDER BY
umessage_count DESC, message_count DESC;
Useful Links
- https://blog.jiayu.co/2018/09/telegrammetry-stats-for-telegram/ - an interesting blog article on a Telegram chat stats
- https://core.telegram.org/tdlib/getting-started
Немає коментарів:
Дописати коментар