leibniz-hbi / tegracli Goto Github PK

View Code? Open in Web Editor NEW

5.0 2.0 1.0 128 KB

A Telegram/telethon research convenience wrapper for the terminal.

Home Page: https://pypi.org/project/tegracli/

License: MIT License

Python 100.00%

research-tool telegram telegram-api

tegracli's Introduction

tegracli

A convenience wrapper around Telethon and the Telegram Client API for research purposes.

Installation Instructions

tegracli uses Poetry and python >= 3.9 and < 4.0 for building and installing.

To install using pipx, run the following command pipx install tegracli.

How to get API keys

If you don't have API keys for Telegram, head over to my.telegram.org. Click on API development tools, fill the form to create yourself an app and pluck the keys into tegracli.conf.yml. The session name can be arbitrary.

api_id: 1234567
api_hash : some12321hashthatmustbehere123
session_name: somesessionyo

This template file is provided with the repository.

Usage

tegracli is a terminal application to access the Telegram API for research purposes. In order to retrieve messages the configuration-file from the section before must be present in the directory you start tegracli.

Usage: tegracli [OPTIONS] COMMAND [ARGS]...

  Tegracli!! Retrieve messages from *Te*le*gra*m with a *CLI*!

Options:
  -d, --debug              Enable legacy debugging, is overwritten by the
                           other options. Defaults to False.
  -v, --verbose            Logging verbosity.
  -l, --log-file FILENAME  File to log to. Defaults to STDOUT.
  -s, --serialize          Serialize output to JSON.
  --help                   Show this message and exit.

Commands:
  configure  Configure tegracli.
  get        Get messages for the specified channels by either ID or...
  group      Manage account groups.
  hydrate    Hydrate a file with messages-ids.
  search     Searches Telegram content that is available to your account.

Logging

tegracli allows for configuring what and how it is logged. Per default logging is disabled and can be enabled by passing --verbose or -v, logging level can be increased by more -vvvvs. By default logging target is STDOUT but this can be redirected to a file with --log-file yourfile.log. Setting --serialize allows to be to write the entire logging information in JSON-encoded form. --debug is the legacy option used by tegracli <= 0.2.5, this will set serialized logging into tegracli.log.jsonl at the DEBUG level; it is overwritten by setting the --verbose option.

Commands

The following commands are available:

configure

Opens an interactive prompt for configuring API-access. Aks you to input your API id, API hash and session name and requests a 2FA code from Telegram.

Usage: tegracli configure [OPTIONS]

  Configure tegracli.

Options:
  --help  Show this message and exit.

get

To get messages from a number of channels, use this command.

Usage: tegracli get [OPTIONS] [CHANNELS]...

  Get messages for the specified channels by either ID or username.

Options:
  -l, --limit INTEGER           Number of messages to retrieve.
  -O, --offset_date [%Y-%m-%d]  Offset retrieval to specific date in YYYY-MM-
                                DD format.
  -o, --offset_id INTEGER       Offset retrieval to a specific post number.
  -m, --min_id INTEGER          Minimal post number.
  -M, --max_id INTEGER          Maximal post number
  -a, --add_offset INTEGER      Add an offset to the post numbers to be
                                retrieved.
  -f, --from_user TEXT          Only messages from this user.
  --reverse / --forward         Should post numbers count upward or downward.
                                Defaults to forward.
  -r, --reply_to TEXT           Only messages replied to specific post id.
  --help                        Show this message and exit.

parameter	description
channels	a list of of either telegram usernames, channel or group URLs or user IDs.
limit	number of messages to retrieve, positive integer. If set to `-1` , retrieves all messages in the channel. defaults to `-1`.
offset_date	specify start point of retrieval by date, retrieval direction is controlled by `reverse/forward`. Format must be YYYY-MM-DD.
offset_id	specify start point of retrieval by post number, retrieval direction is controlled by `reverse/forward`.
min_id	sets the minimum post number
max_id	sets the maximum post number
add_offset	add a offset to the post numbers to be retrieved
from_user	limit messages to posts from a specific user
reply_to	limit messages to replies to a specific user
reverse/forward	flag to indicate whether messages should be retrieved in chronological or reverse chronological order.

Basic Examples

To retrieve the last fifty messages from a Telegram channel:

tegracli get --limit 50 corona_infokanal_bmg

To retrieve the entire history starting with post #1 of a channel, set limit to -1.

tegracli get --reverse --limit -1 corona_infokanal_bmg

To retrieve messages sent after January, 1st 2022:

tegracli get --offset_date 2022-01-01 corona_infokanal_bmg

To retrieve message sent before January, 1st 2022:

tegracli get --reverse --offset_date 2022-01-01 corona_infokanal_bmg

search

To search messages of your chats and groups and channels you are subscribed to, use this command.

Usage: tegracli search [OPTIONS] [QUERIES]...

  This function searches Telegram content that is available to your account for the specified search term(s).

Options:
  --help  Show this message and exit.

hydrate

To rehydrate messages from the API this command accepts a file with message IDs in the format of $channel_name/$post_number. Both input and output file are optional, if not given, stdin and stdout are used.

Output data is JSONL, one message per line.

Usage: tegracli hydrate [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]

  Hydrate a file with messages-ids.

Options:
  --help  Show this message and exit.

For example, to rehydrate message IDs:

echo test_channel/1234 | tegracli hydrate
>> {"_":"Message","id": 1234, ... , "restriction_reason":[],"ttl_period":null}

groups

In order to support updatable and long-running collections tegracli sports an account group feature which retrieves the history of a given set of accounts and is able to retrieve updates on each of these accounts.

Groups are initialized by calling teracli group init, where accounts to track are stated by either stating them as arguments or by reading in a file.

Account Group File Format

Account files are expected to follow these requirements:

UTF8 text document,
per line one account, given as either username, channel-URL or ID,
there shall be no header and no additional columns

Usage: tegracli group init [OPTIONS] NAME [ACCOUNTS]...

  initialize a new account group

Options:
  -f, --read_file PATH         read an account list from a file, one
                               handle/id/url per line.
  -s, --start_date [%Y-%m-%d]  Start date for the collection. Must be in YYYY-
                               MM-DD format.
  -l, --limit INTEGER          number of posts fo retrieve in one run
  --help                       Show this message and exit.

A group is essentially a directory in your tegracli project folder which holdes a group configuration file, a profiles.jsonl file which will collect all user objects returned by Telegram (these will be recycled to save API requests), as well as the jsonl-files containing the messages. The messages-files are structured in a way that one file holds the messages of one account and is named by the account's ID.

An exemplary project could look this:

tegracli-project/
 |- tegracli.conf.yml
 |- mysession.session
 |- my_group/
    |- tegracli_group.conf.yml
    |- profiles.jsonl
    |- 10000001.jsonl
    |- 10000002.jsonl

To run the project command your terminal to tegracli group run my_group to collect the latest post of the accounts you want to track. If you have multiple groups configured you can run all by running tegracli group run all. This interprets all subdirectories as valid groups. However, tegracli will fail if a subdirectory is not a valid group.

Usage: tegracli group run [OPTIONS] [GROUPS]...

  Load a group configuration and run the groups operations.
    
  GROUPS are subdirectories with a valid group configuration.
    If the special keyword all is given, all subdirectories are considered.

Result File Format

Messages are stored in jsonl-files per channel or query. For channels filename is the channel's or user's id, for searches the query. BEWARE: how directories and files are structured is subject to active development and prone to changes in the near future.

Developer Installation

Install poetry,
Clone repository and unzip, if necessary,
In the directory run poetry install,
Run poetry shell to start the development virtualenv,
Run pytest to run tests, run pytest --run_api to include tests against the Telegram API (these do require a valid configuration), coverage report can be found under tests/coverage.

tegracli's People

Contributors

Stargazers

Watchers

Forkers

weisente

tegracli's Issues

Document basic usage/help function

pytest should be in dev dependencies

refactor notifications via Telegram

Make it configurable and respect message length limit (4096).

python3.8 compatibility

Group-based collection àla twacapic

For on-going collections a mechanism for collecting account groups would be good.
I.e. a call to tegracli get account1 account2 could receive an additional option to save the resulting list to a subfolder from which we can update and/or rerun the collection: tegracli get --name_group my_accountlist account1 account2 ....
Hence, that would allow for a call in the like of tegracli update my_accountlist. A setup like this could have folder structure similar to this:

folder/
  tegracli.conf.yml
  my_accountlist/
    12932xxx.jsonl
    14204xxx.jsonl

Optionally, we should introduce some verbs to manipulate these accountlists, i.e. tegracli remove $account1 to remove accounts from the list, as well as add to add accounts to a list.

feat: allow non-interactive configuration for Jupyter notebooks

fix: Unicode characters in the reaction field

The characters in the reaction fields are replaced with a Unicode placeholder. This not intended behavior.

feat: add strategy for file rollover

Single JSONL-files per users are not really feaible for downstream database-persistance (since, we would have read the entire file and let SQLAlchemy's ORM-system figure out which things to persist).

Possible solutions:

keep user files in a seperate directory (e.g. `.ponyuexpress' ) and after a rollover period, move their contents into a single file for each rollover period.
do rollover in place (one file per user and rolloverperiod).

fix: fresh tegracli installation does not connect to Telegram when running a newly intialized group

fix: catch async errors

Traceback (most recent call last):
  File "~/.local/bin/tegracli", line 8, in <module>
    sys.exit(cli())
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/tegracli/main.py", line 224, in run
    run_group(ctx.obj["client"], groups)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/tegracli/main.py", line 316, in run_group
    client.loop.run_until_complete(
  File "~/.pyenv/versions/3.9.12/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/tegracli/dispatch.py", line 30, in dispatch_iter_messages
    async for message in client.iter_messages(wait_time=10, **params):
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/telethon/requestiter.py", line 74, in __anext__
    if await self._load_next_chunk():
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/telethon/client/messages.py", line 184, in _load_next_chunk
    r = await self.client(self.request)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/telethon/client/users.py", line 30, in __call__
    return await self._call(self._sender, request, ordered=ordered)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/telethon/client/users.py", line 84, in _call
    result = await future
telethon.errors.rpcerrorlist.UserDeactivatedError: The user has been deleted/deactivated (caused by GetHistoryRequest)

setup Actions to run code quality control programs

Too ensure consistent code that algin with our code guide lines it is helpful to implement quality gates.
Naturally the opportunity for such checks are pull request (not pushes, as our main branch should always be locked).

The GitHub documentation gives some examples: building and testing python.

Document API key procurement and config

fix(hydrate): memory leak, somewhere

After (only) 135,313 messages ubuntu's OOM killer came and took out hte processes:

[Sa Apr 22 00:20:52 2023] Out of memory: Killed process 18967 (tegracli) total-vm:532276kB, anon-rss:291708kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:872kB oom_score_adj:0

Document dev installation

Add CITATION.bib

Add DOI and a bib.

Tegracli group retrieval does not catch ChannelInvalidErrors

Traceback (most recent call last):                                                                [0/1898]
  File "~/.local/bin/tegracli", line 8, in <module>
    sys.exit(cli())
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/click/core.py", line 
1130, in __call__
    return self.main(*args, **kwargs)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/click/core.py", line 
1055, in main
    rv = self.invoke(ctx)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/click/core.py", line 
1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/click/core.py", line 
1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/click/core.py", line 
1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/click/core.py", line 
760, in invoke
    return __callback(*args, **kwargs)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/click/decorators.py",
 line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/tegracli/main.py", li
ne 209, in run
    run_group(ctx.obj["client"], groups)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/tegracli/main.py", li
ne 231, in run_group
    profile = client.loop.run_until_complete(
  File "~/.pyenv/versions/3.9.12/lib/python3.9/asyncio/base_events.py", line 647, in r
un_until_complete
    return future.result() 
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/tegracli/dispatch.py"
, line 140, in get_profile 
    profile = await client.get_entity(_member)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/telethon/client/users
.py", line 319, in get_entity
    channels = (await self(
  File "/home/philippkessling/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/telethon/client/users
.py", line 30, in __call__ 
    return await self._call(self._sender, request, ordered=ordered)
  File "~/.local/pipx/venvs/tegracli/lib/python3.9/site-packages/telethon/client/users
.py", line 84, in _call
    result = await future
telethon.errors.rpcerrorlist.ChannelInvalidError: Invalid channel object. Make sure to pass the right type
s, for instance making sure that the request is designed for channels or otherwise look for a different on
e more suited (caused by GetChannelsRequest)

fix: pre-commit is missing from deps

make public and publish on pypi

feat: add special keyword to `group run` to run all configured groups

Document user installation

add options to adjust outfile variable grouping

Right now, tegracli writes posts with the following file-pattern: ${user.id}.jsonl, but other formats are more disreable, e.g. ${created_at:"%Y-%m-%d"}.

lock file in repo

shouldn't this be at least OS specific?

(should also be in .gitignore for the new cookiecutter template)

increase test coverage

feat: function to crawl replies (comments) to each post of channel?

Hi, is there a function built in tegracli to crawl not only the text of posts of a channel (tegracli does a great job in this respect) but actually to crawl all replies (comments) to each post of a telegram-channel?

tegracli yields the number of replies to a post (in the variable called replies). I have not found a way to actually crawl the text of those individual replies.

Thanks in advance for any insights.

bug: catch UserNameNotOccupiedErrors while rehydrating

Traceback (most recent call last):
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/telethon/client/users.py", line 548, in _get_entity_from_string
    result = await self(
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/telethon/client/users.py", line 30, in __call__
    return await self._call(self._sender, request, ordered=ordered)
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/telethon/client/users.py", line 83, in _call
    result = await future
telethon.errors.rpcerrorlist.UsernameNotOccupiedError: The username is not in use by anyone else yet (caused by ResolveUsernameRequest)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ec2-user/.local/bin/tegracli", line 8, in <module>
    sys.exit(cli())
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/tegracli/main.py", line 105, in hydrate
    client.loop.run_until_complete(
  File "/usr/lib64/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/tegracli/dispatch.py", line 80, in dispatch_hydrate
    await dispatch_iter_messages(
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/tegracli/dispatch.py", line 34, in dispatch_iter_messages
    async for message in client.iter_messages(wait_time=10, **params):
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/telethon/requestiter.py", line 58, in __anext__
    if await self._init(**self.kwargs):
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/telethon/client/messages.py", line 289, in _init
    self._entity = (await self.client.get_input_entity(entity)) if entity else None
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/telethon/client/users.py", line 432, in get_input_entity
    await self._get_entity_from_string(peer))
  File "/home/ec2-user/.local/pipx/venvs/tegracli/lib64/python3.8/site-packages/telethon/client/users.py", line 551, in _get_entity_from_string
    raise ValueError('No user has "{}" as username'

Group account list should deduplicate itself.

Interactive configuration in pip(x) installed version

The template file for the config is not accessible if tegracli is installed via pip or pipx. Therefore, there should either an interactive option via a click command or a simple copy of the file in the working directory with a message/instructions how to fill it.

handle unexpected dropouts from Telethon's `iter_messages`

[Problem] Unit tests for interface code to Telethon

Problem: Telethon requires the client to be registered and authenticated with Telegram with makes running tests quite difficult.
As Telethon is an implementation of Telegram's MTProto-protocol which utilizes TCP with the low-level socket-package, mocking is not easily done with readily-available packages.

Solution 1: Mocking the Telethon.TelegramClient with prerecorded/mock data. Would require a sophisticated implementation and introduces quite some complexity.
Additionally it will require some learning about asyncio's EventLoops and how to disable actual requests to socket.socket.

Solution 2: Marking the tests that require interactions with Telethon/Telegram optional and running them if valid crendentials are supplied by the user and pytest is told by the user to do so – this would mean that running tests in GH actions on push/pull request may be quite expensive as rate limit errors may cause the test to pause for ~24h.

Further alternatives and input is more than welcome.

Style guideline

Our smormlpy-reposiitory is the main knowledge base for codequality and management. It employs pre-commit for code quality checks in the pre-commit hook.
This repository should do the same.