Coder Social home page Coder Social logo

mincka / dmarchiver Goto Github PK

View Code? Open in Web Editor NEW
222.0 23.0 25.0 155 KB

A tool to archive the direct messages, images and videos from your private conversations on Twitter

License: GNU General Public License v3.0

Python 100.00%
twitter conversation direct-message archive tweets downloader backup dm

dmarchiver's Introduction

DMArchiver is currently broken

2020-08-16

Due to recent changes on Twitter, the method originally used by DMArchiver will no longer work. There won't be a quick fix as it requires a major rewrite.

The issue is tracked here: #83

GitHub release PyPI Github All Releases Windows package Ubuntu package macOS package

DMArchiver

A tool to archive all the direct messages from your private conversations on Twitter.

Introduction

Have you ever need to retrieve old information from a chat with your friends on Twitter? Or maybe you would just like to backup all these cheerful moments and keep them safe.

I have made this tool to retrieve all the tweets from my private conversations and transform them in an IRC-like log for archiving.

Output sample:

[2016-09-07 10:35:55] <Michael> [Media-image] https://ton.twitter.com/1.1/ton/data/dm/773125478562429059/773401254876366208/mfeDmXXj.jpg I am so a Dexter fan...
[2016-09-07 10:36:12] <Michael> [Media-sticker] [Grinning face] https://ton.twimg.com/stickers/stickers/10001_raw.png
[2016-09-07 10:37:12] <Kathy> He is so sexy. 😳 I love him. ❀️
[2016-09-07 10:38:10] <Steve> You guys are ridiculous! πŸ˜‚

This tool is also able to download all the uploaded images and videos in their original resolution and, as a bonus, also retrieve the GIFs you used in your conversations as MP4 files (the format used by Twitter to optimize them and save space).

You may have found suggestions to use the Twitter's archive feature to do the same but Direct Messages are not included in the generated archive.

The script does not leverage the Twitter API because of its very restrictive limitations in regard of the handling of the Direct Messages. Actually, it is currently possible to retrieve only the latest 200 messages of a private conversation.

Because it is still possible to retrieve older messages from a Conversation by scrolling up, this script only simulates this behavior to automatically get the messages.

Warning: possible account lockout

A few users have reported account lockouts because of the use of this tool. Twitter seems to lock accounts more aggressively if a new login context is detected. Even though locking can be reverted, you should be aware of this risk when using this tool. An additional attempt after unlocking can allow the tool to perform better on the second run.

If you need to run the tool multiple times, it is also recommended to use the -s parameter to reuse cookies from a previous session. You will not receive a new login warning by e-mail since the tool will reuse an existing session.

Disclaimer:

Using this tool will only behave like you using the Twitter web site with your browser, so there is nothing illegal to use it to retrieve your own data. However, depending on your conversations' length, it may trigger a lot of requests to the site that could be suspicious for Twitter. In this case, Twitter could lock preemptively the account.

Because this script leverages an unsupported method to retrieve the tweets, it may break at any time. Indeed, Twitter may change the output code without warning. If you get errors you did not have previously, please check if new releases of the tool are available.

Installation & Quick start

By running the tool without any argument, you will be only prompted for your username and your password. The script will retrieve all the messages, from all the conversations without the images or the GIFs.

Windows

Download a Windows build from the project releases.

Unzip the archive in a temporary folder and double-click the executable or run it in a Command Prompt (mandatory if you want to use parameters to download images and videos):

> C:\Temp\DMArchiver.exe

Note: If you run the tool directly from the zip archive window, it may fail when writing the log file. Instead, copy DMArchiver.exe to any directory and run it from there.

Mac OS X / macOS

Download a macOS build from the project releases.

Then click on the executable, or run Terminal and execute the following commands (mandatory if you want to use parameters to download images and videos):

$ cd Downloads
$ ./dmarchiver

Note: If you run the tool by clicking on it, the result files will be available in your /users/username folder.

Ubuntu

$ pip3 install dmarchiver
$ dmarchiver

Installation & upgrade with pip (any platform)

$ pip3 install dmarchiver
$ dmarchiver
$ pip3 install dmarchiver --upgrade

Advanced usage

Command line tool

$ dmarchiver [-h] [-id CONVERSATION_ID] [-u] [-p] [-di] [-dg] [-dv]

$ dmarchiver --help
	usage: cmdline.py [-h] [-id CONVERSATION_ID] [-u] [-p] [-di] [-dg] [-dv]
	
	optional arguments:
	  -h, --help            show this help message and exit
	  -id CONVERSATION_ID, --conversation_id CONVERSATION_ID
	                        Conversation ID
	  -u,  --username       Username (e-mail or handle)
	  -p,  --password       Password
	  -d,  --delay          Delay between requests (seconds)
	  -s,  --save-session   Save the session locally
	  -di, --download-images
	                        Download images
	  -dg, --download-gifs  Download GIFs (as MP4)
	  -dv, --download-videos
	                        Download videos (as MP4)
	  -th,  --twitter-handle     
	                        Use the Twitter handles instead of the display names						
	  -r, --raw-output      Write the raw HTML to a file

Examples

Archive all conversations with images and videos:

$ dmarchiver -di -dv

The script output will be the 645754097571131337.txt file with the conversation formatted in an IRC-like style.

The images and videos files can be respectively found in the 645754097571131337/images and 645754097571131337/mp4-* folders.

Archive a specific conversation, and use the Twitter handles for the usernames:

To retrieve only one conversation with the ID 645754097571131337:

$ dmarchiver -id "645754097571131337" -th

The script output will be the 645754097571131337.txt file with the conversation formatted in an IRC-like style, using the Twitter handles instead of the display names.

Schedule a task to perform incremental backups of a conversation

You can also specify the username and the password in the options. Because DMArchiver is able to perform incremental updates, you can schedule a task or create a shortcut with the following arguments:

$ dmarchiver -id "conversation_id" -di -dg -dv -u your_username -p your_password -s

Note the usage of the -s flag to use an existing session, instead of creating a new one.

Development

Ubuntu / Windows

$ git clone https://github.com/Mincka/DMArchiver.git
$ cd DMArchiver
$ virtualenv venv
$ source venv/bin/activate # "venv/Scripts/Activate.bat" on Windows
$ pip install -r requirements.txt
$ python -m dmarchiver.cmdline

Mac OS X / macOS

To build and run the pip3 package, you need to have Xcode (β‰ˆ 130 MB), Homebrew and Python 3 (β‰ˆ 20 MB):

$ xcode-select --install
$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
$ brew install python3

Binary build with pyinstaller

The Python 3.4 (32-bit) branch is recommended to build the binaries. It will allow the best compatibility with all the platforms.

On Windows

> pip3 install pyinstaller
> pyinstaller --onefile dmarchiver\cmdline.py -n dmarchiver.exe
or alternative in case of import error
pyinstaller --onefile dmarchiver\cmdline.py --paths=dmarchiver -n dmarchiver.exe --hidden-import queue
> cd dist
> dmarchiver.exe

On Mac OS / macOS

$ pip3 install pyinstaller
$ pyinstaller --onefile dmarchiver/cmdline.py -n dmarchiver
or alternative for macOS Sierra with handling of external imports
$ /Library/Frameworks/Python.framework/Versions/3.4/bin/pyinstaller --onefile dmarchiver/cmdline.py -n dmarchiver --hidden-import cssselect --hidden-import lxml --hidden-import urllib3 --hidden-import requests --hidden-import queue 
$ cd dist
$ ./dmarchiver

Package upload to PyPI Live

python setup.py sdist upload -r pypi

Known issues

Missing messages in conversations

Sometimes, generally due to a connection error, the script will write the messages of the conversations before retrieving all the messages. In this case, you should try to run the script again.

Error message: "Unknown element type" / "Unknown media type" / "Unknown media"

Twitter may introduce new features or change the HTML output at any time. When it happens, DMArchiver may generate empty, broken logs or even crash. This kind of error message means the tool must be updated to handle the new output. Feel free to create a new issue when you encounter one of these messages.

Troubleshooting

Error building lxml

You may encounter building issues with the lxml library on Windows (error: Unable to find vcvarsall.bat). The most simple and straightforward fix is to download and install a precompiled binary from this site and install the package locally:

$ pip install lxml‑3.8.0‑cp34‑cp34m‑win32.whl

dmarchiver script not found after pip3 install

If Python bin path in not in your environment PATH variable, the program will not be found. Just run it with the complete path (location may vary...):

$ /Library/Frameworks/Python.framework/Versions/3.4/bin/dmarchiver

FAQ

What happens to my password and my messages? Are they sent to a third-party service?

Not at all. Unlike other online backup services, everything happens here on your computer. Your username and your password are only sent once to Twitter using a secured connection. Your messages are downloaded from your connection, and are written on your computer at the end of the script execution, so are the images and the GIFs if you chose to download them.

I received an e-mail from Twitter saying a suspicious connection occured on Twitter, should I be worried about it?

Not at all. The tool simulates a Chrome (Windows or Linux) or Safari (macOS) browser on your current operation system. Because the tool does not keep any cookie locally, Twitter will warn you each time you use it. You can safely ignore this message if you received it at the same time the tool was used.

macOS says the application is blocked because it is not from an identified developer, what should I do?

I am not able to sign the macOS executable. You will have to unblock the application if you want to use it. Go the "Security & Privacy" settings and click on the "Open Anyway" button.

License

Copyright (C) 2016-2017 Julien EHRHART

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

dmarchiver's People

Contributors

cajuncooks avatar dependabot-preview[bot] avatar dependabot[bot] avatar mincka avatar trwnh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dmarchiver's Issues

How to get the out text file in UTF-8 encoding?

This works like a charm but im seeing characters like ПолСгÑ‑С.. which means that it dont support other languages... is there any other way to do it?

Getting DMArchiver to work with phone verification and login codes

Whenever I want to use DMA I have to disable phone verification, it's a minmal risk as I can turn it back on again right after. But I imagine if you'd use it more often than me, it becomes a hassle. And even worse some people might forget to turn it back on or leave it off on purpose because of that.

Now I tried using the login code sent to me on my phone as a password once, and it obviously didn't work. Also the 1 hour temporary app password doesn't work. Do you think you can add support for proper app authentication or look into why the temporaray password doesn't work? And then add a command line switch -pv (phone verification) or something?

Certificate Verify Fail when using -u -p switch

Hi,

I tried using dmarchiver -di with username password prompted and everything worked fine. However when I am using -u -p arguments to use username password in CLI, I am having following error:

C:\dmarchiver>dmarchiver.exe -u [email protected] -p yyyyyyyy -di
DMArchiver 0.1.7
Running on Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600
32 bit (Intel)]

Traceback (most recent call last):
File "site-packages\requests\packages\urllib3\connectionpool.py", line 595, in
urlopen
File "site-packages\requests\packages\urllib3\connectionpool.py", line 352, in
make_request
File "site-packages\requests\packages\urllib3\connectionpool.py", line 831, in
validate_conn
File "site-packages\requests\packages\urllib3\connection.py", line 289, in con
nect
File "site-packages\requests\packages\urllib3\util\ssl
.py", line 308, in ssl

wrap_socket
File "ssl.py", line 362, in wrap_socket
File "ssl.py", line 580, in init
File "ssl.py", line 807, in do_handshake
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c
:600)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "site-packages\requests\adapters.py", line 423, in send
File "site-packages\requests\packages\urllib3\connectionpool.py", line 621, in
urlopen
requests.packages.urllib3.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED]
certificate verify failed (_ssl.c:600)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "dmarchiver\cmdline.py", line 107, in
File "dmarchiver\cmdline.py", line 75, in main
File "dmarchiver\core.py", line 270, in authenticate
File "site-packages\requests\sessions.py", line 488, in get
File "site-packages\requests\sessions.py", line 475, in request
File "site-packages\requests\sessions.py", line 596, in send
File "site-packages\requests\adapters.py", line 497, in send
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verif
y failed (_ssl.c:600)
Failed to execute script cmdline

Request: csv output

Hi, I just used your DMArchiver and now have 143 txt files with no indication of order or tweep. It would be of great help if you could send the output to one file in csv format: Date (ANSI) tweep name text, each conversation separated by an empty line or something like "======" .
For your information:
I got mails from Twitter that someone logged into my account. Which is fine of course.
It didn't work well in W10 outside a command prompt. But also could be the fact that it was then run from a network share.

Add an new option to add the date to the images filename

When I want to search a date in the archive, I can use any text tool to find [YYYY-MM-DD hh:mm:ss]. However, the images filename use a less intuitive format, converting https://ton.twitter.com/i/ton/data/dm/firsthash/secondhash/thirdhash.jpg to firsthash-secondhash-thirdhash.jpg.

Can you add a new option to use an intuitive or "human" format?. Something like this: YYYYMMDD-hhmmss-thirdhash.jpg. The thirdhash would avoid any collision in the filename. Probably the original format is useful for someone, that's why I'm asking for a new option instead of change the default.

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Edited by Mincka on August 10th 2017:

The error message was due in this case to invalid json data. It seemed to be related to a connection issue and it was not possible to reproduce it. Other causes can be found here: https://stackoverflow.com/a/18460958

Original post:
New ticket created from
#1 (comment)

$ /Users/xxx/Downloads/dmarchiver -id "YYY" -di -dg

Enter your username or email: zzz

Enter your password (characters will not be displayed):

Authentication succeedeed.

Conversation ID specified (YYY). Retrieving only one thread.

Starting crawl of 'YYY'

Failed to execute script cmdline

Traceback (most recent call last):

File "dmarchiver/cmdline.py", line 70, in

File "dmarchiver/cmdline.py", line 62, in main

File "dmarchiver/core.py", line 468, in crawl

File "requests/models.py", line 826, in json

File "json/init.py", line 319, in loads

File "json/decoder.py", line 339, in decode

File "json/decoder.py", line 357, in raw_decode

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Request: Save twitter user names?

Would it be possible to have the program save the @ twitter id of the people I have conversations with, perhaps at the top of the log? It would help me jump to those conversations with searches and id those that deactivate.

Allow differential backup to complete a previous backup

From @williammmiller1's idea.

It's currently not possible to make a differential archive based on a previous extraction or a time-based option. Consequently, the script will have to download again a complete thread up to the first message.

Multiple implementations could be done:

  1. Allow the possibility to specify a previous backup to complete only the delta since the last message
  2. Allow the possibility to specify min / max tweet IDs with the arguments
  3. Allow the possibility to specify min / max date with the arguments

Error when using the lastest release - not all messages downloaded

Hi there.
I downloaded the latest Mac release. When I ran the archiver, I noticed it only went back to July 2017 in some of the threads and not all the way to the beginning. One of my largest DM messages is about 22MB txt file once downloaded and this time it was only 1.5MB. I did one screen shot of what seems to be different than what I normally see. Hope this helps in the explanation
As a reminder I'm using the Mac version.

Thanks
Ronnie
screenshot 2017-10-06 19 25 34
screenshot 2017-10-06 19 56 01
screenshot 2017-10-06 19 55 37

KeyError: 'trusted'

Conversation ID not specified. Retrieving all the threads.
Traceback (most recent call last):
File "dmarchiver\cmdline.py", line 107, in
File "dmarchiver\cmdline.py", line 96, in main
File "dmarchiver\core.py", line 302, in get_threads
KeyError: 'trusted'
Failed to execute script cmdline

how can to set proxy setting?

Hello
When running gives me this error
i think to need set proxy server setting for https connection (in iran https is blocked)
Is it possible to define a new parameter in the command line t set proxy settings?
thnaks

E:\tww>dmarchiver.exe
DMArchiver 0.1.7
Running on Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (Intel)]

Enter your username or email: zoghal
Enter your password (characters will not be displayed):
Traceback (most recent call last):
  File "site-packages\requests\packages\urllib3\connection.py", line 142, in _new_conn
  File "site-packages\requests\packages\urllib3\util\connection.py", line 98, in create_connection
  File "site-packages\requests\packages\urllib3\util\connection.py", line 88, in create_connection
OSError: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "site-packages\requests\packages\urllib3\connectionpool.py", line 595, in urlopen
  File "site-packages\requests\packages\urllib3\connectionpool.py", line 352, in _make_request
  File "site-packages\requests\packages\urllib3\connectionpool.py", line 831, in _validate_conn
  File "site-packages\requests\packages\urllib3\connection.py", line 254, in connect
  File "site-packages\requests\packages\urllib3\connection.py", line 151, in _new_conn
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x02DB23D0>: Failed to establish a new connection: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "site-packages\requests\adapters.py", line 423, in send
  File "site-packages\requests\packages\urllib3\connectionpool.py", line 640, in urlopen
  File "site-packages\requests\packages\urllib3\util\retry.py", line 287, in increment
requests.packages.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='twitter.com', port=443): Max retries exceeded with url: /login (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x02DB23D0>: Failed to establish a new connection: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "dmarchiver\cmdline.py", line 107, in <module>
  File "dmarchiver\cmdline.py", line 75, in main
  File "dmarchiver\core.py", line 270, in authenticate
  File "site-packages\requests\sessions.py", line 488, in get
  File "site-packages\requests\sessions.py", line 475, in request
  File "site-packages\requests\sessions.py", line 596, in send
  File "site-packages\requests\adapters.py", line 487, in send
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='twitter.com', port=443): Max retries exceeded with url: /login (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x02DB23D0>: Failed to establish a new connection: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions',))
Failed to execute script cmdline

Code 131: Internal error

I am trying this tool out but I get the same error each time:

Previous conversation not found. Creating a new one with incremental support.
An error occured during the parsing of the tweets.

Twitter error details below:
Code 131: Internal error

Download videos

Neither the argument -di nor -dg currently downloads videos. Would be cool to add a -dv argument.

DMArchiver stopped working: "Unknown element type"

Since just a few days downloading the direct messages doesn't work anymore. Something in the HTML mus thave changed on Twitter's side.

I get the error Unknown element type. In the txt file the date, time and username is there but the actual text message is missing.

Where do you find the output files on macOS?

First of all, I want to thank you so much for creating this tool. No one on the internet did this but you. Its my 1 year anniversary with my girlfriend and i really want to retrieve a string of messages because i'm planning to gift her something and it is important that i have all the dms archived. Unfortunately i'm very bad at coding and i couldn't understand the instructions. I think my convo id is 853339611514507267, do i have to type my user name & pass then enter this $ dmarchiver -id "853339611514507267" or is there something i am missing. also where do the messages end up? i do realize how stupid this whole question might be but i'd be grateful if you could assist me. Thank you so much.

Handle errors in requests (locked account)

Connection from new IP addresses, with new browser (Firefox user-agent) or after invalid authentications may trigger an account block. The script will be unable to parse the request and will return error messages such as "KeyError: 'threads' or "KeyError: 'inner'

image

Intermittent timeout errors

this won't work. once i enter my username and password, the application crashes and doesn't provide any further information.

Not Finding Files Of DM's

So I've been having trouble to display any of the DM's it's downloaded. I have a Mac if that helps. Anyone, can you help?

XMLSyntaxError: switching encoding: encoder error

Edited by Mincka on August 10th 2017:
For anybody Googling for this error message XMLSyntaxError: switching encoding: encoder error:

  • It may be related to the parsing in lxml of emojis or specific ranges of Unicode characters (like πœ‹) which are four-byte characters
  • The issue is specific to macOS and Python 3.5
  • A ticket for a bug is opened but nobody seems to be working on it (https://bugs.launchpad.net/lxml/+bug/1538213)

Possible workarounds:

  1. Strip the emojis on macOS before the parsing, see this implementation in 073a358
  2. Downgrade to Python 3.4 if you can. I attempted to upgrade to Python 3.6 but had other compatibility issues, this time with pyinstaller, so I was unable to move forward. Downgrade to Python 3.4 allow my tool to work perfectly on all platforms.
  3. Remove lxml package and reinstall it using STATIC_DEPS=true (lorien/grab#199 (comment)). However, I cannot guarantee this will work. Using multiple Python versions on macOS is such a huge pain. 😞

Original message:
My setup:

  • Python 3.5.2
  • macOS Sierra 10.12
$ dmarchiver
Enter your username or email: myusername
Enter your password (characters will not be displayed): 
Authentication succeedeed.
Conversation ID not specified. Retrieving all the threads.
Starting crawl of '################'
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/bin/dmarchiver", line 9, in <module>
    load_entry_point('dmarchiver==0.0.5', 'console_scripts', 'dmarchiver')()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/cmdline.py", line 67, in main
    crawler.crawl(thread_id, args.download_images, args.download_gifs)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 443, in crawl
    tweets, download_images, download_gif)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 357, in _process_tweets
    document = lxml.html.fragment_fromstring(value)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/__init__.py", line 825, in fragment_fromstring
    base_url=base_url, **kw)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/__init__.py", line 786, in fragments_fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/__init__.py", line 752, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737)
  File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674)
  File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220)
  File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345)
  File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584)
  File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238)
  File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147)
lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1

Conversation not extracted

Hello,
I tried to extract a twitter DM conversation with the macOS version, but I encountered two issues:

  1. the screen says "Process completed", I can find the .txt files in Finder but the conversation I need has not been extracted (I maybe have two dozens conversations, but I only need that one)
  2. the extracted conversations look like this: no DM visible. Did I do something wrong?
    screen shot 2017-06-25 at 15 35 46

My boyfriend died in April. I'm trying to save our twitter DMs... Thanks for any help.
Celine

Option file

Any chance a option file could be integrated? (Windows)

Possible options:
Username and password for scheduled archives
Browser emulation selection

Only 50 latest conversations are downloaded in "All conversations" mode

From #7:

I think I may have an idea why all the threads have not been downloaded the first time. I've counted all the threads in your previous message and found exactly 50 conversations. Currently, to find "all" the conversation IDs, the script loads the conversations available on the "first" "Messages" page but do not simulate scrolling to load more. I though that all the conversations were listed directly.

My guess is that when you scroll down through all the conversations, at the bottom, Twitter loads the next 50 conversations. I did not identify this case because I have a lot less than 50 conversations on Twitter! But it's an interesting case and I'm going to open a new ticket to improve the "all threads" mode which is in fact a "latest 50 conversations" mode it seems.

Twitter account being locked due to suspected Robot

Hi, twice now when using DMArchiver I've had my Twitter account locked because they think I'm using a robot (which I am, technically) to do something bad (which I'm not).

Both times it's easy to unlock using ReCaptcha or SMS code - I'm assuming it's because of the speed of the requests being made.

Would it be possible to add an optional argument to introduce a delay between requests, to more closely resemble normal browser action?

eg. -td 10 (10 second delay between requests)

Thanks for an excellent program.

How do i find the requests in the Twitter DM developer page

Hi there! Thanks so much for replying to my message. I found my token ID using the developer tab in Chrome but how do i find the requests to get the DM Message ID once i've clicked on the conversation i'm looking at. Do i do it the same way on same screen i found the token ID? Under elements?
Here is a screen shot of my window. I can switch to safari if that is easier to find it.
screenshot 2016-10-23 15 04 33

Sorry to be such a novice, i really wish i understood all of this. So please excuse my ignorance, but i have a willingness and ability to take direction when it comes to tech stuff.

Thanks,
Ronnie
[email protected]
Here is the developer tool in Safari but i'm not sure where to look for conversation iD for the conversation shown on screen as #TeamErin

screenshot 2016-10-23 15 27 20

Text added to cards may be incomplete

When a link is shared and user adds additional text, the added text may not be included in the log.

In the following generated sample, "This is a test." is not included.

  <p class="TweetTextSize  js-tweet-text tweet-text" lang="" data-aria-label-part="0">How I lost my 25-year battle against corporate claptrap <a href="https://t.co/gIrbtXuRSv" rel="nofollow noopener" dir="ltr" data-expanded-url="https://www.ft.com/lucycolumn" class="twitter-timeline-link" target="_blank" title="https://www.ft.com/lucycolumn" >
        <span class="tco-ellipsis"/>
        <span class="invisible">https://www.</span>
        <span class="js-display-url">ft.com/lucycolumn</span>
        <span class="invisible"/>
        <span class="tco-ellipsis">
            <span class="invisible">&nbsp;</span>
        </span>
    </a> This is a test.</p>

This is because cssselect extracts only the text node before the . A workaround could be to use text_content():

def _parse_dm_text(self, element):
    dm_text = '' text_tweet = element.cssselect("p.tweet-text")[0]
    dm_text = text_tweet.text_content()
    return DirectMessageText(dm_text)

The output would be:
[2017-08-16 13:37:49] <Julien Ehrhart> [Card-summary_large_image] https://www.ft.com/lucycolumn How I lost my 25-year battle against corporate claptrap https://www.ft.com/lucycolumn This is a test.

Two issues here:

  1. The link appears twice (once during the parsing of the card, once during the parsing of the text) -> Acceptable
  2. The emojis are not in the text so they are stripped from the output -> Not acceptable

KeyError: Threads, Failed to execute script cmdline

New Error attempting on Windows,
With or without -di -dg

DMArchiver 0.1.6
Running on Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900
64 bit (AMD64)]

Authentication succeedeed.

Conversation ID not specified. Retrieving all the threads.
Traceback (most recent call last):
File "dmarchiver\cmdline.py", line 95, in
File "dmarchiver\cmdline.py", line 86, in main
File "dmarchiver\core.py", line 302, in get_threads
KeyError: 'threads'
Failed to execute script cmdline

DMArchiver 0.1.7
Running on Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600
32 bit (Intel)]

Authentication succeedeed.

Press Ctrl+C at anytime to write the current conversation and skip to the next o
ne.
Keep it pressed to exit the script.

Conversation ID not specified. Retrieving all the threads.
Traceback (most recent call last):
File "dmarchiver\cmdline.py", line 107, in
File "dmarchiver\cmdline.py", line 96, in main
File "dmarchiver\core.py", line 302, in get_threads
KeyError: 'threads'
Failed to execute script cmdline

Crashes after login

After login, DMArchiver crashed several times. Using my @_name eventually worked but then it began crawling from the very beginning.
This is a very long DM conversation (2 years (!) now), so Twitter did not like that and locked my account.

I only needed an incremental update to get the last two weeks of this conversation. Hope this will be possible soon. (Windows 7)

Scalability (# of messages)?

Are there limits of the number of messages? I successfully tested the script with roughly 13k messages / 1.3mb in one conversation.

The script seems to cache the messages. Would it maybe more scalable if it stored the messages into a file in an incremental fashion instead of caching them?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.