digitalmethodsinitiative / zeeschuimer Goto Github PK

View Code? Open in Web Editor NEW

141.0 141.0 10.0 13.84 MB

A browser extension to collect social media data with.

License: Other

JavaScript 95.17% HTML 4.28% CSS 0.38% Shell 0.18%

zeeschuimer's People

Contributors

Stargazers

Watchers

Forkers

mathiasfls pabloera bumatic greenwoodma ianroberts michaelachmann imaginarystargazer parker-kasiewicz

zeeschuimer's Issues

Twitter doesn't work

Hello.
I've tried Twitter and it works by browsing multiple accounts, but nothing is downloaded in the search engine.
There's a solution?

While this is an incredible and amazing tool and I'm highly grateful to the developers, it does not seem to collect the links that people share in their LinkedIn posts. It does collect the text and the images, but links to pdfs or other websites not.

Doesn't Seem to Collect Replies to a Tweet

If you view the web page of a single tweet and scroll through the replies it appears that no data is collected. Looking at the network history in the browser developer console it looks as if the data comes via a URL of the form https://twitter.com/i/api/graphql/<random string>/TweetDetail. Looking at the source code for the plugin TweetDetail doesn't seem to be one of the URL checks. Not sure if just adding that check would be enough or if the format of the response is too different etc. Will try and check myself but thought I'd raise it now given I currently only have a released version installed and so am not immediately set up to debug/develop.

Generated object URLs are never revoked

Currently whenever the user clicks a button to download ndjson Zeeschuimer creates a new object URL:

zeeschuimer/popup/interface.js

Lines 273 to 277 in 6d305ff

    
           await browser.downloads.download({ 
        
               url: window.URL.createObjectURL(blob), 
        
               filename: filename, 
        
               conflictAction: 'uniquify' 
        
           });

These URLs are never released, so in a long-running session even if you periodically download the ndjson and then "reset" to discard the downloaded data, the ndjson blobs are still referenced and cannot be freed from memory.

The code should track when the ndjson download is complete and then call revokeObjectURL appropriately.

TikTok Thumbnails Output "D", no URL

After scraping TikTok, exporting to 4cat, and downloading the CSV from 4cat, I see that the Thumbnail URL column is populated with just "D". Restarting firefox and recrawling did not fix the issue. This has happened before, and I was able to fix the issue by uninstalling Zeescheuimer and reinstalling it.

Any idea why this keeps happening or how I can fix it more sustainably (rather than having to wipe and re-install Zeeschuimer continually?

Other data types for various platforms

Right now Zeeschuimer collects "posts" or the local equivalent (messages, tweets, etc). But platforms often offer lists of other types of content, e.g. accounts. Zeeschuimer is currently not set up to capture more than one type of object per platform. But it could be...

Instagram datasource does not collect "Explore" page

Page	Status	Notes
Home	✔
User pages	✔
Hashtag pages	✔
Explore	❌	A couples posts appear to be collected, but not the vast majority
Reels	❌	Reels are a slightly different datatype...

I noticed the Explore page could not collect results and wanted to compile a quick list for reference.

Extension does not provide correct X-Zeeschuimer-Platform header for 9gag.com

Steps to reproduce:

Install Zeeschuimer
Enable collection for 9gag.com
Visit 9gag.com, ingest posts into Zeeschuimer (item count increases)
Click "to 4CAT"
Error "The 4CAT server does not accept 9gag.com datasets. The 4CAT administrator may need to enable the data source or upgrade 4CAT."

The response from the /api/import-dataset is 404, "Unknown platform or source format"

Possible cause:

I have located this issue in the value provided in the X-Zeeschuimer-Platform header, which is currently 9gag.com.

This is is not a recognised data source on the 4CAT side: https://github.com/digitalmethodsinitiative/4cat/blob/master/datasources/ninegag/__init__.py#L11.

Port 5000 in result/dataset link

After gathering data with Zeeschuimer, a result link is generated combining the given 4CAT server URL with a public port number (inserting :5000).

On our servers the usage of port 5000 within the container is possible, but we cannot open this port publicly. As a result, this leads to a connection timeout error, as the server cannot be accessed through this port.

Would any of the following be possible?

Making the public port number in result links optional.
Making the public port setting adjustable, so that we could for example use port 443.
Leaving the port number out the result links (resulting in: https://my-4cat.link/results/).

TikTok video_url maybe need token

Trying to download the links output to the video_url field and having an issue with the majority. It seems that links in this format work:
https://v16m-webapp.tiktokcdn-us.com/a5f9e55857d5db62a3ebba484739bed4/63924705/video/tos/alisg/tos-alisg-pve-0037c001/o4bTCdKE8jmYQgDxlVPeZEfnBEU8hoRPoPQAgB/?a=1988&ch=0&cr=0&dr=0&lr=tiktok_m&cd=0%7C0%7C1%7C0&cv=1&br=3904&bt=1952&cs=0&ds=3&ft=ebtHKH-qMyq8ZpuyShe2Nom~fl7Gb&mime_type=video_mp4&qs=0&rc=NTk6OTU1aGllOmc0Nzw7OUBpM2hkZjo6Zjd2aDMzODczNEAxXmI2YDIvNTMxNGNhLjUwYSNzY21qcjRvaDBgLS1kMS1zcw%3D%3D&l=2022120814193899017DE6214470059B7C
While links in this format return a 403 Forbidden error:
https://v16-webapp-prime.us.tiktok.com/video/tos/useast2a/tos-useast2a-pve-0068/owRNRjLvjuVDEDYnBAUEeNwbVxbQmeInQPJBVQ/?a=1988&ch=0&cr=0&dr=0&lr=tiktok_m&cd=0%7C0%7C1%7C0&cv=1&br=2812&bt=1406&cs=0&ds=3&ft=ebLH6H-qMyq8Zn.yShe2N03ufl7Gb&mime_type=video_mp4&qs=0&rc=PDk3ZTxkOWc7OWQ0ZTZoNkBpang1dDk6ZjhqaDMzNzczM0AyMGJgMGNjXjIxYWJhY140YSNeY2tvcjRfbDNgLS1kMTZzcw%3D%3D&expire=1670530563&l=20221208141458BB1C197C6FEECD05291B&policy=2&signature=42c81996a4e72c5814e104b8cf1d8624&tk=tt_chain_token

I'm guessing that tt_chain_token means something. Not sure that there is anything in the response that ZeeSchuimer could lift and provide, but right now I'm not sure how to download the majority of videos (also not sure what causes the different formats). I'm currently not logged in if that makes a difference.

not collecting tiktok data

Hi!
So, no matter how long I scroll or on whichever page on Tiktok, the number of items on Zeeschuimer keeps being 0. Yesterday, it would only collect from the for you page, but not hashtag pages. today it won't collect at all. I really appreciate any help you can provide.

Refreshing image URLs?

I guess this is more of a feature request, and maybe zeeschuimer isn't the right place for it.

I didn't realize that titok preview URLs have a limited lifetime and now my dataset has a lot of stale links in it. Is there an easy way to take the titok URLs and rerun zeeschuimer to get fresh image URLS?

Twitter: JSON does not distinguish between genuine search results and "promoted" tweets

When you perform a search or view a user or home timeline on Twitter, the tweets you are actually searching for are peppered with "promoted" advertising tweets that are not related to your search terms. These are currently not distinguished in the exported JSON so it is impossible to filter them out if you want to work on just the tweets that are actually relevant to your search terms.

Missing TikTok data

Hello there!
I am collecting TikTok data, and I continuously get datasets that miss the first 30-40 TikToks. I re-runned it a few times, and then once in a while I can actually collect all, but it is rather unpredictable. Is there a way to fix this? Thank you in advance!

From what I can see the issue of duplicates happens with manual scrolling, and not if I use an extension to automatise the process.
When present, duplicates consist of the same tiktok videos appearing 2 times, with the same metadata. In the datasets I collected (around 4.000-5.000 posts), after deleting the duplicates the datasets decreased, on average, of 1000 posts.

	await browser.downloads.download({
	url: window.URL.createObjectURL(blob),
	filename: filename,
	conflictAction: 'uniquify'
	});