watchful1 / sketchpad Goto Github PK

View Code? Open in Web Editor NEW

113.0 7.0 48.0 170 KB

Python 100.00%

sketchpad's People

Contributors

Stargazers

Watchers

Forkers

datamonkey01 casheddie ilango2486 loganbro28 thrichar sotimus evanroden julietmorin yorgyetson aqua7734 iamnottheway erics25 khaledmeb taffy784483 mrendi29 alliesaizan svprice danparisian legionof7 crushtoe sahuang mwgoodr mumraiz junoarc cheesesteak21 corgfather xiaochen93 sandjy1996 johncurd shuangzhao100 bach10101997 patrickdrouin choochooooooo henryburkhardt bkblodget rtsam monk1337 sarahjunan kevinheimbach davishedrick jkwok2 eltats alteres trsatt jsouthgb dibop alxd137 perth-seo-agency

sketchpad's Issues

Little improvement so the bot do not repeat work already done if things go wrong

Hi, first of all: thanks for creating this script. It has been super useful for me this last days. I wanted to comment that the example I've seen you posting when people have doubts because of the lack of code with shiftpush could be done in a way that it saves the user's work. By doing this you avoid repeating calls every time you encounter an error with the script or just if you want to stop the script for a while so your computer (and the api) can rest for a while. I understand that this is not a serious script but people probably use it a lot as it works wonders and is super easy to use.

I've tried to improve it by changing when the file opens and closes so the progress is not loss if errors are encountered.
Also, the main reason why I am writting this: added the ability to continue previous work so no redundant calls are made.

The downside is that comment files are not supported as I have not yet studied how they work. However if they share and id similar to how submissions work should be really similar.

Thanks for your time.

Script:

'''
Improvement on https://github.com/Watchful1/Sketchpad/blob/master/postDownloader.py
Now capable of continuing previous work avoiding reiteration of previous searches and probably avoiding 
repeated calls to the server api in case your download did not work properly.

Please before using it make sure you have your .txt with the following order:
time_stamp : id : <others>

In case it is not placed like this the bot will detect the file exists but will not be able to continue your work.
'''
import requests
from datetime import datetime
import traceback
import time
import json
import sys
import os

username = ""  # put the username you want to download in the quotes
subreddit = ""  # put the subreddit you want to download in the quotes
# leave either one blank to download an entire user's or subreddit's history
# or fill in both to download a specific users history from a specific subreddit

filter_string = None
if username == "" and subreddit == "":
    print("Fill in either username or subreddit")
    sys.exit(0)
elif username == "" and subreddit != "":
    filter_string = f"subreddit={subreddit}"
elif username != "" and subreddit == "":
    filter_string = f"author={username}"
else:
    filter_string = f"author={username}&subreddit={subreddit}"

url = "https://api.pushshift.io/reddit/{}/search?limit=1000&sort=desc&{}&before="

start_time = datetime.utcnow()

def get_count():
    numbers = []
    files = os.listdir()
    old_filename = '' #place first used file name
    try:
        for file in files:
            if file.endswith('.txt'):
                file_sliced = file.split('.')
                if len(file_sliced) > 2:
                    number = file.split('.')[1]
                    numbers.append(int(number))
        highest_n = int(max(numbers))
        print(max(numbers))           
        old_filename = 'spcrusaders' + '.' + str(highest_n) + '.txt'               
        n_lines = 0
        print(numbers)
        print(old_filename)
        with open(old_filename, 'r') as f:
            for lines in f:
                n_lines += 1
    
            f.close()
        highest_n += n_lines
    except Exception:
        highest_n = 0
    return highest_n, old_filename


#https://stackoverflow.com/questions/7167008/efficiently-finding-the-last-line-in-a-text-file
#Note: ls = last_submission
def get_last_ts(path):
    #This method works well enough. Tested with 8*10^6 lines and took about 2.9 seconds on i5-10351G1
    
    try:
        with open(path, 'r') as f:
            for line in f:
                pass
            last_line = line
            f.close()
        ls_id = last_line.split(' : ')[1]    #place where id is located in my .txt 
        ls_url = 'https://api.pushshift.io/reddit/submission/search/?ids=' + ls_id +'&filter=created_utc'  
        ls_json_text = requests.get(ls_url, headers={'User-Agent': "Post downloader by /u/Watchful1"})    
        ls_json_data = ls_json_text.json()    
        ls_json_object = ls_json_data['data']
        ls_time_stamp = ls_json_object[0]['created_utc']
        print('Last submission time: ' + str(ls_time_stamp) + ' from id: ' + str(ls_id) + ' and url: ' + ls_url)
        print('Continuing from there')
    except Exception as e:
        print('For some reason ls_time_stamp has not been found.\nA new file will not be created but will start from the beggining but writting in last line')
        print('Error: ' + str(e))
        ls_time_stamp = False

    return ls_time_stamp

def downloadFromUrl(filename, object_type):
    print(f"Saving {object_type}s to {filename}")

    count, filename = get_count()
    last_ts = get_last_ts(filename)
    
    previous_epoch = last_ts if last_ts != False  else int(start_time.timestamp())
    
    while True:
        handle = open(filename, 'a')
        new_url = url.format(object_type, filter_string)+str(previous_epoch)
        json_text = requests.get(new_url, headers={'User-Agent': "Post downloader by /u/Watchful1"})
        time.sleep(1)  # pushshift has a rate limit, if we send requests too fast it will start returning error messages
        try:
            json_data = json_text.json()
        except json.decoder.JSONDecodeError:
            time.sleep(1)
            continue
        except simplejson.errors.JSONDecodeError:
            time.sleep(1)
            continue

        if 'data' not in json_data:
            break
        objects = json_data['data']
        if len(objects) == 0:
            break

        for object in objects:
            previous_epoch = object['created_utc'] - 1
            count += 1
            if object_type == 'comment':
                try:
                    handle.write(str(object['score']))
                    handle.write(" : ")
                    handle.write(datetime.fromtimestamp(object['created_utc']).strftime("%Y-%m-%d"))
                    handle.write("\n")
                    handle.write(object['body'].encode(encoding='ascii', errors='ignore').decode())
                    handle.write("\n-------------------------------\n")
                except Exception as err:
                    print(f"Couldn't print comment: https://www.reddit.com{object['permalink']}")
                    print(traceback.format_exc())
            elif object_type == 'submission':
                if object['url']:
                    if 'url' not in object:
                        continue
                    try:
                        #Remember not to change created_utc and id positions or the bot will not work!!
                        handle.write(datetime.fromtimestamp(object['created_utc']).strftime("%Y-%m-%d"))
                        handle.write(" : ")
                        handle.write(str(object['id']).encode(encoding='ascii', errors='ignore').decode())
                        handle.write(" : ")

                        handle.write(str(object['url']).encode(encoding='ascii', errors='ignore').decode())
                        handle.write(" : ")
                        handle.write(str(object['score']).encode(encoding='ascii', errors='ignore').decode())
                        handle.write(" : ")
                        handle.write(object['title'].encode(encoding='ascii', errors='ignore').decode())
                        handle.write("\n")

                    except Exception as err:
                        print(f"Couldn't print post: {object['url']}")
                        print(traceback.format_exc())

        print("Saved {} {}s through {}".format(count, object_type, datetime.fromtimestamp(previous_epoch).strftime("%Y-%m-%d")))
        handle.close()
        if count % CHANGE_FILE_NUMBER == 0:
            print('se cambia de fichero')
            filename_list = filename.split('.')
            filename = filename_list[0] + '.' + str(count) + '.txt'
    print(f"Saved {count} {object_type}s")



downloadFromUrl("submissions.txt", "submission")

Posts collect only up to 2022-11-03

When I put an end date of 01/01/2021, the script collects all of the posts in a subreddit upto November 3 2022, and then stops. It then collects all of the comments until 01/01/2021 as requested.

Any idea why the script is not behaving the same for posts and comments?

I only added the end date, and changed the output files so that they include the name of the subreddit.

Would it be possible for you to let me know what am I missing? Thank you in advance for your help.

Just in case, here is the script.

import requests
from datetime import datetime
import traceback
import time
import json
import sys
import csv
import json

username = "" # put the username you want to download in the quotes
subreddit = "php" # put the subreddit you want to download in the quotes
thread_id = "" # put the id of the thread you want to download in the quotes, it's the first 5 to 7 character string of letters and numbers from the url, like 107xayi

leave either one blank to download an entire user's or subreddit's history

or fill in both to download a specific users history from a specific subreddit

post_file = "posts_" + subreddit + ".txt"
comment_file = "comments_" + subreddit + ".txt"

change this to one of "human", "csv" or "json"

- human: the score, creation date, author, link and then the comment/submission body on a second line. Objects are separated by lines of dashes

- csv: a comma seperated value file with the fields score, date, title, author, link and then body or url

- json: the full json object

output_format = "csv"

default start time is the current time and default end time is all history

you can change out the below lines to set a custom start and end date. The script works backwards, so the end date has to be before the start date

start_time = datetime.utcnow() #datetime.strptime("10/05/2021", "%m/%d/%Y")
#end_time = None #datetime.strptime("09/25/2021", "%m/%d/%Y")
end_time = datetime.strptime("01/01/2021", "%m/%d/%Y") #datetime.strptime("09/25/2021", "%m/%d/%Y")

convert_to_ascii = False # don't touch this unless you know what you're doing
convert_thread_id_to_base_ten = True # don't touch this unless you know what you're doing

def write_human_line(handle, obj, is_submission, convert_to_ascii):
handle.write(str(obj['score']))
handle.write(" : ")
handle.write(datetime.fromtimestamp(obj['created_utc']).strftime("%Y-%m-%d"))
if is_submission:
handle.write(" : ")
if convert_to_ascii:
handle.write(obj['title'].encode(encoding='ascii', errors='ignore').decode())
else:
handle.write(obj['title'])
handle.write(" : u/")
handle.write(obj['author'])
handle.write(" : ")
handle.write(f"https://www.reddit.com{obj['permalink']}")
handle.write("\n")
if is_submission:
if obj['is_self']:
if 'selftext' in obj:
if convert_to_ascii:
handle.write(obj['selftext'].encode(encoding='ascii', errors='ignore').decode())
else:
handle.write(obj['selftext'])
else:
handle.write(obj['url'])
else:
if convert_to_ascii:
handle.write(obj['body'].encode(encoding='ascii', errors='ignore').decode())
else:
handle.write(obj['body'])
handle.write("\n-------------------------------\n")

def write_csv_line(writer, obj, is_submission):
output_list = []
output_list.append(str(obj['score']))
output_list.append(datetime.fromtimestamp(obj['created_utc']).strftime("%Y-%m-%d"))
if is_submission:
output_list.append(obj['title'])
output_list.append(f"u/{obj['author']}")
output_list.append(f"https://www.reddit.com{obj['permalink']}")
if is_submission:
if obj['is_self']:
if 'selftext' in obj:
output_list.append(obj['selftext'])
else:
output_list.append("")
else:
output_list.append(obj['url'])
else:
output_list.append(obj['body'])
writer.writerow(output_list)

def write_json_line(handle, obj):
handle.write(json.dumps(obj))
handle.write("\n")

def download_from_url(filename, url_base, output_format, start_datetime, end_datetime, is_submission, convert_to_ascii):
print(f"Saving to {filename}")

count = 0
if output_format == "human" or output_format == "json":
	if convert_to_ascii:
		handle = open(filename, 'w', encoding='ascii')
	else:
		handle = open(filename, 'w', encoding='UTF-8')
else:
	handle = open(filename, 'w', encoding='UTF-8', newline='')
	writer = csv.writer(handle)

previous_epoch = int(start_datetime.timestamp())
break_out = False
while True:
	new_url = url_base+str(previous_epoch)
	json_text = requests.get(new_url, headers={'User-Agent': "Post downloader by /u/Watchful1"})
	time.sleep(1)  # pushshift has a rate limit, if we send requests too fast it will start returning error messages
	try:
		json_data = json_text.json()
	except json.decoder.JSONDecodeError:
		time.sleep(1)
		continue

	if 'data' not in json_data:
		break
	objects = json_data['data']
	if len(objects) == 0:
		break

	for obj in objects:
		previous_epoch = obj['created_utc'] - 1
		if end_datetime is not None and datetime.utcfromtimestamp(previous_epoch) < end_datetime:
			break_out = True
			break
		count += 1
		try:
			if output_format == "human":
				write_human_line(handle, obj, is_submission, convert_to_ascii)
			elif output_format == "csv":
				write_csv_line(writer, obj, is_submission)
			elif output_format == "json":
				write_json_line(handle, obj)
		except Exception as err:
			if 'permalink' in obj:
				print(f"Couldn't print object: https://www.reddit.com{obj['permalink']}")
			else:
				print(f"Couldn't print object, missing permalink: {obj['id']}")
			print(err)
			print(traceback.format_exc())

	if break_out:
		break

	print(f"Saved {count} through {datetime.fromtimestamp(previous_epoch).strftime('%Y-%m-%d')}")

print(f"Saved {count}")
handle.close()

if name == "main":
filter_string = None
if username == "" and subreddit == "" and thread_id == "":
print("Fill in username, subreddit or thread id")
sys.exit(0)
if output_format not in ("human", "csv", "json"):
print("Output format must be one of human, csv, json")
sys.exit(0)

filters = []
if username:
	filters.append(f"author={username}")
if subreddit:
	filters.append(f"subreddit={subreddit}")
if thread_id:
	if convert_thread_id_to_base_ten:
		filters.append(f"link_id={int(thread_id, 36)}")
	else:
		filters.append(f"link_id=t3_{thread_id}")
filter_string = '&'.join(filters)

url_template = "https://api.pushshift.io/reddit/{}/search?limit=1000&order=desc&{}&before="

if not thread_id:
	download_from_url(post_file, url_template.format("submission", filter_string), output_format, start_time, end_time, True, convert_to_ascii)
download_from_url(comment_file, url_template.format("comment", filter_string), output_format, start_time, end_time, False, convert_to_ascii)

watchful1 / sketchpad Goto Github PK

sketchpad's People

Contributors

Stargazers

Watchers

Forkers

sketchpad's Issues

Little improvement so the bot do not repeat work already done if things go wrong

Posts collect only up to 2022-11-03

leave either one blank to download an entire user's or subreddit's history

or fill in both to download a specific users history from a specific subreddit

change this to one of "human", "csv" or "json"

- human: the score, creation date, author, link and then the comment/submission body on a second line. Objects are separated by lines of dashes

- csv: a comma seperated value file with the fields score, date, title, author, link and then body or url

- json: the full json object

default start time is the current time and default end time is all history

you can change out the below lines to set a custom start and end date. The script works backwards, so the end date has to be before the start date

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent