Coder Social home page Coder Social logo

sketchpad's People

Contributors

monk1337 avatar watchful1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

sketchpad's Issues

Posts collect only up to 2022-11-03

When I put an end date of 01/01/2021, the script collects all of the posts in a subreddit upto November 3 2022, and then stops. It then collects all of the comments until 01/01/2021 as requested.

Any idea why the script is not behaving the same for posts and comments?

I only added the end date, and changed the output files so that they include the name of the subreddit.

Would it be possible for you to let me know what am I missing? Thank you in advance for your help.

Just in case, here is the script.

import requests
from datetime import datetime
import traceback
import time
import json
import sys
import csv
import json

username = "" # put the username you want to download in the quotes
subreddit = "php" # put the subreddit you want to download in the quotes
thread_id = "" # put the id of the thread you want to download in the quotes, it's the first 5 to 7 character string of letters and numbers from the url, like 107xayi

leave either one blank to download an entire user's or subreddit's history

or fill in both to download a specific users history from a specific subreddit

post_file = "posts_" + subreddit + ".txt"
comment_file = "comments_" + subreddit + ".txt"

change this to one of "human", "csv" or "json"

- human: the score, creation date, author, link and then the comment/submission body on a second line. Objects are separated by lines of dashes

- csv: a comma seperated value file with the fields score, date, title, author, link and then body or url

- json: the full json object

output_format = "csv"

default start time is the current time and default end time is all history

you can change out the below lines to set a custom start and end date. The script works backwards, so the end date has to be before the start date

start_time = datetime.utcnow() #datetime.strptime("10/05/2021", "%m/%d/%Y")
#end_time = None #datetime.strptime("09/25/2021", "%m/%d/%Y")
end_time = datetime.strptime("01/01/2021", "%m/%d/%Y") #datetime.strptime("09/25/2021", "%m/%d/%Y")

convert_to_ascii = False # don't touch this unless you know what you're doing
convert_thread_id_to_base_ten = True # don't touch this unless you know what you're doing

def write_human_line(handle, obj, is_submission, convert_to_ascii):
handle.write(str(obj['score']))
handle.write(" : ")
handle.write(datetime.fromtimestamp(obj['created_utc']).strftime("%Y-%m-%d"))
if is_submission:
handle.write(" : ")
if convert_to_ascii:
handle.write(obj['title'].encode(encoding='ascii', errors='ignore').decode())
else:
handle.write(obj['title'])
handle.write(" : u/")
handle.write(obj['author'])
handle.write(" : ")
handle.write(f"https://www.reddit.com{obj['permalink']}")
handle.write("\n")
if is_submission:
if obj['is_self']:
if 'selftext' in obj:
if convert_to_ascii:
handle.write(obj['selftext'].encode(encoding='ascii', errors='ignore').decode())
else:
handle.write(obj['selftext'])
else:
handle.write(obj['url'])
else:
if convert_to_ascii:
handle.write(obj['body'].encode(encoding='ascii', errors='ignore').decode())
else:
handle.write(obj['body'])
handle.write("\n-------------------------------\n")

def write_csv_line(writer, obj, is_submission):
output_list = []
output_list.append(str(obj['score']))
output_list.append(datetime.fromtimestamp(obj['created_utc']).strftime("%Y-%m-%d"))
if is_submission:
output_list.append(obj['title'])
output_list.append(f"u/{obj['author']}")
output_list.append(f"https://www.reddit.com{obj['permalink']}")
if is_submission:
if obj['is_self']:
if 'selftext' in obj:
output_list.append(obj['selftext'])
else:
output_list.append("")
else:
output_list.append(obj['url'])
else:
output_list.append(obj['body'])
writer.writerow(output_list)

def write_json_line(handle, obj):
handle.write(json.dumps(obj))
handle.write("\n")

def download_from_url(filename, url_base, output_format, start_datetime, end_datetime, is_submission, convert_to_ascii):
print(f"Saving to {filename}")

count = 0
if output_format == "human" or output_format == "json":
	if convert_to_ascii:
		handle = open(filename, 'w', encoding='ascii')
	else:
		handle = open(filename, 'w', encoding='UTF-8')
else:
	handle = open(filename, 'w', encoding='UTF-8', newline='')
	writer = csv.writer(handle)

previous_epoch = int(start_datetime.timestamp())
break_out = False
while True:
	new_url = url_base+str(previous_epoch)
	json_text = requests.get(new_url, headers={'User-Agent': "Post downloader by /u/Watchful1"})
	time.sleep(1)  # pushshift has a rate limit, if we send requests too fast it will start returning error messages
	try:
		json_data = json_text.json()
	except json.decoder.JSONDecodeError:
		time.sleep(1)
		continue

	if 'data' not in json_data:
		break
	objects = json_data['data']
	if len(objects) == 0:
		break

	for obj in objects:
		previous_epoch = obj['created_utc'] - 1
		if end_datetime is not None and datetime.utcfromtimestamp(previous_epoch) < end_datetime:
			break_out = True
			break
		count += 1
		try:
			if output_format == "human":
				write_human_line(handle, obj, is_submission, convert_to_ascii)
			elif output_format == "csv":
				write_csv_line(writer, obj, is_submission)
			elif output_format == "json":
				write_json_line(handle, obj)
		except Exception as err:
			if 'permalink' in obj:
				print(f"Couldn't print object: https://www.reddit.com{obj['permalink']}")
			else:
				print(f"Couldn't print object, missing permalink: {obj['id']}")
			print(err)
			print(traceback.format_exc())

	if break_out:
		break

	print(f"Saved {count} through {datetime.fromtimestamp(previous_epoch).strftime('%Y-%m-%d')}")

print(f"Saved {count}")
handle.close()

if name == "main":
filter_string = None
if username == "" and subreddit == "" and thread_id == "":
print("Fill in username, subreddit or thread id")
sys.exit(0)
if output_format not in ("human", "csv", "json"):
print("Output format must be one of human, csv, json")
sys.exit(0)

filters = []
if username:
	filters.append(f"author={username}")
if subreddit:
	filters.append(f"subreddit={subreddit}")
if thread_id:
	if convert_thread_id_to_base_ten:
		filters.append(f"link_id={int(thread_id, 36)}")
	else:
		filters.append(f"link_id=t3_{thread_id}")
filter_string = '&'.join(filters)

url_template = "https://api.pushshift.io/reddit/{}/search?limit=1000&order=desc&{}&before="

if not thread_id:
	download_from_url(post_file, url_template.format("submission", filter_string), output_format, start_time, end_time, True, convert_to_ascii)
download_from_url(comment_file, url_template.format("comment", filter_string), output_format, start_time, end_time, False, convert_to_ascii)

Little improvement so the bot do not repeat work already done if things go wrong

Hi, first of all: thanks for creating this script. It has been super useful for me this last days. I wanted to comment that the example I've seen you posting when people have doubts because of the lack of code with shiftpush could be done in a way that it saves the user's work. By doing this you avoid repeating calls every time you encounter an error with the script or just if you want to stop the script for a while so your computer (and the api) can rest for a while. I understand that this is not a serious script but people probably use it a lot as it works wonders and is super easy to use.

I've tried to improve it by changing when the file opens and closes so the progress is not loss if errors are encountered.
Also, the main reason why I am writting this: added the ability to continue previous work so no redundant calls are made.

The downside is that comment files are not supported as I have not yet studied how they work. However if they share and id similar to how submissions work should be really similar.

Thanks for your time.

Script:

'''
Improvement on https://github.com/Watchful1/Sketchpad/blob/master/postDownloader.py
Now capable of continuing previous work avoiding reiteration of previous searches and probably avoiding 
repeated calls to the server api in case your download did not work properly.

Please before using it make sure you have your .txt with the following order:
time_stamp : id : <others>

In case it is not placed like this the bot will detect the file exists but will not be able to continue your work.
'''
import requests
from datetime import datetime
import traceback
import time
import json
import sys
import os

username = ""  # put the username you want to download in the quotes
subreddit = ""  # put the subreddit you want to download in the quotes
# leave either one blank to download an entire user's or subreddit's history
# or fill in both to download a specific users history from a specific subreddit

filter_string = None
if username == "" and subreddit == "":
    print("Fill in either username or subreddit")
    sys.exit(0)
elif username == "" and subreddit != "":
    filter_string = f"subreddit={subreddit}"
elif username != "" and subreddit == "":
    filter_string = f"author={username}"
else:
    filter_string = f"author={username}&subreddit={subreddit}"

url = "https://api.pushshift.io/reddit/{}/search?limit=1000&sort=desc&{}&before="

start_time = datetime.utcnow()

def get_count():
    numbers = []
    files = os.listdir()
    old_filename = '' #place first used file name
    try:
        for file in files:
            if file.endswith('.txt'):
                file_sliced = file.split('.')
                if len(file_sliced) > 2:
                    number = file.split('.')[1]
                    numbers.append(int(number))
        highest_n = int(max(numbers))
        print(max(numbers))           
        old_filename = 'spcrusaders' + '.' + str(highest_n) + '.txt'               
        n_lines = 0
        print(numbers)
        print(old_filename)
        with open(old_filename, 'r') as f:
            for lines in f:
                n_lines += 1
    
            f.close()
        highest_n += n_lines
    except Exception:
        highest_n = 0
    return highest_n, old_filename


#https://stackoverflow.com/questions/7167008/efficiently-finding-the-last-line-in-a-text-file
#Note: ls = last_submission
def get_last_ts(path):
    #This method works well enough. Tested with 8*10^6 lines and took about 2.9 seconds on i5-10351G1
    
    try:
        with open(path, 'r') as f:
            for line in f:
                pass
            last_line = line
            f.close()
        ls_id = last_line.split(' : ')[1]    #place where id is located in my .txt 
        ls_url = 'https://api.pushshift.io/reddit/submission/search/?ids=' + ls_id +'&filter=created_utc'  
        ls_json_text = requests.get(ls_url, headers={'User-Agent': "Post downloader by /u/Watchful1"})    
        ls_json_data = ls_json_text.json()    
        ls_json_object = ls_json_data['data']
        ls_time_stamp = ls_json_object[0]['created_utc']
        print('Last submission time: ' + str(ls_time_stamp) + ' from id: ' + str(ls_id) + ' and url: ' + ls_url)
        print('Continuing from there')
    except Exception as e:
        print('For some reason ls_time_stamp has not been found.\nA new file will not be created but will start from the beggining but writting in last line')
        print('Error: ' + str(e))
        ls_time_stamp = False

    return ls_time_stamp

def downloadFromUrl(filename, object_type):
    print(f"Saving {object_type}s to {filename}")

    count, filename = get_count()
    last_ts = get_last_ts(filename)
    
    previous_epoch = last_ts if last_ts != False  else int(start_time.timestamp())
    
    while True:
        handle = open(filename, 'a')
        new_url = url.format(object_type, filter_string)+str(previous_epoch)
        json_text = requests.get(new_url, headers={'User-Agent': "Post downloader by /u/Watchful1"})
        time.sleep(1)  # pushshift has a rate limit, if we send requests too fast it will start returning error messages
        try:
            json_data = json_text.json()
        except json.decoder.JSONDecodeError:
            time.sleep(1)
            continue
        except simplejson.errors.JSONDecodeError:
            time.sleep(1)
            continue

        if 'data' not in json_data:
            break
        objects = json_data['data']
        if len(objects) == 0:
            break

        for object in objects:
            previous_epoch = object['created_utc'] - 1
            count += 1
            if object_type == 'comment':
                try:
                    handle.write(str(object['score']))
                    handle.write(" : ")
                    handle.write(datetime.fromtimestamp(object['created_utc']).strftime("%Y-%m-%d"))
                    handle.write("\n")
                    handle.write(object['body'].encode(encoding='ascii', errors='ignore').decode())
                    handle.write("\n-------------------------------\n")
                except Exception as err:
                    print(f"Couldn't print comment: https://www.reddit.com{object['permalink']}")
                    print(traceback.format_exc())
            elif object_type == 'submission':
                if object['url']:
                    if 'url' not in object:
                        continue
                    try:
                        #Remember not to change created_utc and id positions or the bot will not work!!
                        handle.write(datetime.fromtimestamp(object['created_utc']).strftime("%Y-%m-%d"))
                        handle.write(" : ")
                        handle.write(str(object['id']).encode(encoding='ascii', errors='ignore').decode())
                        handle.write(" : ")

                        handle.write(str(object['url']).encode(encoding='ascii', errors='ignore').decode())
                        handle.write(" : ")
                        handle.write(str(object['score']).encode(encoding='ascii', errors='ignore').decode())
                        handle.write(" : ")
                        handle.write(object['title'].encode(encoding='ascii', errors='ignore').decode())
                        handle.write("\n")

                    except Exception as err:
                        print(f"Couldn't print post: {object['url']}")
                        print(traceback.format_exc())

        print("Saved {} {}s through {}".format(count, object_type, datetime.fromtimestamp(previous_epoch).strftime("%Y-%m-%d")))
        handle.close()
        if count % CHANGE_FILE_NUMBER == 0:
            print('se cambia de fichero')
            filename_list = filename.split('.')
            filename = filename_list[0] + '.' + str(count) + '.txt'
    print(f"Saved {count} {object_type}s")



downloadFromUrl("submissions.txt", "submission")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.