Coder Social home page Coder Social logo

stashbox's Introduction

stashbox

Go Build Go Test Go Lint Gosec CodeQL Go Report Card License Contributers gopherbadger-tag-do-not-edit

Stashbox is your personal Internet Archive

The goal of stashbox is to help you create your own - personal copy of websites that you wish to archive.

The initial way to do this will be to run a simple command, but in the future it can be extended to a web interface, a plugin or other means.

Having a local "static" copy of a website can help for research, change tracking and for many other purposes

Roadmap

  • Initial command line client to add urls to a personal archive with Text, Html and Pdf copies of the website
  • Ability to save new versions of the same URL
  • Version "diffing" and browsing
  • User friendly interface (web, etc)
  • Searching and other functions

Usage

Usage: stashbox <command> <options>

  Where command is one of:
    add   --  add a url to the archive
      -b string
            stashbox archive directory (defaults to ./stashDb) (default "./stashDb")
      -u string
            url to download
    list  --  list all archives
      -b string
            stashbox archive directory (defaults to ./stashDb) (default "./stashDb")
    open  --  open an archive
      -b string
            stashbox archive directory (defaults to ./stashDb) (default "./stashDb")
      -n int
            archive number to open (from list command)

  To see help text, you can run:
    stashbox <command> -h

Contributing

New issues and pull requests are welcomed. Please see CONTRIBUTING.md

Contributors

Made with contributors-img.

stashbox's People

Contributors

aesherman avatar benprew avatar chopinsky avatar darkangeel-hd avatar dependabot-preview[bot] avatar erizkiatama avatar ilmanzo avatar krishkayc avatar lucasturci avatar orama254 avatar renovate-bot avatar shivamanipatil avatar szliao avatar theykk avatar zpeters avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

stashbox's Issues

Dependabot can't resolve your Go dependency files

Dependabot can't resolve your Go dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

Cannot detect VCS for jaytaylor.com/html2text. Attempted to detect VCS because the version looks like a git revision: v0.0.0-20200412013138-3577fbdbcff7

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

Review and take some inspiration from Joplin

I recently started using Joplin. There are some things about this that i really like that could be borrowed from this app. For stashbox i'm really thinking from more of an archiving and "seeing changes" model. But there are a lot of cool things here. One thing i really like is the plugin for Chrome which will save HTML files at Markdown.

Action Required: Fix Renovate Configuration

There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.

Error type: undefined. Note: this is a nested preset so please contact the preset author if you are unable to fix it yourself.

stashDb structure/layout

I found your project through Digital Ocean's Hacktober program, thanks for submitting your repo, there aren't a lot of golang repos listed (at least when I looked yesterday).

I was looking at your project and have been thinking about a couple of things:

  1. Doing diffs between webpage versions

Downloading a web page as a pdf is nice, but makes it difficult to store just the changes between versions. It also makes it difficult to identify what has changed between versions. ImageMagick can diff pdfs, but only a single page at a time.

I'd propose using something like monolith to save the page as a single html page and then use diff to show differences.

  1. Storing multiple versions of page

If you use the html representation of a page, you could then store the delta between versions, instead of storing the entire html file each time.

  1. Export instead of storing multiple versions of the same webpage

Instead of having a .html, .txt and .pdf version of a page, have an export command that transforms the format on disk to the format requested. If you're storing a monolith .html file, you can call wkhtmltopdf on it to generate a pdf ondemand.

  1. File layout

Instead of using the title of the page as the filename (which can change), you probably either to have a map of url -> page_id and store the file as <page_id>.html. This insulates you from changes to the page title and means you don't have to represent the url structure as a file structure on disk.

If you wanted to be fancy, this could be a sqllite database stored in the stashDB directory, or it could be as simple as a JSON/YAML file in the stashDB directory.

  1. Misc: Other commands

Ok, this isn't related to the structure or layout, but defining the structure/layout of the archive helps enable features like:

-list -- list all the pages I've saved and # of versions
-export -- export a webpage in html, pdf, text, markdown, etc...
-update -- update all webpages I've saved to the latest version, could also take a domain argument to limit to pages from a specific domain

Thanks and let me know what you think.

thin out the main function

Currently there is a lot going on in the main function. Try to thin this out be spreading to into packages or other ways that would simplify things

create -export flag

-export -- export a webpage in html, pdf, text, markdown, etc...

not exactly sure how this should go. would a bare export list the possible formats. Would it export into the archive...into a different location, etc

`AddUrl` needs a sanity check for URLs

Currently, when adding an URL to crawl from the command line, we won't check if the URL input is valid. We will rely on the http client (which is used as the foundation of the crawler) to tell us if we have a valid URL string or not.

This can be inconvenient when the user added a number of URLs already, but only one of them were entered incorrectly due to a typo, but now the crawling job would fail for the rest of the URLs. A common mistake is to omit the http:// or https:// protocol prefix of the URL strings.

I think we can at least add some simple sanity checks and reject obvious illegal URL entries, and let the http package determine if it can handle the entry or not for the rest of them.

Dependabot can't resolve your Go dependency files

Dependabot can't resolve your Go dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

Cannot detect VCS for jaytaylor.com/html2text. Attempted to detect VCS because the version looks like a git revision: v0.0.0-20200412013138-3577fbdbcff7

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

update layout of the stashdb file structure

This idea might need a little more thought before implementation

Instead of using the title of the page as the filename (which can change), you probably either to have a map of url -> page_id and store the file as <page_id>.html. This insulates you from changes to the page title and means you don't have to represent the url structure as a file structure on disk.

Current file naming scheme does not work on Windows.

t := time.Now()
dateTime := fmt.Sprintf("%d-%02d-%02dT%02d:%02d:%02d",
	t.Year(), t.Month(), t.Day(),
	t.Hour(), t.Minute(), t.Second())

domainSubPath := path.Join(c.Archive, d, dateTime)
err = os.MkdirAll(domainSubPath, 0700)
if err != nil {
	return err
}

This code generates filename in the following format: github.com/2020-10-02T20:28:51. Colon is an invalid character for naming files in Windows.

in crawler_test TestSave figure out how to run in github test environment

I have wkhtmltopdf installing as part of the test process, but there is the error below about a display.

Figure out either how to get this to run in github, or how to exclude just this one test in github

  crawler_test.go:37: 
        	Error Trace:	crawler_test.go:37
        	Error:      	Received unexpected error:
        	            	QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-runner'
        	            	qt.qpa.screen: QXcbConnection: Could not connect to display 
        	            	Could not connect to any X display.
        	Test:       	TestSave

Error listing archives Error building archive

If you add a domain and then another path from that domain it fails

Steps to reproduce

./stashbox add -u https://thehelpfulhacker.net                                                   
Crawling https://thehelpfulhacker.net...
Saving https://thehelpfulhacker.net...

/stashbox list                                              
Archive listing...
1. thehelpfulhacker.net [1 image(s)]

./stashbox add -u https://thehelpfulhacker.net/posts/2020-10-13-golang-testify-table-tests/                                                 
Crawling https://thehelpfulhacker.net/posts/2020-10-13-golang-testify-table-tests/..
Saving https://thehelpfulhacker.net/posts/2020-10-13-golang-testify-table-tests/...

./stashbox list
Error listing archives Error building archive

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.

Detected dependencies

dockerfile
Dockerfile
  • golang 1.15-alpine
  • alpine 3.13
github-actions
.github/workflows/codeql-analysis.yml
  • actions/checkout v2
  • github/codeql-action v1
  • github/codeql-action v1
  • github/codeql-action v1
.github/workflows/go-build.yml
  • actions/setup-go v2
  • actions/checkout v2
.github/workflows/go-test.yml
  • actions/setup-go v2
  • actions/checkout v2
.github/workflows/golangci-lint.yml
  • actions/checkout v2
  • golangci/golangci-lint-action v2
.github/workflows/gosec.yml
  • actions/checkout v2
gomod
go.mod
  • go 1.14
  • github.com/PuerkitoBio/goquery v1.6.1
  • github.com/SebastiaanKlippert/go-wkhtmltopdf v1.6.1
  • github.com/bmatcuk/doublestar/v2 v2.0.4
  • github.com/stretchr/testify v1.7.0
  • jaytaylor.com/html2text v0.0.0-20200412013138-3577fbdbcff7@3577fbdbcff7

  • Check this box to trigger a request for Renovate to run again on this repository

Abstract away filesystem writes.

The app directly writes to files and creates directories. Example saveSite() => os.MkdirAll, iouti.WriteFile.

We need to abstract away these file writing and directory creation into an interface. (Ex: Writer, FSHandler etc..)

This will help in unit test being independent of filesystem and faster unit tests in general (From unit tests, we can use our mock fswriter)

Fix Gosec errors

[/github/workspace/pkg/crawler/crawler.go:4] - G505 (CWE-327): Blocklisted import crypto/sha1: weak cryptographic primitive (Confidence: HIGH, Severity: MEDIUM)

3: import (

4: "crypto/sha1"
5: "fmt"

[/github/workspace/pkg/crawler/crawler.go:47] - G401 (CWE-326): Use of weak cryptographic primitive (Confidence: HIGH, Severity: MEDIUM)

46: 		// generate a file title

47: h := sha1.New()
48: io.WriteString(h, s.Url)

[/github/workspace/pkg/crawler/crawler.go:59] - G301 (CWE-276): Expect directory permissions to be 0750 or less (Confidence: HIGH, Severity: MEDIUM)
58: domainSubPath := path.Join(c.Archive, d, dateTime)

59: err = os.MkdirAll(domainSubPath, 0777)
60: if err != nil {

[/github/workspace/pkg/crawler/crawler.go:67] - G306 (CWE-276): Expect WriteFile permissions to be 0600 or less (Confidence: HIGH, Severity: MEDIUM)
66: htmlSavePath := path.Join(domainSubPath, htmlFileName)

67: err = ioutil.WriteFile(htmlSavePath, s.HtmlBody, 0777)
68: if err != nil {

[/github/workspace/pkg/crawler/crawler.go:75] - G306 (CWE-276): Expect WriteFile permissions to be 0600 or less (Confidence: HIGH, Severity: MEDIUM)
74: textSavePath := path.Join(domainSubPath, textFileName)

75: err = ioutil.WriteFile(textSavePath, s.TextBody, 0777)
76: if err != nil {

[/github/workspace/pkg/crawler/crawler.go:113] - G107 (CWE-88): Potential HTTP request made with variable url (Confidence: MEDIUM, Severity: MEDIUM)
112: func getHtmlBody(url string) (body []byte, err error) {

113: resp, err := http.Get(url)
114: if err != nil {

[/github/workspace/pkg/crawler/crawler.go:136] - G301 (CWE-276): Expect directory permissions to be 0750 or less (Confidence: HIGH, Severity: MEDIUM)
135: func ensureArchive(p string) {

136: err := os.MkdirAll(p, 0777)
137: if err != nil {

[/github/workspace/pkg/crawler/crawler.go:48] - G104 (CWE-703): Errors unhandled. (Confidence: HIGH, Severity: LOW)
47: h := sha1.New()

48: io.WriteString(h, s.Url)
49: s.Title = fmt.Sprintf("%x", h.Sum(nil))

implement -update

update all webpages I've saved to the latest version, could also take a domain argument to limit to pages from a specific domain

Adding support for concurrent crawling

Hey,

I think it would a great feature to have if we supported taking multiple URLs at the same time and then spawning multiple crawlers using goroutines. I think there won't be any change to our crawler package due to this.

Can be divided into :

  • Support for taking multiple inputs
  • Concurrency support to spawn multiple crawlers

@zpeters What do you think about this? And can you assign me this issue?

Dependabot can't resolve your Go dependency files

Dependabot can't resolve your Go dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

Cannot detect VCS for jaytaylor.com/html2text. Attempted to detect VCS because the version looks like a git revision: v0.0.0-20200412013138-3577fbdbcff7

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

Come up with ideas on how to handle "updates"

Currently, if you save the same url again it will overwrite the existing file. it would be nice to have some sort of versioning system so that nothing is overwritten so you can eventually browse through changes

Dependabot can't resolve your Go dependency files

Dependabot can't resolve your Go dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

Cannot detect VCS for jaytaylor.com/html2text. Attempted to detect VCS because the version looks like a git revision: v0.0.0-20200412013138-3577fbdbcff7

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.