Coder Social home page Coder Social logo

txtdot / txtdot Goto Github PK

View Code? Open in Web Editor NEW
120.0 3.0 5.0 378 KB

An HTTP proxy that parses only text, links and pictures from pages reducing internet bandwidth usage, removing ads and heavy scripts

Home Page: https://txt.dc09.ru

License: MIT License

TypeScript 68.47% EJS 17.58% CSS 11.98% Dockerfile 0.64% Shell 1.32%
nodejs proxy readability minify self-hosted text

txtdot's Introduction

txt.
Documentation Instances
MIT license Latest release Matrix chat

HTTP proxy that parses only text, links and pictures from pages reducing internet traffic, removing ads and heavy scripts. Mozilla's Readability library is used under the hood.

Features

  • Server-side page simplification
  • Media proxy
  • Image compression with Sharp
  • Rendering client-side apps (Vanilla, React, Vue, etc) with webder
  • Search with SearXNG
  • Custom parsers for StackOverflow and SearXNG
  • Handy API endpoints
  • No client JavaScript
  • Some kind of Material Design 3
  • Customization with plugins, see @txtdot/sdk and @txtdot/plugins

Running

Development

npm install
npm run dev

Production

npm install
npm run build
npm run start

Docker

docker compose up -d

Screenshots

Main page with URL input field SearXNG search results page

Performance tests

txtdot is a great tool in case of slow internet connection or weak signal. Here is the comparision of performance metrics from pagespeed.web.dev between original page and proxied one.

"Mobile" test includes "Slow 4G" artificial network throttling.

Expand
Original page Proxied through txtdot
Habr Desktop 56% 99%
Habr Mobile 21% 100%
Medium Desktop 44% 100%
Medium Mobile 36% 100%
Nginx Blog Desktop 53% 100%
Nginx Blog Mobile 26% 100%

Credits

txtdot - HTTP proxy that saves bandwidth, removing ads and scripts. | Product Hunt

txtdot's People

Contributors

artegoser avatar darkcat09 avatar dependabot[bot] avatar

Stargazers

Adam LaCombe avatar Egor Lynov avatar  avatar Tema Smirnov avatar Yuri Kachanyuk avatar Valeriy Selitskiy avatar  avatar Vladislav Sorokin avatar George avatar id-2 avatar  avatar  avatar  avatar  avatar Jerald avatar minpeter avatar Mario Siqueira avatar Danny Heng avatar Gohan472 avatar  avatar  avatar Erik Nystrom avatar  avatar Okita avatar Tristan avatar SN avatar gasolin avatar Ryan Hull avatar Cristian Colosimo avatar  avatar  avatar Noah Halstead avatar  avatar Do Anh avatar Nikita Zhenev avatar  avatar  avatar  avatar Egor Eremeev avatar Torben Raab avatar Rodrigo Carvalho avatar Mohammed AlShannaq avatar Konstantin L avatar  avatar Herwin Bozet avatar Nathan Blaney avatar Nate avatar Alessandro Digilio avatar  avatar  avatar  avatar  avatar Kyle Britton avatar Bilgehan Zeki Γ–ZAYTAΓ‡ avatar Ton Luong avatar  avatar Henry avatar  avatar Koray avatar George avatar Jim Yang avatar Ajay Mamtora avatar Adrien Brignon avatar chosenpath avatar rusty kay avatar  avatar  avatar JΓ‘n Bočínec avatar Rotsen Mark Acob avatar Duncan Lock avatar  avatar Vutsal avatar  avatar Salvatore Gentile avatar Marc Willis avatar Jefferson Phillips avatar  avatar  avatar Eric avatar  avatar BunyIp avatar Dustin Miller avatar Andy Shooner avatar Christopher Massey avatar  avatar Daniel Cruz-Castro avatar Idris Bhavnagarwala avatar  avatar  avatar Hi-no-Kagutsuchi avatar Moinul Moin avatar  avatar Arbal avatar  avatar Milosz Galazka avatar Valentin Glinskiy avatar JazmΓ­n RocΓ­o avatar Sky avatar  avatar Nicholas avatar

Watchers

Yuri Kachanyuk avatar  avatar  avatar

txtdot's Issues

Parsing JS apps

Execute scripts that are not related to analytics/ads in isolated-vm passing JSDOM object to their context, and only after that process DOM with Readability / anything else.

Should be configurable, EXEC_JS=false|true in .env, false by default.

Can be implemented as a separate engine named "readability with js".

Plugins

Plugins are divided into different types:

  • Engines (#146) - allow you to create custom engines to process sites by domain and routes
  • Extenders - They add the ability to add code to the engines output. For example, you can add a syntax highlighter.

Plugins code (in typescript) is located in plugins monorepo. They're published in @txtdot npm organization.

To activate a plugin, the instance admin have to install it with npm and then add to plugins list in pluginConfig.ts

Performance: string methods instead of `new URL`

We'll get a really good performance boost if we replace new URL, which is used too much often as I see, with string methods. On the other hand, manually parsing URL in some cases is not simple and concise.

Benchmark code
const Benchmarkify = require("benchmarkify")

const benchmark = new Benchmarkify("Benchmark", { chartImage: true }).printHeader()

benchmark.createSuite("URL convert", { time: 10000 })

  .add("URL object", () => {
                const url = new URL("https://dc09.ru/posts/fediverse#comments")
                const hash = url.hash
                url.hash = ""
                return encodeURIComponent(url.toString()) + hash
        })

        .ref("String methods", () => {
                let url = "https://dc09.ru/posts/fediverse#comments"
                const hashIdx = url.indexOf("#")
                if (hashIdx != -1) {
                        return encodeURIComponent(url.substring(0, hashIdx)) + url.substring(hashIdx)
                }
                else {
                        return encodeURIComponent(url)
                }
        })

benchmark.run()
Benchmark results
Platform info:
==============
   Linux 6.7.4-artix1-1 x64
   Node.JS: 21.6.1
   V8: 11.8.172.17-node.19
   CPU: Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz Γ— 12
   Memory: 30 GB

Suite: URL convert
==================

βœ” URL object         1Β 176Β 230 ops/sec
βœ” String methods     4Β 825Β 419 ops/sec

   URL object           -75,62%   (1Β 176Β 230 ops/sec)   (avg: 850ns)
   String methods (#)        0%   (4Β 825Β 419 ops/sec)   (avg: 207ns)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ URL object     β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ String methods β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

[?format=text] No newlines around headings

Example: MDN page

  • When parser output format set to text, I often get headings concatenated with a paragraph body:

1

Try itSyntaxfind(callbackFn)
find(callbackFn, thisArg)

2

SpecificationsSpecificationECMAScript Language Specification # sec-array.prototype.findBrowser compatibilityBCD tables only load in the browserSee also

  • While it must be:

1

txtdot1

2

txtdot2

Links with #hashes

  1. For example, open: Nginx wiki article
  2. Each heading have a ΒΆ (paragraph sign) with a link to this heading, e.g. #fastcgi-params
  3. txtdot converts these links to something like /get?url=...%2Fphpfcgi%2F%23fastcgi-params (%23 is a urlencoded hash), but txtdot should separate the hash-link so the browser can understand it, like ?url=...%2Fphpfcgi#fastcgi-params

This is also makes sense in a "Table of Contents" where jumping to a heading made with hash-links.

Add more settings

Settings should be placed either below URL bar or on a separate page. Maybe there's no point in the second.

Either pass all the settings (listed below) with the /get txtdot query, which will make txtdot URLs bigger, or save the settings in cookies, so they will be user-specific, which is the intended and handy functionality as I think.

  • Proxy images: Remove from page / No proxy / Compress / Original
    see #98 #96
  • (??) Proxy media, i.e. video & audio
  • Replace iframes: Remove from page / No replace / Parse with txtdot

And, of course:

  • Dark theme: Off / Auto / On

Add alternative frontend redirection page

Useful in iframes. For example, do not load YouTube embed, but load a txtdot page instead, where "Open in Invidious" / "Original page" can be chosen. Alternative frontends' URLs are configured in settings, #99

[Security bug] Possibility of executing JS on client

From Readability Readme:

If you're going to use Readability with untrusted input (whether in HTML or DOM form), we strongly recommend you use a sanitizer library like DOMPurify to avoid script injection when you use the output of Readability. We would also recommend using CSP to add further defense-in-depth restrictions to what you allow the resulting content to do. The Firefox integration of reader mode uses both of these techniques itself.

Reproducing

I've hosted an XSS vulnerability showcase code from DOMPurify's page at xss.dc09.ru/dompurify. If you open it in your browser without txtdot, all JS code will be executed.

txtdot should remove javascript, but... look, some code with alert is still executed!

The Problem Itself

XSS attacks are usually aimed at stealing some sensitive data, for example, auth token from cookies.

txtdot does not store any data. But I said "usually". An attacker could embed JS code sending requests to google analytics or even loading some malicious JS / WASM from a remote server.

Because of lack of DOM sanitization, txtdot can potentially pass JS to the client while the description says the opposite (...removing scripts...)

The Solution

As mentioned above, DOMPurify.

I think that shifting the task of disabling JS to the user is against txtdot principles.

Restrict proxy to media MIMEs, split routes

For now, /proxy can pass any type of content without any checks. We don't want the users to download a 3 GB ISO through our txtdot servers.

  • Restrict proxying to images, video (and audio??) MIME types
  • Create separate API route /imgproxy, and then #96
  • Leave /proxy for video and audio, or create a separate route

Improve readme

There are no illustrative pictures or descriptions of functions in our readme. You can take the description of functions from the documentation, for example. And also add comparison of site page with and without proxy.

  • Pictures, comparison
  • Features

Proxy resources

One of the users asked me about proxying images, videos, etc., because privacy is really important to him.
I support this idea, and this looks like not hard to implement.

Should be configurable, PROXY_RES=false|true in .env, true by default.

Templates in components/ are not found when built for prod

Error: /home/darkcat09/code/txtdot/dist/templates/index.ejs:25 23| <p><%= publicConfig.description %></p> 24| </header> >> 25| <%- include('./components/form-main.ejs') %> 26| </main> 27| </body> 28| </html> Could not find the include file "./components/form-main.ejs"

Separate API/browser endpoints, errors and docs

If we also want to provide an API (only /raw-html for now), all endpoints should be under /api/ path, not at the root.

  1. In that case, we can configure Swagger docs only to /api that is more convenient.
  2. We can display a beautiful error page if it's a browser endpoint, and a JSON error object for API endpoints.

Huge HTML hangs whole server

Example: https://www.win.tue.nl/~aeb/natlang/ie/tochB.html

This page is really big, so okay, txtdot proxy can't parse it in reasonable time. But it hangs the whole server, and after client gets timeout error, txtdot still does not accept any requests, what makes DoS attacks as simple as possible: attacker needs only one request.

Using NodeJS profiler, I found out that JSDOM takes almost all the time.

What we can do:

  • Set a timeout for server response, so Fastify won't wait page processing to complete; connectionTimeout seems to be the solution, but I'm not sure.
  • Switch to other HTML processing library which provides DOM API (sorted by last commit date: linkedom, happy-dom, cheerio).
  • Rewrite Readability so it won't need DOM API; and/or rewrite it in Rust.

Encoding in Axios response

For example, 4PDA gives the content in CP-1251 encoding. Ahhh, who really uses 1251 except several web sites written by... However, the bug exists.

Question about the main instance

txt.dc09.ru is located in Russia. How does it open sites such as BBC News, Meduza, SoundCloud and other sites blocked in Russia? Are you using zapret or what?

Error pages and API objects

Related to #30

  1. Reply with the corresponding error codes instead of 500.
  2. Show a beautiful error page for browser endpoints instead of default Fastify's JSON.
  3. Show a meaningful JSON for /parse API endpoint (maybe, just add a field to IHandlerOutput).

Create engine routing system

It would be more convenient for development engines to add an interface through which they interact, for example to match the route that is requested by the user.

Something like fastify, express.

engine.route("/search", (ro)=>{
 const query = ro.query.q;
 ...
})

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.