txtdot,txtdot

Parsing JS apps

Execute scripts that are not related to analytics/ads in isolated-vm passing JSDOM object to their context, and only after that process DOM with Readability / anything else.

Should be configurable, EXEC_JS=false|true in .env, false by default.

Can be implemented as a separate engine named "readability with js".

Plugins

Plugins are divided into different types:

Engines (#146) - allow you to create custom engines to process sites by domain and routes
Extenders - They add the ability to add code to the engines output. For example, you can add a syntax highlighter.

Plugins code (in typescript) is located in plugins monorepo. They're published in @txtdot npm organization.

To activate a plugin, the instance admin have to install it with npm and then add to plugins list in pluginConfig.ts

Create custom engine instead of Readability

Performance: string methods instead of `new URL`

We'll get a really good performance boost if we replace new URL, which is used too much often as I see, with string methods. On the other hand, manually parsing URL in some cases is not simple and concise.

Benchmark code

const Benchmarkify = require("benchmarkify")

const benchmark = new Benchmarkify("Benchmark", { chartImage: true }).printHeader()

benchmark.createSuite("URL convert", { time: 10000 })

  .add("URL object", () => {
                const url = new URL("https://dc09.ru/posts/fediverse#comments")
                const hash = url.hash
                url.hash = ""
                return encodeURIComponent(url.toString()) + hash
        })

        .ref("String methods", () => {
                let url = "https://dc09.ru/posts/fediverse#comments"
                const hashIdx = url.indexOf("#")
                if (hashIdx != -1) {
                        return encodeURIComponent(url.substring(0, hashIdx)) + url.substring(hashIdx)
                }
                else {
                        return encodeURIComponent(url)
                }
        })

benchmark.run()

Benchmark results

Platform info:
==============
   Linux 6.7.4-artix1-1 x64
   Node.JS: 21.6.1
   V8: 11.8.172.17-node.19
   CPU: Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz × 12
   Memory: 30 GB

Suite: URL convert
==================

✔ URL object         1 176 230 ops/sec
✔ String methods     4 825 419 ops/sec

   URL object           -75,62%   (1 176 230 ops/sec)   (avg: 850ns)
   String methods (#)        0%   (4 825 419 ops/sec)   (avg: 207ns)

┌────────────────┬────────────────────────────────────────────────────┐
│ URL object     │ ████████████                                       │
├────────────────┼────────────────────────────────────────────────────┤
│ String methods │ ██████████████████████████████████████████████████ │
└────────────────┴────────────────────────────────────────────────────┘

[?format=text] No newlines around headings

Example: MDN page

When parser output format set to text, I often get headings concatenated with a paragraph body:

1

Try itSyntaxfind(callbackFn)
find(callbackFn, thisArg)

2

SpecificationsSpecificationECMAScript Language Specification # sec-array.prototype.findBrowser compatibilityBCD tables only load in the browserSee also

While it must be:

1

2

Links with #hashes

For example, open: Nginx wiki article
Each heading have a ¶ (paragraph sign) with a link to this heading, e.g. #fastcgi-params
txtdot converts these links to something like /get?url=...%2Fphpfcgi%2F%23fastcgi-params (%23 is a urlencoded hash), but txtdot should separate the hash-link so the browser can understand it, like ?url=...%2Fphpfcgi#fastcgi-params

This is also makes sense in a "Table of Contents" where jumping to a heading made with hash-links.

Add more settings

Settings should be placed either below URL bar or on a separate page. Maybe there's no point in the second.

Either pass all the settings (listed below) with the /get txtdot query, which will make txtdot URLs bigger, or save the settings in cookies, so they will be user-specific, which is the intended and handy functionality as I think.

Proxy images: Remove from page / No proxy / Compress / Original
see #98 #96
(??) Proxy media, i.e. video & audio
Replace iframes: Remove from page / No replace / Parse with txtdot

And, of course:

Dark theme: Off / Auto / On

Parsers partially broken

Google

TypeError: Invalid URL
Exception thrown at google.ts:22, new URL(a.href)

Stackoverflow

TypeError: Cannot read properties of undefined (reading 'href')
Exception thrown at stackoverflow/main.ts:11, most probably because of switching to LinkeDOM which does not have window.location

Add alternative frontend redirection page

Useful in iframes. For example, do not load YouTube embed, but load a txtdot page instead, where "Open in Invidious" / "Original page" can be chosen. Alternative frontends' URLs are configured in settings, #99

Add image compression

For even better traffic reduction, you can compress images as done here: https://github.com/ayastreb/bandwidth-hero-proxy.

We can add handling of this to /proxy route

Add support for webder

https://github.com/TxtDot/webder

Links and version on the main page

There should be links to GitHub and Matrix in the main page, and maybe version of txtdot too.

[Security bug] Possibility of executing JS on client

From Readability Readme:

If you're going to use Readability with untrusted input (whether in HTML or DOM form), we strongly recommend you use a sanitizer library like DOMPurify to avoid script injection when you use the output of Readability. We would also recommend using CSP to add further defense-in-depth restrictions to what you allow the resulting content to do. The Firefox integration of reader mode uses both of these techniques itself.

Reproducing

I've hosted an XSS vulnerability showcase code from DOMPurify's page at xss.dc09.ru/dompurify. If you open it in your browser without txtdot, all JS code will be executed.

txtdot should remove javascript, but... look, some code with alert is still executed!

The Problem Itself

XSS attacks are usually aimed at stealing some sensitive data, for example, auth token from cookies.

txtdot does not store any data. But I said "usually". An attacker could embed JS code sending requests to google analytics or even loading some malicious JS / WASM from a remote server.

Because of lack of DOM sanitization, txtdot can potentially pass JS to the client while the description says the opposite (...removing scripts...)

The Solution

As mentioned above, DOMPurify.

I think that shifting the task of disabling JS to the user is against txtdot principles.

Restrict proxy to media MIMEs, split routes

For now, /proxy can pass any type of content without any checks. We don't want the users to download a 3 GB ISO through our txtdot servers.

Restrict proxying to images, video (and audio??) MIME types
Create separate API route /imgproxy, and then #96
Leave /proxy for video and audio, ~~or create a separate route~~

Improve readme

There are no illustrative pictures or descriptions of functions in our readme. You can take the description of functions from the documentation, for example. And also add comparison of site page with and without proxy.

Pictures, comparison
Features

Proxy resources

One of the users asked me about proxying images, videos, etc., because privacy is really important to him.
I support this idea, and this looks like not hard to implement.

Should be configurable, PROXY_RES=false|true in .env, true by default.

Templates in components/ are not found when built for prod

Error: /home/darkcat09/code/txtdot/dist/templates/index.ejs:25 23| <p><%= publicConfig.description %></p> 24| </header> >> 25| <%- include('./components/form-main.ejs') %> 26| </main> 27| </body> 28| </html> Could not find the include file "./components/form-main.ejs"

Separate API/browser endpoints, errors and docs

If we also want to provide an API (only /raw-html for now), all endpoints should be under /api/ path, not at the root.

In that case, we can configure Swagger docs only to /api that is more convenient.
We can display a beautiful error page if it's a browser endpoint, and a JSON error object for API endpoints.

Huge HTML hangs whole server

Example: https://www.win.tue.nl/~aeb/natlang/ie/tochB.html

This page is really big, so okay, txtdot proxy can't parse it in reasonable time. But it hangs the whole server, and after client gets timeout error, txtdot still does not accept any requests, what makes DoS attacks as simple as possible: attacker needs only one request.

Using NodeJS profiler, I found out that JSDOM takes almost all the time.

What we can do:

Set a timeout for server response, so Fastify won't wait page processing to complete; connectionTimeout seems to be the solution, but I'm not sure.
Switch to other HTML processing library which provides DOM API (sorted by last commit date: linkedom, happy-dom, cheerio).
Rewrite Readability so it won't need DOM API; and/or rewrite it in Rust.

Encoding in Axios response

For example, 4PDA gives the content in CP-1251 encoding. Ahhh, who really uses 1251 except several web sites written by... However, the bug exists.

Question about the main instance

txt.dc09.ru is located in Russia. How does it open sites such as BBC News, Meduza, SoundCloud and other sites blocked in Russia? Are you using zapret or what?

Error pages and API objects

Related to #30

Reply with the corresponding error codes instead of 500.
Show a beautiful error page for browser endpoints instead of default Fastify's JSON.
Show a meaningful JSON for /parse API endpoint (maybe, just add a field to IHandlerOutput).

Create engine routing system

It would be more convenient for development engines to add an interface through which they interact, for example to match the route that is requested by the user.

Something like fastify, express.

engine.route("/search", (ro)=>{
 const query = ro.query.q;
 ...
})

403 on Android Developers

Example: https://developer.android.com/reference/android/widget/TextView

It gives 200 OK when requesting with cURL and even without User-Agent, but 403 Forbidden when proxying through txtdot.

	Original page	Proxied through txtdot
Habr Desktop
Habr Mobile
Medium Desktop
Medium Mobile
Nginx Blog Desktop
Nginx Blog Mobile

txtdot / txtdot Goto Github PK

txtdot's Introduction

Features

Running

Development

Production

Docker

Screenshots

Performance tests

Credits

txtdot's People

Contributors

Stargazers

Watchers

Forkers

txtdot's Issues

1

2

1

2

Google

Stackoverflow

From Readability Readme:

Reproducing

The Problem Itself

The Solution

Recommend Projects

Recommend Topics

Recommend Org