kagisearch / smallweb Goto Github PK

View Code? Open in Web Editor NEW

425.0 7.0 238.0 2.01 MB

Kagi Small Web

Home Page: https://kagi.com/smallweb

License: MIT License

CSS 16.29% Python 45.80% HTML 30.93% Dockerfile 2.75% Shell 4.23%

smallweb's People

Contributors

Stargazers

Watchers

Forkers

carl-ki loki890 raybb janikvonrotz lsomm piranha tekknolagi mhitza andrew-monroe mohamedelashri warpspin slyngshede mtlynch alvanrahimli chrisodva jackeyjack1 ryanb58 101313 commandblockguy tuinslak vkeerthivikram jwldk dkarter thejeshgn carwynnelson philippe2803 sponge5 lutzh cdransf freeagent85 matizzy einenlum ordoviz gingerbeardman cuphx thek3nger ericmortensen flamedfury jonasvautherin cloudrac3r sdball adamchandler1986 lucasgonze agentcooper simonesilvestroni adactio fgiasson chriscoyier otherjoel thejanmanshow kevinspencer castarco zhiwei-g-luo kblestarge belst-n samwho mzagaja vackosar tinloaf rghedin manicx100 v411e acheam0 m1hax hergaiety kevquirk rotmoded chotrin jeremysarber amitmerchant1990 peterdavehello akirk groovenectar alexture dmitrika mothdotmonster aidangilmore kangminsukdotcom josh-g91 muddi900 scootray matmat j2kun mbwgh daltonwillard ohscee packetcat thelmuxkriovar simoneorg openjck bneijt taylorjadin lee-perry bn-l benhr richardoc dex3703 jimmielightner andersju dustydean

smallweb's Issues

Guidance about microblogging and social platforms

I'd like to ask for guidance about the eligibility of microblogs and other content feeds close to social-networks.

Specifically, most Fediverse software (Mastodon, Pleroma, Lemmy, etc.) publishes syndication feeds. Some of the content is explicitly defined as (micro)blogging (e.g. Mastodon), while others (like Lemmy) aren't, and it's safe to say that all content can also be considered as social media (as in Facebook) postings thanks to the connections provided by the ActivityPub protocol.

I would definitely see value in indexing microblogs and other types of content (e.g. Lemmy posts can be useful resources), especially since discoverability on Fedivers is notoriously bad. On the other hand I expect (but wouldn't necessarily agree with) social media to be excluded.

Please clarify your stance on this! Thank you!

Stripping out MathML from the RSS feed

It looks like you're stripping MathML out of the RSS feed -- is that intentional, or are you using an older sanitizer that doesn't recognize it? For example, my RSS feed for my most recent post has:

&lt;math display=&quot;block&quot;&gt;
&lt;msup&gt;&lt;mi&gt;e&lt;/mi&gt;
      &lt;mrow&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;
&lt;/math&gt;

Which shows up in your RSS feed as:

e
      kt

Guidelines: Websites that do not work at all without JavaScript allowed?

I feel like even though their content might be enjoyable, some websites do not work at all (i.e. show no text) without JS. Isn't that against the small web spirit?

Search & Index Nostr - the small (organic) web social media

user story

As a Kagi user who appreciates the small web, I would like Kagi to index nostr notes and nostr profiles, so that I can discover small web nostr notes and profiles in my Kagi search results.

acceptance criteria

there is a method to search for only, or first/primarily nostr notes (e.g. !nostr insert search query here) on Kagi
nostr notes are added to the small web Kagi list of organic results

context & discussion

Nostr is an organic public domain information sharing protocol - notes and other stuff transmitted by relays).

The first popular nostr application is social media, although there are many "other stuff" developments happening, which should make things increasingly interesting.

Unlike Facebook, Twitter, Instagram, Wartscaster there is no company behind the protocol, there are no pre-mined tokens, and there are no advertisements built in. As a beachhead, devs building on nostr aim to fix the rage-bait gulag tech advertising problematic model of the status quo social media.

Current nostr only search engines include nostr.band, and nos.today. Clients like Damus, Iris, Primal have some/basic levels of search built in.

https://njump.me/ is a nostr gateway that allows browser users to see a preview of notes, profiles, and other events.

Let me know if you have any questions on nostr - happy to help. I am the product contributor for Damus, and help with a few other nostr things. You can tag me here or on nostr https://iris.to/npub1zafcms4xya5ap9zr7xxr0jlrtrattwlesytn2s42030lzu0dwlzqpd26k5.

guidelines for where to insert site?

Do you want people to insert their sites so the list is kept alphabetical? If so, you probably should add that to the guidelines.

One plus would be that it should help avoid merge conflicts from many people adding to the top or bottom.

Wrong links in README for repo files

In the info section you link to files smallweb.txt etc.
The links are for editing those files, not viewing them. Clicking the link prompts to fork the repo.

What's currently linked
https://github.com/kagisearch/smallweb/edit/main/smallweb.txt
vs
What should be linked
https://github.com/kagisearch/smallweb/blob/main/smallweb.txt

Feeds in other languages?

Firts, thanks for starting this initiative! Love it.

I have started curating a list of urls for feeds meeting the same criteria as the main ones but in Swedish (my native language) and in a file called smallweb-sv.txt. Would that be something that would be accepted in the main repo?

My WIP is here: https://github.com/matmat/smallweb/blob/smallweb-sv/smallweb-sv.txt

Limit number of posts per source in RSS feed

I think it will be a good idea to limit number of posts per website in RSS feed. For example, currently, there are already 9 posts from same website (samim.io) on top of RSS feed, and this continues for a few days already. The website and posts are ok, but there is just so many of them on top of the feed, that other posts get lost.

De-duplicate feed

Would anyone care to deduplicate the list.

Here is what I got with

awk -F/ '{print $3}' smallweb.txt | sort | uniq -d | while read domain; do echo "$domain"; grep "$domain" smallweb.txt; echo ""; done

When duplicate feeds, we should keep just one that is more complete.

antonz.org
https://antonz.org/feed.xml
https://antonz.org/index.xml

ariadne.space
https://ariadne.space/atom.xml
https://ariadne.space/index.xml

bengtan.com
https://bengtan.com/index.xml
https://bengtan.com/interesting-things/index.xml

benjamincongdon.me
https://benjamincongdon.me/blog/feed.xml
https://benjamincongdon.me/feed.xml

bhavukjain.com
https://bhavukjain.com/feed.xml
https://bhavukjain.com/rss.xml

bitsofco.de
https://bitsofco.de/feed/feed.xml
https://bitsofco.de/rss

bitspook.in
https://bitspook.in/archive/feed.xml
https://bitspook.in/blog/feed.xml

blog.bofh.it
https://blog.bofh.it/?format=atom
https://blog.bofh.it/debian/?format=atom

blog.computationalcomplexity.org
https://blog.computationalcomplexity.org/feeds/posts/default
https://blog.computationalcomplexity.org/feeds/posts/default?alt=rss

blog.ctis.me
https://blog.ctis.me/blog/index.xml
https://blog.ctis.me/index.xml

blog.ncase.me
https://blog.ncase.me/feed.xml
https://blog.ncase.me/rss/

blogs.agu.org
https://blogs.agu.org/fromaglaciersperspective/
https://blogs.agu.org/geoedtrek/feed/
https://blogs.agu.org/georneys/
https://blogs.agu.org/landslideblog/feed/
https://blogs.agu.org/mountainbeltway/feed/
https://blogs.agu.org/wildwildscience/feed/

blogs.bl.uk
https://blogs.bl.uk/digitisedmanuscripts/atom.xml
https://blogs.bl.uk/european/atom.xml

blogs.gnome.org
https://blogs.gnome.org/alexl/feed/
https://blogs.gnome.org/hughsie/feed/
https://blogs.gnome.org/tbernard/feed/
https://blogs.gnome.org/wjjt/feed/

boffosocko.com
https://boffosocko.com/category/mathematics/feed/
https://boffosocko.com/feed/?cat=-484
https://boffosocko.com/kind/article/feed/

brandur.org
https://brandur.org/articles.atom
https://brandur.org/nanoglyphs.atom

buttondown.email
https://buttondown.email/alby/rss
https://buttondown.email/linotypebook/rss
https://buttondown.email/meticulous.serendipity/rss
https://buttondown.email/touchgrass/rss

chrisx.xyz
https://chrisx.xyz/blog/index.xml
https://chrisx.xyz/index.xml

cohost.org
https://cohost.org/Cadey-Bunny/rss/public.atom
https://cohost.org/tomforsyth/rss/public.atom

colinwalker.blog
https://colinwalker.blog/dailyfeed.xml
https://colinwalker.blog/livefeed.xml

cweiske.de
https://cweiske.de/feeds/news.xml
https://cweiske.de/tagebuch/feed/

danq.me
https://danq.me/feed/
https://danq.me/kind/article/feed/

daveon.design
https://daveon.design/atom.xml
https://daveon.design/rss.xml

declanbyrd.co.uk
https://declanbyrd.co.uk/journal/articles.xml
https://declanbyrd.co.uk/journal/feed.xml

embedded.fm
https://embedded.fm/blog?format=RSS
https://embedded.fm/blog?format=rss

entangledlogs.com
https://entangledlogs.com/archives/index.xml
https://entangledlogs.com/index.xml

gilmi.me
https://gilmi.me/blog/atom.xml
https://gilmi.me/blog/rss

granary.io
https://granary.io/url?input=html&output=rss&url=https://jamesg.blog
https://granary.io/url?url=https://werd.io/content/all/&input=html&output=atom&hub=https://bridgy-fed.superfeedr.com/

hacdias.com
https://hacdias.com/articles/feed.xml
https://hacdias.com/feed.xml

honzadvorsky.com
https://honzadvorsky.com/feed
https://honzadvorsky.com/feed.xml

hyperborea.org
https://hyperborea.org/feed.xml
https://hyperborea.org/journal/feed/

jasm1nii.xyz
https://jasm1nii.xyz/blog/articles/articles.xml
https://jasm1nii.xyz/blog/notes/notes.xml

jmtd.net
https://jmtd.net/log/feed
https://jmtd.net/log/index.atom

junkcharts.typepad.com
https://junkcharts.typepad.com/junk_charts/atom.xml
https://junkcharts.typepad.com/numbersruleyourworld/atom.xml

kwon.nyc
https://kwon.nyc/index.xml
https://kwon.nyc/notes/index.xml

lukesmith.xyz
https://lukesmith.xyz/index.xml
https://lukesmith.xyz/rss.xml

mathspp.com
https://mathspp.com/blog.atom
https://mathspp.com/blog.rss

mattbruenig.com
https://mattbruenig.com/comments/feed/
https://mattbruenig.com/feed/

matthewrocklin.com
https://matthewrocklin.com/atom.xml
https://matthewrocklin.com/blog/atom.xml

mccormick.cx
https://mccormick.cx/news/index.rss
https://mccormick.cx/news/tags/workblog.rss

nadh.in
https://nadh.in/blog/index.xml
https://nadh.in/index.xml

podviaznikov.com
https://podviaznikov.com/feed.xml
https://podviaznikov.com/writings/feed.xml

pxtl.ca
https://pxtl.ca/feed.xml
https://pxtl.ca/rss.xml

quentin.delcourt.be
https://quentin.delcourt.be/feeds/blog.xml
https://quentin.delcourt.be/feeds/links.xml

research.swtch.com
http://research.swtch.com/feed.atom
https://research.swtch.com/feed.atom

sobolevn.me
https://sobolevn.me/feed.xml
https://sobolevn.me/git-secret/feed.xml

stefan-marr.de
https://stefan-marr.de/feed/
https://stefan-marr.de/feed/index.xml

susam.net
https://susam.net/blog/feed.xml
https://susam.net/maze/feed.xml

theadhocracy.co.uk
https://theadhocracy.co.uk/rss-articles.xml
https://theadhocracy.co.uk/rss.xml

theprivacydad.com
https://theprivacydad.com/feed/
https://theprivacydad.com/feed/?type=rss

thoughts.hnr.fyi
https://thoughts.hnr.fyi/feed.xml
https://thoughts.hnr.fyi/index.xml

tomaugspurger.net
https://tomaugspurger.net/feeds/all.atom.xml
https://tomaugspurger.net/index.xml

tomgamon.com
https://tomgamon.com/atom.xml
https://tomgamon.com/index.xml

tomstu.art
https://tomstu.art/articles.atom
https://tomstu.art/feed.atom

tratt.net
https://tratt.net/laurie/blog/blog.rss
https://tratt.net/laurie/news.rss

utf9k.net
https://utf9k.net/blog/rss.xml
https://utf9k.net/rss.xml

veerle.duoh.com
https://veerle.duoh.com/design-feed.rss
https://veerle.duoh.com/inspiration-feed.rss

veronique.ink
https://veronique.ink/feed/
https://veronique.ink/feed/?type=rss

werd.io
https://granary.io/url?url=https://werd.io/content/all/&input=html&output=atom&hub=https://bridgy-fed.superfeedr.com/
https://werd.io/?_t=rss
https://werd.io/content/posts?_t=rss

wizardzines.com
https://wizardzines.com/comics/index.xml
https://wizardzines.com/index.xml

write.as
https://write.as/davepolaschek/feed/
https://write.as/matt/feed/
https://write.as/zampano/feed/

www.allendowney.com
https://www.allendowney.com/blog/feed/
https://www.allendowney.com/wp/feed/

www.briandeconinck.com
https://www.briandeconinck.com/feed/
https://www.briandeconinck.com/notes/feed/

www.ekran.org
https://www.ekran.org/ben/wp/feed/
https://www.ekran.org/ben/wp/feed/?_=8177

www.ellyloel.com
https://www.ellyloel.com/articles.atom
https://www.ellyloel.com/feed.atom

www.gyford.com
https://www.gyford.com/phil/feeds/everything/rss/
https://www.gyford.com/phil/writing/feeds/posts/rss/

www.jeremycherfas.net
https://www.jeremycherfas.net/blog.atom
https://www.jeremycherfas.net/blog.rss

www.math.columbia.edu
https://www.math.columbia.edu/~dejong/wordpress/?feed=rss2
https://www.math.columbia.edu/~woit/wordpress/?feed=rss2

www.preposterousuniverse.com
https://www.preposterousuniverse.com/blog/feed/
https://www.preposterousuniverse.com/feed/

www.siennaeggler.com
https://www.siennaeggler.com/author/sienna/rss/
https://www.siennaeggler.com/rss/

www.sierz.co.uk
https://www.sierz.co.uk/blog/feed/
https://www.sierz.co.uk/reviews/feed/

www.somebits.com
https://www.somebits.com/linkblog/index.atom
https://www.somebits.com/weblog/index.atom

www.tekovic.com
https://www.tekovic.com/atom.xml
https://www.tekovic.com/rss.xml

xdg.me
https://xdg.me/index.xml
https://xdg.me/writing/index.xml

Consider leaving Microsoft GitHub

If we want to push for a smaller web not owned by US corporate control, shouldn’t this code be hosted on a libre platform ideal with federated patches? Y’all link sources like https://smallweb.page/why

Modern Gatekeepers

Today, most of the time spent on the web is either on a small number of very dominant platforms like Facebook and LinkedIn, or mediated through them.

Why stop at blogs? Microsoft GitHub is trying to monopolize the coding space as much as Facebook, et al. are trying to do with web content. Note the way users can’t participate without an account, the platform’s Code search is disabled unless you are authenticated, there are upsell advertisements like the one for Microsoft GitHub Copilot in the UI, etc. DCVS (distributed version control systems) like Git don’t need to be kept behind a platform anymore than the small web. Why not apply the same rules/goals to this code forge (and then follow it with the chatroom)?

joelx.com does not meet quality requirements

Browsing The Small Web, I was served this: https://joelx.com/dead-children/18136/

Not sure how that meets anywhere near the quality criteria for the site. It's just grotesque.

High volume feeds

There are some feeds with a very high volume of posts currently in the list. Should there be a limit for this in the criteria for inclusion?

Here are the ones with recent items that have an average of >5 posts per day.

Average posts per day	Average posts per week	Feed URL	Newest post date	Oldest post date	Number of posts
15.0	105.0	https://soft-wa.re/podcast.rss	2023-10-19T01:35:39+00:00Z	2023-10-19T01:35:38+00:00Z	15
12.5	87.5	https://maique.eu/feed.xml	2023-10-19T16:09:29+00:00Z	2023-10-17T14:12:04+00:00Z	25
10.0	70.0	https://bottledaux.com/feed/	2023-10-18T22:22:26+00:00Z	2023-10-17T02:56:13+00:00Z	10
10.0	70.0	https://feeds.feedburner.com/thisisnthappiness	2023-10-19T16:02:47+00:00Z	2023-10-16T19:08:11+00:00Z	20
10.0	70.0	https://jabberwocking.com/feed/	2023-10-19T17:11:30+00:00Z	2023-10-18T16:42:35+00:00Z	10
10.0	70.0	https://jacklimpert.com/feed/	2023-10-19T15:30:16+00:00Z	2023-10-18T20:27:47+00:00Z	10
10.0	70.0	https://johnmenadue.com/feed/	2023-10-18T17:58:41+00:00Z	2023-10-17T17:57:46+00:00Z	10
10.0	70.0	https://putah-creek.tumblr.com/rss	2023-10-19T12:54:35+00:00Z	2023-10-17T04:06:24+00:00Z	20
10.0	70.0	https://www.fullmoonfiberart.com/feed/	2023-10-19T15:34:33+00:00Z	2023-10-17T21:53:52+00:00Z	10
10.0	70.0	https://www.lesswrong.com/feed.xml	2023-10-19T16:48:00+00:00Z	2023-10-18T20:39:34+00:00Z	10
10.0	70.0	https://www.thepassivevoice.com/feed/	2023-10-19T17:03:29+00:00Z	2023-10-18T04:02:56+00:00Z	10
9.1	63.6	https://www.jvt.me/feed.xml	2023-10-19T13:10:04+00:00Z	2023-10-07T15:23:00+00:00Z	100
8.6	60.0	https://feeds.kottke.org/main	2023-10-19T17:12:18+00:00Z	2023-10-11T19:29:48+00:00Z	60
7.5	52.5	https://feeds.feedblitz.com/marginalrevolution&x=1	2023-10-19T15:32:23+00:00Z	2023-10-16T18:03:52+00:00Z	15
6.3	43.8	https://itstartswithabirthstone.blogspot.com/feeds/posts/default	2023-10-19T16:42:00+00:00Z	2023-10-15T13:44:00+00:00Z	25
6.3	43.8	https://samim.io/rss.xml	2023-10-17T21:24:58+00:00Z	2023-10-13T19:31:24+00:00Z	25

Possible canditates removals from the index

Hi,

I spent some time to create a small web dataset command line tool to start analyzing the index and help others creating a dataset from it.

The first task was to analyze the feeds to check if they were all English sites as per the guidelines. After an initial processing, it determined that some feeds are not English feeds.

To identify the language of a feed, it proceeds that way:

use langdetect on the title and description of the feed
use langdetect on the title and content of the articles of each feed
it reassign a different language for the feed if the articles it contains are in majority of another language

The result of this processing is attached to this issue. The ones that are tagged with a specific language should most likely be of that language, and could be candidate for removal. For the ones with empty languages, this happens when there is not enough data to detect the language (concat(title + description) < 64 characters) for most of its articles, so the current heuristic can't determine the appropriate language for the feed.

Attached: small_wed_non_en_feeds.csv

What data source is used by RSS feed?

Hello! I added Kagi Small Web RSS data feed (https://kagi.com/api/v1/smallweb/feed) to my reader and noticed that there are a few non-english sites (e.g. I have a lot of some chinese posts in my reader right now). I wanted to create PR to remove those sites, but it seem sites in question (hktv1.vip in this case) isn't present in smallweb list in this repo.

Inoreader won’t load rss (worked for first day or two on launch)

Hi,
Great initiative here! Keen to use the rss feed.
Getting this error on Inoreader:

A feed could not be found at https://kagi.com/api/v1/smallweb/feed/. This does not appear to be a valid RSS or Atom feed

It was working for a day or two from launch, but then broken since.
I tried both with and without the trailing slash in the url.

Any ideas?
Much thanks!

Error when flagging a website that couldn't be opened

I was served here but got a "Firefox can't open this page" warning.

So I clicked flag but I'm getting a not found error here.

Add support for Wiby, the search engine for Classic Websites

Wiby is a boutique search engine specializing in "older style pages, lightweight and based on a subject of interest," websites "reminiscent of the early internet." Pages are manually submitted by anonymous users. Its mission seems very in line with Kagi Small Web. Would a partnership to integrate Wiby somehow be in order?

There's a Wiby.me and a Wiby.org, which are probably identical.

Previous post navigation

I am greatly enjoying this, thank you for making it! It reminds me of StumbleUpon in the best of ways.

With SU, your URL changed for each site you visited, so you could press back button if you want to see something in this session again. With kagi it's not, so I wonder if something could be added to the top bar to help you navigate back to previous sites you've read in this session? e.g. a "Prev Post" button or a dropdown next to the URL listing all of the URLs you've opened in this session.

Reject posts without post date

There is this feed: https://franklin.dyer.me/rss/en
Some posts in it have time in <published>, but don't have date. Kagi parser in such cases seem to use today's date, which makes those posts appear in feed every day.

        <entry>
            <title type="html">The Summer Language Institute</title>
            <link href="https://franklin.dyer.me/post/121">
            <published>T00:00:00-05:00</published>
        </entry>
        <entry>
            <title type="html">Other blogs</title>
            <link href="https://franklin.dyer.me/post/216">
            <published>T00:00:00-05:00</published>
        </entry>
        <entry>
            <title type="html">Contact me</title>
            <link href="https://franklin.dyer.me/post/2">
            <published>T00:00:00-05:00</published>
        </entry>
        <entry>
            <title type="html">CV</title>
            <link href="https://franklin.dyer.me/post/1">
            <published>T00:00:00-05:00</published>
        </entry>
        <entry>
            <title type="html">None</title>
            <link href="https://franklin.dyer.me/post/None">
            <published>T00:00:00-05:00</published>
        </entry>

Remove Cloudflare Blog from Small Web list

Cloudflare is a billon-dollar company and controls all traffic on a significant fraction of websites globally. Regardless of how interesting the Cloudflare Blog is, there is no possibility of it being considered part of the "Small Web". It should be removed from Kagi's Small Web list.