kagisearch / smallweb Goto Github PK
View Code? Open in Web Editor NEWKagi Small Web
Home Page: https://kagi.com/smallweb
License: MIT License
Kagi Small Web
Home Page: https://kagi.com/smallweb
License: MIT License
I'd like to ask for guidance about the eligibility of microblogs and other content feeds close to social-networks.
Specifically, most Fediverse software (Mastodon, Pleroma, Lemmy, etc.) publishes syndication feeds. Some of the content is explicitly defined as (micro)blogging (e.g. Mastodon), while others (like Lemmy) aren't, and it's safe to say that all content can also be considered as social media (as in Facebook) postings thanks to the connections provided by the ActivityPub protocol.
I would definitely see value in indexing microblogs and other types of content (e.g. Lemmy posts can be useful resources), especially since discoverability on Fedivers is notoriously bad. On the other hand I expect (but wouldn't necessarily agree with) social media to be excluded.
Please clarify your stance on this! Thank you!
It looks like you're stripping MathML out of the RSS feed -- is that intentional, or are you using an older sanitizer that doesn't recognize it? For example, my RSS feed for my most recent post has:
<math display="block">
<msup><mi>e</mi>
<mrow><mi>k</mi><mi>t</mi></mrow></msup>
</math>
Which shows up in your RSS feed as:
e
kt
I feel like even though their content might be enjoyable, some websites do not work at all (i.e. show no text) without JS. Isn't that against the small web spirit?
As a Kagi user who appreciates the small web, I would like Kagi to index nostr notes and nostr profiles, so that I can discover small web nostr notes and profiles in my Kagi search results.
Nostr is an organic public domain information sharing protocol - notes and other stuff transmitted by relays).
The first popular nostr application is social media, although there are many "other stuff" developments happening, which should make things increasingly interesting.
Unlike Facebook, Twitter, Instagram, Wartscaster there is no company behind the protocol, there are no pre-mined tokens, and there are no advertisements built in. As a beachhead, devs building on nostr aim to fix the rage-bait gulag tech advertising problematic model of the status quo social media.
Current nostr only search engines include nostr.band, and nos.today. Clients like Damus, Iris, Primal have some/basic levels of search built in.
https://njump.me/ is a nostr gateway that allows browser users to see a preview of notes, profiles, and other events.
Let me know if you have any questions on nostr - happy to help. I am the product contributor for Damus, and help with a few other nostr things. You can tag me here or on nostr https://iris.to/npub1zafcms4xya5ap9zr7xxr0jlrtrattwlesytn2s42030lzu0dwlzqpd26k5.
Do you want people to insert their sites so the list is kept alphabetical? If so, you probably should add that to the guidelines.
One plus would be that it should help avoid merge conflicts from many people adding to the top or bottom.
In the info section you link to files smallweb.txt etc.
The links are for editing those files, not viewing them. Clicking the link prompts to fork the repo.
What's currently linked
https://github.com/kagisearch/smallweb/edit/main/smallweb.txt
vs
What should be linked
https://github.com/kagisearch/smallweb/blob/main/smallweb.txt
Firts, thanks for starting this initiative! Love it.
I have started curating a list of urls for feeds meeting the same criteria as the main ones but in Swedish (my native language) and in a file called smallweb-sv.txt. Would that be something that would be accepted in the main repo?
My WIP is here: https://github.com/matmat/smallweb/blob/smallweb-sv/smallweb-sv.txt
I think it will be a good idea to limit number of posts per website in RSS feed. For example, currently, there are already 9 posts from same website (samim.io) on top of RSS feed, and this continues for a few days already. The website and posts are ok, but there is just so many of them on top of the feed, that other posts get lost.
Would anyone care to deduplicate the list.
Here is what I got with
awk -F/ '{print $3}' smallweb.txt | sort | uniq -d | while read domain; do echo "$domain"; grep "$domain" smallweb.txt; echo ""; done
When duplicate feeds, we should keep just one that is more complete.
antonz.org
https://antonz.org/feed.xml
https://antonz.org/index.xml
ariadne.space
https://ariadne.space/atom.xml
https://ariadne.space/index.xml
bengtan.com
https://bengtan.com/index.xml
https://bengtan.com/interesting-things/index.xml
benjamincongdon.me
https://benjamincongdon.me/blog/feed.xml
https://benjamincongdon.me/feed.xml
bhavukjain.com
https://bhavukjain.com/feed.xml
https://bhavukjain.com/rss.xml
bitsofco.de
https://bitsofco.de/feed/feed.xml
https://bitsofco.de/rss
bitspook.in
https://bitspook.in/archive/feed.xml
https://bitspook.in/blog/feed.xml
blog.bofh.it
https://blog.bofh.it/?format=atom
https://blog.bofh.it/debian/?format=atom
blog.computationalcomplexity.org
https://blog.computationalcomplexity.org/feeds/posts/default
https://blog.computationalcomplexity.org/feeds/posts/default?alt=rss
blog.ctis.me
https://blog.ctis.me/blog/index.xml
https://blog.ctis.me/index.xml
blog.ncase.me
https://blog.ncase.me/feed.xml
https://blog.ncase.me/rss/
blogs.agu.org
https://blogs.agu.org/fromaglaciersperspective/
https://blogs.agu.org/geoedtrek/feed/
https://blogs.agu.org/georneys/
https://blogs.agu.org/landslideblog/feed/
https://blogs.agu.org/mountainbeltway/feed/
https://blogs.agu.org/wildwildscience/feed/
blogs.bl.uk
https://blogs.bl.uk/digitisedmanuscripts/atom.xml
https://blogs.bl.uk/european/atom.xml
blogs.gnome.org
https://blogs.gnome.org/alexl/feed/
https://blogs.gnome.org/hughsie/feed/
https://blogs.gnome.org/tbernard/feed/
https://blogs.gnome.org/wjjt/feed/
boffosocko.com
https://boffosocko.com/category/mathematics/feed/
https://boffosocko.com/feed/?cat=-484
https://boffosocko.com/kind/article/feed/
brandur.org
https://brandur.org/articles.atom
https://brandur.org/nanoglyphs.atom
buttondown.email
https://buttondown.email/alby/rss
https://buttondown.email/linotypebook/rss
https://buttondown.email/meticulous.serendipity/rss
https://buttondown.email/touchgrass/rss
chrisx.xyz
https://chrisx.xyz/blog/index.xml
https://chrisx.xyz/index.xml
cohost.org
https://cohost.org/Cadey-Bunny/rss/public.atom
https://cohost.org/tomforsyth/rss/public.atom
colinwalker.blog
https://colinwalker.blog/dailyfeed.xml
https://colinwalker.blog/livefeed.xml
cweiske.de
https://cweiske.de/feeds/news.xml
https://cweiske.de/tagebuch/feed/
danq.me
https://danq.me/feed/
https://danq.me/kind/article/feed/
daveon.design
https://daveon.design/atom.xml
https://daveon.design/rss.xml
declanbyrd.co.uk
https://declanbyrd.co.uk/journal/articles.xml
https://declanbyrd.co.uk/journal/feed.xml
embedded.fm
https://embedded.fm/blog?format=RSS
https://embedded.fm/blog?format=rss
entangledlogs.com
https://entangledlogs.com/archives/index.xml
https://entangledlogs.com/index.xml
gilmi.me
https://gilmi.me/blog/atom.xml
https://gilmi.me/blog/rss
granary.io
https://granary.io/url?input=html&output=rss&url=https://jamesg.blog
https://granary.io/url?url=https://werd.io/content/all/&input=html&output=atom&hub=https://bridgy-fed.superfeedr.com/
hacdias.com
https://hacdias.com/articles/feed.xml
https://hacdias.com/feed.xml
honzadvorsky.com
https://honzadvorsky.com/feed
https://honzadvorsky.com/feed.xml
hyperborea.org
https://hyperborea.org/feed.xml
https://hyperborea.org/journal/feed/
jasm1nii.xyz
https://jasm1nii.xyz/blog/articles/articles.xml
https://jasm1nii.xyz/blog/notes/notes.xml
jmtd.net
https://jmtd.net/log/feed
https://jmtd.net/log/index.atom
junkcharts.typepad.com
https://junkcharts.typepad.com/junk_charts/atom.xml
https://junkcharts.typepad.com/numbersruleyourworld/atom.xml
kwon.nyc
https://kwon.nyc/index.xml
https://kwon.nyc/notes/index.xml
lukesmith.xyz
https://lukesmith.xyz/index.xml
https://lukesmith.xyz/rss.xml
mathspp.com
https://mathspp.com/blog.atom
https://mathspp.com/blog.rss
mattbruenig.com
https://mattbruenig.com/comments/feed/
https://mattbruenig.com/feed/
matthewrocklin.com
https://matthewrocklin.com/atom.xml
https://matthewrocklin.com/blog/atom.xml
mccormick.cx
https://mccormick.cx/news/index.rss
https://mccormick.cx/news/tags/workblog.rss
nadh.in
https://nadh.in/blog/index.xml
https://nadh.in/index.xml
podviaznikov.com
https://podviaznikov.com/feed.xml
https://podviaznikov.com/writings/feed.xml
pxtl.ca
https://pxtl.ca/feed.xml
https://pxtl.ca/rss.xml
quentin.delcourt.be
https://quentin.delcourt.be/feeds/blog.xml
https://quentin.delcourt.be/feeds/links.xml
research.swtch.com
http://research.swtch.com/feed.atom
https://research.swtch.com/feed.atom
sobolevn.me
https://sobolevn.me/feed.xml
https://sobolevn.me/git-secret/feed.xml
stefan-marr.de
https://stefan-marr.de/feed/
https://stefan-marr.de/feed/index.xml
susam.net
https://susam.net/blog/feed.xml
https://susam.net/maze/feed.xml
theadhocracy.co.uk
https://theadhocracy.co.uk/rss-articles.xml
https://theadhocracy.co.uk/rss.xml
theprivacydad.com
https://theprivacydad.com/feed/
https://theprivacydad.com/feed/?type=rss
thoughts.hnr.fyi
https://thoughts.hnr.fyi/feed.xml
https://thoughts.hnr.fyi/index.xml
tomaugspurger.net
https://tomaugspurger.net/feeds/all.atom.xml
https://tomaugspurger.net/index.xml
tomgamon.com
https://tomgamon.com/atom.xml
https://tomgamon.com/index.xml
tomstu.art
https://tomstu.art/articles.atom
https://tomstu.art/feed.atom
tratt.net
https://tratt.net/laurie/blog/blog.rss
https://tratt.net/laurie/news.rss
utf9k.net
https://utf9k.net/blog/rss.xml
https://utf9k.net/rss.xml
veerle.duoh.com
https://veerle.duoh.com/design-feed.rss
https://veerle.duoh.com/inspiration-feed.rss
veronique.ink
https://veronique.ink/feed/
https://veronique.ink/feed/?type=rss
werd.io
https://granary.io/url?url=https://werd.io/content/all/&input=html&output=atom&hub=https://bridgy-fed.superfeedr.com/
https://werd.io/?_t=rss
https://werd.io/content/posts?_t=rss
wizardzines.com
https://wizardzines.com/comics/index.xml
https://wizardzines.com/index.xml
write.as
https://write.as/davepolaschek/feed/
https://write.as/matt/feed/
https://write.as/zampano/feed/
www.allendowney.com
https://www.allendowney.com/blog/feed/
https://www.allendowney.com/wp/feed/
www.briandeconinck.com
https://www.briandeconinck.com/feed/
https://www.briandeconinck.com/notes/feed/
www.ekran.org
https://www.ekran.org/ben/wp/feed/
https://www.ekran.org/ben/wp/feed/?_=8177
www.ellyloel.com
https://www.ellyloel.com/articles.atom
https://www.ellyloel.com/feed.atom
www.gyford.com
https://www.gyford.com/phil/feeds/everything/rss/
https://www.gyford.com/phil/writing/feeds/posts/rss/
www.jeremycherfas.net
https://www.jeremycherfas.net/blog.atom
https://www.jeremycherfas.net/blog.rss
www.math.columbia.edu
https://www.math.columbia.edu/~dejong/wordpress/?feed=rss2
https://www.math.columbia.edu/~woit/wordpress/?feed=rss2
www.preposterousuniverse.com
https://www.preposterousuniverse.com/blog/feed/
https://www.preposterousuniverse.com/feed/
www.siennaeggler.com
https://www.siennaeggler.com/author/sienna/rss/
https://www.siennaeggler.com/rss/
www.sierz.co.uk
https://www.sierz.co.uk/blog/feed/
https://www.sierz.co.uk/reviews/feed/
www.somebits.com
https://www.somebits.com/linkblog/index.atom
https://www.somebits.com/weblog/index.atom
www.tekovic.com
https://www.tekovic.com/atom.xml
https://www.tekovic.com/rss.xml
xdg.me
https://xdg.me/index.xml
https://xdg.me/writing/index.xml
If we want to push for a smaller web not owned by US corporate control, shouldn’t this code be hosted on a libre platform ideal with federated patches? Y’all link sources like https://smallweb.page/why
Modern Gatekeepers
Today, most of the time spent on the web is either on a small number of very dominant platforms like Facebook and LinkedIn, or mediated through them.
Why stop at blogs? Microsoft GitHub is trying to monopolize the coding space as much as Facebook, et al. are trying to do with web content. Note the way users can’t participate without an account, the platform’s Code search is disabled unless you are authenticated, there are upsell advertisements like the one for Microsoft GitHub Copilot in the UI, etc. DCVS (distributed version control systems) like Git don’t need to be kept behind a platform anymore than the small web. Why not apply the same rules/goals to this code forge (and then follow it with the chatroom)?
Browsing The Small Web, I was served this: https://joelx.com/dead-children/18136/
Not sure how that meets anywhere near the quality criteria for the site. It's just grotesque.
There are some feeds with a very high volume of posts currently in the list. Should there be a limit for this in the criteria for inclusion?
Here are the ones with recent items that have an average of >5 posts per day.
Average posts per day | Average posts per week | Feed URL | Newest post date | Oldest post date | Number of posts |
---|---|---|---|---|---|
15.0 | 105.0 | https://soft-wa.re/podcast.rss | 2023-10-19T01:35:39+00:00Z | 2023-10-19T01:35:38+00:00Z | 15 |
12.5 | 87.5 | https://maique.eu/feed.xml | 2023-10-19T16:09:29+00:00Z | 2023-10-17T14:12:04+00:00Z | 25 |
10.0 | 70.0 | https://bottledaux.com/feed/ | 2023-10-18T22:22:26+00:00Z | 2023-10-17T02:56:13+00:00Z | 10 |
10.0 | 70.0 | https://feeds.feedburner.com/thisisnthappiness | 2023-10-19T16:02:47+00:00Z | 2023-10-16T19:08:11+00:00Z | 20 |
10.0 | 70.0 | https://jabberwocking.com/feed/ | 2023-10-19T17:11:30+00:00Z | 2023-10-18T16:42:35+00:00Z | 10 |
10.0 | 70.0 | https://jacklimpert.com/feed/ | 2023-10-19T15:30:16+00:00Z | 2023-10-18T20:27:47+00:00Z | 10 |
10.0 | 70.0 | https://johnmenadue.com/feed/ | 2023-10-18T17:58:41+00:00Z | 2023-10-17T17:57:46+00:00Z | 10 |
10.0 | 70.0 | https://putah-creek.tumblr.com/rss | 2023-10-19T12:54:35+00:00Z | 2023-10-17T04:06:24+00:00Z | 20 |
10.0 | 70.0 | https://www.fullmoonfiberart.com/feed/ | 2023-10-19T15:34:33+00:00Z | 2023-10-17T21:53:52+00:00Z | 10 |
10.0 | 70.0 | https://www.lesswrong.com/feed.xml | 2023-10-19T16:48:00+00:00Z | 2023-10-18T20:39:34+00:00Z | 10 |
10.0 | 70.0 | https://www.thepassivevoice.com/feed/ | 2023-10-19T17:03:29+00:00Z | 2023-10-18T04:02:56+00:00Z | 10 |
9.1 | 63.6 | https://www.jvt.me/feed.xml | 2023-10-19T13:10:04+00:00Z | 2023-10-07T15:23:00+00:00Z | 100 |
8.6 | 60.0 | https://feeds.kottke.org/main | 2023-10-19T17:12:18+00:00Z | 2023-10-11T19:29:48+00:00Z | 60 |
7.5 | 52.5 | https://feeds.feedblitz.com/marginalrevolution&x=1 | 2023-10-19T15:32:23+00:00Z | 2023-10-16T18:03:52+00:00Z | 15 |
6.3 | 43.8 | https://itstartswithabirthstone.blogspot.com/feeds/posts/default | 2023-10-19T16:42:00+00:00Z | 2023-10-15T13:44:00+00:00Z | 25 |
6.3 | 43.8 | https://samim.io/rss.xml | 2023-10-17T21:24:58+00:00Z | 2023-10-13T19:31:24+00:00Z | 25 |
Hi,
I spent some time to create a small web dataset command line tool to start analyzing the index and help others creating a dataset from it.
The first task was to analyze the feeds to check if they were all English sites as per the guidelines. After an initial processing, it determined that some feeds are not English feeds.
To identify the language of a feed, it proceeds that way:
langdetect
on the title and description of the feedlangdetect
on the title and content of the articles of each feedThe result of this processing is attached to this issue. The ones that are tagged with a specific language should most likely be of that language, and could be candidate for removal. For the ones with empty languages, this happens when there is not enough data to detect the language (concat(title + description) < 64 characters)
for most of its articles, so the current heuristic can't determine the appropriate language for the feed.
Attached: small_wed_non_en_feeds.csv
Hello! I added Kagi Small Web RSS data feed (https://kagi.com/api/v1/smallweb/feed
) to my reader and noticed that there are a few non-english sites (e.g. I have a lot of some chinese posts in my reader right now). I wanted to create PR to remove those sites, but it seem sites in question (hktv1.vip
in this case) isn't present in smallweb list in this repo.
Hi,
Great initiative here! Keen to use the rss feed.
Getting this error on Inoreader:
A feed could not be found at https://kagi.com/api/v1/smallweb/feed/. This does not appear to be a valid RSS or Atom feed
It was working for a day or two from launch, but then broken since.
I tried both with and without the trailing slash in the url.
Any ideas?
Much thanks!
Wiby is a boutique search engine specializing in "older style pages, lightweight and based on a subject of interest," websites "reminiscent of the early internet." Pages are manually submitted by anonymous users. Its mission seems very in line with Kagi Small Web. Would a partnership to integrate Wiby somehow be in order?
There's a Wiby.me and a Wiby.org, which are probably identical.
I am greatly enjoying this, thank you for making it! It reminds me of StumbleUpon in the best of ways.
With SU, your URL changed for each site you visited, so you could press back button if you want to see something in this session again. With kagi it's not, so I wonder if something could be added to the top bar to help you navigate back to previous sites you've read in this session? e.g. a "Prev Post" button or a dropdown next to the URL listing all of the URLs you've opened in this session.
There is this feed: https://franklin.dyer.me/rss/en
Some posts in it have time in <published>
, but don't have date. Kagi parser in such cases seem to use today's date, which makes those posts appear in feed every day.
<entry>
<title type="html">The Summer Language Institute</title>
<link href="https://franklin.dyer.me/post/121">
<published>T00:00:00-05:00</published>
</entry>
<entry>
<title type="html">Other blogs</title>
<link href="https://franklin.dyer.me/post/216">
<published>T00:00:00-05:00</published>
</entry>
<entry>
<title type="html">Contact me</title>
<link href="https://franklin.dyer.me/post/2">
<published>T00:00:00-05:00</published>
</entry>
<entry>
<title type="html">CV</title>
<link href="https://franklin.dyer.me/post/1">
<published>T00:00:00-05:00</published>
</entry>
<entry>
<title type="html">None</title>
<link href="https://franklin.dyer.me/post/None">
<published>T00:00:00-05:00</published>
</entry>
Cloudflare is a billon-dollar company and controls all traffic on a significant fraction of websites globally. Regardless of how interesting the Cloudflare Blog is, there is no possibility of it being considered part of the "Small Web". It should be removed from Kagi's Small Web list.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.