Coder Social home page Coder Social logo

smallweb's People

Contributors

adactio avatar agentcooper avatar arthurmelton avatar dependabot[bot] avatar dkarter avatar dlnorman avatar fgiasson avatar funvill avatar highb avatar hoffination avatar lsomm avatar marccoquand avatar matmat avatar misterbiggs avatar mms-zen avatar mozedz avatar mtlynch avatar olegwock avatar prmadev avatar rkingett avatar rorylawless avatar rosnovsky avatar samwho avatar sblaplace avatar steelcutoatmeal avatar the80srobot avatar tuinslak avatar vprelovac avatar waozi-dev avatar warpspin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

smallweb's Issues

Guidance about microblogging and social platforms

I'd like to ask for guidance about the eligibility of microblogs and other content feeds close to social-networks.

Specifically, most Fediverse software (Mastodon, Pleroma, Lemmy, etc.) publishes syndication feeds. Some of the content is explicitly defined as (micro)blogging (e.g. Mastodon), while others (like Lemmy) aren't, and it's safe to say that all content can also be considered as social media (as in Facebook) postings thanks to the connections provided by the ActivityPub protocol.

I would definitely see value in indexing microblogs and other types of content (e.g. Lemmy posts can be useful resources), especially since discoverability on Fedivers is notoriously bad. On the other hand I expect (but wouldn't necessarily agree with) social media to be excluded.

Please clarify your stance on this! Thank you!

Stripping out MathML from the RSS feed

It looks like you're stripping MathML out of the RSS feed -- is that intentional, or are you using an older sanitizer that doesn't recognize it? For example, my RSS feed for my most recent post has:

<math display="block">
<msup><mi>e</mi>
      <mrow><mi>k</mi><mi>t</mi></mrow></msup>
</math>

Which shows up in your RSS feed as:

e
      kt

Search & Index Nostr - the small (organic) web social media

user story

As a Kagi user who appreciates the small web, I would like Kagi to index nostr notes and nostr profiles, so that I can discover small web nostr notes and profiles in my Kagi search results.

acceptance criteria

  1. there is a method to search for only, or first/primarily nostr notes (e.g. !nostr insert search query here) on Kagi
  2. nostr notes are added to the small web Kagi list of organic results

context & discussion

Nostr is an organic public domain information sharing protocol - notes and other stuff transmitted by relays).

The first popular nostr application is social media, although there are many "other stuff" developments happening, which should make things increasingly interesting.

Unlike Facebook, Twitter, Instagram, Wartscaster there is no company behind the protocol, there are no pre-mined tokens, and there are no advertisements built in. As a beachhead, devs building on nostr aim to fix the rage-bait gulag tech advertising problematic model of the status quo social media.

Current nostr only search engines include nostr.band, and nos.today. Clients like Damus, Iris, Primal have some/basic levels of search built in.

https://njump.me/ is a nostr gateway that allows browser users to see a preview of notes, profiles, and other events.

Let me know if you have any questions on nostr - happy to help. I am the product contributor for Damus, and help with a few other nostr things. You can tag me here or on nostr https://iris.to/npub1zafcms4xya5ap9zr7xxr0jlrtrattwlesytn2s42030lzu0dwlzqpd26k5.

guidelines for where to insert site?

Do you want people to insert their sites so the list is kept alphabetical? If so, you probably should add that to the guidelines.

One plus would be that it should help avoid merge conflicts from many people adding to the top or bottom.

Limit number of posts per source in RSS feed

I think it will be a good idea to limit number of posts per website in RSS feed. For example, currently, there are already 9 posts from same website (samim.io) on top of RSS feed, and this continues for a few days already. The website and posts are ok, but there is just so many of them on top of the feed, that other posts get lost.

De-duplicate feed

Would anyone care to deduplicate the list.

Here is what I got with

awk -F/ '{print $3}' smallweb.txt | sort | uniq -d | while read domain; do echo "$domain"; grep "$domain" smallweb.txt; echo ""; done

When duplicate feeds, we should keep just one that is more complete.

antonz.org
https://antonz.org/feed.xml
https://antonz.org/index.xml

ariadne.space
https://ariadne.space/atom.xml
https://ariadne.space/index.xml

bengtan.com
https://bengtan.com/index.xml
https://bengtan.com/interesting-things/index.xml

benjamincongdon.me
https://benjamincongdon.me/blog/feed.xml
https://benjamincongdon.me/feed.xml

bhavukjain.com
https://bhavukjain.com/feed.xml
https://bhavukjain.com/rss.xml

bitsofco.de
https://bitsofco.de/feed/feed.xml
https://bitsofco.de/rss

bitspook.in
https://bitspook.in/archive/feed.xml
https://bitspook.in/blog/feed.xml

blog.bofh.it
https://blog.bofh.it/?format=atom
https://blog.bofh.it/debian/?format=atom

blog.computationalcomplexity.org
https://blog.computationalcomplexity.org/feeds/posts/default
https://blog.computationalcomplexity.org/feeds/posts/default?alt=rss

blog.ctis.me
https://blog.ctis.me/blog/index.xml
https://blog.ctis.me/index.xml

blog.ncase.me
https://blog.ncase.me/feed.xml
https://blog.ncase.me/rss/

blogs.agu.org
https://blogs.agu.org/fromaglaciersperspective/
https://blogs.agu.org/geoedtrek/feed/
https://blogs.agu.org/georneys/
https://blogs.agu.org/landslideblog/feed/
https://blogs.agu.org/mountainbeltway/feed/
https://blogs.agu.org/wildwildscience/feed/

blogs.bl.uk
https://blogs.bl.uk/digitisedmanuscripts/atom.xml
https://blogs.bl.uk/european/atom.xml

blogs.gnome.org
https://blogs.gnome.org/alexl/feed/
https://blogs.gnome.org/hughsie/feed/
https://blogs.gnome.org/tbernard/feed/
https://blogs.gnome.org/wjjt/feed/

boffosocko.com
https://boffosocko.com/category/mathematics/feed/
https://boffosocko.com/feed/?cat=-484
https://boffosocko.com/kind/article/feed/

brandur.org
https://brandur.org/articles.atom
https://brandur.org/nanoglyphs.atom

buttondown.email
https://buttondown.email/alby/rss
https://buttondown.email/linotypebook/rss
https://buttondown.email/meticulous.serendipity/rss
https://buttondown.email/touchgrass/rss

chrisx.xyz
https://chrisx.xyz/blog/index.xml
https://chrisx.xyz/index.xml

cohost.org
https://cohost.org/Cadey-Bunny/rss/public.atom
https://cohost.org/tomforsyth/rss/public.atom

colinwalker.blog
https://colinwalker.blog/dailyfeed.xml
https://colinwalker.blog/livefeed.xml

cweiske.de
https://cweiske.de/feeds/news.xml
https://cweiske.de/tagebuch/feed/

danq.me
https://danq.me/feed/
https://danq.me/kind/article/feed/

daveon.design
https://daveon.design/atom.xml
https://daveon.design/rss.xml

declanbyrd.co.uk
https://declanbyrd.co.uk/journal/articles.xml
https://declanbyrd.co.uk/journal/feed.xml

embedded.fm
https://embedded.fm/blog?format=RSS
https://embedded.fm/blog?format=rss

entangledlogs.com
https://entangledlogs.com/archives/index.xml
https://entangledlogs.com/index.xml

gilmi.me
https://gilmi.me/blog/atom.xml
https://gilmi.me/blog/rss

granary.io
https://granary.io/url?input=html&output=rss&url=https://jamesg.blog
https://granary.io/url?url=https://werd.io/content/all/&input=html&output=atom&hub=https://bridgy-fed.superfeedr.com/

hacdias.com
https://hacdias.com/articles/feed.xml
https://hacdias.com/feed.xml

honzadvorsky.com
https://honzadvorsky.com/feed
https://honzadvorsky.com/feed.xml

hyperborea.org
https://hyperborea.org/feed.xml
https://hyperborea.org/journal/feed/

jasm1nii.xyz
https://jasm1nii.xyz/blog/articles/articles.xml
https://jasm1nii.xyz/blog/notes/notes.xml

jmtd.net
https://jmtd.net/log/feed
https://jmtd.net/log/index.atom

junkcharts.typepad.com
https://junkcharts.typepad.com/junk_charts/atom.xml
https://junkcharts.typepad.com/numbersruleyourworld/atom.xml

kwon.nyc
https://kwon.nyc/index.xml
https://kwon.nyc/notes/index.xml

lukesmith.xyz
https://lukesmith.xyz/index.xml
https://lukesmith.xyz/rss.xml

mathspp.com
https://mathspp.com/blog.atom
https://mathspp.com/blog.rss

mattbruenig.com
https://mattbruenig.com/comments/feed/
https://mattbruenig.com/feed/

matthewrocklin.com
https://matthewrocklin.com/atom.xml
https://matthewrocklin.com/blog/atom.xml

mccormick.cx
https://mccormick.cx/news/index.rss
https://mccormick.cx/news/tags/workblog.rss

nadh.in
https://nadh.in/blog/index.xml
https://nadh.in/index.xml

podviaznikov.com
https://podviaznikov.com/feed.xml
https://podviaznikov.com/writings/feed.xml

pxtl.ca
https://pxtl.ca/feed.xml
https://pxtl.ca/rss.xml

quentin.delcourt.be
https://quentin.delcourt.be/feeds/blog.xml
https://quentin.delcourt.be/feeds/links.xml

research.swtch.com
http://research.swtch.com/feed.atom
https://research.swtch.com/feed.atom

sobolevn.me
https://sobolevn.me/feed.xml
https://sobolevn.me/git-secret/feed.xml

stefan-marr.de
https://stefan-marr.de/feed/
https://stefan-marr.de/feed/index.xml

susam.net
https://susam.net/blog/feed.xml
https://susam.net/maze/feed.xml

theadhocracy.co.uk
https://theadhocracy.co.uk/rss-articles.xml
https://theadhocracy.co.uk/rss.xml

theprivacydad.com
https://theprivacydad.com/feed/
https://theprivacydad.com/feed/?type=rss

thoughts.hnr.fyi
https://thoughts.hnr.fyi/feed.xml
https://thoughts.hnr.fyi/index.xml

tomaugspurger.net
https://tomaugspurger.net/feeds/all.atom.xml
https://tomaugspurger.net/index.xml

tomgamon.com
https://tomgamon.com/atom.xml
https://tomgamon.com/index.xml

tomstu.art
https://tomstu.art/articles.atom
https://tomstu.art/feed.atom

tratt.net
https://tratt.net/laurie/blog/blog.rss
https://tratt.net/laurie/news.rss

utf9k.net
https://utf9k.net/blog/rss.xml
https://utf9k.net/rss.xml

veerle.duoh.com
https://veerle.duoh.com/design-feed.rss
https://veerle.duoh.com/inspiration-feed.rss

veronique.ink
https://veronique.ink/feed/
https://veronique.ink/feed/?type=rss

werd.io
https://granary.io/url?url=https://werd.io/content/all/&input=html&output=atom&hub=https://bridgy-fed.superfeedr.com/
https://werd.io/?_t=rss
https://werd.io/content/posts?_t=rss

wizardzines.com
https://wizardzines.com/comics/index.xml
https://wizardzines.com/index.xml

write.as
https://write.as/davepolaschek/feed/
https://write.as/matt/feed/
https://write.as/zampano/feed/

www.allendowney.com
https://www.allendowney.com/blog/feed/
https://www.allendowney.com/wp/feed/

www.briandeconinck.com
https://www.briandeconinck.com/feed/
https://www.briandeconinck.com/notes/feed/

www.ekran.org
https://www.ekran.org/ben/wp/feed/
https://www.ekran.org/ben/wp/feed/?_=8177

www.ellyloel.com
https://www.ellyloel.com/articles.atom
https://www.ellyloel.com/feed.atom

www.gyford.com
https://www.gyford.com/phil/feeds/everything/rss/
https://www.gyford.com/phil/writing/feeds/posts/rss/

www.jeremycherfas.net
https://www.jeremycherfas.net/blog.atom
https://www.jeremycherfas.net/blog.rss

www.math.columbia.edu
https://www.math.columbia.edu/~dejong/wordpress/?feed=rss2
https://www.math.columbia.edu/~woit/wordpress/?feed=rss2

www.preposterousuniverse.com
https://www.preposterousuniverse.com/blog/feed/
https://www.preposterousuniverse.com/feed/

www.siennaeggler.com
https://www.siennaeggler.com/author/sienna/rss/
https://www.siennaeggler.com/rss/

www.sierz.co.uk
https://www.sierz.co.uk/blog/feed/
https://www.sierz.co.uk/reviews/feed/

www.somebits.com
https://www.somebits.com/linkblog/index.atom
https://www.somebits.com/weblog/index.atom

www.tekovic.com
https://www.tekovic.com/atom.xml
https://www.tekovic.com/rss.xml

xdg.me
https://xdg.me/index.xml
https://xdg.me/writing/index.xml

Consider leaving Microsoft GitHub

If we want to push for a smaller web not owned by US corporate control, shouldn’t this code be hosted on a libre platform ideal with federated patches? Y’all link sources like https://smallweb.page/why

Modern Gatekeepers

Today, most of the time spent on the web is either on a small number of very dominant platforms like Facebook and LinkedIn, or mediated through them.

Why stop at blogs? Microsoft GitHub is trying to monopolize the coding space as much as Facebook, et al. are trying to do with web content. Note the way users can’t participate without an account, the platform’s Code search is disabled unless you are authenticated, there are upsell advertisements like the one for Microsoft GitHub Copilot in the UI, etc. DCVS (distributed version control systems) like Git don’t need to be kept behind a platform anymore than the small web. Why not apply the same rules/goals to this code forge (and then follow it with the chatroom)?

High volume feeds

There are some feeds with a very high volume of posts currently in the list. Should there be a limit for this in the criteria for inclusion?

Here are the ones with recent items that have an average of >5 posts per day.

Average posts per day Average posts per week Feed URL Newest post date Oldest post date Number of posts
15.0 105.0 https://soft-wa.re/podcast.rss 2023-10-19T01:35:39+00:00Z 2023-10-19T01:35:38+00:00Z 15
12.5 87.5 https://maique.eu/feed.xml 2023-10-19T16:09:29+00:00Z 2023-10-17T14:12:04+00:00Z 25
10.0 70.0 https://bottledaux.com/feed/ 2023-10-18T22:22:26+00:00Z 2023-10-17T02:56:13+00:00Z 10
10.0 70.0 https://feeds.feedburner.com/thisisnthappiness 2023-10-19T16:02:47+00:00Z 2023-10-16T19:08:11+00:00Z 20
10.0 70.0 https://jabberwocking.com/feed/ 2023-10-19T17:11:30+00:00Z 2023-10-18T16:42:35+00:00Z 10
10.0 70.0 https://jacklimpert.com/feed/ 2023-10-19T15:30:16+00:00Z 2023-10-18T20:27:47+00:00Z 10
10.0 70.0 https://johnmenadue.com/feed/ 2023-10-18T17:58:41+00:00Z 2023-10-17T17:57:46+00:00Z 10
10.0 70.0 https://putah-creek.tumblr.com/rss 2023-10-19T12:54:35+00:00Z 2023-10-17T04:06:24+00:00Z 20
10.0 70.0 https://www.fullmoonfiberart.com/feed/ 2023-10-19T15:34:33+00:00Z 2023-10-17T21:53:52+00:00Z 10
10.0 70.0 https://www.lesswrong.com/feed.xml 2023-10-19T16:48:00+00:00Z 2023-10-18T20:39:34+00:00Z 10
10.0 70.0 https://www.thepassivevoice.com/feed/ 2023-10-19T17:03:29+00:00Z 2023-10-18T04:02:56+00:00Z 10
9.1 63.6 https://www.jvt.me/feed.xml 2023-10-19T13:10:04+00:00Z 2023-10-07T15:23:00+00:00Z 100
8.6 60.0 https://feeds.kottke.org/main 2023-10-19T17:12:18+00:00Z 2023-10-11T19:29:48+00:00Z 60
7.5 52.5 https://feeds.feedblitz.com/marginalrevolution&x=1 2023-10-19T15:32:23+00:00Z 2023-10-16T18:03:52+00:00Z 15
6.3 43.8 https://itstartswithabirthstone.blogspot.com/feeds/posts/default 2023-10-19T16:42:00+00:00Z 2023-10-15T13:44:00+00:00Z 25
6.3 43.8 https://samim.io/rss.xml 2023-10-17T21:24:58+00:00Z 2023-10-13T19:31:24+00:00Z 25

Possible canditates removals from the index

Hi,

I spent some time to create a small web dataset command line tool to start analyzing the index and help others creating a dataset from it.

The first task was to analyze the feeds to check if they were all English sites as per the guidelines. After an initial processing, it determined that some feeds are not English feeds.

To identify the language of a feed, it proceeds that way:

  1. use langdetect on the title and description of the feed
  2. use langdetect on the title and content of the articles of each feed
  3. it reassign a different language for the feed if the articles it contains are in majority of another language

The result of this processing is attached to this issue. The ones that are tagged with a specific language should most likely be of that language, and could be candidate for removal. For the ones with empty languages, this happens when there is not enough data to detect the language (concat(title + description) < 64 characters) for most of its articles, so the current heuristic can't determine the appropriate language for the feed.

Attached: small_wed_non_en_feeds.csv

What data source is used by RSS feed?

Hello! I added Kagi Small Web RSS data feed (https://kagi.com/api/v1/smallweb/feed) to my reader and noticed that there are a few non-english sites (e.g. I have a lot of some chinese posts in my reader right now). I wanted to create PR to remove those sites, but it seem sites in question (hktv1.vip in this case) isn't present in smallweb list in this repo.

Add support for Wiby, the search engine for Classic Websites

Wiby is a boutique search engine specializing in "older style pages, lightweight and based on a subject of interest," websites "reminiscent of the early internet." Pages are manually submitted by anonymous users. Its mission seems very in line with Kagi Small Web. Would a partnership to integrate Wiby somehow be in order?

There's a Wiby.me and a Wiby.org, which are probably identical.

Previous post navigation

I am greatly enjoying this, thank you for making it! It reminds me of StumbleUpon in the best of ways.

With SU, your URL changed for each site you visited, so you could press back button if you want to see something in this session again. With kagi it's not, so I wonder if something could be added to the top bar to help you navigate back to previous sites you've read in this session? e.g. a "Prev Post" button or a dropdown next to the URL listing all of the URLs you've opened in this session.

Reject posts without post date

There is this feed: https://franklin.dyer.me/rss/en
Some posts in it have time in <published>, but don't have date. Kagi parser in such cases seem to use today's date, which makes those posts appear in feed every day.

        <entry>
            <title type="html">The Summer Language Institute</title>
            <link href="https://franklin.dyer.me/post/121">
            <published>T00:00:00-05:00</published>
        </entry>
        <entry>
            <title type="html">Other blogs</title>
            <link href="https://franklin.dyer.me/post/216">
            <published>T00:00:00-05:00</published>
        </entry>
        <entry>
            <title type="html">Contact me</title>
            <link href="https://franklin.dyer.me/post/2">
            <published>T00:00:00-05:00</published>
        </entry>
        <entry>
            <title type="html">CV</title>
            <link href="https://franklin.dyer.me/post/1">
            <published>T00:00:00-05:00</published>
        </entry>
        <entry>
            <title type="html">None</title>
            <link href="https://franklin.dyer.me/post/None">
            <published>T00:00:00-05:00</published>
        </entry>

Remove Cloudflare Blog from Small Web list

Cloudflare is a billon-dollar company and controls all traffic on a significant fraction of websites globally. Regardless of how interesting the Cloudflare Blog is, there is no possibility of it being considered part of the "Small Web". It should be removed from Kagi's Small Web list.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.