searchmysite / searchmysite.net Goto Github PK

searchmysite.net is an open source search engine and search as a service

License: GNU Affero General Public License v3.0

Dockerfile 1.75% Python 68.28% PLpgSQL 0.07% Shell 1.40% HTML 24.41% CSS 3.96% JavaScript 0.13%

search search-engine self-hostable software-as-a-service

searchmysite.net's Introduction

searchmysite.net

This repository contains the complete codebase for https://searchmysite.net/, the independent open source search engine and search as a service for personal and independent websites (see About searchmysite.net for further details of searchmysite.net).

You can use this repository to:

See exactly how searchmysite.net works, e.g. inspect the code for indexing, relevancy tuning, search queries etc.
Help improve the searchmysite.net service, e.g. by reporting issues, suggesting improvements, or implementing fixes or enhancements. See Contributing.
Set up your own instance to provide a fully open and independent search for another part of the internet.

Directory structure and docker-compose files

The application is split into 5 components, each deployed in its own Docker container:

db - Postgres database (for managing site and indexing configuration)
indexing - Scrapy web crawler and bulk import scripts (for indexing sites)
models - TorchServe container with the Large Language Model (LLM)
search - Apache Solr search server (for the actual search index)
web - Apache httpd with mod_wsgi web server, with static assets (including home page), and dynamic pages (including API)

The project directory structure is as follows:

.
├── data                    # Data for Docker data volumes (not in git - to be set up locally)
│   ├── solrdata            # Mounted to /var/solr/solrdata in Solr Docker
│   ├── sqldata             # Mounted to /var/lib/postgresql/data in Postgres Docker
├── src                     # Source files
│   ├── db                  # Database scripts
│   │   ├── bulkimport      # Scripts to load sites into the database for the indexer to index
│   ├── indexing            # Indexing code
│   │   ├── bulkimport      # Scripts to load content directly into the search engine
│   │   ├── common          # Indexing code shared between bulk import and spider
│   │   ├── indexer         # Spidering code
│   ├── models              # Language models
│   ├── search              # Search engine configuration
│   ├── web                 # Files for deployment to web / app server
│   │   ├── config          # Web server configuration
│   │   ├── content/dynamic # Dynamic pages and API, for deployment to app server
│   │   ├── content/static  # Static assets, for deployment to static web server
├── tests                   # Test scripts
└── README.md               # This file

There are 3 docker-compose files, which are largely identical except:

docker-compose.yml (for local dev) configures the web server and indexing to read from the source, so code changes don't require a rebuild.
docker-compose.test.yml (for running test scripts) does not persist the database and search and doesn't run the scheduled indexing, so each test cycle can start with a clean and predictable environment.
docker-compose.prod.yml (for production) persists the database and search, and copies in web server and indexing code.

Setting up your development environment

Prerequisites

Ensure Docker is installed.

Get the source code with e.g.

cd ~/projects/
git clone https://github.com/searchmysite/searchmysite.net.git

Create the data directories for the database and search index:

cd ~/projects/searchmysite.net/
mkdir -p data/solrdata
mkdir -p data/sqldata
sudo chown 8983:8983 data/solrdata

Create a ~/projects/searchmysite.net/src/.env file for the docker-compose.yml containing at least the following:

POSTGRES_PASSWORD=<password>
SECRET_KEY=<secretkey>

The POSTGRES_PASSWORD and SECRET_KEY can be any values you choose for local dev. Note that although these are the only values required for the basic application to work, there are other values which will need to be set up for additional functionality - see the "Additional environment variables" section below.

And finally, build the docker images:

cd ~/projects/searchmysite.net/src
docker compose build

Note that the first build could take 20-30 mins, and the models container downloads a 3Gb model file.

Starting your development environment

With the prerequisites in place, you can start your development environment with:

cd ~/projects/searchmysite.net/src
docker compose up -d

The website will be available at http://localhost:8080/, and the Apache Solr admin interface at http://localhost:8983/solr/#/.

Setting up an admin login

If you want to be able to Approve or Reject sites added as a Basic listing, you will need to set up one or more Admin users. Only verified site owners, i.e. ones with a Full listing and able to login, can be permissioned as Admin users. You can use the web interface to add your own site as Full listing via Add Site, or insert details directly into the database.

Once you have one or more verified site owners, you can permission them as Admins in the database, e.g.:

INSERT INTO tblPermissions (domain, role)
  VALUES ('michael-lewis.com', 'admin');

Adding other websites

You can use Add Site to add a site or sites as a Basic listing via the web interface. You will need to login as an Admin user, click Review, and select Approve for them to be queued for indexing.

There are also bulk import scripts in src/db/bulkimport. checkdomains.py takes a list of domains or home pages as input, checks that they are valid sites, and that they aren't already in the list or the database, and generates a file for insertdomains.py to insert.

Additional environment variables

If you want to use functionality which sends emails (e.g. the Contact form) you will need to set the following values:

SMTP_SERVER=
SMTP_PORT=
SMTP_FROM_EMAIL=
SMTP_FROM_PASSWORD=
SMTP_TO_EMAIL=

If just testing, you can create a web based email account and use the SMTP details for that.

If you want to enable the payment mechanism for verified submissions, you will need to set:

ENABLE_PAYMENT=True
STRIPE_SECRET_KEY=
STRIPE_PUBLISHABLE_KEY=
STRIPE_PRODUCT_ID=
STRIPE_ENDPOINT_SECRET=

If just testing, you can get a test account from Stripe.

Making changes on local dev

Web changes

The docker-compose.yml for dev configures the web server to read from the source, so changes can be made in the source and reloaded. The web server will typically have to be restarted to view changes:

docker exec -it web_dev apachectl restart

For frequent changes it is better to use a Flask development environment outside of Docker.

To do this, firstly, you will need to set up local host entries for "db", "models" and "search", i.e. in /etc/hosts (given the "web" container talks to the db, models and search containers via the "db", "models" and "search" hostnames):

127.0.0.1 search
127.0.0.1 db
127.0.0.1 models

Secondly, install Flask and dependencies locally (noting that apache2-dev is required for mod-wsgi and libpq-dev for psycopg2), and install the searchmysite package in editable mode (these steps just need to be performed once):

sudo apt install apache2-dev libpq-dev
cd ~/projects/searchmysite.net/src/web/
pip3 install -r requirements.txt
cd ~/projects/searchmysite.net/src/web/content/dynamic/
pip3 install -e .

Finally, at the start of every dev session, load environment variables and start Flask in development mode via:

set -a; source ~/projects/searchmysite.net/src/.env; set +a
export FLASK_ENV=development
export FLASK_APP=~/projects/searchmysite.net/src/web/content/dynamic/searchmysite
flask run --debug

You local Flask website will be available at e.g. http://localhost:5000/search/ (note that the home page, i.e. http://localhost:5000/, isn't served dynamically so won't be available via Flask). Changes to the code will be reflected without a server restart, you will see debug log messages, and full stack traces will be more visible in case of errors.

Indexing changes

As with the web container, the indexing container on dev is configured to read directly from the source, so changes just need to be saved.

You would typically trigger a reindex by running SQL like:

UPDATE tblDomains 
  SET  full_indexing_status = 'PENDING'
  WHERE domain = 'michael-lewis.com';

and waiting for the next src/indexing/indexer/run.sh (up to 1 min on dev), or triggering it manually:

docker exec -it src-indexing-1 python /usr/src/app/search_my_site_scheduler.py

There shouldn't be any issues with multiple schedulers running concurrently if you trigger it manually and the scheduled job then runs.

You can monitor the indexing logs via:

docker logs -f src-indexing-1

and can change the LOG_LEVEL to DEBUG in src/indexing/indexer/settings.py.

Search (Solr) changes

The dev Solr docker container copies in the config on build, so a docker compose build is required for each config change.

Note that the solr-precreate content /opt/solr/server/solr/configsets/content doesn't actually load the new config after a docker compose build, so the following steps are required to apply Solr config changes:

docker compose build
docker compose up -d
docker exec -it search_dev cp -r /opt/solr/server/solr/configsets/content/conf /var/solr/data/content/
docker restart search_dev

Depending on the changes, you may also need to delete some or all data in the index, e.g.

curl http://localhost:8983/solr/content/update?commit=true -H "Content-Type: text/xml" --data-binary '<delete><query>domain:michael-lewis.com</query></delete>'

and trigger reindexing as per above. Use <query>*:*</query> to delete all data in the index.

You can also delete and recreate the data/solrdata directory, then rebuild, for a fresh start.

Database (Postgres) changes

You can connect to the database via:

  "host": "127.0.0.1",
  "user": "postgres",
  "port": 5432,
  "ssl": false,
  "database": "searchmysitedb",
  "password": <password-from-dotenv-file>

Schema changes should be applied to the src/db/sql/init* files so if you delete and recreate the data/sqldata directory the latest schema is applied.

Relevancy tuning

For basic experimentation with relevancy tuning, you can manually add a few sites, and experiment with those. Remember to ensure there are links between these sites, because indexed_inlink_domains_count is an important factor in the scoring. Remember also that indexed_inlink* values may require sites to be indexed twice to be correctly set - the indexing process sets indexed_inlink* values from the indexed_outlink* values, so needs a first pass to ensure all sites to have indexed_outlink* values set.

However, for serious relevancy tuning, it is better to use a restore of the production Solr collection. If you are interested in doing this, let me know and I'll make an up-to-date one available.

Note that if you add new fields to the Solr schema which are to be used in the relevancy scoring, it is better to wait until all the sites have had these fields added before deploying the new relevancy scoring changes. There are two ways of doing this: force a reindex of all sites, or wait until all sites are naturally reindexed. It is easier and safer to wait for the natural reindex. The force reindexed reindex of everything is likely to take over 24 hours given reindexing happens in batches of 20 and some sites take over 1 hour to reindex, while a natural reindexing will take 3.5 days to ensure all the verified sites are reindexed (28 days for unverified sites).

Testing

The tests are run with pytest on a local Flask instance, so you will need to install pytest and set up a local Flask instance as per the "Making changes on local dev" / "Web changes" section above. If you have ENABLE_PAYMENT=True, you will also need to setup Selenium and WebDriver, because the Stripe integration involves buttons which execute JavaScript, e.g.:

pip3 install selenium
pip3 install chromedriver-py

There are two test scripts:

clean_test_env.sh - shuts down any dev docker instances, rebuilds and starts the clean test docker instances.
run_tests.sh - sets up the environment variables, runs the pytest scripts and the indexing.

The pytest scripts:

submit and approve a Basic listing
submit a Full listing site, including making a test payment to the Stripe account specified with the STRIPE_* variables if ENABLE_PAYMENT=True
search the newly indexed sites
remove the test sites

To run:

cd ~/projects/searchmysite.net/tests
./clean_test_env.sh
./run_tests.sh

The indexing step will take minute or two, given it is performing indexing of real sites, and if ENABLE_PAYMENT=True you'll see a browser pop up which takes a few seconds to open and close.

If the tests succeed, it will leave the environment in the same state it was at the start, i.e. it cleans up after itself, so you don't need to run clean_test_env.sh before run_tests.sh again. If however, the tests fail, you will need to rerun clean_test_env.sh. For the same reason, if you accidentally run run_tests.sh against the dev rather than test env, e.g. because you didn't run clean_test_env.sh first, then if the tests succeed then the environment will be fine. It is better to use the test docker environment though because this provides a known clean starting point, and ensures the scheduled reindexing doesn't interfere with the indexing in the testing.

searchmysite.net's People

Contributors

Stargazers

Watchers

Forkers

yzzyx nibblegap proteanblank hbcbh1999 squalx aizyuval writeas

searchmysite.net's Issues

Infrastructure: Multiple web and indexing servers on production

At the moment production has all 4 components (web, indexing, database and search) on the same server instance. As it grows it is going to need to scale these out. First would be to have multiple indexing servers and web servers. I've tested running multiple indexing server running concurrently and it is fine, and I've put a reverse proxy in to do the SSL termination so that should make it easier to add multiple web/app servers.

However, at the moment, all the servers are started via a docker-compose file, which is the same as dev except that it copies in the web and indexer code. If splitting over multiple servers, probably going to need some container orchestration, like Docker Swarm or Kubernetes. It would be nice to try and keep the dev setup as close to prod as possible though, ideally without making dev setup prohibitively complicated.

Web: Authenticated API call to reindex

If you could trigger a reindex of your site with an API call, you could trigger a reindex as part of your CI pipeline for publishing, effectively giving you near-real-time search, which would be nice. Would need to put authentication on the API, and maybe even some kind of throttling. As of now, there are only two sites using the API, so not a priority.

Search: Upgrade from Solr 8.7.0 to 8.8.2

Update search/Dockerfile. Do a diff between default 8.7.0 config files and default 8.8.2 config files to see if there are any new config lines added, and add those to the custom config if necessary. Also see if solr-precreate content /opt/solr/server/solr/configsets/content now does actually load the new config, and update docs and script if it does.

Testing: See if there's a way of running the tests inside Docker (httpd + wod_wsgi) rather than Flask locally

Some significant issues have not affected Flask but have affected Apache httpd + mod_wsgi, i.e. have not affected dev but have affected prod. It would be worth looking into the feasibility for running pytest within the docker container running Apache httpd + mod_wsgi, rather than Flask locally.

Web: OpenAPI Spec generated documentation broken

The OpenAPI spec was at https://searchmysite.net/api/ but this is now no longer working, and https://searchmysite.net/api/v1/ gives a "Failed to fetch http://127.0.0.1:8080/api/v1/swagger.json ... The page was loaded over https:// but a http:// URL was specified."

On local dev, http://localhost:8080/api/ gives the same response as prod, but http://localhost:8080/api/v1/ does look like it returns the docs, so maybe it can be fixed with some flask_restx config to fix the url: "http://127.0.0.1:8080/api/v1/swagger.json" in the JavaScript it generates.

It was all working until the recent change to have all the dynamic sites served from one web app (via different aliases) and the putting of the reverse proxy in front to do the SSL termination.

General: Open source searchmysite.net

Pre-requisites:

A working Minimum Viable Product.
Some "battle testing" by real users to try and iron out the worst of the bugs.
Ensure there is no sensitive data in the repo, e.g. passwords, personal information.
Documentation, including (i) public issues list, (ii) contributor guidelines, and (iii) full instructions on how to setup a development environment.

Steps to action:

Make new repo public.
Link to source from footer.
Publish blog post that searchmysite.net is now open source, to draw attention to it.

Indexing: Automate site expiry

All indexed sites, whether submitted via Quick Add or Verified Add, have an expire_date field. This is initially set to 1 year after the validation_date (if validation_method is IndieAuth or DCV) or 1 year after the site is approved (if validation_method is QuickAdd). The idea is that sites are only indexed for a year unless further action is taken, to help try and stop too many stale and unmaintained sites building up in the system, which in turn wastes indexing resources and pollutes results.

The issue is that there's no automated code that does anything with the expire_date at the moment.

For sites submitted via Quick Add, the site should go back to the Review phase (i.e. move from tblIndexedDomains to tblPendingDomains), where the moderator(s) either reapprove for another year, or reject to move to the excluded sites.

Need to confirm the exact process for sites submitted via Verified Add. They could simply also be moved back to tblPendingDomains as per Quick Add, preserving owner_submitted, submission_method etc. status, with the owner having to enter the home page in the Validated Add and the final Verify button again. However, there may need to be some additional activity, e.g. an email reminder a week before.

As for where this code could be implemented, there is some maintenance related code already in src/indexer/search_my_site_scheduler.py given that is already run every 2 mins, although it might make more sense to pull all the maintenance related code into a new script (which doesn't need to be run so frequently).

Something needs to be implemented before July 2021 when the first expire_dates arrive.

"SolrException: this IndexWriter is closed" preventing reindexing until Solr restarted

The indexing process sometimes gets a "this IndexWriter is closed" error from Solr when trying to submit a site to Solr. Indexing for that site stops, but indexing for other sites continues. However, given the error is Solr-side, none of the subsequent indexing processes successfully submit documents to Solr either. It requires a Solr restart to resolve.

Not entirely sure what causes the error because it isn't reproduceable, but it may be something like Solr running out of file handles.

Ideally the source of this error could be identified and remediated, or failing that some way of raising some kind of alert to make it known Solr needs to be restarted.

Not a top priority though given it has only happened twice so far.

Error on the indexing process:

2021-01-28 21:33:21 [pysolr] INFO: Finished 'http://search:8983/solr/content/update/' (post) with body '<delete><q' in 0.015 seconds, with status 500
2021-01-28 21:33:21 [pysolr] ERROR: Solr responded with an error (HTTP 500): [Reason: this IndexWriter is closed]
2021-01-28 21:33:21 [scrapy.core.engine] ERROR: Scraper close failure

Corresponding error in the Solr logs:

28/01/2021, 21:33:21
ERROR false
x:content
RequestHandlerBase
org.apache.solr.common.SolrException: this IndexWriter is closed
28/01/2021, 21:33:21
ERROR false
x:content
HttpSolrCall
null:org.apache.solr.common.SolrException: this IndexWriter is closed
28/01/2021, 21:33:26
ERROR false
x:content
UpdateLog
Error opening realtime searcher:org.apache.solr.common.SolrException: Error opening new searcher

Hosting: Move to a cheaper hosting provider

As per https://blog.searchmysite.net/posts/searchmysite.net-retrospective-and-future-plans/ , one of the priorities for 2022 is moving to a cheaper hosting provider.

Web: Check a site can be indexed before completing submission process

A surprising number of sites (including verified sites) have been submitted with a robots.txt containing:

User-agent: *
Disallow: /

Which means searchmysite doesn't index them. It would be good to check robots.txt at the point of Quick Add or Verified Add so feedback can be given immediately if it isn't possible to index the site.

Note also that there are cases where robots.txt initially allows indexing, but is subsequently changed - see #11 for details of the handling of these.

Error when resubmitting a site which has had indexing disabled

If a user submits a site via Quick Add which is already in the database, the system tries to show them a relevant message as to what state that site is in, e.g.

Domain has been owner verified and is already being indexed - click on Manage Site and login if you are the owner
Domain has not been owner verified but is already being indexed - click on Add via IndieAuth or Add via DCV if you want to verify
etc.

If indexing is disabled, as per #11 Better handling of multiple failed reindexes, the idea had been to use this method to in effect communicate with the site owner, i.e. they could submit their site and see the message as to why indexing had been disabled.

Unfortunately this is giving an error for sites which have had indexing disabled, so need to fix this.

Enable alternate payment methods in addition to credit cards

At the moment, payment for the verified listing fee is via Stripe, and just accepts credit cards. However, in many countries, e.g. Germany, credit card adoption isn't high (see https://stripe.com/en-gb/payments/payment-methods-guide ). Should investigate and if necessary enable alternate payment mechanisms.

Not a priority at the moment though - the main focus is on building the system out, increasing adoption, and testing whether a search engine can be sustained by anything other than advertising.

Testing: See if there's a better way testing indexing

The run_tests.sh script just runs docker exec -it src_indexer_1 python /usr/src/app/search_my_site_scheduler.py at the moment. Need to figure out if there's a better way of testing the scraping process. There's a start in test_3_index.py but this throws a logging error and there isn't an assert to check for a return value or state. One workaround might be to implement the API call to trigger indexing for a site, i.e. #19

Indexing: Stop indexing pinterest.com and tumblr.com

If you search for domain:jenniferenglund.net, domain:duarte.vg, domain:fridabemighty.com and page through the results, you will see pages with URLs starting https://www.pinterest.com and https://www.tumblr.com which shouldn't be there. I think this is because of the domain appearing in the URL. Might be bug in scrapy. Workaround could be to add those domains to the deny domains or have a regex in the allow field.

Indexing: Stop indexing some sites if they are taking too long

Some of the larger sites are taking nearly 2.5 hours to index, which is starting to risk knock-on effects.

There are a couple of ways this could be implemented:

Via a crawler stat, like that which is used for indexing_page_limit. The advantage of this approach is that it can be configurable on a per site basis. Unfortunately the elapsed_time_seconds crawler stat is only set at the end, although you could get the start_time stat and deduct from datetime.now().
The simplest way is just to use the CLOSESPIDER_TIMEOUT close spider extension. This will be a global setting, i.e. not configurable on a per site basis, but that is fine given it is more of a stability thing.

Thinking of setting to 1800 seconds (30 minutes) for now.

Web: Internationalisation of menus

There are quite a few non-English language sites submitted, so it would be good to make the experience better in those languages, e.g. via internationalisation of menus.

Web: Language selector for search results

For the Search and the Newest Pages functions, all results are shown irrespective of language (for Browse there is a language filter). It would be good to have a language selector on the Search and Newest Pages so that users could only show results in the specified language.

Not sure if there are any HTTP headers that could be used to set sensible defaults (e.g. not sure how widely use Accept-Language is).

Given the privacy policy, probably don't want to remember a user's setting this with a cookie.

Purchase button isn't working

On Verified Add, everything works up to and including the Validate button, but the Purchase button doesn't work. This was reported by a user on 12 Apr 2022.

Tested on dev, and everything works as expected, so it is a prod-specific issue. Prod has the following error in the web_prod log:

stripe.error.AuthenticationError: Invalid API Key provided

Web: Extend domcheck to look in subdomain

e.g. verification of hiebl.cc fails because TXT record is on blog.hiebl.cc (compare dig hiebl.cc TXT and dig blog.hiebl.cc TXT). See also #17

Search: Upgrade from Solr 8.6.3 to 8.7.0

It is in the search/Dockerfile. Need to do a diff between default 8.6.3 config files and default 8.7.0 config files to see if there are any new config lines added, and add those to the custom config if necessary. Might also be worth seeing if solr-precreate content /opt/solr/server/solr/configsets/content now does actually load the new config, and update docs and script if it does.

Sites added via Verified Add (IndieAuth) lose their Category and Email

Every site added via the Verified Add (IndieAuth) option since 12 Dec 2020 has lost its site_category field, i.e. the following both return the same results:

select * from tblindexeddomains where site_category IS NULL;
select * from tblindexeddomains where validation_method = 'IndieAuth' AND validation_date > '12 Dec 2020';

I've manually fixed the data for now, and will do so for any new sites submitted until I've released a fix for the issue.

Web: Setup dedicated IndieAuth provider or implement IndieAuth client

The site is currently using https://indielogin.com/ as its IndieAuth provider. As per discussion with aaronpk (see https://chat.indieweb.org/dev/2020-07-12) "i don't want too many people to rely on indielogin.com itself, i would rather they implement an indieauth client directly or set up an instance of indielogin.com for themselves".

Need to work out the best way of doing this. Either it could be a fork of https://github.com/aaronpk/indielogin.com hosted in its own container at e.g. indielogin.searchmysite.net, or the indieauth client code could be directly integrated into src/wec/content/dynamic/searchmysite/admin/auth.py.

See also https://indieauth.net/ for more information on IndieAuth.

Indexing: Some URLs are being indexed which aren't on the same domain

The crawler is configured with LinkExtractor(allow_domains=self.allowed_domains...) where self.allowed_domains is the current domain, e.g. michael-lewis.com, so in theory only pages on the current domain should be indexed and recorded with the domain field value set to the current domain. However, there are some domains which index URLs which are clearly not from that domain. Search e.g. for domain:iwebthings.com and you will see URLs like https://www.justus.ws/ and https://minitokyo3d.com/ which should not be there. See also #8

Indexer: Upgrade from Scrapy 2.4.0 to 2.5.0

Change scrapy==2.4.0 in indexer/requirements.txt to scrapy==2.5.0 and test.

Indexer: Upgrade from Scrapy 2.5.0 to 2.5.1

There is an open vulnerability in <=2.5.0 - see GHSA-jwqp-28gf-p498 for details. It doesn't impact this project because this project doesn't use the affected HttpAuthMiddleware, but might as well upgrade anyway.

Web: Make site categories list database driven

At the moment the 2 values (personal-websites and independent-websites) are hardcoded into 3 templates. It is not likely to change any time soon so it really isn't a priority, but just making a note here. In future there may be other categories such as "special interest site", "independent online store", "independent small business".

General: Build a community

It would be great to get some good discussion going on the future direction of searchmysite.net, before spending too much more time developing it further. Not sure the best way to do this. Perhaps a https://www.discourse.org/ instance, or a https://discord.com/ group, or something else.

Indexing: Index wikipedia

A user submitted https://en.wikipedia.org/ via Quick Add. As per my rejection note (which you can see by trying to resubmit) I would love to index wikipedia, but it would require custom dev and likely an infra upgrade.

The big advantage of including wikipedia would be that it would turn searchmysite.net from a niche search into a more general search, and therefore give the site more "stickiness". It wouldn't be a departure from the original philopsophy, which is (among other things) to index just the "good stuff", to penalise pages with adverts, and to focus on personal and independent websites at first (I think wikipedia still falls under the category of "independent website").

However, given the 6M+ English pages, and 20M+ other languages, spidering it via the normal approach would not be a good idea. Indeed the page at https://en.wikipedia.org/wiki/Wikipedia:Database_download even says "Please do not use a web crawler to download large numbers of articles." A better idea would be to periodically download the database, and have a custom indexer for that database. The tblIndexedDomains could have a column added for indexer type. It may require some Solr schema changes too in order to get the most out of it. It would have to be listed as not owner verified, and of course an exception made to increase the 50 page non-owner verified page limit.

Not a trivial undertaking, and it would almost certainly require a CPU, memory, and disk upgrade for the production sever, i.e. increase running costs. But not completely out of the question either.

Web: Extend domcheck to follow redirects

e.g. for firebase with cleanurls configured which send a 301 redirect from an uploaded file.html to file. See also #18

Web: Mobiles with small screens may have the search box obscured by the footer

Reported by a user, with a G5S Plus: 5.5" (140 mm) 1920 x 1080 (401 ppi) IPS LCD.

Screenshot:

Indexer: Upgrade from Scrapy 2.5.1 to 2.6.1

There are a couple of open security advisories which would be resolved by upgrading to scrapy 2.6.0 or higher:

User-set cookies are kept on redirect requests regardless of the target domain https://github.com/searchmysite/searchmysite.net/security/dependabot/1
Cookie-setting is not restricted based on the public suffix list https://github.com/searchmysite/searchmysite.net/security/dependabot/2

Indexing of some sites is blocked by Cloudflare

As per #11 Better handling of multiple failed reindexes, sites which fail to index content two times in a row have their indexing disabled.

There are two sites which index fine on dev, but fail on prod, and so have had indexing disabled. I think this is because indexing is blocked by Cloudflare. A dig <domain.com> NS +short (replacing domain.com with the actual domain), show they use cloudflare.com name servers, and one of the sites also has /cdn-cgi/challenge-platform/h/b/scripts/invisible.js in the source which is related to Cloudflare's Bot Fight Mode.

To recreate, run the scrapy shell inside the docker container (replacing home_page with the actual site's home page), i.e.
docker exec -it src_indexing_1 scrapy shell 'home_page'
This returns
DEBUG: Crawled (200) (referer: None)
on dev, but
DEBUG: Crawled (503) (referer: None)
or
DEBUG: Crawled (403) (referer: None)
on prod.

Need to contact Cloudflare to see if they can address. According to "I run a good bot and want for it to be added to the allowlist (cf.bot_management.verified_bot). What should I do?" at https://support.cloudflare.com/hc/en-us/articles/360035387431#h_5itGQRBabQ51RwT5cNJX8u there is a form to fill in.

Add a link to the Discussion forum in the Community section of the footer

As per the discussion at #25 there is a suggestion to provide a "link to the discussion forum from the search engine front page for more visibility and input from others."

I think the simplest way of doing this would be to add a link called Discussion to https://github.com/searchmysite/searchmysite.net/discussions in the Community section of the footer, between Contact and Documentation (it is ordered alphabetically).

Indexer: Upgrade from Scrapy 2.6.1 to 2.6.2

2.6.2 is not currently released, but when it is, upgrading to it should fix the errors in the indexing log caused by:

scrapy/scrapy#5437

Indexing: Add an incremental reindex (only indexing new items)

As per the post searchmysite.net: The delicate matter of the bill one of the less desirable "features" of the searchmysite.net model is that it burns up a lot of money indexing sites on a regular basis even if no-one is actually using the system. It would therefore be good to try to reduce indexing costs.

One idea is to only reindex sites and/or pages which have been updated. It doesn't look like there is a reliable way of doing this though, e.g. given only around 45% of pages in the system currently return a Last-Modified header, so there may need to be some "good enough" only-if-probably-modified approach.

For the only-if-probably-modified approach, one idea may be to store the entire home page in Solr, and at the start of reindexing that site compare the last home page with the new home page - if they are different, then proceed with reindexing that site, and if they are the same, do not reindex that site. There are some issues with this, e.g. if the page has some auto-generated text which changes on each page load, e.g. a timestamp, it will always register as different even if it isn't, and conversely there may be pages within the site which have been updated even if the home page hasn't changed at all. It might therefore be safest to have, e.g. a weekly only-if-probably-modified reindex and monthly reindex-everything-regardless (i.e. the current) approach as a fail-safe.

Indexing: A deny path starting with a . hangs all indexing

A user has entered a deny path .html.en.utf8 and this has hung all indexing. Last entries in the indexing log were:

 [searchmysitescript] INFO: Changing *.html.en.utf8 in deny path to *.html.en.utf8
 [searchmysitescript] INFO: Deny path ['*.html.en.utf8']

It looks like issue #32 "Deny path *. blocking all indexing for a site", which was the issue where e.g. *.xml (i.e. with the asterisk before the dot) would block indexing. The solution for that was a regex to change e.g. *.xml to *.xml$ prior to indexing (leaving the database untouched).

Temporary workaround in this case is to change the filter from .html.en.utf8 to *.utf8 in the database. This puts the following in the indexing log before indexing continues:

2021-04-29 09:06:47 [searchmysitescript] INFO: Changing *.utf8 in deny path to .utf8$
2021-04-29 09:06:47 [searchmysitescript] INFO: Deny path ['.utf8$']

This suggests the fix from last time isn't working exactly as expected, i.e. it triggers on a string starting with (or containing multiple) dot characters rather than one starting with asterisk dot, plus in that case it wasn't correctly adding the $ to the end of the string.

Web: Browse screen doesn't work well on mobile view

On the Browse screen I've a d-sm-table-cell on the Domain and Tags columns, so those aren't displayed on small screen devices. However, the filters and Site column are displayed. The issue is that some of the filters now have longer words, e.g. "independent website", that overlap with the Site on small screens. Should consider moving to the filters to a menu or something.

Indexing: Custom deduplicator not deduplicating urls with trailing slashes

There is custom deduplicator code in process_item in pipelines.py that deduplicates www and non-www links, e.g. so only one of https://michael-lewis.com/ and https://www.michael-lewis.com/ would be indexed. It also deduplicated trailing slashes, e.g. so only one of https://michael-lewis.com/ and https://michael-lewis.com would be indexed, but it looks like this is no longer working, e.g. there are entries for both https://kevq.uk/how-does-mastodon-work/ and https://kevq.uk/how-does-mastodon-work (with and without the trailing slash).

Testing: Add more tests for Verified Add

The test scripts perform a Verified Add (DCV) submission for blog.searchmysite.net (running some SQL to change the validation_key to match the actual one on searchmysite.net). This is a a "happy path" submission though, so doesn't test any of the other paths, e.g. someone starts the submission but revisits at a later date, or something fails.

Also not sure how best to test Verified Add (IndieAuth).

Web: Add email box to forgot password page, and make user facing

At the moment the forgot password page (/forgotten/password/ inside auth.py) submit (although not view) is admin only. This is because it only asks the user for domain, and I didn't want random people being able to spam verified site owners with password reset emails. The solution is to add a box to request email on the forgot password page, and say that the reset link will only be sent if the email and domain match. At that point the forgot password page can be made user facing.

Web: If you use Verified Add (IndieAuth) on an already verified site, and then try to use IndieAuth to login to Manage Site, the login will redirect to Verified Add (IndieAuth)

To recreate:

With a new browser session, click Add Site / Verified Add (IndieAuth) and enter a site already validated site with IndieAuth. It'll go through the auth process but then say "This domain has already been registered. Click Manage My Site to manage it"
Go to Manage Site. You'll be taken to the login page. View source and you'll see hidden redirect_uri set to https://searchmysite.net/admin/add/indieauth1home/ . It should be https://searchmysite.net/admin/login/ .
Login with IndieAuth. You'll be redirected to the Verified Add (IndieAuth) page, with "This domain has already been registered." message, rather than the "Manage Site" page.

Need to catch that sequence and make sure the redirect_uri is set correctly.

Short term workaround for affected users is to start a new browser session (or enter https://searchmysite.net/admin/logout/ within the same browser session to clear the session), and go straight to Manage Site.

Autofocus in search field on 'Browse Pages'/'Newest Pages'

When browsing 'Browse Pages'/'Newest Pages' the cursor is always placed
in the search box. This breaks key-based navigation, e.g. hitting space
bar to scroll down.

After clicking on these links the user clearly intended to not use the
search, so I'd say not auto focussing the search should be the default.

Web: Logo and redesign

The current design was meant to be temporary just to get something working, so is pretty plain Bootstrap. I don't think the final design should be especially exciting given it is supposed to be a back-to-basics search engine, but it could look a bit more slick/professional.

It would also be nice to have a logo. I did experiment with layering bootstrap icon's bi-search on bi-person on bi-globe2 (quite literally "search" "my" "site") but it didn't look so good at 32x32. I also quite liked the concept of the silhouette of a bird foraging for food, but couldn't find a good free icon (in fact many of the ones with the birds heads bent down looked a bit odd so maybe its just a bad idea).

The favicon.ico has kindly been provided by @binyamin .

Web: Error if you login with IndieAuth but don't have a verified site

If a site has IndieAuth on it, but that site isn't in this system, there is currently nothing stopping that site owner logging in to this system. If they do, they'll login and get an non-specific error, but a better user experience would be to show a specific error earlier in the process. Solution would be to check a site is in this system and with IndieAuth enabled before beginning the IndieAuth login workflow. Not aware of any real users having encountered this, and it isn't a security risk or anything, so not a high priority

Indexing: Other possible new Solr fields: rel_me, microformat, endpoints

Other possible new Solr fields: rel_me, microformat (h-card, h-entry, h-feed, etc.), endpoints (webmention, micropub, authorisation, token, etc.) Not quite sure what would be done with them yet, but e.g. an index of rel_me links could be able to drive a people search.

Deny path *.<extension> blocking all indexing for a site

A user entered a deny path of *.xml, which seems a reasonable way to express "don't index any XML files". However, the deny parameter on LinkExtractor interprets this is such a way that nothing on the site is indexed at all. I think this is because the regex is interpreted as *. which means everything, although escaping it with *. doesn't appear to resolve the issue.

Web: Upgrade from Bootstrap 4 to Bootstrap 5

Bootstrap v4.5.3 is currently used. However Bootstrap v5.0.1 is now available: https://getbootstrap.com/

Database: Add domain of person who performed Review action

Probably a new column in tblIndexedDomains, e.g. approver, only populated where validation method is 'QuickAdd', and similar in tblExcludeDomains, e.g. called reviewer, storing the domain (i.e. userid) of the person who approved/rejected the site. Not necessary now there's only the one admin, but will be useful when/if there is more than one.

Search: Provide OpenSearch Atom responses for standards-based federation

I notice you have an OpenSearch document, as it's a standard I've long been interested in for enabling federated websearch.

What do you think of delivering search results in RSS or Atom feeds for clients to combine with other compatable search engines (currently mostly government open data sites)?

I'd be willing to contribute the code or develop a client.

Indexing: Better handling of multiple failed reindexes

At the moment, tblIndexingLog contains the following messages for robots.txt forbidden and site timeout respectively:

"WARNING: No documents found. Possibly robots.txt forbidden or site timeout: robotstxt/forbidden 1, retry/max_reached None"

"WARNING: No documents found. Possibly robots.txt forbidden or site timeout: robotstxt/forbidden None, retry/max_reached 2"

However, it will keep on trying every 3.5 days or 7 days indefinitely. Leaving it this way isn't ideal because (i) the indexing gets filled up with warnings and resources are potentially wasted attempting reindexing, and perhaps more importantly (ii) if a site was previously indexed before consistently timing out or subsequently blocking indexing via robots.txt then stale content will be left in the search index adversely impacting the quality of results.

At the moment, the tblIndexingLog is checked manually for such issues, and one of 2 actions taken manually:

If the site looks like it is permanently offline, or robots.txt blocks indexing and it isn't a verified site, it is moved from tblIndexedDomains to tblExcludeDomains.
If robots.txt blocks indexing and it is a verified site, it is left in tblIndexedDomains but the indexing_frequency increased from '3.5 days' to '30 days'

It would be good to automate this process. Might want to have indexing such that it keeps a count of unsuccessful indexing, and moves to tblExcludeDomains after certain number of unsuccessful indexes, plus conversely a maintenance job that (much less often) checks that certain reasons on tblExcludeDomains are still true (e.g. robots.txt forbidden or site timeout) and moves back to tblIndexedDomains.

See also #14 to try and prevent sites which block indexing via robots.txt from being submitted (although there have been cases, including with validates sites, where robots.txt allowed indexing on submission but was subsequently changed to block indexing, leaving the data in the index to become stale).