Coder Social home page Coder Social logo

yggo's Introduction

Project archived. Please, visit Yo! - the next generation of YGGo project based on Manticore search.

YGGo! - Distributed Web Search Engine

StandWithUkraine

Written by inspiration to explore Yggdrasil ecosystem. Engine could be useful for crawling regular websites, small business resources, local networks.

The project goal - simple interface, clear architecture and lightweight server requirement.

Overview

Home page

https://github.com/YGGverse/YGGo/tree/main/media

Online instances

  • http://[201:23b4:991a:634d:8359:4521:5576:15b7]/yggo/

Database snaps

  • 17-09-2023 http://[201:23b4:991a:634d:8359:4521:5576:15b7]/yggtracker/en/torrent/15

Requirements

php8^
php-dom
php-xml
php-pdo
php-curl
php-gd
php-mbstring
php-zip
php-mysql
php-memcached
memcached
sphinxsearch

Installation

  • git clone https://github.com/YGGverse/YGGo.git
  • cd YGGo
  • composer install

Setup

  • Server configuration /example/environment
  • The web root dir is /src/public
  • Deploy the database using MySQL Workbench project presented in the /database folder
  • Install Sphinx Search Server
  • Configuration examples presented at /config folder
  • Make sure /src/storage/cache, /src/storage/tmp, /src/storage/snap folders are writable
  • Set up the /src/crontab by following example
  • To start crawler, add at least one initial URL using search form or CLI

JSON API

Build third party applications / index distribution.

Could be enabled or disabled by API_ENABLED option

Address
/api.php
Search

Returns search results.

Could be enabled or disabled by API_SEARCH_ENABLED option

Request attributes
GET action=search  - required
GET query={string} - optional, search request, empty if not provided
GET type={string}  - optional, filter mime type of available or empty
GET page={int}     - optional, search results page, 1 if not provided
GET mode=SphinxQL  - optional, enable extended SphinxQL syntax
Hosts distribution

Returns hosts collected with fields provided in API_HOSTS_FIELDS option.

Could be enabled or disabled by API_HOSTS_ENABLED option

Request attributes
GET action=hosts - required
Application manifest

Returns node information for other nodes that have same CRAWL_MANIFEST_API_VERSION and DEFAULT_HOST_URL_REGEXP conditions.

Could be enabled or disabled by API_MANIFEST_ENABLED option

Request attributes
GET action=manifest - required

Search textual filtering

Default constructions
word prefix:

yg*

operator OR:

hello | world

operator MAYBE:

hello MAYBE world

operator NOT:

hello -world

strict order operator (aka operator "before"):

aaa << bbb << ccc

exact form modifier:

raining =cats and =dogs

field-start and field-end modifier:

^hello world$

keyword IDF boost modifier:

boosted^1.234 boostedfieldend$^1.234

Extended syntax

https://sphinxsearch.com/docs/current.html#extended-syntax

Could be enabled with following attributes

GET m=SphinxQL

Roadmap

Basic features
  • Web pages full text ranking search
    • Sphinx
  • Unlimited content MIME crawling
  • Flexible settings compatible with IPv4/IPv6 networks
  • Extended search syntax support
  • Compressed page history snaps with multi-provider storage sync
    • Local (unlimited locations)
    • Remote FTP (unlimited mirrors)
    • Privacy-oriented downloads counting, traffic controls
UI
  • CSS only, JS-less interface
  • Unique host ident icons
  • Content MIME tabs (#1)
  • Page index explorer
    • Meta
    • Snaps history
    • Referrers
  • Top hosts page
  • Safe media preview
  • Results with found matches highlight
  • The time machine feature by content snaps history
API
  • Index API
    • Manifest
    • Search
    • Hosts
    • Snaps
  • Context advertising API
Crawler
  • Auto crawl links by regular expression rules
    • Pages
    • Manifests
  • Robots.txt / robots meta tags support (#2)
  • Specific rules configuration for every host
  • Auto stop crawling on disk quota reached
  • Transactions support to prevent data loss on queue failures
  • Distributed index crawling between YGGo nodes trough manifest API
  • MIME Content-type settings
  • Ban non-condition links to prevent extra requests
  • Debug log
  • Index homepages and shorter URI with higher priority
  • Collect target location links on page redirect available
  • Collect referrer pages (redirects including)
  • URL aliasing support on PR calculation
  • Host page DOM elements collecting by CSS selectors
    • Custom settings for each host
  • XML Feeds support
    • Sitemap
    • RSS
    • Atom
  • Palette image index / filter
  • Crawl queue balancer, that depends of CPU available
  • Networks integration
Cleaner
  • Banned pages reset by timeout
  • DB tables optimization
CLI

*CLI interface still under construction, use it for your own risk!

  • help
  • db
    • optimize [x] crontab
    • crawl
    • clean
  • hostSetting
    • get
    • set
    • list
    • delete
    • flush
  • hostPage
    • add
    • rank
      • reindex
  • hostPageSnap
    • repair
      • db
      • fs
    • reindex
    • truncate
Other
  • Administrative panel for useful index moderation
  • Deployment tools
  • Testing
  • Documentation

Contributions

Please make a new branch of main|sqliteway tree for each patch in your fork before create PR

git checkout main
git checkout -b my-pr-branch-name

See also: SQLite tree

Donate to contributors

License

Feedback

Feel free to share your ideas and bug reports!

Community

See also

yggo's People

Contributors

d47081 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

yggo's Issues

White list / black list websites, robots.txt pre-sets

So, trackers with external seeders is shit inside the network

Nice start..

I mean this subject for the websites we need to crawl and some maybe a mirrors we need to block or limit by the crawlPageLimit/CRAWL_HOST_DEFAULT_PAGES_LIMIT

Ideas here, just few relevant relations
#1 (comment)

And I would to ask, do we need to enable the GitHub Discussions page, or do Issues to resolve, not talk.

Implement MySQL + Sphinx data driving model

Just tried to make search request on 2.5M rows on SQLite / FTS5 and seems that we starting to have performance issue.

According to following conversation we need to rewrite current DB driver model. Suppose MySQL is the nice accessible alternative.

I have experience with Sphinx engine, it stores compiled data in RAM and able to process at least 8M rows in milliseconds with same server resources, comparing to the current result.

If some one have better ideas - you are welcome here.

Core upgrade

Thoughts to change DB structure or/and current search model implementation.

As the project have no releases yet, current repository could be separated to the sphinxway branch.

By this way, following changes could be wanted:

  • Rewrite crawler
    • Make it more flexible for changes (currently it's monolith)
    • Improve multimedia index / caching
    • Improve content semantics by internal content parsing to the ranked keywords
  • Framework for web UI
  • Make installation simpler for distributed usage
  • Add more features related with snaps exploring

imho, as the basement, perspective replacement for sphinx engine is
https://github.com/manticoresoftware/manticoresearch

Main goal for the future changes - make search useful by collected experience.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.