Coder Social home page Coder Social logo

dehlirious / doogle Goto Github PK

View Code? Open in Web Editor NEW

This project forked from safesploitorg/doogle

1.0 0.0 1.0 364 KB

Doogle is a search engine and web crawler which can search indexed websites and images

Home Page: https://search.safesploit.com/

License: MIT License

JavaScript 4.36% PHP 89.44% CSS 5.34% Hack 0.87%

doogle's Introduction

Doogle

Doogle is a search engine and web crawler which can search indexed websites and images, and then use keywords to be searched later.

This Fork(AlteredCore) is under Development. Making many changes to ultimately end up with a refined program ready to index the internet.

DoogleHomepage-Preview

Features

  • Search sites
    • Displays title, URL and description
  • Search images
    • Hover over images to preview description (alt tag)
    • Masonry layout for searched images
    • Image preview using Fancybox
    • Image search page responds dynamically
  • Redis Caching
  • AJAX Support
  • Clean homepage
  • Filters broken image results
  • Organises search results by clicks/visits
  • Pagination system at the bottom of the search page
  • Shows 'results found' for search term
  • Supports non-latin characters (UTF-8)

Table of Contents

PHP Dependencies

mysql
pdo_mysql
(optional) redis

Setup and Usage

Two methods of setup are discussed.

  • Docker (Easiest)
  • Server Setup

Docker

Docker configuration files are available at doogle-docker.

Presuming you already have Docker v3.9 (or greater) installed and configured.

git clone https://github.com/safesploit/doogle-docker.git
cd doogle-docker
sh build.sh

Screenshot 2023-02-22 at 21 11 33 image

Doogle is now accessible via localhost:8000.

For debugging phpMyAdmin has also been included on localhost:8001.

Server Setup

v1.0.0-beta.1 is supported and tested in PHP 7.4, 8.0 and 8.1.

Please refer to XAMPP for the web server, PHP server and MySQL server configuration. XAMPP is the simplest method as several servers are required to use Doogle.

MySQL Setup on XAMPP will use PHPMyAdmin as a GUI method of setting up the database.

Once logged into the database via PHPMyAdmin under the PHPMyAdmin > SQL tab, the content of 'doogle-tables-no-data.sql' can be pasted into the field

Image1-PHPMyAdmin

SQL User Creation

Amend the password PASSWORD_HERE using a strong random password.

mysql> CREATE USER IF NOT EXISTS 'doogle'@'localhost' IDENTIFIED BY 'PASSWORD_HERE';

SQL User Permissions

The SQL user 'doogle' must have SELECT, INSERT and UPDATE privileges:

mysql> GRANT SELECT, INSERT, UPDATE ON `doogle`.* TO 'doogle'@'localhost';
  • INSERT is used for crawling
  • SELECT is required for the search engine to return queries
  • UPDATE is required to amend the clicks and broken results (see ./ajax/)

Connecting PHP to MySQL Server

In the file config.php the following must be entered correctly for your database configuration:

$dbname = "doogle";
$dbhost = "localhost";
$dbuser = "doogle";
$dbpass = "";

In the file 'doogle-tables-no-data.sql' the database will be created as 'doogle'.

Crawling Websites to Populate Images and Sites tables

Form-based crawl

In your browser go to where the file is hosted http://localhost/crawl.php

Paste the URL into the input field and press the Crawl button.

Manual crawl

At the bottom of crawl-manual.php the variable $startUrl is where to paste the URL of the website to be crawled:

$urls = [
'urls'=>[
  "https://google.com/",
  "http://udemy.com/topic/natural-language-processing/",
]

];

URLs will be processed in out of order unless this line is commented shuffle($urls['urls']);

Then in your browser go to where the file is hosted http://localhost/crawl-manual.php

Explanation

The crawling process will take some time, it will completely depend on the size of the website being crawled. The page will continue to load (without output) until the crawl.php script finishes.

Check the tables images and sites in the database to ensure they are being populated.

Image2-PHPMyAdmin

Once the tables are populated visit the Doogle homepage and search! See preview images.

Programming Logic

Pagination

Logic of pagination system

Inside search.php, pagination is implemented

image demonstrating pagnigation

In the example above, currentPage=11. The number of pages to show is always 10.

Results Per Page

Site search will return 20 results per page and image search will return 30 results per page.

The results per page can be changed inside search.php on lines {83, 88} respectively. As indicated by the $pageSize variables:

Search-resultsPerPage

Handling an edge case

An edge case can occur when no more pages are available.

So, for 331 results, 17 pages will be available. However, without an edge case scenario consider, the UI for the pagination system will allow scrolling through pages which don't exist; which would return an empty result.

To handle an edge case the following logic is implemented in the while-loop:

if($currentPage + $pagesLeft > $numPages + 1)
    $currentPage = $numPages + 1 - $pagesLeft;

while($pagesLeft != 0 && $currentPage <= $numPages) 
{ ... }

Image Search

Image Captions

To make image searches more informative, the 'alt' tag is part of the search term. As shown in ./classes/ImageResultsProvider.php line 34

ImageResultsProvider-query

Loading Images with JavaScript

In the 'images' table, there is a row 'broken' which tracks images which return an error.

Because images are already loaded with a pure server-side solution, AJAX must be leveraged, loading images dynamically. Which is shown in ./assets/js/script.js

script js-loadImage-broken

Masonry

Image searches are using Masonry - Cascading grid layout library.

Masonry allows images a grid layout which is responsive due to jQuery. The image below shows an example layout:

Masonry-item-layout

Site Search - Trimming Results

As shown in the preview images, Doogle when performing a site search will return (title, URL and description) for each result.

However, to make some results easier to read, a trimming process is performed. Inside ./classes/SiteResultsProvider.php the function truncate_hl() is called:

Title's are trimmed at 100 characters and description's are trimmed at 230 characters.

Telemetry

Both the 'images' and 'sites' tables in the database have a row containing 'clicks' for each column.

The 'clicks' field is increased each time a site is visited or image is previewed.

When performing a search, results returned are organised in descending order of clicks. This behaviour is shown by the $query inside ./classes/SiteResultsProvider.php function getResultsHtml().

SiteResultsProvider-getResultsHtml

User-Agent

Inside ./classes/DomDocumentParser.php the user-agent data used during crawling is located.

DomDocumentParser-bot

Preview Images

Doogle Homepage

Image3-DoogleHomepage-Edge

Doogle Search - Sites

Image4-DoogleSearch-PoC

Expanded Doogle Search Results

Image4-DoogleSearch-PoC

Doogle Search - Images

Image5-DoogleSearch-PoC-images

Image Preview

Image preview is done using Fancybox.

The title, image URL and site URL are available on the bottom left corner.

Image9-DoogleSearch-imagePreview

Pagination System

Naturally, certain search terms may return many results like 'bbc'.

To which Doogle only displays 20 sites per page. At the bottom of the page, we can view the next 10 pages.

Results Shown

Image6-DoogleSearch-pagination-ResultsShown

Bottom of Page

Image7-DoogleSearch-pagination-Bottom

Bottom of Page 20

Image8-DoogleSearch-pagination-scrollingThrough

doogleBot Crawl Form

An HTML form to submit a URL for crawling

Image10-doogleBot-Crawler-formpng

Preview Video

Doogle Search demo - YouTube

doogle's People

Contributors

dehlirious avatar safesploit avatar

Stargazers

 avatar

Forkers

moto

doogle's Issues

Doogle differences

Hello what is the difference between this repository and original doogle repository?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.