ahmia / search Goto Github PK

View Code? Open in Web Editor NEW

153.0 153.0 59.0 25.8 MB

Ahmia - Search Engine for onion services.

License: BSD 3-Clause "New" or "Revised" License

Python 4.98% HTML 78.38% CSS 3.36% JavaScript 11.82% Shell 0.17% TeX 1.17% XSLT 0.09% Perl 0.03%

search's People

Contributors

Stargazers

Watchers

Forkers

kokope11i skrish13 wtf torsearch haydenprock mdhash pratyusha972 doufu13 chrismacnaughton securitywarrior codiscope-public sw0rrdd ossa619 ltoscano mikerah solertis quxianjiangluo ellipsys wrestrtdr neelneelpurk cznq rishikeerthi ribbles jsklein nemani keyboardcowboy42 cloudxtreme dineshvipe hybridious chaomas arijitthehacker gugronnier ph3n0m3n0n lashuk1729 klumpur wandochocolates enterstudio pavlikdee daoxuans paulhb7 x-seo awesomevinc kukupigs bahattab elmergonzalezb legogris iitians patronsaintowl redweiweb classicvalues gitlocked padong worldwideweb-digital cyb3r0 cccoin22 aluminum5 jeeveshw crusher2256

search's Issues

Add more statistics

We should start a conversation on what statistics should be added.

What about searches themselves (top keywords) aka "trends project".

Add a screenshot to an indexed item

We should store a screenshot of an indexed page.
I'm not sure whether we should put the image path in the index or the picture as a base64 string.

Add a site uptime to it's stats

Icey talked about this.
I'm writing it here to not forget.

Add search commands to ahmia

I think we can get our inspiration from http://www.searchcommands.com/google/
The most useful are site: , inurl: , intitle: + date stuff.

Index of Onions

Thank you for the project! Does the ahmia.fi project offer a list of working onions you've scraped?
How often do these onions change?

Return individual pages instead of top level domain

Items we display should be individual pages instead of top-level onion domain.
But it will increase needed server config. I'm not sure how scalable the current server is.

Automate more things

Any idea what should be done on the automation front?

This is not a major issue but the documentation is outdated to the last version and there is very little information available as in a guide, would be great to have something more detailed if possible.

Use NLP to improve search results

We could analyze a page content to understand what it is about. The indexation would take way more time but results would be better. Natural Language Toolkit is the reference library for this, thus being slow.

Use TravisCI for continuous integration

I'm not sure if it's useful or not for non-compiled code. Maybe static code analysis is enough.

Bang syntax

The !bang syntax of duckduckgo is a great feature. It enables to search something in another website that have it's own search engine. We should enable people to propose bangs to search in other hidden services.

Should we make a list of initially supported bangs (for instance, duckduckgo has an .onion service, why not expose it with a !ddg command).

Optimize display for mobile devices

The right padding could be reduced I think.

API in CSV, JSON and RSS

Please offer an API so it would be easy to retrieve results using IRC (or XMPP) chat bots and other applications.

Fake Real comparison

there seems to be 1 big cloner (and some smaller ones)
luckily the big cloner also clones linkdirectories where the real onion links are replaced with his portofolio of cloned onions.
simply comparing the diff of the list will give a list of clones.

have forked your site (just trying to learn my way with python and django) and was planning to implement an automated script to regularly check cloned linkdirectories and mark them as "clone" in the same fashion as the "banned" sites, if time permits that is.
below example is by Daniel Winzen, operator of the real link directory in the script

<?php
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_PROXY, '127.0.0.1:9050');
curl_setopt($ch, CURLOPT_PROXYTYPE, 7);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 25);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_URL,
'http://tt3j2x4k5ycaa5zt.onion/onions.php?format=scamtest');
$links=explode("\n", curl_exec($ch));
curl_setopt($ch, CURLOPT_URL,
'http://tt3j277rncfaqmj7.onion/onions.php?format=scamtest');
$scam_links=explode("\n", curl_exec($ch));
$i=$scam=0;
if(count($links)===count($scam_links)){
        foreach($links as $link){
        if($link!==$scam_links[$i]){
            preg_match('~(^(https?://)?([a-z2-7]{16})(\.onion(/.*)?)?$)~i',
$link, $addr);
            $address=strtolower($addr[3]); //real address
            preg_match('~(^(https?://)?([a-z2-7]{16})(\.onion(/.*)?)?$)~i',
$scam_links[$i], $addr);
            $scam_address=strtolower($addr[3]); //clone
            //add clone to database
            ++$scam;
        }
        ++$i;
    }
}
echo "$i onions checked\n";
echo "$scam onions were scam\n";
?>

Make a google-trend like interface

It is related with #22
This issue concerns only making the interface. Any idea of a good chart library?
Since this tool should be dynamic, what about making it a one page web-app?

Add stats to indexed items to compute a pertinence score

The idea is to being able to compute a score for an indexed onion site/page.
I'm not sure whether the score should be computed at indexation time or search time. I need to do some research firsts.
The stats we have (popularity, backlink, number of clicks) can be useful to compute that score.
Also, icey proposed to work on the uptime stat, which could be useful.

Go to the top of a page button

We should add a button that enables to return to the top of search results.

Update django app installation guide

We should update the following install guide: https://github.com/juhanurmi/ahmia/blob/master/README.md
Should we target the latest Ubuntu LTS and Fedora?

Statistics should be available as a REST API

It also depends on #22

Evaluate repositories organization

I'm thinking of the following structure:
ahmia/ <- Org name
ahmia <- Repository containing the django app of ahmia.fi, documentation on how to install it and how to run it (with apache, nginx config samples)
onion-elastic-bot <- Crawler and install guide

I'm open to suggestions. Is the tools directory still used?

Write more tests

Tests need to be written for the django app and the crawler.
The coverage should be displayed on each project index.

Static code analysis should be automatically run before accepting a commit

It improve code quality and avoid some internal server error in production when pushing a quick fix in production.

succ

search/LICENSE

Line 20 in dd01a27

WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE

json_html description update

hi there,

have forked your site, and checking it out.
all seems to be working fine except for one issue,
not sure if its the code here on github which is not updated since it seems to work on ahmia.fi, or that there is something wrong with my ubuntu 14.04 python setup.

the test_hidden_services.py updates the official description.json perfectly,

but the json_html seems to be problematic,
the log indicates that the description is updated... the only thing however filled in postgresql description table is the "title" column with http://blahblah.onion instead of the title extracted from html, other fields are blank instead of NULL after the update.

am trying to find what is wrong (have limited skills but enjoying the exercise), but been on a goosehunt for a while now and can't find the source of the problem.

when i test "def analyze_front_page(raw_html):" manually it seems to produce the correct json output,.
again, not sure if it a known problem or simply my setup

Delete all mention to deprecated tools

I'm thinking we should remove all references to Solr to avoid people getting confused.
I'm not sure if another tool/lib is concerned?