ahmia / search Goto Github PK
View Code? Open in Web Editor NEWAhmia - Search Engine for onion services.
License: BSD 3-Clause "New" or "Revised" License
Ahmia - Search Engine for onion services.
License: BSD 3-Clause "New" or "Revised" License
We should start a conversation on what statistics should be added.
What about searches themselves (top keywords) aka "trends project".
We should store a screenshot of an indexed page.
I'm not sure whether we should put the image path in the index or the picture as a base64 string.
Icey talked about this.
I'm writing it here to not forget.
I think we can get our inspiration from http://www.searchcommands.com/google/
The most useful are site: , inurl: , intitle: + date stuff.
Thank you for the project! Does the ahmia.fi project offer a list of working onions you've scraped?
How often do these onions change?
Items we display should be individual pages instead of top-level onion domain.
But it will increase needed server config. I'm not sure how scalable the current server is.
Any idea what should be done on the automation front?
This is not a major issue but the documentation is outdated to the last version and there is very little information available as in a guide, would be great to have something more detailed if possible.
We could analyze a page content to understand what it is about. The indexation would take way more time but results would be better. Natural Language Toolkit is the reference library for this, thus being slow.
I'm not sure if it's useful or not for non-compiled code. Maybe static code analysis is enough.
The !bang syntax of duckduckgo is a great feature. It enables to search something in another website that have it's own search engine. We should enable people to propose bangs to search in other hidden services.
Should we make a list of initially supported bangs (for instance, duckduckgo has an .onion service, why not expose it with a !ddg command).
The right padding could be reduced I think.
Please offer an API so it would be easy to retrieve results using IRC (or XMPP) chat bots and other applications.
there seems to be 1 big cloner (and some smaller ones)
luckily the big cloner also clones linkdirectories where the real onion links are replaced with his portofolio of cloned onions.
simply comparing the diff of the list will give a list of clones.
have forked your site (just trying to learn my way with python and django) and was planning to implement an automated script to regularly check cloned linkdirectories and mark them as "clone" in the same fashion as the "banned" sites, if time permits that is.
below example is by Daniel Winzen, operator of the real link directory in the script
<?php
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_PROXY, '127.0.0.1:9050');
curl_setopt($ch, CURLOPT_PROXYTYPE, 7);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 25);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_URL,
'http://tt3j2x4k5ycaa5zt.onion/onions.php?format=scamtest');
$links=explode("\n", curl_exec($ch));
curl_setopt($ch, CURLOPT_URL,
'http://tt3j277rncfaqmj7.onion/onions.php?format=scamtest');
$scam_links=explode("\n", curl_exec($ch));
$i=$scam=0;
if(count($links)===count($scam_links)){
foreach($links as $link){
if($link!==$scam_links[$i]){
preg_match('~(^(https?://)?([a-z2-7]{16})(\.onion(/.*)?)?$)~i',
$link, $addr);
$address=strtolower($addr[3]); //real address
preg_match('~(^(https?://)?([a-z2-7]{16})(\.onion(/.*)?)?$)~i',
$scam_links[$i], $addr);
$scam_address=strtolower($addr[3]); //clone
//add clone to database
++$scam;
}
++$i;
}
}
echo "$i onions checked\n";
echo "$scam onions were scam\n";
?>
It is related with #22
This issue concerns only making the interface. Any idea of a good chart library?
Since this tool should be dynamic, what about making it a one page web-app?
The idea is to being able to compute a score for an indexed onion site/page.
I'm not sure whether the score should be computed at indexation time or search time. I need to do some research firsts.
The stats we have (popularity, backlink, number of clicks) can be useful to compute that score.
Also, icey proposed to work on the uptime stat, which could be useful.
We should add a button that enables to return to the top of search results.
We should update the following install guide: https://github.com/juhanurmi/ahmia/blob/master/README.md
Should we target the latest Ubuntu LTS and Fedora?
It also depends on #22
I'm thinking of the following structure:
ahmia/ <- Org name
ahmia <- Repository containing the django app of ahmia.fi, documentation on how to install it and how to run it (with apache, nginx config samples)
onion-elastic-bot <- Crawler and install guide
I'm open to suggestions. Is the tools directory still used?
Tests need to be written for the django app and the crawler.
The coverage should be displayed on each project index.
It improve code quality and avoid some internal server error in production when pushing a quick fix in production.
Line 20 in dd01a27
hi there,
have forked your site, and checking it out.
all seems to be working fine except for one issue,
not sure if its the code here on github which is not updated since it seems to work on ahmia.fi, or that there is something wrong with my ubuntu 14.04 python setup.
the test_hidden_services.py updates the official description.json perfectly,
but the json_html seems to be problematic,
the log indicates that the description is updated... the only thing however filled in postgresql description table is the "title" column with http://blahblah.onion instead of the title extracted from html, other fields are blank instead of NULL after the update.
am trying to find what is wrong (have limited skills but enjoying the exercise), but been on a goosehunt for a while now and can't find the source of the problem.
when i test "def analyze_front_page(raw_html):" manually it seems to produce the correct json output,.
again, not sure if it a known problem or simply my setup
I'm thinking we should remove all references to Solr to avoid people getting confused.
I'm not sure if another tool/lib is concerned?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.