o0111 / ruralcafe Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 0 B

Automatically exported from code.google.com/p/ruralcafe

C# 93.43% HTML 2.46% Lex 0.08% Perl 0.29% CSS 0.40% JavaScript 0.84% C 2.50%

ruralcafe's People

Contributors

ruralcafe's Issues

Wiki dump improvements

1. Searching for wiki pages, that are in the dump, does not seem to work.

2. For some pages only a redirect page is found. E.g. when you enter 
http://en.wikipedia.org/wiki/Germany it finds "GerMany" which only redirects to 
"Germany", but "Germany" itself is not found.

3. The html for the dump pages is awful. There is no formatting and code pieces 
that clearly shouldn't be, are visible. MzReader (builds upon BzReader) could 
be an option.

Original issue reported on code.google.com by [email protected] on 6 Jul 2013 at 10:17

Blacklist reporting and addition for the user in the search interface

Blacklist reporting in UI so that we can filter it out for the current user 
immediately, log it and check later for removal by the administrator.

So there'd be two pieces to this. 

First, the user clicks on something that tells RuralCafe that the domain or 
page is bad, and they're allowed to enter a reason or use a form interface for 
this. RuralCafe will record the client IP address and add the domain/URL to a 
data structure thats soft-state for the IP address until shutdown some timeout.

Second, the blacklist requests are all aggregated together for perusal either 
in a file or something for the administrator and also in logs for us to improve 
our crawler.

What could also be nice is if the admin has a special version of the user 
interface that does basically the same thing, we could specify the admin's UI 
in the config.txt and then allow the admin to edit/view the exiting blacklist, 
and suggested blacklist pages by other clients.

Original issue reported on code.google.com by [email protected] on 3 Oct 2010 at 8:27

Larger granularity caching

Currently each item (image, script, HTML-page) is considered independent. The 
cache size can be limited and then an LRU strategy is applied to evict items 
from the cache.

This can destroy website's appearances, if embedded objects get deleted but the 
website doesn't. Or, if the situation is vice versa, space is being wasted 
until the embedded objects get deleted, too.

We would need a new database table for websites, with a *-to-* association (one 
website can have 0 to several embedded objects (this should include the HTML 
page here), one objects is embedded into at least 1 page).

Then, when evicting, whole webistes should get deleted, but only the embedded 
objects that are not embedded on another page currently.

For packages from the remote proxy it is easy to see, which items are embedded, 
as all items in the package are embedded into the page. But for streaming or 
when downloading at the remote side, this is rather difficult.

Original issue reported on code.google.com by [email protected] on 24 Sep 2013 at 9:23

RequestHandler enumeration for status changes

The status codes are not very useful across the files in the source. A bunch of 
different classes and at least the two handlers are using it for different 
purposes. So there isn't really any consistency and its not clear what the 
different codes mean for the code logic and in the logging for analysis.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 7:34

Automatically remove 404 pages

The search results page from RuralCafe sometimes returns 404 pages, these 
should be filtered out and removed from the Lucene index.

The 404 pages could possibly also blacklisted (in only the crawler) since 
they're obviously unable to be cached. The only thing about blacklisting them 
completely is that if the dynamic caching is better later, the pages could be 
re-included so they should probably be differentiated somehow if this is to be 
done.

Original issue reported on code.google.com by [email protected] on 3 Oct 2010 at 8:53

Reduce the number of search results

The number of search results on the search results page is too high, it slows 
down the displaying and will get worse when the content snippets are added.

Once the number of results returned is reduced, there should also be buttons 
for the nth page, next, and previous pages.

Original issue reported on code.google.com by [email protected] on 3 Oct 2010 at 7:53

TranslateToAbsoluteAddress() is long and unclear

This method seems overly complex and long.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 8:44

Dynamic content caching

Caching dynamic content is going to be a bit tricky. There are several research 
papers about this for the server side to reduce load.

The basic purpose is for cases where the local proxy is behind an intermittent 
or very slow connection and pages aren't rendered properly so having the entire 
page cached including (possibly stale) dynamic content means that the page will 
be displayed more closely to how the web designer intended. Currently, there 
are a lot of pages that are garbled because the dynamic objects aren't there in 
the cache.

This is going to be a bit challenging to implement, and ideas are welcome.

Original issue reported on code.google.com by [email protected] on 3 Oct 2010 at 8:05

Speeding up RuralCafe search and page request results

There needs to be some amount of profiling for search and page requests in 
RuralCafe to determine where the bottlenecks are. The overall speed for those 
requests is pretty terrible at the moment.

Original issue reported on code.google.com by [email protected] on 3 Oct 2010 at 8:33

Add support for firewall between local and remote proxies

This requires tunneling requests through a gateway proxy from the local proxy 
to the remote proxy. This requires rewriting the interface between the local 
and remote proxies.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 7:16

Filenotfoundexception

What steps will reproduce the problem?
1. Copying the config file to C:\cygwin\home\Aaron 
Lynch\ruralcafe\RuralCafe\obj\Release
2. Starting up the .exe file
3. Debugging

What is the expected output? What do you see instead?
Hoping to test out the program, instead the program stops launching after 
Loading LDC Data 1 Grams.

Please use labels and text to provide additional information.
Debugger says FileNotFoundException:

C:\cygwin\home\Aaron Lynch\ruralcafe-read-only\RuralCafe\obj\Release\config.txt

Original issue reported on code.google.com by [email protected] on 26 Apr 2013 at 7:48

Implementation of IsATextPage is incomplete

This method is used to decide whether to add a page to a package based on the 
richness filtering rules.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 8:42

New asynchronous inter-proxy protocol

Make the requests from the local proxy to the remote proxy asynchronous. This 
means, the remote proxy will answer immediately after checking if the request 
is valid, with 200 OK. The remote proxy will then download the page and after 
that make an own HTTP POST request to the local proxy that issued the original 
request, "uploading" the page as POST data.

If both proxies retry, until they get a HTTP response (the local proxy only if 
the status is not offline, and the remote proxy only a couple of times and/or 
in bigger intervals), this ensures that the pages reach the local proxy even 
with short network outages.

This solution would need a listener for responses from the remote proxy at the 
local side and some kind of management for the requests/responses at the remote 
side so that the HTTP POST requests can be made, and also to the correct local 
proxy.

Original issue reported on code.google.com by [email protected] on 24 Sep 2013 at 7:44

User interface revamp

Need to rethink and re-implement the user interface for all the various 
configurations. Something integrated and works with gradually degraded 
performance based on the network.

Really the hooks are what need to be in. Compatibility with a Firefox extension 
would also be important to think about if that's going to be one of the UI 
options.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 8:32

LPRequestHandler does not handle non GET/HEAD requests

Currently other requests are not being handled properly. They are just being 
streamed through without any filtering or connection management. Minimally 
things like POST should work.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 7:47

Squid log index loading refactor

IndexWrapper.cs contains some depricated squid index loading logs. This code 
should split this into its own class.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 6:24

Include metadata (response headers) in the cache

At the moment only the page content is stored in the cache. The response 
headers should be included, too.

This has to play nicely with the clustering code piece.

It is also going to be challenging for dynamic content. E.g. the headers and/or 
content might vary per user, once POST requests work fine.

The POST response headers will have to be treated different, e.g. a set-cookie 
header cannot be put in the cache (at least not available to everyone).

Original issue reported on code.google.com by [email protected] on 8 Jul 2013 at 6:41

StreamTransparently should have gateway proxy support

Not sure if this is in there yet, but this method should work through a proxy. 
The information should be in the RCRequest and get passed in or simply accessed 
from the requestHandler information.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 8:36

Allow for queuing of requested pages that result in 404

Currently the returned 404 error page does not allow for this, see: 
RequestHandler.cs SendErrorPage().

Original issue reported on code.google.com by [email protected] on 5 May 2012 at 9:04

Compression algorithms are redundant

Gzip wrapper is being used to compress and decompress between the proxies, 
bzip2 is being used to compress the compressable files in the cache. I believe 
the wiki renderer is also using bzip2, but its own library?

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 8:47

Quota implementation needs a revamp

The current implementation is not very clean or accurate. It is not accurate 
because the uncompressed size is what is being used. It should instead be the 
compressed sizes used.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 8:38

Signup

TroTro needs a signup page and the proxies need routines to handle such a 
request.

Original issue reported on code.google.com by [email protected] on 24 Apr 2013 at 7:40

Package rewrite

Package.cs - needs to be rewritten somewhat with a new local/remote proxy 
implementation. Related to Issue 12 and Issue 13.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 8:57

Allow re-requesting a page in higher quality.

TroTro doesn't allow to change the richness. There needs to be a default 
setting, a per-user setting, and the possibility to change current richness on 
the TroTro page.

Original issue reported on code.google.com by [email protected] on 21 Apr 2013 at 10:34

Remote proxy support for multiple local proxies

Not super important immediately, but for deployment purposes if one entity 
would like to run a remote proxy and support multiple clients then it should be 
possible if it has sufficient resources to do so.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 7:21

Requesting localhost produces crash

What steps will reproduce the problem?
1. Run another server on localhost, other port
2. request it on the browser (explicitly type localhost!)
3. crash!

For localhost requests, whyever, the remote proxy does not get triggered. The 
local proxy gets the result directly from the other server bypassing the remote 
proxy. It then tries to unpack a package, which has never been sent by the 
remote proxy => Crash.

You can still test the local server by typing your local IP adress instead of 
localhost. Nevertheless the program shouldn't crash just if someone funny types 
localhost in the browser.

Original issue reported on code.google.com by [email protected] on 26 Apr 2013 at 5:25

AnalysisTools.cs and "benchmarking" code rewrite

Previously this code was meant to simply use RuralCafe to perform some 
measurements and gather statistics on Google result pages and count the number 
of objects, links, and get the download times of various things.

This code needs to be almost completely rewritten to gather interesting metrics 
for performance of the system and usage by clients. E.g. Number of embedded 
links on a page, number of links, size, etc.

Search for comments with "benchmarking" in the source to remove the old code 
and get an idea of what kinds of metrics would be interesting. These statistics 
could be useful later down the road.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 5:46

Avoid duplicate request/response logging when redirecting

Currently these redirects are causing duplicates. Really caused by the way the 
local proxy interacts with the client browser.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 8:00

Counting and limiting the number of active requests in LocalProxy.cs

There are two unused variables for counting and limiting the number of active 
requests in the local proxy. There needs to be a connection manager implemented 
to do this, whether it is as a separate class or internal set of methods is up 
to the implementer.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 7:13

Debug and Message logging revamp

Revamp of all logging and debug messaging. The messages and places where things 
are logged are not necessarily consistent/useful. A more general problem, but 
related to Issue 18.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 9:00

Support for remove requests in the remote proxy

Currently there is no such support. I believe some of the dummy functions are 
implemented in the remote proxy code, but there is no way for the local proxy 
to communicate to the remote proxy to do so if the request is already removed 
at the local proxy.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 7:19

Wikipedia check should fallback to the cache

When pages are not found in the wiki dump the handler should fall back to the 
caching logic.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 7:57

Including hooks for the ELF extension

The ELF extension is mostly a logging and reporting firefox extension that does 
prefetching and caching completely within the browser, but one other thing that 
it does is that it highlights the links that are cached so that users don't end 
up clicking on a bunch of dead links and getting frustrated. 

The "defect" here is that the CIP/RuralCafe code doesn't do any kind of cached 
object link highlighting. Either rewriting page or adding a CSS for those pages 
so that the links are highlighted in this way or adding hooks for asking 
RuralCafe whether pages are cached for the ELF extension would fix this 
usability problem.

Original issue reported on code.google.com by [email protected] on 3 Oct 2010 at 8:18

Cache eviction

There need to be eviction strategies on both the local and remote proxy.

A first implementation could simply use FIFO or something similar on both 
proxies but later we have to think about hierarchical caching and therefore 
different strategies for the two proxies.

Original issue reported on code.google.com by [email protected] on 21 Apr 2013 at 10:39

Remove compression of certain file stored in the cache

The original idea for this was to save on disk space, but now it is just 
consuming processing at the proxy. Trading off cpu for disk space is not the 
direction we want to go. The difficulty though is that the existing cache will 
be obsolete. Will need to either convert an existing proxy to support 
reverse-compatibility or just break everything.

Also, whatever is done needs to be resynched with what's in the CIP crawler 
code. This will be less of a problem once that code is integrated into this 
project.

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 8:27

HTML Extract References is not very robust

Since we rolled our own, this method is not very robust. We probably want to 
borrow an open source library for this. Somewhat ties into the caching of 
dynamic pages (Issue 3).

Original issue reported on code.google.com by [email protected] on 10 Oct 2010 at 8:43

Automatic updates

Some kind of automatic update mechanism would be nice. An installer that 
deinstalls the old and installs the new version would be a solution.

A solution that you can do this remotely, with automatic SVN update, etc. would 
be nicer.

Original issue reported on code.google.com by [email protected] on 8 Jul 2013 at 6:36

Network status detection

An (at first rudimentary) detection of the network status (online, slow, 
offline).

Original issue reported on code.google.com by [email protected] on 26 Apr 2013 at 9:30

Setup an installer tool and a Configuration UI

There is no installer tool for RuralCafe. This would help significantly in 
easing the setup for RuralCafe. As a part of the installer, some help with 
setting up the configuration file would be good.

Original issue reported on code.google.com by [email protected] on 3 Oct 2010 at 8:36

Add a snippet from search result page for each search result

Snippet of result in search results would help a lot to avoid dumb links. 
Currently the search results just show the URL result and the page title.

Since the Lucene index doesn't have the content stored, to implement this the 
actual page would probably need to be opened in the cache and the snippet 
containing the search terms should be returned. This would be pretty slow for a 
lot of results. If we also reduce the number of search results returned to 10 
or 50 at most it might be ok.

Original issue reported on code.google.com by [email protected] on 3 Oct 2010 at 6:44

Requesting other ports than 80 or 443 fails.

The URI "http://192.168.178.20:9080/Disasters/" led to the following error in 
Lucene.Net. Running the same sever on port 80 did not produce any error.

Local Proxy: 8651 02.05.2013 15:38:22 error handling request:  
http://192.168.178.20:9080/Disasters/ Lucene.Net.QueryParsers.ParseException: 
Cannot parse '(http
://192.168.178.20:9080/Disasters/)': Encountered " ":" ": "" at line 1, column 
22.
Was expecting one of:
    <AND> ...
    <OR> ...
    <NOT> ...
    "+" ...
    "-" ...
    "(" ...
    ")" ...
    "*" ...
    "^" ...
    <QUOTED> ...
    <TERM> ...
    <FUZZY_SLOP> ...
    <PREFIXTERM> ...
    <WILDTERM> ...
    "[" ...
    "{" ...
    <NUMBER> ...

   bei Lucene.Net.QueryParsers.QueryParser.Parse(String query) in c:\Users\Satia\Documents\Visual Studio 2012\Projects\Rural Cafe\trunk\Lucene.Net\QueryParser\Q
ueryParser.cs:Zeile 238.
   bei RuralCafe.IndexWrapper.Query(String indexPath, String queryString) in c:\Users\Satia\Documents\Visual Studio 2012\Projects\Rural Cafe\trunk\RuralCafe\Ind
exWrapper.cs:Zeile 109.
   bei RuralCafe.LocalRequestHandler.GetMimeType(String uri) in c:\Users\Satia\Documents\Visual Studio 2012\Projects\Rural Cafe\trunk\RuralCafe\LocalRequestHand
ler.cs:Zeile 1502.
   bei RuralCafe.LocalRequestHandler.HandleRequest() in c:\Users\Satia\Documents\Visual Studio 2012\Projects\Rural Cafe\trunk\RuralCafe\LocalRequestHandler.cs:Z
eile 149.
   bei RuralCafe.RequestHandler.Go() in c:\Users\Satia\Documents\Visual Studio 2012\Projects\Rural Cafe\trunk\RuralCafe\RequestHandler.cs:Zeile 304.

Original issue reported on code.google.com by [email protected] on 2 May 2013 at 1:41

o0111 / ruralcafe Goto Github PK

ruralcafe's People

Contributors

ruralcafe's Issues

Recommend Projects

Recommend Topics

Recommend Org