o0111 / ruralcafe Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/ruralcafe
Automatically exported from code.google.com/p/ruralcafe
Currently these redirects are causing duplicates. Really caused by the way the
local proxy interacts with the client browser.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 8:00
Not sure if this is in there yet, but this method should work through a proxy.
The information should be in the RCRequest and get passed in or simply accessed
from the requestHandler information.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 8:36
The status codes are not very useful across the files in the source. A bunch of
different classes and at least the two handlers are using it for different
purposes. So there isn't really any consistency and its not clear what the
different codes mean for the code logic and in the logging for analysis.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 7:34
The ELF extension is mostly a logging and reporting firefox extension that does
prefetching and caching completely within the browser, but one other thing that
it does is that it highlights the links that are cached so that users don't end
up clicking on a bunch of dead links and getting frustrated.
The "defect" here is that the CIP/RuralCafe code doesn't do any kind of cached
object link highlighting. Either rewriting page or adding a CSS for those pages
so that the links are highlighted in this way or adding hooks for asking
RuralCafe whether pages are cached for the ELF extension would fix this
usability problem.
Original issue reported on code.google.com by [email protected]
on 3 Oct 2010 at 8:18
TroTro doesn't allow to change the richness. There needs to be a default
setting, a per-user setting, and the possibility to change current richness on
the TroTro page.
Original issue reported on code.google.com by [email protected]
on 21 Apr 2013 at 10:34
When pages are not found in the wiki dump the handler should fall back to the
caching logic.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 7:57
Blacklist reporting in UI so that we can filter it out for the current user
immediately, log it and check later for removal by the administrator.
So there'd be two pieces to this.
First, the user clicks on something that tells RuralCafe that the domain or
page is bad, and they're allowed to enter a reason or use a form interface for
this. RuralCafe will record the client IP address and add the domain/URL to a
data structure thats soft-state for the IP address until shutdown some timeout.
Second, the blacklist requests are all aggregated together for perusal either
in a file or something for the administrator and also in logs for us to improve
our crawler.
What could also be nice is if the admin has a special version of the user
interface that does basically the same thing, we could specify the admin's UI
in the config.txt and then allow the admin to edit/view the exiting blacklist,
and suggested blacklist pages by other clients.
Original issue reported on code.google.com by [email protected]
on 3 Oct 2010 at 8:27
The number of search results on the search results page is too high, it slows
down the displaying and will get worse when the content snippets are added.
Once the number of results returned is reduced, there should also be buttons
for the nth page, next, and previous pages.
Original issue reported on code.google.com by [email protected]
on 3 Oct 2010 at 7:53
There are two unused variables for counting and limiting the number of active
requests in the local proxy. There needs to be a connection manager implemented
to do this, whether it is as a separate class or internal set of methods is up
to the implementer.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 7:13
Previously this code was meant to simply use RuralCafe to perform some
measurements and gather statistics on Google result pages and count the number
of objects, links, and get the download times of various things.
This code needs to be almost completely rewritten to gather interesting metrics
for performance of the system and usage by clients. E.g. Number of embedded
links on a page, number of links, size, etc.
Search for comments with "benchmarking" in the source to remove the old code
and get an idea of what kinds of metrics would be interesting. These statistics
could be useful later down the road.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 5:46
This method seems overly complex and long.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 8:44
The URI "http://192.168.178.20:9080/Disasters/" led to the following error in
Lucene.Net. Running the same sever on port 80 did not produce any error.
Local Proxy: 8651 02.05.2013 15:38:22 error handling request:
http://192.168.178.20:9080/Disasters/ Lucene.Net.QueryParsers.ParseException:
Cannot parse '(http
://192.168.178.20:9080/Disasters/)': Encountered " ":" ": "" at line 1, column
22.
Was expecting one of:
<AND> ...
<OR> ...
<NOT> ...
"+" ...
"-" ...
"(" ...
")" ...
"*" ...
"^" ...
<QUOTED> ...
<TERM> ...
<FUZZY_SLOP> ...
<PREFIXTERM> ...
<WILDTERM> ...
"[" ...
"{" ...
<NUMBER> ...
bei Lucene.Net.QueryParsers.QueryParser.Parse(String query) in c:\Users\Satia\Documents\Visual Studio 2012\Projects\Rural Cafe\trunk\Lucene.Net\QueryParser\Q
ueryParser.cs:Zeile 238.
bei RuralCafe.IndexWrapper.Query(String indexPath, String queryString) in c:\Users\Satia\Documents\Visual Studio 2012\Projects\Rural Cafe\trunk\RuralCafe\Ind
exWrapper.cs:Zeile 109.
bei RuralCafe.LocalRequestHandler.GetMimeType(String uri) in c:\Users\Satia\Documents\Visual Studio 2012\Projects\Rural Cafe\trunk\RuralCafe\LocalRequestHand
ler.cs:Zeile 1502.
bei RuralCafe.LocalRequestHandler.HandleRequest() in c:\Users\Satia\Documents\Visual Studio 2012\Projects\Rural Cafe\trunk\RuralCafe\LocalRequestHandler.cs:Z
eile 149.
bei RuralCafe.RequestHandler.Go() in c:\Users\Satia\Documents\Visual Studio 2012\Projects\Rural Cafe\trunk\RuralCafe\RequestHandler.cs:Zeile 304.
Original issue reported on code.google.com by [email protected]
on 2 May 2013 at 1:41
Gzip wrapper is being used to compress and decompress between the proxies,
bzip2 is being used to compress the compressable files in the cache. I believe
the wiki renderer is also using bzip2, but its own library?
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 8:47
Currently the returned 404 error page does not allow for this, see:
RequestHandler.cs SendErrorPage().
Original issue reported on code.google.com by [email protected]
on 5 May 2012 at 9:04
There is no installer tool for RuralCafe. This would help significantly in
easing the setup for RuralCafe. As a part of the installer, some help with
setting up the configuration file would be good.
Original issue reported on code.google.com by [email protected]
on 3 Oct 2010 at 8:36
Currently other requests are not being handled properly. They are just being
streamed through without any filtering or connection management. Minimally
things like POST should work.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 7:47
IndexWrapper.cs contains some depricated squid index loading logs. This code
should split this into its own class.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 6:24
There need to be eviction strategies on both the local and remote proxy.
A first implementation could simply use FIFO or something similar on both
proxies but later we have to think about hierarchical caching and therefore
different strategies for the two proxies.
Original issue reported on code.google.com by [email protected]
on 21 Apr 2013 at 10:39
Need to rethink and re-implement the user interface for all the various
configurations. Something integrated and works with gradually degraded
performance based on the network.
Really the hooks are what need to be in. Compatibility with a Firefox extension
would also be important to think about if that's going to be one of the UI
options.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 8:32
An (at first rudimentary) detection of the network status (online, slow,
offline).
Original issue reported on code.google.com by [email protected]
on 26 Apr 2013 at 9:30
What steps will reproduce the problem?
1. Run another server on localhost, other port
2. request it on the browser (explicitly type localhost!)
3. crash!
For localhost requests, whyever, the remote proxy does not get triggered. The
local proxy gets the result directly from the other server bypassing the remote
proxy. It then tries to unpack a package, which has never been sent by the
remote proxy => Crash.
You can still test the local server by typing your local IP adress instead of
localhost. Nevertheless the program shouldn't crash just if someone funny types
localhost in the browser.
Original issue reported on code.google.com by [email protected]
on 26 Apr 2013 at 5:25
This method is used to decide whether to add a page to a package based on the
richness filtering rules.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 8:42
Currently each item (image, script, HTML-page) is considered independent. The
cache size can be limited and then an LRU strategy is applied to evict items
from the cache.
This can destroy website's appearances, if embedded objects get deleted but the
website doesn't. Or, if the situation is vice versa, space is being wasted
until the embedded objects get deleted, too.
We would need a new database table for websites, with a *-to-* association (one
website can have 0 to several embedded objects (this should include the HTML
page here), one objects is embedded into at least 1 page).
Then, when evicting, whole webistes should get deleted, but only the embedded
objects that are not embedded on another page currently.
For packages from the remote proxy it is easy to see, which items are embedded,
as all items in the package are embedded into the page. But for streaming or
when downloading at the remote side, this is rather difficult.
Original issue reported on code.google.com by [email protected]
on 24 Sep 2013 at 9:23
The search results page from RuralCafe sometimes returns 404 pages, these
should be filtered out and removed from the Lucene index.
The 404 pages could possibly also blacklisted (in only the crawler) since
they're obviously unable to be cached. The only thing about blacklisting them
completely is that if the dynamic caching is better later, the pages could be
re-included so they should probably be differentiated somehow if this is to be
done.
Original issue reported on code.google.com by [email protected]
on 3 Oct 2010 at 8:53
Caching dynamic content is going to be a bit tricky. There are several research
papers about this for the server side to reduce load.
The basic purpose is for cases where the local proxy is behind an intermittent
or very slow connection and pages aren't rendered properly so having the entire
page cached including (possibly stale) dynamic content means that the page will
be displayed more closely to how the web designer intended. Currently, there
are a lot of pages that are garbled because the dynamic objects aren't there in
the cache.
This is going to be a bit challenging to implement, and ideas are welcome.
Original issue reported on code.google.com by [email protected]
on 3 Oct 2010 at 8:05
This requires tunneling requests through a gateway proxy from the local proxy
to the remote proxy. This requires rewriting the interface between the local
and remote proxies.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 7:16
TroTro needs a signup page and the proxies need routines to handle such a
request.
Original issue reported on code.google.com by [email protected]
on 24 Apr 2013 at 7:40
Some kind of automatic update mechanism would be nice. An installer that
deinstalls the old and installs the new version would be a solution.
A solution that you can do this remotely, with automatic SVN update, etc. would
be nicer.
Original issue reported on code.google.com by [email protected]
on 8 Jul 2013 at 6:36
Make the requests from the local proxy to the remote proxy asynchronous. This
means, the remote proxy will answer immediately after checking if the request
is valid, with 200 OK. The remote proxy will then download the page and after
that make an own HTTP POST request to the local proxy that issued the original
request, "uploading" the page as POST data.
If both proxies retry, until they get a HTTP response (the local proxy only if
the status is not offline, and the remote proxy only a couple of times and/or
in bigger intervals), this ensures that the pages reach the local proxy even
with short network outages.
This solution would need a listener for responses from the remote proxy at the
local side and some kind of management for the requests/responses at the remote
side so that the HTTP POST requests can be made, and also to the correct local
proxy.
Original issue reported on code.google.com by [email protected]
on 24 Sep 2013 at 7:44
The original idea for this was to save on disk space, but now it is just
consuming processing at the proxy. Trading off cpu for disk space is not the
direction we want to go. The difficulty though is that the existing cache will
be obsolete. Will need to either convert an existing proxy to support
reverse-compatibility or just break everything.
Also, whatever is done needs to be resynched with what's in the CIP crawler
code. This will be less of a problem once that code is integrated into this
project.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 8:27
There needs to be some amount of profiling for search and page requests in
RuralCafe to determine where the bottlenecks are. The overall speed for those
requests is pretty terrible at the moment.
Original issue reported on code.google.com by [email protected]
on 3 Oct 2010 at 8:33
Currently there is no such support. I believe some of the dummy functions are
implemented in the remote proxy code, but there is no way for the local proxy
to communicate to the remote proxy to do so if the request is already removed
at the local proxy.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 7:19
Revamp of all logging and debug messaging. The messages and places where things
are logged are not necessarily consistent/useful. A more general problem, but
related to Issue 18.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 9:00
Not super important immediately, but for deployment purposes if one entity
would like to run a remote proxy and support multiple clients then it should be
possible if it has sufficient resources to do so.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 7:21
At the moment only the page content is stored in the cache. The response
headers should be included, too.
This has to play nicely with the clustering code piece.
It is also going to be challenging for dynamic content. E.g. the headers and/or
content might vary per user, once POST requests work fine.
The POST response headers will have to be treated different, e.g. a set-cookie
header cannot be put in the cache (at least not available to everyone).
Original issue reported on code.google.com by [email protected]
on 8 Jul 2013 at 6:41
Snippet of result in search results would help a lot to avoid dumb links.
Currently the search results just show the URL result and the page title.
Since the Lucene index doesn't have the content stored, to implement this the
actual page would probably need to be opened in the cache and the snippet
containing the search terms should be returned. This would be pretty slow for a
lot of results. If we also reduce the number of search results returned to 10
or 50 at most it might be ok.
Original issue reported on code.google.com by [email protected]
on 3 Oct 2010 at 6:44
What steps will reproduce the problem?
1. Copying the config file to C:\cygwin\home\Aaron
Lynch\ruralcafe\RuralCafe\obj\Release
2. Starting up the .exe file
3. Debugging
What is the expected output? What do you see instead?
Hoping to test out the program, instead the program stops launching after
Loading LDC Data 1 Grams.
Please use labels and text to provide additional information.
Debugger says FileNotFoundException:
C:\cygwin\home\Aaron Lynch\ruralcafe-read-only\RuralCafe\obj\Release\config.txt
Original issue reported on code.google.com by [email protected]
on 26 Apr 2013 at 7:48
1. Searching for wiki pages, that are in the dump, does not seem to work.
2. For some pages only a redirect page is found. E.g. when you enter
http://en.wikipedia.org/wiki/Germany it finds "GerMany" which only redirects to
"Germany", but "Germany" itself is not found.
3. The html for the dump pages is awful. There is no formatting and code pieces
that clearly shouldn't be, are visible. MzReader (builds upon BzReader) could
be an option.
Original issue reported on code.google.com by [email protected]
on 6 Jul 2013 at 10:17
The current implementation is not very clean or accurate. It is not accurate
because the uncompressed size is what is being used. It should instead be the
compressed sizes used.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 8:38
Since we rolled our own, this method is not very robust. We probably want to
borrow an open source library for this. Somewhat ties into the caching of
dynamic pages (Issue 3).
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 8:43
Package.cs - needs to be rewritten somewhat with a new local/remote proxy
implementation. Related to Issue 12 and Issue 13.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2010 at 8:57
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.