crawlserv / crawlservpp Goto Github PK
View Code? Open in Web Editor NEWcrawlserv++: Application for crawling and analyzing textual content of websites.
License: Other
crawlserv++: Application for crawling and analyzing textual content of websites.
License: Other
absolute archived links with multiple protocol mentions will use the last instead of the second one
-> wrong domain detected
example link: http://web.archive.org/web/20090808201715/http://myweb2.search.yahoo.com/myresults/bookmarklet?u=http://www.rt.com/Best_Videos/2009-07-20/investing_in_networking__dst_ceo__yuri_milner.html&t=Investing%20in%20networking%3A%20DST%20CEO%2C%20Yuri%20Milner in http://web.archive.org/web/20090808201715/https://rt.com/Best_Videos.html
[ERROR] log() SQL Error #0 (SQLState HY000) - Value not set for all parameters [DEBUG: INSERT INTO crawlserv_log(module, entry) VALUES (?, ?)]
-> after a while, probably while logging "accepted client"
-> maybe there was a connection reset and something went wrong with recovering the prepared statements?
IMPORTANT: .sql file needs to be able to use a variable like $prefix -> replace with new custom prefix
change id.query and id.source in parsing configuration to array, rename to id.queries and id.sources
check URL queries first - content queries second
use any result as ID
Both server (ConfigParser
class) and frontend (parser.json
) need to be changed. The ThreadParser
class needs to be written accordingly.
used by both crawler and extractor module, accessed by Networking class
tbd: child class of ConfigModule?
add config/change config -> saves a lot of scrolling
extend crawlserv_analyzedtables by format column
implement filter on server-side (including configuration .json files)
implement filter on frontend
e.g. when running out of memory or connection loss
thread should be halted, but not destroyed
to check: also when checkConnection fails?
see e.g. "Field result as JSON" / "field.json" property of parsing configuration
are retries archive-only? (added new "extended" logging entry to test hypothesis)
either into already existing member functions for checking content and checking XML content or in new member functions of Module::Crawler::Thread class
The pause state of threads sometimes does not get saved properly (threads start running again after restart)
(paused=0 in database after shutdown although threads are paused)
MarkovTweet and MarkovText are not optimized (multi-threading, memory usage) and should be removed.
just in case: * can be used in config file too!
to be implemented in Networking::setCrawlingConfigCurrent
!
Table lock timeout should not lead to thread termination, but to warning and retry.
Downside: All database commands need to be rewritten :(
Workaround for now: Set the table lock timeout to 10min in Database::connect()
.
faster, less code
downside: algo has full control over database (but otherwise no custom prepared statements can be implemented)
similar to parsed data
change datetime.*
in parsing configuration to array and rename accordingly (plural!)
check URL queries first - content queries second
use any result as datetime
Both server (ConfigParser
class) and frontend (parser.json
) need to be changed. The ThreadParser
class needs to be written accordingly.
I think XML output does not work correctly yet (not tested).
Best bet: test XMLDocument::getContent(...)
externally first.
do not forget to remove #include <tuple>
do not end tick to re-crawl memento (avoid going through all already crawled mementos again)
use parsed id to avoid duplicates inside generated text corpus
implement parsing functionality
can be implemented by using rapidjson: http://rapidjson.org/md_doc_pointer.html
Boolean result: Pointer(query).Get(document) != NULL
Single or multiple results: Pointer(query).GetValueByPointer(document)
For single results, if the result is an array, get only the first entry of the array
Convert the resulting value to a string (QUESTION: Can getString() be used on arrays and numeric/null values as well?)
e.g. by saving additional config entries in Analyzer and access it via Algo class
-> separate jsons and update frontend (load all algo jsons)
target table creation may run into problems when checking for existence and creating the same table by parallel threads
e.g. Server::Server()
add import of URL lists to Import/Export category in frontend
add server command allowing to import URL list
in server: parse one line one entry, CSV,... (allow to specify column number or name if necessary)
in frontend: (advanced) parse CSV etc. to allow column selection [double-parsing ๐]
alternative: two server commands (A. upload/parse and respond with column results - need to change command structure for advanced JSON reply, B. add uploaded file to URL list)
file parsing in worker thread!
using callback in frontend to allow adding after parsing
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.