The crawlservpp from crawlserv

wrong domain detected in certain archived links

absolute archived links with multiple protocol mentions will use the last instead of the second one
-> wrong domain detected

example link: http://web.archive.org/web/20090808201715/http://myweb2.search.yahoo.com/myresults/bookmarklet?u=http://www.rt.com/Best_Videos/2009-07-20/investing_in_networking__dst_ceo__yuri_milner.html&t=Investing%20in%20networking%3A%20DST%20CEO%2C%20Yuri%20Milner in http://web.archive.org/web/20090808201715/https://rt.com/Best_Videos.html

Mysterious SQL error in Main::Database::log()

[ERROR] log() SQL Error #0 (SQLState HY000) - Value not set for all parameters [DEBUG: INSERT INTO crawlserv_log(module, entry) VALUES (?, ?)]
-> after a while, probably while logging "accepted client"
-> maybe there was a connection reset and something went wrong with recovering the prepared statements?

make table prefix changeable (by config file)

IMPORTANT: .sql file needs to be able to use a variable like $prefix -> replace with new custom prefix

parse ID from URL or content using multiple queries

change id.query and id.source in parsing configuration to array, rename to id.queries and id.sources
check URL queries first - content queries second
use any result as ID

Both server (ConfigParser class) and frontend (parser.json) need to be changed. The ThreadParser class needs to be written accordingly.

use createTable and deleteTableIfExists member functions of Main::Database in-class

server: enable SSL (by config file)

https://stackoverflow.com/questions/25743876/how-to-run-ssl-with-mongoose-webserver-api

frontend: show extracted data

minor: handling redirection error (and possibly other cURL errors) when crawling archives

move parsing, extracting and analyzing status and lock to separate tables

remove parsed, parselock, extracted, extractlock, analyzed, analyzelock from URL list table in Main::Database class [DONE!]
change resetAnalyzingStatus, resetExtractingStatus, resetParsingStatus in Main::Database class
create, change, reset, delete crawlserv_[website]_[urllist]_parsing, _extracting and _analyzing tables with columns target (id), url (id), lock (datetime), success (bool, default FALSE) in Main::Database class
save result table id in-class (Module::Analyzer::Database and Module::Parser::Database) [DONE!]
parser: add LEFT OUTER JOIN crawlserv_[website]_[urllist]_parsing AS c ON a.id = c.url (and c.target = [result table] in WHERE clause) to SQL statement for getNextUrl
parser: change SQL statements for isUrlParsed, isUrlLockable, getUrlLock, checkUrlLock, lockUrl, unLockUrl, setUrlFinished accordingly
parser: remove reset on finish config option

list available locales on server for datetime parsing

move network config to new class

used by both crawler and extractor module, accessed by Networking class
tbd: child class of ConfigModule?

addLinkIfNotExists() SQL Error #1213 (SQLState 40001) Deadlock found when trying to get lock; try restarting transaction.

kills thread (see #6, #22, #24) after external lock
but: can it be avoided in any case? (maybe by changing the locking behaviour)

frontend/minor: show actionlinks also at beginning of configuration

add config/change config -> saves a lot of scrolling

distinguish different table formats for analyzing tables

extend crawlserv_analyzedtables by format column
implement filter on server-side (including configuration .json files)
implement filter on frontend

thread terminates when losing connection to mySQL database while performing queries

e.g. when running out of memory or connection loss
thread should be halted, but not destroyed
to check: also when checkConnection fails?

metadata analysis: implement count over time algorithm

count entries, values and json values
different time span resolutions (hour, day, week, month, year)
difference between continous and summarized (hours of the day, days of the week, days of the month, weeks of the month, months of the year) time spans
implement table formats accordingly (after implementation of #26)
data presentation? (maybe "Statistics" category in frontend)
data export ("Import/Export" category in frontend, which file formats?)

plus/minus buttons for array of type bool wrongly placed in frontend

see e.g. "Field result as JSON" / "field.json" property of parsing configuration

frontend: transfer query from one website to another

allow specific exceptions to pause, not to terminate thread

could fix #6, #22

Crawler does not detect redirection error

are retries archive-only? (added new "extended" logging entry to test hypothesis)

CorpusProperties as struct

minor: move code for HTML comprehensive/canonical tests into functions

either into already existing member functions for checking content and checking XML content or in new member functions of Module::Crawler::Thread class

rename C++ headers to *.hpp

minor: frontend jumps from advanced to simple after configuration update

crawler: memory leak in XML/tidyAPI

Pause state of threads not saved under certain circumstances

The pause state of threads sometimes does not get saved properly (threads start running again after restart)
(paused=0 in database after shutdown although threads are paused)

remove testing analyzers

MarkovTweet and MarkovText are not optimized (multi-threading, memory usage) and should be removed.

server: limit Access-Control-Allow-Origin to HTTP server (by config file)

just in case: * can be used in config file too!

Networking class: networkCookiesOverwrite not implemented yet

to be implemented in Networking::setCrawlingConfigCurrent !

Thread gets killed on table lock timeout

Table lock timeout should not lead to thread termination, but to warning and retry.

Downside: All database commands need to be rewritten :(

Workaround for now: Set the table lock timeout to 10min in Database::connect().

remove custom data functions, use execute and prepared SQL statements instead

faster, less code
downside: algo has full control over database (but otherwise no custom prepared statements can be implemented)

frontend: present analyzing tables in "Content"

similar to parsed data

parse datetime using multiple queries

change datetime.* in parsing configuration to array and rename accordingly (plural!)
check URL queries first - content queries second
use any result as datetime

Both server (ConfigParser class) and frontend (parser.json) need to be changed. The ThreadParser class needs to be written accordingly.

XML output

I think XML output does not work correctly yet (not tested).

Best bet: test XMLDocument::getContent(...) externally first.

parsing/crawling: use URL (and target for parsing) instead of lockID!

do not forget to remove #include <tuple>

implement ThreadParser

always use exceptions instead of return values for errors

crawler: re-try memento immediately

do not end tick to re-crawl memento (avoid going through all already crawled mementos again)

extractor

analyzer

corpus generation: avoid duplicates

use parsed id to avoid duplicates inside generated text corpus

parser

implement parsing functionality

feature: add JSON Pointer as query type

can be implemented by using rapidjson: http://rapidjson.org/md_doc_pointer.html

Boolean result: Pointer(query).Get(document) != NULL
Single or multiple results: Pointer(query).GetValueByPointer(document)
For single results, if the result is an array, get only the first entry of the array
Convert the resulting value to a string (QUESTION: Can getString() be used on arrays and numeric/null values as well?)

frontend: save simple and advanced config tab when changing websites/url lists

find a way to separate algo config from analyzer config

e.g. by saving additional config entries in Analyzer and access it via Algo class
-> separate jsons and update frontend (load all algo jsons)

minor: empty items of string array are not saved temporarily when switching between simple and advanced configuration tab in frontend

use unique_ptr / shared_ptr instead of new/delete (esp. for threads)

implement locks for target table creation

target table creation may run into problems when checking for existence and creating the same table by parallel threads

possible memory leaks when constructors throw

e.g. Server::Server()

import, export, merge URL lists

add import of URL lists to Import/Export category in frontend
add server command allowing to import URL list
in server: parse one line one entry, CSV,... (allow to specify column number or name if necessary)
in frontend: (advanced) parse CSV etc. to allow column selection [double-parsing 👎]
alternative: two server commands (A. upload/parse and respond with column results - need to change command structure for advanced JSON reply, B. add uploaded file to URL list)

file parsing in worker thread!
using callback in frontend to allow adding after parsing

crawlserv / crawlservpp Goto Github PK

crawlservpp's People

Contributors

Stargazers

Watchers

crawlservpp's Issues

Recommend Projects

Recommend Topics

Recommend Org