Coder Social home page Coder Social logo

crawlserv / crawlservpp Goto Github PK

View Code? Open in Web Editor NEW
5.0 5.0 0.0 91.07 MB

crawlserv++: Application for crawling and analyzing textual content of websites.

License: Other

C++ 86.04% PHP 8.48% CSS 0.47% JavaScript 4.23% CMake 0.46% C 0.32% Shell 0.01%

crawlservpp's People

Contributors

crawlserv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

crawlservpp's Issues

wrong domain detected in certain archived links

Mysterious SQL error in Main::Database::log()

[ERROR] log() SQL Error #0 (SQLState HY000) - Value not set for all parameters [DEBUG: INSERT INTO crawlserv_log(module, entry) VALUES (?, ?)]
-> after a while, probably while logging "accepted client"
-> maybe there was a connection reset and something went wrong with recovering the prepared statements?

parse ID from URL or content using multiple queries

change id.query and id.source in parsing configuration to array, rename to id.queries and id.sources
check URL queries first - content queries second
use any result as ID

Both server (ConfigParser class) and frontend (parser.json) need to be changed. The ThreadParser class needs to be written accordingly.

move parsing, extracting and analyzing status and lock to separate tables

  • remove parsed, parselock, extracted, extractlock, analyzed, analyzelock from URL list table in Main::Database class [DONE!]
  • change resetAnalyzingStatus, resetExtractingStatus, resetParsingStatus in Main::Database class
  • create, change, reset, delete crawlserv_[website]_[urllist]_parsing, _extracting and _analyzing tables with columns target (id), url (id), lock (datetime), success (bool, default FALSE) in Main::Database class
  • save result table id in-class (Module::Analyzer::Database and Module::Parser::Database) [DONE!]
  • parser: add LEFT OUTER JOIN crawlserv_[website]_[urllist]_parsing AS c ON a.id = c.url (and c.target = [result table] in WHERE clause) to SQL statement for getNextUrl
  • parser: change SQL statements for isUrlParsed, isUrlLockable, getUrlLock, checkUrlLock, lockUrl, unLockUrl, setUrlFinished accordingly
  • parser: remove reset on finish config option

metadata analysis: implement count over time algorithm

  • count entries, values and json values
  • different time span resolutions (hour, day, week, month, year)
  • difference between continous and summarized (hours of the day, days of the week, days of the month, weeks of the month, months of the year) time spans
  • implement table formats accordingly (after implementation of #26)
  • data presentation? (maybe "Statistics" category in frontend)
  • data export ("Import/Export" category in frontend, which file formats?)

remove testing analyzers

MarkovTweet and MarkovText are not optimized (multi-threading, memory usage) and should be removed.

Thread gets killed on table lock timeout

Table lock timeout should not lead to thread termination, but to warning and retry.

Downside: All database commands need to be rewritten :(

Workaround for now: Set the table lock timeout to 10min in Database::connect().

parse datetime using multiple queries

change datetime.* in parsing configuration to array and rename accordingly (plural!)
check URL queries first - content queries second
use any result as datetime

Both server (ConfigParser class) and frontend (parser.json) need to be changed. The ThreadParser class needs to be written accordingly.

XML output

I think XML output does not work correctly yet (not tested).

Best bet: test XMLDocument::getContent(...) externally first.

parser

implement parsing functionality

feature: add JSON Pointer as query type

can be implemented by using rapidjson: http://rapidjson.org/md_doc_pointer.html

Boolean result: Pointer(query).Get(document) != NULL
Single or multiple results: Pointer(query).GetValueByPointer(document)
For single results, if the result is an array, get only the first entry of the array
Convert the resulting value to a string (QUESTION: Can getString() be used on arrays and numeric/null values as well?)

import, export, merge URL lists

add import of URL lists to Import/Export category in frontend
add server command allowing to import URL list
in server: parse one line one entry, CSV,... (allow to specify column number or name if necessary)
in frontend: (advanced) parse CSV etc. to allow column selection [double-parsing ๐Ÿ‘Ž]
alternative: two server commands (A. upload/parse and respond with column results - need to change command structure for advanced JSON reply, B. add uploaded file to URL list)

file parsing in worker thread!
using callback in frontend to allow adding after parsing

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.