Coder Social home page Coder Social logo

letlink-crawler's People

Watchers

 avatar

letlink-crawler's Issues

Web Crawler

Web Server: Tomcat
OS: Ubuntu Linux server
Techs: jQuery, JS, Ajax, css, monitoring tools
Additional struts action classes should also be developed to react to the web 
client.

Original issue reported on code.google.com by [email protected] on 5 Jul 2011 at 3:22

TODO

This issue item will be my TODO list, which will contain web-crawler related 
and other tech related tasks together with schedules.

Original issue reported on code.google.com by [email protected] on 20 Jul 2011 at 2:33

DB Design

Initial Idea on the DB Design:

+------------------------------------+
| domains                         |
| frontierReportMonitor       |
| fullharvests                     |
| global_crawler_trap_expressions |
| global_crawler_trap_lists  |
| harvestdefinitions            |
| historyinfo                      |
| jobs                               |
| partialharvests                |
| runningJobsHistory         |
| runningJobsMonitor         |
| schedules                       |
| schemaversions              |
| seedlists                         |
+-------------------------------------+

domains:        id, name, alias, desc, isCrawling (if being crawled, cannot modify 
it)
excluded_url: id, name, domain_id, desc, isActive
seedslist:      id, name, desc, seeds(a list of domains)
cron:
crawler:            id, name, crawler_id, domain_id, excluded_url, cron, 
max_bytes_download, max_image_download, max_time_seconds, max_threads, 
activated, desc

job:                id, name, crawler_id, start_time, end_time, submit_time, priority, 
seedlist_id, status, crawling_error, crawling_error_details, 
image_upload_error, image_upload_error_details, bak_col01, bak_col02, bak_col03

image_info(raw image information):  id, filename (same as image name in located 
web page), storage_path, image_url, page_url, pixels, size, format
image_stat(statistics data):    id, crawler_id, job_id, seedlist_id, domain_id, 
crawled_time

Original issue reported on code.google.com by [email protected] on 8 Jul 2011 at 8:59

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.