The disco_layer from ausdto

Fetch Condition - Due Date Callback fail

The callback chain is causing issues. This function effectively needs to wait until we get an answer.
Promises should resolve.

crawler: Expected redirect but getting 599

For example, if going to this website:
http://www.acnc.gov.au/findacharity
it redirects to:
http://www.acnc.gov.au/ACNC/FindCharity/QuickSearch/ACNC/OnlineProcessors/Online_register/Search_the_Register.aspx?noleft=1

But in the crawler I get a 599 and the redirected url is not fetched.

configure disco_service to use celery-haystack

https://github.com/django-haystack/celery-haystack/

purge orientdb references

in disco_service/spiderbucket/management/commands/sync_docs_from_orientdb.py, I have hardcoded values for orientdb.

these should be drawn from environment variables indirectly, through settings.py.

crawler: url encoding of query parameters

Some query params seem to be getting incorrectly defined.
Most likely this is an encode/decode issue.

only close database after queries are done.

At the moment I am just leaving a wait for database commands to finish.
Need to move to close the database only after all the queries are done.

Load JSON samples of service description document

crawler: http://greenpower.gov.au/

/home/ec2-user/crawler/logs/greenpower.gov.au_investigate.log

There are a whole bunch from green power that say completed but never make it to DB.
I can manually insert the url using studio.
Seems to be a lot of pdfs

Node/crawl/crawl.js - Externalise config

Need to move the config db/params out of the primary functions

content cagefight API presentation

depends on #29

page and/or API. Given a URL (e.g. the current page hosting a widget), return a list of pages "like this one".

crawler: database password not being passed

The actual database password is not being passed.
Also look to see if the server password is actually needed any more.
at least create another account that is limited to just listing the dbs etc.

Potential fix is in local git.

Crawl Server Errors

At the errors (404, 500, timeout) are not stored. They should be, but at some stage we should stop trying to refresh them. Need to decide rules.

simplecrawler (Crawler.prototype.domainValid) Hack

In haste, a hack was added to bypass the domain checking to allow it to go across all gov.au. This change the domainValid function to accept any domain. There is also crawler.filterByDomain = true though so maybe that can be changed.

Regardless the hack needs to be removed because it is an external module. Then the node_modules can be removed from repo.

https://github.com/AusDTO/discoveryLayer/blob/master/node/node_modules/simplecrawler/lib/crawler.js lines 536

var crawler = this,
crawlerHost = crawler.host;
//console.log("crawlerHost: " + crawlerHost);
// If we're ignoring the WWW domain, remove the WWW for comparisons...
if (crawler.ignoreWWWDomain)
host = host.replace(/^www./i,"");
//console.log("in domainValid");
///TODO: HACKED This is hacked to let it go outside this domain. Should then get caught by my conditions.
return true;

use REST api for service catalogue data

currently, disco_service/govservices/management/commands/.. is a messy contraption that interfaces a local git repo of the service catalogue repository.

It would be much better if it accessed an API on the node.js for things like fetching lists of things that need to be synced in the DB. Better to only have one codebase for processing/manageing that json graph.

normalise service catalogue and sync to RDBMS

this is WIP ATM.

make browsable version of the service catalogue

following #45, make some views so the service catalogue can be browsed.

seed "disco service" for the spider - these pages should be indexed (and boosted!).

duplicate keys in url index

error: OrientDB.RequestError: Cannot index record webDocumentContainer{protocol:http,host:agriculture.gov.au,port:80,path:/ScriptResource.axd?d=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3,depth:3,pathname:/ScriptResource.axd?d=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3,url:http://agriculture.gov.au/ScriptResource.axd%3Fd=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3}: found duplicated key 'http://agriculture.gov.au/ScriptResource.axd%3Fd=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3' in index 'webDocumentContainer.url' previously assigned to the record #13:25476
at Operation.parseError (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/protocol28/operation.js:832:13)
at Operation.consume (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/protocol28/operation.js:422:35)
at Connection.process (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/connection.js:360:17)
at Connection.handleSocketData (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/connection.js:279:17)
at Socket.emit (events.js:107:17)
at readableAddChunk (_stream_readable.js:163:16)
at Socket.Readable.push (_stream_readable.js:126:10)
at TCP.onread (net.js:538:20)

rename spiderbucket to metadata

related to #30

long overdue, spiderbucket was always a stupid name

adopt haystack-celery module

disco_service as a gateway to all the content

faceted browsing site with search features and recommendation engine

Kickstart semantic information extraction / enrichment. Reason over:

10M resources found
content extraction/description (NLP etc)
service catalogue data
AGIF metadata ontology + alternative terms taxonomy (AN Archives).
assertions binding service catalogue to AGIF metadata
assertions binding MOG chart to AGIF metadata

MVP might be "life event" facets based on service catalogue + content cagefight clustering (#30)

haystack/solr version compatability

The current dockerised solr container is Solr5, but things might be easier with solr4 (until haystack support for solr5 gets a bit more mature).

either shake the issues out OR downgrade to solr4 and upgrade later.

wrap crawler's DB with new app/model

connected to #30.

new tests/celery task for syncing NEW resources from crawler's DB to metadata.models.Page

related to #30

How should this go:

trigger periodic job (60s?) from celerybeat?
job pulls (limit 10K?) new resources from crawler, dispatches create task to queue
task creates new Resource into metadata
per #28, index maintenance should trigger automatically...

Prevent crawling job from deleting content

use bookmarklet to record assertions about pages

Add build tests for node module

create haystack index configuration for servicecatalogue

following from #45, need to configure a search index

crawler: addIfMissing is attempting a duplicate insert

When queuing a lot of results we can get duplicate inserts after the select count. Not a big issue because the record has infact been stored which is all we want anyhow.

mysterious bug in govservices test_update_dimension

I had a test_update_dimension that I'm sure used to pass, but I think I broke it when I refactored the tests such that agencies were stashed in an "ORM cache" class member (as a performance hack, to reduce traffic between the test suite and the test DB).

I don't understand why this bug occurs now, but want to get on with a major refactor that might end up removing it anyhow. So, current plan is to switch the test off (rename to "buggy_test_update_dimension").

Exclude state based domains

Enhance the Exclude Domains fetch condition to exclude state domains.

orientdb trigger and function to handle changes

Add hooks and a function to handle it.
http://orientdb.com/docs/last/Dynamic-Hooks.html
http://orientdb.com/docs/last/Functions.html

Post enchanced document to solr

Dummy and then full

jenkins job for disco_service

should be trivial, this works 'python manage.py test'

automated tests for the node stuff

following from #27, we have jenkins testing the disco_service but not the node stuff yet.

Fix Logging

Currently just using console logs.
Need to move to a logging library.

Maybe winston.

crawler: change docker user

crawler: add a document hash field

This will allow easier determination of changes for downstream processing.

crawler: Stripping of query params

Some urls are giving 404 when the query params are left, but I assume some will not work if they are stripped.
Example where it fails with the query param (400):
acnc.gov.au/ACNC/Manage/Reporting/ReportTransitional/ACNC/Report/ReportTransitional.aspx%3Fhkey=61d173e0-cabb-4be4-82d5-0bf37fa55c7c

crawler: increase runtime between restarts

Need to add incremental commits

make AST (graph) of cleaned content

Parse the cleaned content (use something like lxml) into some kind of temporary structure, then traverse that structure to create a corresponding OrientDB graph.

note: it should be possible to "roundtrip" test this: from cleaned content to ContentAST, and from ContentAST back to equivalent (if not identical) clean content.

use jenkins to maintain normalised service model in RDBMS

following from #45, create a jenkins job that maintains RDBMS content when json changes in github.

generate enhanced document

Python or node module to generate the enhanced document to be added to solr

retry with www if url has null path

Some of the domains are not redirecting. We need to retry the www equivalent.

normalise/decruftify content

starting with raw content (spidered from web sites), create a shiny clean version of the content that's free of cruft.

Link AST Graph to service documents using assertions

Create a sample of user assertions about content.

No database error handling

Need to handle database errors. Should just require attaching a catch hander.

acnc.gov.au is currently excluded

I deleted all the document that were for ACNC becuse they were all going 400 because of the query strings.
Querystring stripping is now enabled. But there is a set of urls which are still there that when they come up for re fetching will cause issues.
It looks to be something with aspx pages that is the root cause.
If the query stripping is all that needs to be done can just update those urls before they are due again.

crawler: only apply max if > 0

The max limit is always being applied, leaving no way to do unlimited.

ausdto / disco_layer Goto Github PK

disco_layer's People

Contributors

Stargazers

Watchers

Forkers

disco_layer's Issues

Recommend Projects

Recommend Topics

Recommend Org