ausdto / disco_layer Goto Github PK
View Code? Open in Web Editor NEWCode, outputs and Information relevant to the discovery layer.
Code, outputs and Information relevant to the discovery layer.
The callback chain is causing issues. This function effectively needs to wait until we get an answer.
Promises should resolve.
For example, if going to this website:
http://www.acnc.gov.au/findacharity
it redirects to:
http://www.acnc.gov.au/ACNC/FindCharity/QuickSearch/ACNC/OnlineProcessors/Online_register/Search_the_Register.aspx?noleft=1
But in the crawler I get a 599 and the redirected url is not fetched.
in disco_service/spiderbucket/management/commands/sync_docs_from_orientdb.py
, I have hardcoded values for orientdb.
these should be drawn from environment variables indirectly, through settings.py.
Some query params seem to be getting incorrectly defined.
Most likely this is an encode/decode issue.
Examples: info: Url was 404: http://ahl.gov.au/%3Fq=partnerships
info: Url was 404: http://ahl.gov.au/%3Fq=our-organisation
info: Url was 404: http://ahl.gov.au/%3Fq=ahl-board
info: Url was 404: http://ahl.gov.au/%3Fq=customer-service-charter
info: Url was 404: http://ahl.gov.au/%3Fq=contact
info: Url was 404: http://ahl.gov.au/%3Fq=employment
info: Url was 404: http://ahl.gov.au/%3Fq=support-services
info: Url was 404: http://ahl.gov.au/%3Fq=node%2F222
http://lmip.gov.au/default.aspx%3FLMIP%2FContactUs
Related: #32
At the moment I am just leaving a wait for database commands to finish.
Need to move to close the database only after all the queries are done.
/home/ec2-user/crawler/logs/greenpower.gov.au_investigate.log
There are a whole bunch from green power that say completed but never make it to DB.
I can manually insert the url using studio.
Seems to be a lot of pdfs
Need to get log files rotating daily.
Need to move the config db/params out of the primary functions
depends on #29
page and/or API. Given a URL (e.g. the current page hosting a widget), return a list of pages "like this one".
The actual database password is not being passed.
Also look to see if the server password is actually needed any more.
at least create another account that is limited to just listing the dbs etc.
Potential fix is in local git.
At the errors (404, 500, timeout) are not stored. They should be, but at some stage we should stop trying to refresh them. Need to decide rules.
In haste, a hack was added to bypass the domain checking to allow it to go across all gov.au. This change the domainValid function to accept any domain. There is also crawler.filterByDomain = true though so maybe that can be changed.
Regardless the hack needs to be removed because it is an external module. Then the node_modules can be removed from repo.
https://github.com/AusDTO/discoveryLayer/blob/master/node/node_modules/simplecrawler/lib/crawler.js lines 536
var crawler = this,
crawlerHost = crawler.host;
//console.log("crawlerHost: " + crawlerHost);
// If we're ignoring the WWW domain, remove the WWW for comparisons...
if (crawler.ignoreWWWDomain)
host = host.replace(/^www./i,"");
//console.log("in domainValid");
///TODO: HACKED This is hacked to let it go outside this domain. Should then get caught by my conditions.
return true;
currently, disco_service/govservices/management/commands/.. is a messy contraption that interfaces a local git repo of the service catalogue repository.
It would be much better if it accessed an API on the node.js for things like fetching lists of things that need to be synced in the DB. Better to only have one codebase for processing/manageing that json graph.
this is WIP ATM.
following #45, make some views so the service catalogue can be browsed.
seed "disco service" for the spider - these pages should be indexed (and boosted!).
error: OrientDB.RequestError: Cannot index record webDocumentContainer{protocol:http,host:agriculture.gov.au,port:80,path:/ScriptResource.axd?d=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3,depth:3,pathname:/ScriptResource.axd?d=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3,url:http://agriculture.gov.au/ScriptResource.axd%3Fd=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3}: found duplicated key 'http://agriculture.gov.au/ScriptResource.axd%3Fd=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3' in index 'webDocumentContainer.url' previously assigned to the record #13:25476
at Operation.parseError (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/protocol28/operation.js:832:13)
at Operation.consume (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/protocol28/operation.js:422:35)
at Connection.process (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/connection.js:360:17)
at Connection.handleSocketData (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/connection.js:279:17)
at Socket.emit (events.js:107:17)
at readableAddChunk (_stream_readable.js:163:16)
at Socket.Readable.push (_stream_readable.js:126:10)
at TCP.onread (net.js:538:20)
related to #30
long overdue, spiderbucket was always a stupid name
faceted browsing site with search features and recommendation engine
Kickstart semantic information extraction / enrichment. Reason over:
MVP might be "life event" facets based on service catalogue + content cagefight clustering (#30)
The current dockerised solr container is Solr5, but things might be easier with solr4 (until haystack support for solr5 gets a bit more mature).
either shake the issues out OR downgrade to solr4 and upgrade later.
connected to #30.
following from #45, need to configure a search index
When queuing a lot of results we can get duplicate inserts after the select count. Not a big issue because the record has infact been stored which is all we want anyhow.
I had a test_update_dimension that I'm sure used to pass, but I think I broke it when I refactored the tests such that agencies were stashed in an "ORM cache" class member (as a performance hack, to reduce traffic between the test suite and the test DB).
I don't understand why this bug occurs now, but want to get on with a major refactor that might end up removing it anyhow. So, current plan is to switch the test off (rename to "buggy_test_update_dimension").
Enhance the Exclude Domains fetch condition to exclude state domains.
Add hooks and a function to handle it.
http://orientdb.com/docs/last/Dynamic-Hooks.html
http://orientdb.com/docs/last/Functions.html
Dummy and then full
should be trivial, this works 'python manage.py test'
following from #27, we have jenkins testing the disco_service but not the node stuff yet.
Currently just using console logs.
Need to move to a logging library.
Maybe winston.
This will allow easier determination of changes for downstream processing.
Some urls are giving 404 when the query params are left, but I assume some will not work if they are stripped.
Example where it fails with the query param (400):
acnc.gov.au/ACNC/Manage/Reporting/ReportTransitional/ACNC/Report/ReportTransitional.aspx%3Fhkey=61d173e0-cabb-4be4-82d5-0bf37fa55c7c
Need to add incremental commits
Parse the cleaned content (use something like lxml) into some kind of temporary structure, then traverse that structure to create a corresponding OrientDB graph.
note: it should be possible to "roundtrip" test this: from cleaned content to ContentAST, and from ContentAST back to equivalent (if not identical) clean content.
following from #45, create a jenkins job that maintains RDBMS content when json changes in github.
Python or node module to generate the enhanced document to be added to solr
Some of the domains are not redirecting. We need to retry the www equivalent.
starting with raw content (spidered from web sites), create a shiny clean version of the content that's free of cruft.
Need to handle database errors. Should just require attaching a catch hander.
I deleted all the document that were for ACNC becuse they were all going 400 because of the query strings.
Querystring stripping is now enabled. But there is a set of urls which are still there that when they come up for re fetching will cause issues.
It looks to be something with aspx pages that is the root cause.
If the query stripping is all that needs to be done can just update those urls before they are due again.
The max limit is always being applied, leaving no way to do unlimited.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.