Coder Social home page Coder Social logo

disco_layer's People

Contributors

maxious avatar monkeypants avatar nokout avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

disco_layer's Issues

purge orientdb references

in disco_service/spiderbucket/management/commands/sync_docs_from_orientdb.py, I have hardcoded values for orientdb.

these should be drawn from environment variables indirectly, through settings.py.

crawler: url encoding of query parameters

Some query params seem to be getting incorrectly defined.
Most likely this is an encode/decode issue.

Examples: info: Url was 404: http://ahl.gov.au/%3Fq=partnerships
info: Url was 404: http://ahl.gov.au/%3Fq=our-organisation
info: Url was 404: http://ahl.gov.au/%3Fq=ahl-board
info: Url was 404: http://ahl.gov.au/%3Fq=customer-service-charter
info: Url was 404: http://ahl.gov.au/%3Fq=contact
info: Url was 404: http://ahl.gov.au/%3Fq=employment
info: Url was 404: http://ahl.gov.au/%3Fq=support-services
info: Url was 404: http://ahl.gov.au/%3Fq=node%2F222
http://lmip.gov.au/default.aspx%3FLMIP%2FContactUs
Related: #32

crawler: http://greenpower.gov.au/

/home/ec2-user/crawler/logs/greenpower.gov.au_investigate.log

There are a whole bunch from green power that say completed but never make it to DB.
I can manually insert the url using studio.
Seems to be a lot of pdfs

crawler: database password not being passed

The actual database password is not being passed.
Also look to see if the server password is actually needed any more.
at least create another account that is limited to just listing the dbs etc.

Potential fix is in local git.

Crawl Server Errors

At the errors (404, 500, timeout) are not stored. They should be, but at some stage we should stop trying to refresh them. Need to decide rules.

simplecrawler (Crawler.prototype.domainValid) Hack

In haste, a hack was added to bypass the domain checking to allow it to go across all gov.au. This change the domainValid function to accept any domain. There is also crawler.filterByDomain = true though so maybe that can be changed.

Regardless the hack needs to be removed because it is an external module. Then the node_modules can be removed from repo.

https://github.com/AusDTO/discoveryLayer/blob/master/node/node_modules/simplecrawler/lib/crawler.js lines 536

var crawler = this,
crawlerHost = crawler.host;
//console.log("crawlerHost: " + crawlerHost);
// If we're ignoring the WWW domain, remove the WWW for comparisons...
if (crawler.ignoreWWWDomain)
host = host.replace(/^www./i,"");
//console.log("in domainValid");
///TODO: HACKED This is hacked to let it go outside this domain. Should then get caught by my conditions.
return true;

use REST api for service catalogue data

currently, disco_service/govservices/management/commands/.. is a messy contraption that interfaces a local git repo of the service catalogue repository.

It would be much better if it accessed an API on the node.js for things like fetching lists of things that need to be synced in the DB. Better to only have one codebase for processing/manageing that json graph.

duplicate keys in url index

error: OrientDB.RequestError: Cannot index record webDocumentContainer{protocol:http,host:agriculture.gov.au,port:80,path:/ScriptResource.axd?d=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3,depth:3,pathname:/ScriptResource.axd?d=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3,url:http://agriculture.gov.au/ScriptResource.axd%3Fd=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3}: found duplicated key 'http://agriculture.gov.au/ScriptResource.axd%3Fd=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3' in index 'webDocumentContainer.url' previously assigned to the record #13:25476
at Operation.parseError (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/protocol28/operation.js:832:13)
at Operation.consume (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/protocol28/operation.js:422:35)
at Connection.process (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/connection.js:360:17)
at Connection.handleSocketData (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/connection.js:279:17)
at Socket.emit (events.js:107:17)
at readableAddChunk (_stream_readable.js:163:16)
at Socket.Readable.push (_stream_readable.js:126:10)
at TCP.onread (net.js:538:20)

disco_service as a gateway to all the content

faceted browsing site with search features and recommendation engine

Kickstart semantic information extraction / enrichment. Reason over:

  • 10M resources found
  • content extraction/description (NLP etc)
  • service catalogue data
  • AGIF metadata ontology + alternative terms taxonomy (AN Archives).
  • assertions binding service catalogue to AGIF metadata
  • assertions binding MOG chart to AGIF metadata

MVP might be "life event" facets based on service catalogue + content cagefight clustering (#30)

haystack/solr version compatability

The current dockerised solr container is Solr5, but things might be easier with solr4 (until haystack support for solr5 gets a bit more mature).

either shake the issues out OR downgrade to solr4 and upgrade later.

mysterious bug in govservices test_update_dimension

I had a test_update_dimension that I'm sure used to pass, but I think I broke it when I refactored the tests such that agencies were stashed in an "ORM cache" class member (as a performance hack, to reduce traffic between the test suite and the test DB).

I don't understand why this bug occurs now, but want to get on with a major refactor that might end up removing it anyhow. So, current plan is to switch the test off (rename to "buggy_test_update_dimension").

Fix Logging

Currently just using console logs.
Need to move to a logging library.

Maybe winston.

crawler: Stripping of query params

Some urls are giving 404 when the query params are left, but I assume some will not work if they are stripped.
Example where it fails with the query param (400):
acnc.gov.au/ACNC/Manage/Reporting/ReportTransitional/ACNC/Report/ReportTransitional.aspx%3Fhkey=61d173e0-cabb-4be4-82d5-0bf37fa55c7c

make AST (graph) of cleaned content

Parse the cleaned content (use something like lxml) into some kind of temporary structure, then traverse that structure to create a corresponding OrientDB graph.

note: it should be possible to "roundtrip" test this: from cleaned content to ContentAST, and from ContentAST back to equivalent (if not identical) clean content.

normalise/decruftify content

starting with raw content (spidered from web sites), create a shiny clean version of the content that's free of cruft.

acnc.gov.au is currently excluded

I deleted all the document that were for ACNC becuse they were all going 400 because of the query strings.
Querystring stripping is now enabled. But there is a set of urls which are still there that when they come up for re fetching will cause issues.
It looks to be something with aspx pages that is the root cause.
If the query stripping is all that needs to be done can just update those urls before they are due again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.