Coder Social home page Coder Social logo

ldspider's People

Contributors

aharth avatar jumbrich avatar kaefer3000 avatar robertisele avatar

Watchers

 avatar

ldspider's Issues

Request for JavaDoc

The next release should include a ant task to create the javadoc and some more 
documentation about the method

Original issue reported on code.google.com by [email protected] on 9 Dec 2012 at 2:17

ldspider does not build

1. SVN checkout 
2. mvn package

produces this
[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] 
/D:/svn/ldspider-read-only/src/test/java/com/ontologycentral/ldspider/hooks/sink
/SinkSparulTest.java:[71,6] no suitable method found for 
evaluateBreadthFirst(com.ontologycentral.ldspider.frontier.Frontier,int,int,int,
com.ontologycentral.ldspider.Crawler.Mode)
    method com.ontologycentral.ldspider.Crawler.evaluateBreadthFirst(com.ontologycentral.ldspider.frontier.Frontier,com.ontologycentral.ldspider.seen.Seen,com.ontologycentral.ldspider.queue.Redirects,int,int,int,int,boolean) is not applicable
      (actual and formal argument lists differ in length)
    method com.ontologycentral.ldspider.Crawler.evaluateBreadthFirst(com.ontologycentral.ldspider.frontier.Frontier,com.ontologycentral.ldspider.seen.Seen,com.ontologycentral.ldspider.queue.Redirects,int,int,int,int,boolean,com.ontologycentral.ldspider.Crawler.Mode) is not applicable
      (actual and formal argument lists differ in length)
[ERROR] 
/D:/svn/ldspider-read-only/src/test/java/com/ontologycentral/ldspider/SeedReadTe
st.java:[14,55] incompatible types
  required: java.util.Set<java.net.URI>
  found:    java.lang.Iterable<java.net.URI>
[ERROR] 
/D:/svn/ldspider-read-only/src/test/java/com/ontologycentral/ldspider/queue/Thre
adingPollTest.java:[24,34] constructor BreadthFirstQueue in class 
com.ontologycentral.ldspider.queue.BreadthFirstQueue cannot be applied to given 
types;
  required: org.semanticweb.yars.tld.TldManager,com.ontologycentral.ldspider.queue.Redirects,com.ontologycentral.ldspider.seen.Seen,int,int,int,boolean
  found: com.ontologycentral.ldspider.tld.TldManager,int,int
  reason: actual and formal argument lists differ in length
[ERROR] 
/D:/svn/ldspider-read-only/src/test/java/com/ontologycentral/ldspider/queue/Perf
ormanceTest.java:[24,40] constructor BreadthFirstQueue in class 
com.ontologycentral.ldspider.queue.BreadthFirstQueue cannot be applied to given 
types;
  required: org.semanticweb.yars.tld.TldManager,com.ontologycentral.ldspider.queue.Redirects,com.ontologycentral.ldspider.seen.Seen,int,int,int,boolean
  found: com.ontologycentral.ldspider.tld.TldManager,int,int
  reason: actual and formal argument lists differ in length
[ERROR] 
/D:/svn/ldspider-read-only/src/test/java/com/ontologycentral/ldspider/queue/Perf
ormanceTest.java:[62,35] cannot find symbol
  symbol:   method addDirectly(java.net.URI)
  location: variable fq of type com.ontologycentral.ldspider.queue.BreadthFirstQueue

Original issue reported on code.google.com by [email protected] on 5 Jan 2015 at 7:31

common.log buggy

common log should include i) HTTP Version and ii) the length of the content
received.

ii) currently use -1 for streaming content or unknown content length, but
analog (log file analyser) complains here.  0 is not correct because well
the content length was not 0 but something else.

Original issue reported on code.google.com by [email protected] on 5 Jan 2010 at 6:14

Exception

A user reported the following exception (not 100% sure which version he's on):

INFO: lookup on http://dbpedia.org/data/John_Henry_Bremridge.xml status 200
java.lang.NullPointerException
at 
com.ontologycentral.ldspider.hooks.links.LinkFilterDefault.addUri(LinkFilterDefa
ult.java:106)
at 
com.ontologycentral.ldspider.hooks.links.LinkFilterDefault.addABox(LinkFilterDef
ault.java:87)
at 
com.ontologycentral.ldspider.hooks.links.LinkFilterDefault.processStatement(Link
FilterDefault.java:63)
at org.semanticweb.yars.util.Callbacks.processStatement(Unknown Source)
at org.semanticweb.yars2.rdfxml.RDFXMLParserBase.processStatement(Unknown 
Source)
at org.semanticweb.yars2.rdfxml.RDFXMLParserBase.handleStatement(Unknown Source)
at 
org.semanticweb.yars2.rdfxml.RDFXMLParserBase.handlePropertyAttributePair(Unknow
n Source)
at 
org.semanticweb.yars2.rdfxml.RDFXMLParserBase.initialiseCurrentProperty(Unknown 
Source)
at org.semanticweb.yars2.rdfxml.RDFXMLParserBase.startElement(Unknown Source)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(Abstra
ctSAXParser.java:501)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractXMLDocumentParser.emptyElemen
t(AbstractXMLDocumentParser.java:179)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElemen
t(XMLNSDocumentScannerImpl.java:377)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentC
ontentDriver.next(XMLDocumentFragmentScannerImpl.java:2755)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentS
cannerImpl.java:648)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocum
entScannerImpl.java:140)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocum
ent(XMLDocumentFragmentScannerImpl.java:511)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configu
ration.java:808)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configu
ration.java:737)
at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXPa
rser.java:1205)
at 
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXPar
serImpl.java:522)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
at org.semanticweb.yars2.rdfxml.RDFXMLParser.<init>(Unknown Source)
at 
com.ontologycentral.ldspider.hooks.content.ContentHandlerRdfXml.handle(ContentHa
ndlerRdfXml.java:37)
at 
com.ontologycentral.ldspider.hooks.content.ContentHandlers.handle(ContentHandler
s.java:35)
at com.ontologycentral.ldspider.http.LookupThread.run(LookupThread.java:120)
at java.lang.Thread.run(Thread.java:680)

Original issue reported on code.google.com by [email protected] on 28 Mar 2011 at 1:18

Proxy configuration

Not clear how to set up a proxy (via ConnectionManager).

Should use the OS-wide proxy setting (if available), maybe also options for
overriding OS proxy setting.


Original issue reported on code.google.com by [email protected] on 23 Dec 2009 at 4:23

404 during intialization of TldManager

What steps will reproduce the problem?
1. create Crawler object

Sep 24, 2010 10:12:20 AM com.ontologycentral.ldspider.tld.TldManager <init>
INFO: status 404 for 
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/src/effective_tld_name
s.dat?raw=1
Sep 24, 2010 10:12:20 AM com.ontologycentral.ldspider.Crawler <init>
INFO: cannot get tld file online cannot access 
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/src/effective_tld_name
s.dat?raw=1: 404

The URL is incorrect, it should be: 

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.da
t?raw=1

Original issue reported on code.google.com by [email protected] on 24 Sep 2010 at 8:20

separate callback for header information

What steps will reproduce the problem?
1. Use regular callback
2. Header-Triples show up in results

Solution: have separate Callback to handle the header information

Original issue reported on code.google.com by [email protected] on 18 May 2010 at 2:05

Weird behaviour with Content-Location header field

What steps will reproduce the problem?

1. Crawl "http://www.w3.org/2002/07/owl"

What is the expected output? What do you see instead?

 - "http://www.w3.org/2002/07/owl" has Content-Location of "owl.rdf"
 - context for quads from this document uses <http://www.w3.org/2002/07/owl>
 - a redirect is output from <http://www.w3.org/2002/07/owl> to <http://www.w3.org/2002/07/owl.rdf>

Please use labels and text to provide additional information.

 - behaviour is strange since we now have contexts which are the source of a redirect... there are various dangling redirects now.

(Found through problems ranking BTC11 where links are rewritten according to 
redirects, causing mis-alignment with contexts.)

Original issue reported on code.google.com by [email protected] on 1 Nov 2011 at 3:23

ContentHandlerRdfXml should also accept application/xml

What steps will reproduce the problem?
1. Retrieve a RDF/XML source that returns application/xml (instead of
rdf+xml) as mime type (for example http://www.hyphen.info/rdf/47.xml)

What is the expected output? What do you see instead?
Data should be downloaded and parsed, instead it is ignored with the
message "disallowed via fetch filter", which is caused by the content
handler, as far as I can tell.

What version of the product are you using? On what operating system?
r213

Please provide any additional information below.
FetchFilterRdfXml accepts both application/rdf+xml and applicatipon/xml, so
the content handler should probably too.

Original issue reported on code.google.com by [email protected] on 7 Jun 2010 at 2:37

Redirects with hash frags

What steps will reproduce the problem?
1.  Crawl http://purl.org/dc/terms/title


What is the expected output? What do you see instead?

The redirects file reports the target of the redirect as:

http://dublincore.org/2010/10/11/dcterms.rdf#title

...should that hash frag really be there?

What version of the product are you using? On what operating system?

From the BTC11 crawl... again causing problems with ranking.
Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 3 Nov 2011 at 2:54

change the source code

hi 
i want to change some part of the code, so I use TortoiseSVN software to 
checkout the code,but 
this software give me all of trunk except the src directory !!!! :(
 who can help me?

   how can get the src codes without svn? can i decompile the     jar file version 1.1 ? how?


What is the expected output? What do you see instead?
i want to change  src code in in trunk function with svn  fcilities



Original issue reported on code.google.com by [email protected] on 9 Jun 2012 at 6:55

Information about -c and -b parameters and empty output file

What steps will reproduce the problem?
1. java -jar ldspider-1.1e.jar -o data.nq -s dump.rdf -b 1 1 1

What is the expected output? What do you see instead?
Instead of getting some triples, there is an empty 'data.nq' and this message:
INFO: Stopping CloseIdleConnectionThread 
com.ontologycentral.ldspider.http.internal.CloseIdleConnectionThread.run


What version of the product are you using? On what operating system?
1.1e ; Windows 7 Enterprise


Please provide any additional information below.
I publish mysql database as linked data on localhost:2020 with d2r-server. The 
dump.rdf file is d2r-server's output.
I'd like to have more information about -c and -b parameters.

Original issue reported on code.google.com by [email protected] on 24 Oct 2012 at 8:19

mvn verify fails, as aduna software repository has not been available for several months

What steps will reproduce the problem?
1. Check out clean copy of current ldspider trunk
2. Remove all cached maven packages on your system ("rm -fr ~/.m2" on Linux)
3. Run "mvn verify"

What is the expected output?

I expect the project to download all of the dependencies and build successfully.

What do you see instead?

The project fails to build because it cannot pull in dependencies that it 
believes are hosted only by Aduna, because the Aduna site has been down since 
April 2014.

[ERROR] Failed to execute goal on project ldspider: Could not resolve 
dependencies for project com.ontologycentral:ldspider:jar:1.2: Failed to 
collect dependencies at 
             org.deri.any23:any23-core:jar:0.6.1 -> org.openrdf.sesame:sesame-model:jar:2.4.0: Failed to read artifact descriptor for org.openrdf.sesame:sesame-model:jar:2.4.0: Could not transfer artifact 
             org.openrdf.sesame:sesame-model:pom:2.4.0 from/to aduna-software-release-repo (http://repo.aduna-software.org/maven2/releases): Connection to http://repo.aduna-software.org refused: Connection 
             timed out -> [Help 1]

What version of the product are you using? On what operating system?

SVN trunk as of 2014-06-02 on Fedora 20.

Please provide any additional information below.

https://groups.google.com/forum/#!msg/fedora-tech/JWAAkvp6mBk/DiCeaN1SN4EJ 
suggests that the Aduna repository has been down since at least April.

Original issue reported on code.google.com by [email protected] on 2 Jun 2014 at 4:35

RDF Parser produces <nodexxy> URIs

What steps will reproduce the problem?
1. Crawl <http://www.lassila.org/ora.rdf#me>
2. Witness <http://www.lassila.org/ora.rdf#me> 
<http://xmlns.com/foaf/0.1/knows> <node1757n7capx1> in output

What is the expected output? What do you see instead?

I'd like to not see <node1757n7capx1> 

Original issue reported on code.google.com by [email protected] on 16 Feb 2014 at 12:24

Microdata support

Hello, 

I've been testing ldspider with different types of pages to check if it's 
extracting data correctly. RDF/XML page seems to be straightforward, but 
crawling an HTML page that contains microdata/rdfa markup doesn't seem to yield 
any data.

I'm using ldspider CLI support, here is the command:
java -jar ldspider.jar -any23 -c 2 -s seed.txt -o data.txt -a access-log.txt -v 
file-log.txt

access-log.txt content:
1347913722 1110 127.0.0.1 TCP_MISS/200 1909 GET 
http://www.guardian.co.uk/robots.txt - NONE/- text/plain
1347913722 0 127.0.0.1 TCP_MISS/499 -1 GET 
http://www.guardian.co.uk/commentisfree/2012/sep/17/cameron-goes-where-thatcher-
never-dared - NONE/- null

data.txt file is empty.

Here is a test page that contains some microdata markup:
http://www.guardian.co.uk/commentisfree/2012/sep/17/cameron-goes-where-thatcher-
never-dared
trying to extract embedded data with any23 service do yield some data 
http://any23.org/any23/best/http:/www.guardian.co.uk/commentisfree/2012/sep/17/c
ameron-goes-where-thatcher-never-dared

Any clues?
Thanks in advance.

Original issue reported on code.google.com by [email protected] on 17 Sep 2012 at 8:33

Can not get any triple :(

I'm trying to fetch data starting from that resource: 
http://linkeddata.few.vu.nl/googleart/index.rdf

What steps will reproduce the problem?
1. java -jar ldspider-1.1d.jar -o test.nq -u 
http://linkeddata.few.vu.nl/googleart/index.rdf -y

What is the expected output? What do you see instead?
Instead of getting some triples, there is an empty 'test.nq' and this message:
INFO: Stopping CloseIdleConnectionThread
java.lang.InterruptedException: sleep interrupted
    at java.lang.Thread.sleep(Native Method)
    at com.ontologycentral.ldspider.http.internal.CloseIdleConnectionThread.run(Unknown Source)


What version of the product are you using? On what operating system?
1.1d


Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 16 Mar 2011 at 1:54

Enable the LinkSelectFilter (hooks/links) from CLI

Add new command line option to enable the linkselect filter. 
-  a flag (-a|-d) followed list of predicates 
-( perhaps use the prefix.cc service to enable prefixes, such as foaf:knows. 
   Translate the prefixes with the prefix.cc. db list to map the input to URIs

Original issue reported on code.google.com by [email protected] on 16 Oct 2009 at 11:00

Redirect counts as hop during crawling

What steps will reproduce the problem?
1. evaluateBreadthFirst with only 1 MaxHop
2. Frontier contains URI that redirects to RDF source
3. Redirected RDF source is not crawled

Would be nicer, if redirects would not count as hops.

Original issue reported on code.google.com by [email protected] on 14 Sep 2010 at 3:44

  • Merged into: #3

Relative redirects not handled properly

What steps will reproduce the problem?
1. lookup http://semantictweet.com/fitango
2. sends local redirects 
3. problem

What is the expected output? What do you see instead?

local redirects should be followed (provided that local redirects conform to 
the HTTP-spec

Original issue reported on code.google.com by [email protected] on 19 Jun 2010 at 12:22

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.