cif2cif / ldspider Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 62.71 MB

Automatically exported from code.google.com/p/ldspider

Java 100.00%

ldspider's People

Contributors

Watchers

ldspider's Issues

Request for JavaDoc

The next release should include a ant task to create the javadoc and some more 
documentation about the method

Original issue reported on code.google.com by [email protected] on 9 Dec 2012 at 2:17

ldspider does not build

1. SVN checkout 
2. mvn package

produces this
[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] 
/D:/svn/ldspider-read-only/src/test/java/com/ontologycentral/ldspider/hooks/sink
/SinkSparulTest.java:[71,6] no suitable method found for 
evaluateBreadthFirst(com.ontologycentral.ldspider.frontier.Frontier,int,int,int,
com.ontologycentral.ldspider.Crawler.Mode)
    method com.ontologycentral.ldspider.Crawler.evaluateBreadthFirst(com.ontologycentral.ldspider.frontier.Frontier,com.ontologycentral.ldspider.seen.Seen,com.ontologycentral.ldspider.queue.Redirects,int,int,int,int,boolean) is not applicable
      (actual and formal argument lists differ in length)
    method com.ontologycentral.ldspider.Crawler.evaluateBreadthFirst(com.ontologycentral.ldspider.frontier.Frontier,com.ontologycentral.ldspider.seen.Seen,com.ontologycentral.ldspider.queue.Redirects,int,int,int,int,boolean,com.ontologycentral.ldspider.Crawler.Mode) is not applicable
      (actual and formal argument lists differ in length)
[ERROR] 
/D:/svn/ldspider-read-only/src/test/java/com/ontologycentral/ldspider/SeedReadTe
st.java:[14,55] incompatible types
  required: java.util.Set<java.net.URI>
  found:    java.lang.Iterable<java.net.URI>
[ERROR] 
/D:/svn/ldspider-read-only/src/test/java/com/ontologycentral/ldspider/queue/Thre
adingPollTest.java:[24,34] constructor BreadthFirstQueue in class 
com.ontologycentral.ldspider.queue.BreadthFirstQueue cannot be applied to given 
types;
  required: org.semanticweb.yars.tld.TldManager,com.ontologycentral.ldspider.queue.Redirects,com.ontologycentral.ldspider.seen.Seen,int,int,int,boolean
  found: com.ontologycentral.ldspider.tld.TldManager,int,int
  reason: actual and formal argument lists differ in length
[ERROR] 
/D:/svn/ldspider-read-only/src/test/java/com/ontologycentral/ldspider/queue/Perf
ormanceTest.java:[24,40] constructor BreadthFirstQueue in class 
com.ontologycentral.ldspider.queue.BreadthFirstQueue cannot be applied to given 
types;
  required: org.semanticweb.yars.tld.TldManager,com.ontologycentral.ldspider.queue.Redirects,com.ontologycentral.ldspider.seen.Seen,int,int,int,boolean
  found: com.ontologycentral.ldspider.tld.TldManager,int,int
  reason: actual and formal argument lists differ in length
[ERROR] 
/D:/svn/ldspider-read-only/src/test/java/com/ontologycentral/ldspider/queue/Perf
ormanceTest.java:[62,35] cannot find symbol
  symbol:   method addDirectly(java.net.URI)
  location: variable fq of type com.ontologycentral.ldspider.queue.BreadthFirstQueue

Original issue reported on code.google.com by [email protected] on 5 Jan 2015 at 7:31

common.log buggy

common log should include i) HTTP Version and ii) the length of the content
received.

ii) currently use -1 for streaming content or unknown content length, but
analog (log file analyser) complains here.  0 is not correct because well
the content length was not 0 but something else.

Original issue reported on code.google.com by [email protected] on 5 Jan 2010 at 6:14

To reduce server load, could follow crawl-delay directions in robots.txt

It would be great if the bot could follow the Crawl-delay extension to the
robots.txt protocol to avoid overloading a server.

Original issue reported on code.google.com by [email protected] on 16 Apr 2010 at 1:44

Exclude header info as default (switch on optionally)

Exclude header info as default (switch on optionally)

Original issue reported on code.google.com by [email protected] on 10 Jun 2010 at 12:27

CloseIdleConnectionThread prints InterruptedException to console instead of logging it

What steps will reproduce the problem?
1. when a program using ldspider finishes, all threads are interrupted
2. resulting InterruptedException appears on the console instead of the log

Original issue reported on code.google.com by [email protected] on 24 Sep 2010 at 8:22

Exception

A user reported the following exception (not 100% sure which version he's on):

INFO: lookup on http://dbpedia.org/data/John_Henry_Bremridge.xml status 200
java.lang.NullPointerException
at 
com.ontologycentral.ldspider.hooks.links.LinkFilterDefault.addUri(LinkFilterDefa
ult.java:106)
at 
com.ontologycentral.ldspider.hooks.links.LinkFilterDefault.addABox(LinkFilterDef
ault.java:87)
at 
com.ontologycentral.ldspider.hooks.links.LinkFilterDefault.processStatement(Link
FilterDefault.java:63)
at org.semanticweb.yars.util.Callbacks.processStatement(Unknown Source)
at org.semanticweb.yars2.rdfxml.RDFXMLParserBase.processStatement(Unknown 
Source)
at org.semanticweb.yars2.rdfxml.RDFXMLParserBase.handleStatement(Unknown Source)
at 
org.semanticweb.yars2.rdfxml.RDFXMLParserBase.handlePropertyAttributePair(Unknow
n Source)
at 
org.semanticweb.yars2.rdfxml.RDFXMLParserBase.initialiseCurrentProperty(Unknown 
Source)
at org.semanticweb.yars2.rdfxml.RDFXMLParserBase.startElement(Unknown Source)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(Abstra
ctSAXParser.java:501)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractXMLDocumentParser.emptyElemen
t(AbstractXMLDocumentParser.java:179)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElemen
t(XMLNSDocumentScannerImpl.java:377)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentC
ontentDriver.next(XMLDocumentFragmentScannerImpl.java:2755)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentS
cannerImpl.java:648)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocum
entScannerImpl.java:140)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocum
ent(XMLDocumentFragmentScannerImpl.java:511)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configu
ration.java:808)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configu
ration.java:737)
at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXPa
rser.java:1205)
at 
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXPar
serImpl.java:522)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
at org.semanticweb.yars2.rdfxml.RDFXMLParser.<init>(Unknown Source)
at 
com.ontologycentral.ldspider.hooks.content.ContentHandlerRdfXml.handle(ContentHa
ndlerRdfXml.java:37)
at 
com.ontologycentral.ldspider.hooks.content.ContentHandlers.handle(ContentHandler
s.java:35)
at com.ontologycentral.ldspider.http.LookupThread.run(LookupThread.java:120)
at java.lang.Thread.run(Thread.java:680)

Original issue reported on code.google.com by [email protected] on 28 Mar 2011 at 1:18

Proxy configuration

Not clear how to set up a proxy (via ConnectionManager).

Should use the OS-wide proxy setting (if available), maybe also options for
overriding OS proxy setting.

Original issue reported on code.google.com by [email protected] on 23 Dec 2009 at 4:23

Redirects handling

Crawler does not detect infinite redirects.

Original issue reported on code.google.com by [email protected] on 3 Jan 2010 at 11:55

Source code vs binary code

Add a disclaimer to the project page that the source code might differ from the 
binary shipped code.

Original issue reported on code.google.com by [email protected] on 9 Dec 2012 at 2:18

404 during intialization of TldManager

What steps will reproduce the problem?
1. create Crawler object

Sep 24, 2010 10:12:20 AM com.ontologycentral.ldspider.tld.TldManager <init>
INFO: status 404 for 
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/src/effective_tld_name
s.dat?raw=1
Sep 24, 2010 10:12:20 AM com.ontologycentral.ldspider.Crawler <init>
INFO: cannot get tld file online cannot access 
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/src/effective_tld_name
s.dat?raw=1: 404

The URL is incorrect, it should be: 

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.da
t?raw=1

Original issue reported on code.google.com by [email protected] on 24 Sep 2010 at 8:20

separate callback for header information

What steps will reproduce the problem?
1. Use regular callback
2. Header-Triples show up in results

Solution: have separate Callback to handle the header information

Original issue reported on code.google.com by [email protected] on 18 May 2010 at 2:05

Check whether large pld queues are access more often than small pld queues

Load seed list with one large pld and many smaller ones.

Make sure that the large pld queue gets hit from the beginning (so that the 
large pld access is spread out during the crawl)

Original issue reported on code.google.com by [email protected] on 6 May 2012 at 2:07

Weird behaviour with Content-Location header field

What steps will reproduce the problem?

1. Crawl "http://www.w3.org/2002/07/owl"

What is the expected output? What do you see instead?

 - "http://www.w3.org/2002/07/owl" has Content-Location of "owl.rdf"
 - context for quads from this document uses <http://www.w3.org/2002/07/owl>
 - a redirect is output from <http://www.w3.org/2002/07/owl> to <http://www.w3.org/2002/07/owl.rdf>

Please use labels and text to provide additional information.

 - behaviour is strange since we now have contexts which are the source of a redirect... there are various dangling redirects now.

(Found through problems ranking BTC11 where links are rewritten according to 
redirects, causing mis-alignment with contexts.)

Original issue reported on code.google.com by [email protected] on 1 Nov 2011 at 3:23

ContentHandlerRdfXml should also accept application/xml

What steps will reproduce the problem?
1. Retrieve a RDF/XML source that returns application/xml (instead of
rdf+xml) as mime type (for example http://www.hyphen.info/rdf/47.xml)

What is the expected output? What do you see instead?
Data should be downloaded and parsed, instead it is ignored with the
message "disallowed via fetch filter", which is caused by the content
handler, as far as I can tell.

What version of the product are you using? On what operating system?
r213

Please provide any additional information below.
FetchFilterRdfXml accepts both application/rdf+xml and applicatipon/xml, so
the content handler should probably too.

Original issue reported on code.google.com by [email protected] on 7 Jun 2010 at 2:37

provide rdfa parser

e.g. http://github.com/shellac/java-rdfa

Original issue reported on code.google.com by [email protected] on 10 Jun 2010 at 12:27

Redirects with hash frags

What steps will reproduce the problem?
1.  Crawl http://purl.org/dc/terms/title


What is the expected output? What do you see instead?

The redirects file reports the target of the redirect as:

http://dublincore.org/2010/10/11/dcterms.rdf#title

...should that hash frag really be there?

What version of the product are you using? On what operating system?

From the BTC11 crawl... again causing problems with ranking.
Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 3 Nov 2011 at 2:54

robots.txt handling is confused if there are http and https URIs on same domain

The robots.txt handling in ldspider is authority(host)-unique, which is not how 
it should be done, see e.g. [1]. This results in IllegalArgumentExceptions if 
https URIs are checked for robots.txt allowance.

[1] https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

Original issue reported on code.google.com by [email protected] on 18 Jun 2012 at 2:16

change the source code

hi 
i want to change some part of the code, so I use TortoiseSVN software to 
checkout the code,but 
this software give me all of trunk except the src directory !!!! :(
 who can help me?

   how can get the src codes without svn? can i decompile the     jar file version 1.1 ? how?


What is the expected output? What do you see instead?
i want to change  src code in in trunk function with svn  fcilities

Original issue reported on code.google.com by [email protected] on 9 Jun 2012 at 6:55

Information about -c and -b parameters and empty output file

What steps will reproduce the problem?
1. java -jar ldspider-1.1e.jar -o data.nq -s dump.rdf -b 1 1 1

What is the expected output? What do you see instead?
Instead of getting some triples, there is an empty 'data.nq' and this message:
INFO: Stopping CloseIdleConnectionThread 
com.ontologycentral.ldspider.http.internal.CloseIdleConnectionThread.run


What version of the product are you using? On what operating system?
1.1e ; Windows 7 Enterprise


Please provide any additional information below.
I publish mysql database as linked data on localhost:2020 with d2r-server. The 
dump.rdf file is d2r-server's output.
I'd like to have more information about -c and -b parameters.

Original issue reported on code.google.com by [email protected] on 24 Oct 2012 at 8:19

mvn verify fails, as aduna software repository has not been available for several months

What steps will reproduce the problem?
1. Check out clean copy of current ldspider trunk
2. Remove all cached maven packages on your system ("rm -fr ~/.m2" on Linux)
3. Run "mvn verify"

What is the expected output?

I expect the project to download all of the dependencies and build successfully.

What do you see instead?

The project fails to build because it cannot pull in dependencies that it 
believes are hosted only by Aduna, because the Aduna site has been down since 
April 2014.

[ERROR] Failed to execute goal on project ldspider: Could not resolve 
dependencies for project com.ontologycentral:ldspider:jar:1.2: Failed to 
collect dependencies at 
             org.deri.any23:any23-core:jar:0.6.1 -> org.openrdf.sesame:sesame-model:jar:2.4.0: Failed to read artifact descriptor for org.openrdf.sesame:sesame-model:jar:2.4.0: Could not transfer artifact 
             org.openrdf.sesame:sesame-model:pom:2.4.0 from/to aduna-software-release-repo (http://repo.aduna-software.org/maven2/releases): Connection to http://repo.aduna-software.org refused: Connection 
             timed out -> [Help 1]

What version of the product are you using? On what operating system?

SVN trunk as of 2014-06-02 on Fedora 20.

Please provide any additional information below.

https://groups.google.com/forum/#!msg/fedora-tech/JWAAkvp6mBk/DiCeaN1SN4EJ 
suggests that the Aduna repository has been down since at least April.

Original issue reported on code.google.com by [email protected] on 2 Jun 2014 at 4:35

Check whether timeouts are logged to access.log

Check whether timeouts and hostname not found etc. errors are recorded in 
access.log.

Original issue reported on code.google.com by [email protected] on 6 May 2012 at 1:43

Documents dereferenced mulitple times with -n

What steps will reproduce the problem?
1. crawl with -n
2. does not check seen list whether URI previously seen

Same URI in seed file should only crawled once.

Original issue reported on code.google.com by [email protected] on 2 May 2012 at 9:19

RDF Parser produces <nodexxy> URIs

What steps will reproduce the problem?
1. Crawl <http://www.lassila.org/ora.rdf#me>
2. Witness <http://www.lassila.org/ora.rdf#me> 
<http://xmlns.com/foaf/0.1/knows> <node1757n7capx1> in output

What is the expected output? What do you see instead?

I'd like to not see <node1757n7capx1>

Original issue reported on code.google.com by [email protected] on 16 Feb 2014 at 12:24

Microdata support

Hello, 

I've been testing ldspider with different types of pages to check if it's 
extracting data correctly. RDF/XML page seems to be straightforward, but 
crawling an HTML page that contains microdata/rdfa markup doesn't seem to yield 
any data.

I'm using ldspider CLI support, here is the command:
java -jar ldspider.jar -any23 -c 2 -s seed.txt -o data.txt -a access-log.txt -v 
file-log.txt

access-log.txt content:
1347913722 1110 127.0.0.1 TCP_MISS/200 1909 GET 
http://www.guardian.co.uk/robots.txt - NONE/- text/plain
1347913722 0 127.0.0.1 TCP_MISS/499 -1 GET 
http://www.guardian.co.uk/commentisfree/2012/sep/17/cameron-goes-where-thatcher-
never-dared - NONE/- null

data.txt file is empty.

Here is a test page that contains some microdata markup:
http://www.guardian.co.uk/commentisfree/2012/sep/17/cameron-goes-where-thatcher-
never-dared
trying to extract embedded data with any23 service do yield some data 
http://any23.org/any23/best/http:/www.guardian.co.uk/commentisfree/2012/sep/17/c
ameron-goes-where-thatcher-never-dared

Any clues?
Thanks in advance.

Original issue reported on code.google.com by [email protected] on 17 Sep 2012 at 8:33

Can not get any triple :(

I'm trying to fetch data starting from that resource: 
http://linkeddata.few.vu.nl/googleart/index.rdf

What steps will reproduce the problem?
1. java -jar ldspider-1.1d.jar -o test.nq -u 
http://linkeddata.few.vu.nl/googleart/index.rdf -y

What is the expected output? What do you see instead?
Instead of getting some triples, there is an empty 'test.nq' and this message:
INFO: Stopping CloseIdleConnectionThread
java.lang.InterruptedException: sleep interrupted
    at java.lang.Thread.sleep(Native Method)
    at com.ontologycentral.ldspider.http.internal.CloseIdleConnectionThread.run(Unknown Source)


What version of the product are you using? On what operating system?
1.1d


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 16 Mar 2011 at 1:54

Enable the LinkSelectFilter (hooks/links) from CLI

Add new command line option to enable the linkselect filter. 
-  a flag (-a|-d) followed list of predicates 
-( perhaps use the prefix.cc service to enable prefixes, such as foaf:knows. 
   Translate the prefixes with the prefix.cc. db list to map the input to URIs

Original issue reported on code.google.com by [email protected] on 16 Oct 2009 at 11:00

Redirect counts as hop during crawling

What steps will reproduce the problem?
1. evaluateBreadthFirst with only 1 MaxHop
2. Frontier contains URI that redirects to RDF source
3. Redirected RDF source is not crawled

Would be nicer, if redirects would not count as hops.

Original issue reported on code.google.com by [email protected] on 14 Sep 2010 at 3:44

Merged into: #3

Relative redirects not handled properly

What steps will reproduce the problem?
1. lookup http://semantictweet.com/fitango
2. sends local redirects 
3. problem

What is the expected output? What do you see instead?

local redirects should be followed (provided that local redirects conform to 
the HTTP-spec

Original issue reported on code.google.com by [email protected] on 19 Jun 2010 at 12:22

cif2cif / ldspider Goto Github PK

ldspider's People

Contributors

Watchers

ldspider's Issues

Recommend Projects

Recommend Topics

Recommend Org