cif2cif / ldspider Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/ldspider
Automatically exported from code.google.com/p/ldspider
The next release should include a ant task to create the javadoc and some more
documentation about the method
Original issue reported on code.google.com by [email protected]
on 9 Dec 2012 at 2:17
1. SVN checkout
2. mvn package
produces this
[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR]
/D:/svn/ldspider-read-only/src/test/java/com/ontologycentral/ldspider/hooks/sink
/SinkSparulTest.java:[71,6] no suitable method found for
evaluateBreadthFirst(com.ontologycentral.ldspider.frontier.Frontier,int,int,int,
com.ontologycentral.ldspider.Crawler.Mode)
method com.ontologycentral.ldspider.Crawler.evaluateBreadthFirst(com.ontologycentral.ldspider.frontier.Frontier,com.ontologycentral.ldspider.seen.Seen,com.ontologycentral.ldspider.queue.Redirects,int,int,int,int,boolean) is not applicable
(actual and formal argument lists differ in length)
method com.ontologycentral.ldspider.Crawler.evaluateBreadthFirst(com.ontologycentral.ldspider.frontier.Frontier,com.ontologycentral.ldspider.seen.Seen,com.ontologycentral.ldspider.queue.Redirects,int,int,int,int,boolean,com.ontologycentral.ldspider.Crawler.Mode) is not applicable
(actual and formal argument lists differ in length)
[ERROR]
/D:/svn/ldspider-read-only/src/test/java/com/ontologycentral/ldspider/SeedReadTe
st.java:[14,55] incompatible types
required: java.util.Set<java.net.URI>
found: java.lang.Iterable<java.net.URI>
[ERROR]
/D:/svn/ldspider-read-only/src/test/java/com/ontologycentral/ldspider/queue/Thre
adingPollTest.java:[24,34] constructor BreadthFirstQueue in class
com.ontologycentral.ldspider.queue.BreadthFirstQueue cannot be applied to given
types;
required: org.semanticweb.yars.tld.TldManager,com.ontologycentral.ldspider.queue.Redirects,com.ontologycentral.ldspider.seen.Seen,int,int,int,boolean
found: com.ontologycentral.ldspider.tld.TldManager,int,int
reason: actual and formal argument lists differ in length
[ERROR]
/D:/svn/ldspider-read-only/src/test/java/com/ontologycentral/ldspider/queue/Perf
ormanceTest.java:[24,40] constructor BreadthFirstQueue in class
com.ontologycentral.ldspider.queue.BreadthFirstQueue cannot be applied to given
types;
required: org.semanticweb.yars.tld.TldManager,com.ontologycentral.ldspider.queue.Redirects,com.ontologycentral.ldspider.seen.Seen,int,int,int,boolean
found: com.ontologycentral.ldspider.tld.TldManager,int,int
reason: actual and formal argument lists differ in length
[ERROR]
/D:/svn/ldspider-read-only/src/test/java/com/ontologycentral/ldspider/queue/Perf
ormanceTest.java:[62,35] cannot find symbol
symbol: method addDirectly(java.net.URI)
location: variable fq of type com.ontologycentral.ldspider.queue.BreadthFirstQueue
Original issue reported on code.google.com by [email protected]
on 5 Jan 2015 at 7:31
common log should include i) HTTP Version and ii) the length of the content
received.
ii) currently use -1 for streaming content or unknown content length, but
analog (log file analyser) complains here. 0 is not correct because well
the content length was not 0 but something else.
Original issue reported on code.google.com by [email protected]
on 5 Jan 2010 at 6:14
It would be great if the bot could follow the Crawl-delay extension to the
robots.txt protocol to avoid overloading a server.
Original issue reported on code.google.com by [email protected]
on 16 Apr 2010 at 1:44
Exclude header info as default (switch on optionally)
Original issue reported on code.google.com by [email protected]
on 10 Jun 2010 at 12:27
What steps will reproduce the problem?
1. when a program using ldspider finishes, all threads are interrupted
2. resulting InterruptedException appears on the console instead of the log
Original issue reported on code.google.com by [email protected]
on 24 Sep 2010 at 8:22
A user reported the following exception (not 100% sure which version he's on):
INFO: lookup on http://dbpedia.org/data/John_Henry_Bremridge.xml status 200
java.lang.NullPointerException
at
com.ontologycentral.ldspider.hooks.links.LinkFilterDefault.addUri(LinkFilterDefa
ult.java:106)
at
com.ontologycentral.ldspider.hooks.links.LinkFilterDefault.addABox(LinkFilterDef
ault.java:87)
at
com.ontologycentral.ldspider.hooks.links.LinkFilterDefault.processStatement(Link
FilterDefault.java:63)
at org.semanticweb.yars.util.Callbacks.processStatement(Unknown Source)
at org.semanticweb.yars2.rdfxml.RDFXMLParserBase.processStatement(Unknown
Source)
at org.semanticweb.yars2.rdfxml.RDFXMLParserBase.handleStatement(Unknown Source)
at
org.semanticweb.yars2.rdfxml.RDFXMLParserBase.handlePropertyAttributePair(Unknow
n Source)
at
org.semanticweb.yars2.rdfxml.RDFXMLParserBase.initialiseCurrentProperty(Unknown
Source)
at org.semanticweb.yars2.rdfxml.RDFXMLParserBase.startElement(Unknown Source)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(Abstra
ctSAXParser.java:501)
at
com.sun.org.apache.xerces.internal.parsers.AbstractXMLDocumentParser.emptyElemen
t(AbstractXMLDocumentParser.java:179)
at
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElemen
t(XMLNSDocumentScannerImpl.java:377)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentC
ontentDriver.next(XMLDocumentFragmentScannerImpl.java:2755)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentS
cannerImpl.java:648)
at
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocum
entScannerImpl.java:140)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocum
ent(XMLDocumentFragmentScannerImpl.java:511)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configu
ration.java:808)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configu
ration.java:737)
at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXPa
rser.java:1205)
at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXPar
serImpl.java:522)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
at org.semanticweb.yars2.rdfxml.RDFXMLParser.<init>(Unknown Source)
at
com.ontologycentral.ldspider.hooks.content.ContentHandlerRdfXml.handle(ContentHa
ndlerRdfXml.java:37)
at
com.ontologycentral.ldspider.hooks.content.ContentHandlers.handle(ContentHandler
s.java:35)
at com.ontologycentral.ldspider.http.LookupThread.run(LookupThread.java:120)
at java.lang.Thread.run(Thread.java:680)
Original issue reported on code.google.com by [email protected]
on 28 Mar 2011 at 1:18
Not clear how to set up a proxy (via ConnectionManager).
Should use the OS-wide proxy setting (if available), maybe also options for
overriding OS proxy setting.
Original issue reported on code.google.com by [email protected]
on 23 Dec 2009 at 4:23
Crawler does not detect infinite redirects.
Original issue reported on code.google.com by [email protected]
on 3 Jan 2010 at 11:55
Add a disclaimer to the project page that the source code might differ from the
binary shipped code.
Original issue reported on code.google.com by [email protected]
on 9 Dec 2012 at 2:18
What steps will reproduce the problem?
1. create Crawler object
Sep 24, 2010 10:12:20 AM com.ontologycentral.ldspider.tld.TldManager <init>
INFO: status 404 for
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/src/effective_tld_name
s.dat?raw=1
Sep 24, 2010 10:12:20 AM com.ontologycentral.ldspider.Crawler <init>
INFO: cannot get tld file online cannot access
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/src/effective_tld_name
s.dat?raw=1: 404
The URL is incorrect, it should be:
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.da
t?raw=1
Original issue reported on code.google.com by [email protected]
on 24 Sep 2010 at 8:20
What steps will reproduce the problem?
1. Use regular callback
2. Header-Triples show up in results
Solution: have separate Callback to handle the header information
Original issue reported on code.google.com by [email protected]
on 18 May 2010 at 2:05
Load seed list with one large pld and many smaller ones.
Make sure that the large pld queue gets hit from the beginning (so that the
large pld access is spread out during the crawl)
Original issue reported on code.google.com by [email protected]
on 6 May 2012 at 2:07
What steps will reproduce the problem?
1. Crawl "http://www.w3.org/2002/07/owl"
What is the expected output? What do you see instead?
- "http://www.w3.org/2002/07/owl" has Content-Location of "owl.rdf"
- context for quads from this document uses <http://www.w3.org/2002/07/owl>
- a redirect is output from <http://www.w3.org/2002/07/owl> to <http://www.w3.org/2002/07/owl.rdf>
Please use labels and text to provide additional information.
- behaviour is strange since we now have contexts which are the source of a redirect... there are various dangling redirects now.
(Found through problems ranking BTC11 where links are rewritten according to
redirects, causing mis-alignment with contexts.)
Original issue reported on code.google.com by [email protected]
on 1 Nov 2011 at 3:23
What steps will reproduce the problem?
1. Retrieve a RDF/XML source that returns application/xml (instead of
rdf+xml) as mime type (for example http://www.hyphen.info/rdf/47.xml)
What is the expected output? What do you see instead?
Data should be downloaded and parsed, instead it is ignored with the
message "disallowed via fetch filter", which is caused by the content
handler, as far as I can tell.
What version of the product are you using? On what operating system?
r213
Please provide any additional information below.
FetchFilterRdfXml accepts both application/rdf+xml and applicatipon/xml, so
the content handler should probably too.
Original issue reported on code.google.com by [email protected]
on 7 Jun 2010 at 2:37
e.g. http://github.com/shellac/java-rdfa
Original issue reported on code.google.com by [email protected]
on 10 Jun 2010 at 12:27
What steps will reproduce the problem?
1. Crawl http://purl.org/dc/terms/title
What is the expected output? What do you see instead?
The redirects file reports the target of the redirect as:
http://dublincore.org/2010/10/11/dcterms.rdf#title
...should that hash frag really be there?
What version of the product are you using? On what operating system?
From the BTC11 crawl... again causing problems with ranking.
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 3 Nov 2011 at 2:54
The robots.txt handling in ldspider is authority(host)-unique, which is not how
it should be done, see e.g. [1]. This results in IllegalArgumentExceptions if
https URIs are checked for robots.txt allowance.
[1] https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
Original issue reported on code.google.com by [email protected]
on 18 Jun 2012 at 2:16
hi
i want to change some part of the code, so I use TortoiseSVN software to
checkout the code,but
this software give me all of trunk except the src directory !!!! :(
who can help me?
how can get the src codes without svn? can i decompile the jar file version 1.1 ? how?
What is the expected output? What do you see instead?
i want to change src code in in trunk function with svn fcilities
Original issue reported on code.google.com by [email protected]
on 9 Jun 2012 at 6:55
What steps will reproduce the problem?
1. java -jar ldspider-1.1e.jar -o data.nq -s dump.rdf -b 1 1 1
What is the expected output? What do you see instead?
Instead of getting some triples, there is an empty 'data.nq' and this message:
INFO: Stopping CloseIdleConnectionThread
com.ontologycentral.ldspider.http.internal.CloseIdleConnectionThread.run
What version of the product are you using? On what operating system?
1.1e ; Windows 7 Enterprise
Please provide any additional information below.
I publish mysql database as linked data on localhost:2020 with d2r-server. The
dump.rdf file is d2r-server's output.
I'd like to have more information about -c and -b parameters.
Original issue reported on code.google.com by [email protected]
on 24 Oct 2012 at 8:19
What steps will reproduce the problem?
1. Check out clean copy of current ldspider trunk
2. Remove all cached maven packages on your system ("rm -fr ~/.m2" on Linux)
3. Run "mvn verify"
What is the expected output?
I expect the project to download all of the dependencies and build successfully.
What do you see instead?
The project fails to build because it cannot pull in dependencies that it
believes are hosted only by Aduna, because the Aduna site has been down since
April 2014.
[ERROR] Failed to execute goal on project ldspider: Could not resolve
dependencies for project com.ontologycentral:ldspider:jar:1.2: Failed to
collect dependencies at
org.deri.any23:any23-core:jar:0.6.1 -> org.openrdf.sesame:sesame-model:jar:2.4.0: Failed to read artifact descriptor for org.openrdf.sesame:sesame-model:jar:2.4.0: Could not transfer artifact
org.openrdf.sesame:sesame-model:pom:2.4.0 from/to aduna-software-release-repo (http://repo.aduna-software.org/maven2/releases): Connection to http://repo.aduna-software.org refused: Connection
timed out -> [Help 1]
What version of the product are you using? On what operating system?
SVN trunk as of 2014-06-02 on Fedora 20.
Please provide any additional information below.
https://groups.google.com/forum/#!msg/fedora-tech/JWAAkvp6mBk/DiCeaN1SN4EJ
suggests that the Aduna repository has been down since at least April.
Original issue reported on code.google.com by [email protected]
on 2 Jun 2014 at 4:35
Check whether timeouts and hostname not found etc. errors are recorded in
access.log.
Original issue reported on code.google.com by [email protected]
on 6 May 2012 at 1:43
What steps will reproduce the problem?
1. crawl with -n
2. does not check seen list whether URI previously seen
Same URI in seed file should only crawled once.
Original issue reported on code.google.com by [email protected]
on 2 May 2012 at 9:19
What steps will reproduce the problem?
1. Crawl <http://www.lassila.org/ora.rdf#me>
2. Witness <http://www.lassila.org/ora.rdf#me>
<http://xmlns.com/foaf/0.1/knows> <node1757n7capx1> in output
What is the expected output? What do you see instead?
I'd like to not see <node1757n7capx1>
Original issue reported on code.google.com by [email protected]
on 16 Feb 2014 at 12:24
Hello,
I've been testing ldspider with different types of pages to check if it's
extracting data correctly. RDF/XML page seems to be straightforward, but
crawling an HTML page that contains microdata/rdfa markup doesn't seem to yield
any data.
I'm using ldspider CLI support, here is the command:
java -jar ldspider.jar -any23 -c 2 -s seed.txt -o data.txt -a access-log.txt -v
file-log.txt
access-log.txt content:
1347913722 1110 127.0.0.1 TCP_MISS/200 1909 GET
http://www.guardian.co.uk/robots.txt - NONE/- text/plain
1347913722 0 127.0.0.1 TCP_MISS/499 -1 GET
http://www.guardian.co.uk/commentisfree/2012/sep/17/cameron-goes-where-thatcher-
never-dared - NONE/- null
data.txt file is empty.
Here is a test page that contains some microdata markup:
http://www.guardian.co.uk/commentisfree/2012/sep/17/cameron-goes-where-thatcher-
never-dared
trying to extract embedded data with any23 service do yield some data
http://any23.org/any23/best/http:/www.guardian.co.uk/commentisfree/2012/sep/17/c
ameron-goes-where-thatcher-never-dared
Any clues?
Thanks in advance.
Original issue reported on code.google.com by [email protected]
on 17 Sep 2012 at 8:33
I'm trying to fetch data starting from that resource:
http://linkeddata.few.vu.nl/googleart/index.rdf
What steps will reproduce the problem?
1. java -jar ldspider-1.1d.jar -o test.nq -u
http://linkeddata.few.vu.nl/googleart/index.rdf -y
What is the expected output? What do you see instead?
Instead of getting some triples, there is an empty 'test.nq' and this message:
INFO: Stopping CloseIdleConnectionThread
java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at com.ontologycentral.ldspider.http.internal.CloseIdleConnectionThread.run(Unknown Source)
What version of the product are you using? On what operating system?
1.1d
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 16 Mar 2011 at 1:54
Add new command line option to enable the linkselect filter.
- a flag (-a|-d) followed list of predicates
-( perhaps use the prefix.cc service to enable prefixes, such as foaf:knows.
Translate the prefixes with the prefix.cc. db list to map the input to URIs
Original issue reported on code.google.com by [email protected]
on 16 Oct 2009 at 11:00
What steps will reproduce the problem?
1. evaluateBreadthFirst with only 1 MaxHop
2. Frontier contains URI that redirects to RDF source
3. Redirected RDF source is not crawled
Would be nicer, if redirects would not count as hops.
Original issue reported on code.google.com by [email protected]
on 14 Sep 2010 at 3:44
What steps will reproduce the problem?
1. lookup http://semantictweet.com/fitango
2. sends local redirects
3. problem
What is the expected output? What do you see instead?
local redirects should be followed (provided that local redirects conform to
the HTTP-spec
Original issue reported on code.google.com by [email protected]
on 19 Jun 2010 at 12:22
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.