crawler-commons / url-frontier Goto Github PK

API definition, resources and reference implementation of URL Frontiers

License: Apache License 2.0

Java 99.75% Dockerfile 0.25%

grpc url-frontier urlfrontier web-crawlers webcrawling

url-frontier's Introduction

Overview

Crawler-Commons is a set of reusable Java components that implement functionality common to any web crawler.
These components benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.

Documentation
Mailing List
Installation
News

User Documentation

Javadocs

Mailing List

There is a mailing list on Google Groups.

Installation

Using Maven, add the following dependency to your pom.xml:

<dependency>
    <groupId>com.github.crawler-commons</groupId>
    <artifactId>crawler-commons</artifactId>
    <version>1.4</version>
</dependency>

Using Gradle, add the folling to your build file:

dependencies {
    implementation group: 'com.github.crawler-commons', name: 'crawler-commons', version: '1.4'
}

News

18th July 2023 - crawler-commons 1.4 released

We are pleased to announce the 1.4 release of Crawler-Commons.

The new release includes many improvements and bug fixes, several dependency upgrades and improvements to the automatic build system. The following are the most notable improvements and changes:

Java 11 is now required to run or build crawler-commons
the robots.txt parser (SimpleRobotRulesParser) is now compliant with RFC 9309 and provides a new API entry point accepting a collection of single-word user-agent product tokens which allows for faster and RFC-compliant matching of robots.txt user-agent lines. Please note that user-agent product tokens must be lower-case.

See the CHANGES.txt file included with the release for the detailed list of changes.

28th July 2022 - crawler-commons 1.3 released

We are glad to announce the 1.3 release of Crawler-Commons. See the CHANGES.txt file included with the release for a complete list of details. The new release includes multiple dependency upgrades, improvements to the automatic builds, and a tighter protections against XXE vulnerability issues in the Sitemap parser.

14th October 2021 - crawler-commons 1.2 released

We are glad to announce the 1.2 release of Crawler-Commons. See the CHANGES.txt file included with the release for a complete list of details. This version fixes an XXE vulnerability issue in the Sitemap parser and includes several improvements to the URL normalizer and the Sitemaps parser.

29th June 2020 - crawler-commons 1.1 released

We are glad to announce the 1.1 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details.

21st March 2019 - crawler-commons 1.0 released

We are glad to announce the 1.0 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. Among other bug fixes and improvements this version adds support for parsing sitemap extensions (image, video, news, alternate links).

7th June 2018 - crawler-commons 0.10 released

We are glad to announce the 0.10 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. This version contains among other things improvements to the Sitemap parsing and the removal of the Tika dependency.

31st October 2017 - crawler-commons 0.9 released

We are glad to announce the 0.9 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are the removal of DOM-based sitemap parser as the SAX equivalent introduced in the previous version has better performance and is also more robust. You might need to change your code to replace SiteMapParserSAX with SiteMapParser. The parser is now aware of namespaces, and by default does not force the namespace to be the one recommended in the specification (http://www.sitemaps.org/schemas/sitemap/0.9) as variants can be found in the wild. You can set the behaviour using the method setStrictNamespace(boolean).

As usual, the version 0.9 contains numerous improvements and bugfixes and all users are invited to upgrade to this version.

9th June 2017 - crawler-commons 0.8 released

We are glad to announce the 0.8 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are the removal of the HTTP fetcher support, which has been put in a separate project. We also added a SAX-based parser for processing sitemaps, which requires less memory and is more robust to malformed documents than its DOM-based counterpart. The latter has been kept for now but might be removed in the future.

24th November 2016 - crawler-commons 0.7 released

We are glad to announce the 0.7 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are that Crawler-Commons now requires JAVA 8 and that the package crawlercommons.url has been replaced with crawlercommons.domains. If your project uses CC then you might want to run the following command on it

find . -type f -print0 | xargs -0 sed -i 's/import crawlercommons\.url\./import crawlercommons\.domains\./'

Please note also that this is the last release containing the HTTP fetcher support, which is deprecated and will be removed from the next version.

The version 0.7 contains numerous improvements and bugfixes and all users are invited to upgrade to this version.

11th June 2015 - crawler-commons 0.6 is released

We are glad to announce the 0.6 release of Crawler Commons. See the CHANGES.txt file included with the release for a full list of details.

We suggest all users to upgrade to this version. Details of how to do so can be found on Maven Central. Please note that the groupId has changed to com.github.crawler-commons.

The Java documentation can be found here.

22nd April 2015 - crawler-commons has moved

The crawler-commons project is now being hosted at GitHub, due to the demise of Google code hosting.

15th October 2014 - crawler-commons 0.5 is released

We are glad to announce the 0.5 release of Crawler Commons. This release mainly improves Sitemap parsing as well as an upgrade to Apache Tika 1.6.

See the CHANGES.txt file included with the release for a full list of details. Additionally the Java documentation can be found here.

We suggest all users to upgrade to this version. The Crawler Commons project artifacts are released as Maven artifacts and can be found at Maven Central.

11th April 2014 - crawler-commons 0.4 is released

We are glad to announce the 0.4 release of Crawler Commons. Amongst other improvements, this release includes support for Googlebot-compatible regular expressions in URL specifications, further improvements to robots.txt parsing and an upgrade of httpclient to v4.2.6.

See the CHANGES.txt file included with the release for a full list of details.

We suggest all users to upgrade to this version. Details of how to do so can be found on Maven Central.

11 Oct 2013 - crawler-commons 0.3 is released

This release improves robots.txt and sitemap parsing support, updates Tika to the latest released version (1.4), and removes some left-over cruft from the pre-Maven build setup.

See the CHANGES.txt file included with the release for a full list of details.

24 Jun 2013 - Nutch 1.7 now uses crawler-commons for robots.txt parsing

Similar to the previous note about Nutch 2.2, there's now a version of Nutch in the 1.x tree that also uses crawler-commons. See Apache Nutch v1.7 Released for more details.

08 Jun 2013 - Nutch 2.2 now uses crawler-commons for robots.txt parsing

See Apache Nutch v2.2 Released for more details.

02 Feb 2013 - crawler-commons 0.2 is released

This release improves robots.txt and sitemap parsing support.

See the CHANGES.txt file included with the release for a full list of details.

License

Published under Apache License 2.0, see LICENSE

url-frontier's People

Contributors

Stargazers

Watchers

Forkers

deathtrap512 jnioche lnceballosz sbdevman pva-01 felixengl michaeldinzinger mattisg

url-frontier's Issues

Remove lock in putURLs for RocksDB service

putURLs is vital in terms of performance as updates and additions to the frontier are done continuously, in a streaming fashion and due to the nature of crawling is done a lot of time.

The implementation in 0.2 uses a monitor on the queues map, originally with the intent to prevent adding to a queue while it is being deleted. Queue deletions happen very infrequently (at least compared to putting URLs) but having this lock means that even when no queue deletion is happening multiple threads block each other within putURLs which is completely unnecessary.

The profiler I am using on a crawl is showing that this is happening millions of times for each thread and wasting hundreds of seconds.

Instead of using a monitor, the deleteQueue method could simply put the queue being deleted in a map - with the added benefit that the operation would be done just once even if the method is called twice and the putURLs would only have to check whether the current queue is being deleted and skip the URL if this is the case. This also makes more sense as far as logic is concerned.

Odd queue names

I ran a brief crawl starting from https://example.com/ and after a few moments inspected the queues with ListQueues. This included the following:

so.icann.org#content
community.icann.org#title-heading
gac.icann.org#nav
lacnic.net
community.icann.org#breadcrumbs
community.icann.org#rw_category_menu
community.icann.org#navigation
community.icann.org#rw_search_query
community.icann.org#dashboard-recently-updated
gac.icann.org#main
ripe.net
www.flickr.com
aso.icann.org#search-modal
aso.icann.org#hero-slider
gac.icann.org?language_id=9
gac.icann.org?language_id=11
gac.icann.org?language_id=1
gac.icann.org?language_id=3
gac.icann.org?language_id=7
gac.icann.org?language_id=2
ardesign.us

i.e. it seems URL query and fragment parts are being included in the queue key generation? I was expecting it to just be the host.

Use multiple threads for putting URLs

Similar to #41 but on the writing side.
A stream of incoming updates is processed by a single thread. To make things parallel, a client needs to create multiple connections to the server. It would be more flexible to have an ExecutorService with n threads in AbstractFrontierService and use it in the method putURLs, possibly changing https://github.com/crawler-commons/url-frontier/blob/master/service/src/main/java/crawlercommons/urlfrontier/service/AbstractFrontierService.java#L767 to return a Future.
This should result in a better use of the CPU and increased overall performance.

Generate multi arch Docker images

we currently generate only linux/amd64 but should also generate one for arm64

Dependency updates

mvn versions:display-dependency-updates | grep ">"

[INFO] ---------------< com.github.crawler-commons:urlfrontier >---------------
[INFO] -------------< com.github.crawler-commons:urlfrontier-API >-------------
[INFO] io.grpc:grpc-netty-shaded ........................... 1.50.0 -> 1.50.2
[INFO] io.grpc:grpc-protobuf ............................... 1.50.0 -> 1.50.2
[INFO] io.grpc:grpc-stub ................................... 1.50.0 -> 1.50.2
[INFO] -----------< com.github.crawler-commons:urlfrontier-service >-----------
[INFO] ch.qos.logback:logback-classic ....................... 1.2.10 -> 1.4.4
[INFO] io.prometheus:simpleclient .......................... 0.15.0 -> 0.16.0
[INFO] io.prometheus:simpleclient_hotspot .................. 0.15.0 -> 0.16.0
[INFO] io.prometheus:simpleclient_httpserver ............... 0.15.0 -> 0.16.0
[INFO] org.apache.ignite:ignite-core ................. 2.13.0 -> 3.0.0-alpha5
[INFO] org.apache.lucene:lucene-core ......................... 9.1.0 -> 9.4.0
[INFO] org.rocksdb:rocksdbjni ................................ 7.2.2 -> 7.6.0
[INFO] -----------< com.github.crawler-commons:urlfrontier-client >------------
[INFO] com.google.protobuf:protobuf-java-util .......... 3.20.3 -> 4.0.0-rc-2
[INFO] ---------< com.github.crawler-commons:urlfrontier-test-suite >----------
[INFO] org.slf4j:slf4j-api .................................. 1.7.35 -> 2.0.3
[INFO] org.slf4j:slf4j-simple ............................... 1.7.35 -> 2.0.3

DistributedFrontierService to use threadpool for all the writes

including the ones to another Frontier instance and not just his own local writes

Before

Sent: 25948931
Acked: 25948931
OK: 20320327
Skipped: 5628604
Failed: 0
Total time: 2547400 msec
Average OPS: 10188

After

CRASHED WITH

14:39:29.457 [grpc-default-worker-ELG-3-3] WARN  i.g.n.s.i.n.util.ReferenceCountUtil - Failed to release a message: UnpooledSlicedByteBuf(freed)
io.grpc.netty.shaded.io.netty.util.IllegalReferenceCountException: refCnt: 0, decrement: 1
	at io.grpc.netty.shaded.io.netty.util.internal.ReferenceCountUpdater.toLiveRealRefCnt(ReferenceCountUpdater.java:83)
	at io.grpc.netty.shaded.io.netty.util.internal.ReferenceCountUpdater.release(ReferenceCountUpdater.java:147)
	at io.grpc.netty.shaded.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:101)
	at io.grpc.netty.shaded.io.netty.buffer.CompositeByteBuf$Component.free(CompositeByteBuf.java:1959)
	at io.grpc.netty.shaded.io.netty.buffer.CompositeByteBuf.deallocate(CompositeByteBuf.java:2264)
	at io.grpc.netty.shaded.io.netty.buffer.AbstractReferenceCountedByteBuf.handleRelease(AbstractReferenceCountedByteBuf.java:111)
	at io.grpc.netty.shaded.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:101)
	at io.grpc.netty.shaded.io.netty.buffer.AbstractDerivedByteBuf.release0(AbstractDerivedByteBuf.java:98)
	at io.grpc.netty.shaded.io.netty.buffer.AbstractDerivedByteBuf.release(AbstractDerivedByteBuf.java:94)
	at io.grpc.netty.shaded.io.netty.util.ReferenceCountUtil.release(ReferenceCountUtil.java:90)
	at io.grpc.netty.shaded.io.netty.util.ReferenceCountUtil.safeRelease(ReferenceCountUtil.java:116)
	at io.grpc.netty.shaded.io.netty.channel.ChannelOutboundBuffer.remove(ChannelOutboundBuffer.java:271)
	at io.grpc.netty.shaded.io.netty.channel.ChannelOutboundBuffer.removeBytes(ChannelOutboundBuffer.java:352)
	at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollStreamChannel.writeBytesMultiple(AbstractEpollStreamChannel.java:305)
	at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollStreamChannel.doWriteMultiple(AbstractEpollStreamChannel.java:510)
	at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollStreamChannel.doWrite(AbstractEpollStreamChannel.java:422)
	at io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:931)
	at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.flush0(AbstractEpollChannel.java:557)
	at io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:895)
	at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1372)
	at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750)
	at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:742)
	at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:728)
	at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler.flush(Http2ConnectionHandler.java:197)
	at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750)
	at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:742)
	at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:728)
	at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:967)
	at io.grpc.netty.shaded.io.netty.channel.AbstractChannel.flush(AbstractChannel.java:254)
	at io.grpc.netty.shaded.io.grpc.netty.WriteQueue.flush(WriteQueue.java:147)
	at io.grpc.netty.shaded.io.grpc.netty.WriteQueue.access$000(WriteQueue.java:34)
	at io.grpc.netty.shaded.io.grpc.netty.WriteQueue$1.run(WriteQueue.java:46)
	at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:391)
	at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)

ConcurrentModificationException when calling GetStats

We iterate on an object backed by the queues, which is accessed by loads of threads. This can result in a ConcurrentModificationException. Synchronizing on the queues should help.

Scan of table failing for queues which have been completed

When resurrecting an existing Rockdb

Caused by: java.lang.RuntimeException: Scan of table found missing queue -.alienvault
at crawlercommons.urlfrontier.service.rocksdb.RocksDBService.recoveryQscan(RocksDBService.java:151)
at crawlercommons.urlfrontier.service.rocksdb.RocksDBService.(RocksDBService.java:103)
... 15 more

The reason for it is that the queue object wasn't created if nothing was scheduled - i.e. the queue was finished. This is a bug.

java.lang.NoClassDefFoundError: io/grpc/netty/shaded/io/netty/util/concurrent/DefaultPromise$1

Exception in thread "grpc-default-worker-ELG-3-2" java.lang.NoClassDefFoundError: io/grpc/netty/shaded/io/netty/util/concurrent/DefaultPromise$1
at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:499)
at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:605)
at io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise.setSuccess(DefaultPromise.java:96)
at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:1048)
at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: io.grpc.netty.shaded.io.netty.util.concurrent.DefaultPromise$1
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 8 more

exception thrown when no rocksdb.path set

java.lang.InstantiationException: crawlercommons.urlfrontier.service.rocksdb.RocksDBService
	at java.lang.Class.newInstance(Class.java:427)
	at crawlercommons.urlfrontier.service.URLFrontierServer.start(URLFrontierServer.java:138)
	at crawlercommons.urlfrontier.service.URLFrontierServer.call(URLFrontierServer.java:73)
	at crawlercommons.urlfrontier.service.URLFrontierServer.call(URLFrontierServer.java:44)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
	at picocli.CommandLine.access$1300(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
	at picocli.CommandLine.execute(CommandLine.java:2078)
	at crawlercommons.urlfrontier.service.URLFrontierServer.main(URLFrontierServer.java:66)
Caused by: java.lang.NoSuchMethodException: crawlercommons.urlfrontier.service.rocksdb.RocksDBService.<init>()
	at java.lang.Class.getConstructor0(Class.java:3082)
	at java.lang.Class.newInstance(Class.java:412)
	... 11 more

should just use a default value instead

getURLs will lock whole Queue when not assign Key

for example
sent 5 urls to QueueA
then i use this code to getUrls, run this code 5 times you will got 5 different urls in QueueA
the sixth time will got empty result.

    Urlfrontier.GetParams request =
            Urlfrontier.GetParams.newBuilder()
                    .setMaxUrlsPerQueue(1)
                    .setMaxQueues(0)
                    .setKey("QueueA")
                    .setDelayRequestable(600)
                    .setCrawlID(crawlId)
                    .build();

but if you not particularly the key like this code

    Urlfrontier.GetParams request =
            Urlfrontier.GetParams.newBuilder()
                    .setMaxUrlsPerQueue(1)
                    .setMaxQueues(0)
                    .setDelayRequestable(600)
                    .setCrawlID(crawlId)
                    .build();

you can only run once, because of the secondary time will return empty.

i checked class AbstractFrontierService.java
looks like

        if (currentQueue.getInProcess(now) >= maxURLsPerQueue) {
            continue;
        }

this code will refuse the request
is this correct?

URLFrontier AckMessage does not contain IDs of ACKed URLInfo

I describes the issue at apache/incubator-stormcrawler#981 the code revealing the problems (+ Unit-Test) can be found there:

IgniteHeartbeat Thread throws a NullPointerException during start of Ignite service

Hi all, thanks for your great work!
It appeared to me that the IgniteHeartbeat thread is killed due to a NullPointerException as soon as the IgniteService is started.
This happens due to a race condition during the instantiation of the frontier service (class URLFrontierServer line 172). At the end of the instantiation, the IgniteHeartbeat Thread is started (class IgniteService line 392). In case the hostname and the port is not set yet (class URLFrontierServer line 179) when the IgniteHeartbeat Thread is started and calls the sendHeartbeat() method for the first time, the thread throws a NullPointerException.

[16:05:36] Ignite node started OK (id=3d3198aa)
[16:05:36] >>> Ignite cluster is in INACTIVE state (limited functionality available). Use control.(sh|bat) script or IgniteCluster.state(ClusterState.ACTIVE) to change the state.
[16:05:36] Topology snapshot [ver=1, locNode=3d3198aa, servers=1, clients=0, state=INACTIVE, CPUs=12, offheap=6.3GB, heap=2.0GB]
16:05:36.916 [main] INFO  c.u.service.ignite.IgniteService - Ignite loaded in 3371 msec
16:05:36.918 [main] INFO  c.u.service.ignite.IgniteService - Scanning tables to rebuild queues... (can take a long time)
16:05:36.919 [main] INFO  c.u.service.ignite.IgniteService - 0 queues discovered in 3 msec
Exception in thread "IgniteHeartbeat" java.lang.NullPointerException: Ouch! Argument cannot be null: key
	at org.apache.ignite.internal.util.GridArgumentCheck.notNull(GridArgumentCheck.java:49)
	at org.apache.ignite.internal.util.GridArgumentCheck.notNull(GridArgumentCheck.java:61)
	at org.apache.ignite.internal.processors.cache.GridCacheAdapter.put(GridCacheAdapter.java:2485)
	at org.apache.ignite.internal.processors.cache.GridCacheAdapter.put(GridCacheAdapter.java:2466)
	at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.put(IgniteCacheProxyImpl.java:1332)
	at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.put(GatewayProtectedCacheProxy.java:867)
	at crawlercommons.urlfrontier.service.ignite.IgniteHeartbeat.sendHeartBeat(IgniteHeartbeat.java:40)
	at crawlercommons.urlfrontier.service.cluster.Hearbeat.run(Hearbeat.java:75)
16:05:37.123 [IgniteQueueChecker] INFO  c.u.service.ignite.IgniteService - Found 0 queues, removed 0, total 0 in 8
16:05:37.470 [main] INFO  c.u.service.URLFrontierServer - Started URLFrontierServer [IgniteService] on port 7071 as localhost:7071

GRPC error when attempting to use from Python/Scrapy

Given I've not used GRPC before, I'm likely doing something wrong, but when attempting to use PushURLs from a Python client I get a protocol-level error. See https://github.com/anjackson/scrapy-url-frontier#scrapy-url-frontier for full details of the setup.

SEVERE: Exception while executing runnable io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable@352d3e9
io.grpc.StatusRuntimeException: INTERNAL: Invalid protobuf byte sequence
	at io.grpc.Status.asRuntimeException(Status.java:526)
	at io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller.parse(ProtoLiteUtils.java:218)
	at io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller.parse(ProtoLiteUtils.java:118)
	at io.grpc.MethodDescriptor.parseRequest(MethodDescriptor.java:307)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:309)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:292)
	at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:765)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.
	at com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:129)

The basic GetURLs connection seems to be fine, although at this point there are no URLs in the Frontier to get.

Any ideas on how I managed to break it?

Exception when trying to delete a non-existing crawl

Docker container running as root

Here is a short Red Hat workshop that touches the USER topic: https://redhatgov.io/workshops/security_containers/exercise1.2/

The pattern consists of a RUN that uses Linux standard tools to add and configure a new user, optionally with a special group, home directory, folder permissions and so on. The second step is then to direct Docker to do subsequent calls with this user identity.

In short, most containers implicitly run everything as USER root unless told otherwise. The main exceptions here are cases where the parent container already configured a user and switched the USER to it, but you would likely have noticed if this is the case due to the limits on apt install and other commands.

Notably, there is a case to be made for running both the mvn clean package as well as the actual service ENTRYPOINT under low-privileged users. The former would make it harder for attackers to escalate if they manage to compromise the build system somehow at build time, while the latter makes it harder for attackers to escalate from a compromised URL Frontier service during runtime.

Configs read.thread.num and write.thread.num are ignored for IgniteService

When starting the IgniteService, the two configuration arguments read.thread.num and write.thread.num are ignored.
In the IgniteService constructor (line 130), the configuration map is not forwarded to the underlying DistributedFrontierService and, subsequently, not forwarded to the underlying AbstractFrontierService. So specifying read.thread.num and write.thread.num makes no difference for the Ignite imlementation.

michael@pc:~/Desktop/Git/url-frontier/service$ java -Xmx2G -cp target/urlfrontier-service-*.jar crawlercommons.urlfrontier.service.URLFrontierServer implementation=crawlercommons.urlfrontier.service.ignite.IgniteService ignite.path=/some_path/data_frontier read.thread.num=1 write.thread.num=1
16:38:33.319 [main] INFO  c.u.service.AbstractFrontierService - Available processor(s) 12
16:38:33.321 [main] INFO  c.u.service.AbstractFrontierService - Using 3 threads for reading from queues
16:38:33.321 [main] INFO  c.u.service.AbstractFrontierService - Using 3 threads for writing to queues
...

Multithread reading from queues

It can currently take a bit of time for the service to submit URLs from queues when the number of URLs gets large. This is due partly to fact that a call to the getURLs endpoint iterates sequentially on the queues; retrieving URs for a queue takes longer and longer. This is not noticeable early in a crawl but becomes more of an issue as the frontier grows.
Looking at the CPU usage of the Frontier, it has only one or two cores busy. If we have a pool of threads getting candidates from the queues in parallel, we'd be able to mobilise more of the CPUs and make the operation faster.

Use Picocli to configure the server

e.g. configure the port to use

https://github.com/crawler-commons/url-frontier/blob/master/service/src/main/java/crawlercommons/urlfrontier/service/URLFrontierServer.java

add getInfo endpoint

to retrieve the info about a specific URL from the backend. Useful for debugging

Stats for memory implementation report the wrong number of URLs in process

Pagination of ListQueues

ListQueues can lead to gRPC error when the number of queues is large

java -jar ./target/urlfrontier-client*.jar ListQueues
io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 4194304: 10945786
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
	at crawlercommons.urlfrontier.URLFrontierGrpc$URLFrontierBlockingStub.listQueues(URLFrontierGrpc.java:604)
	at crawlercommons.urlfrontier.client.ListQueues.run(ListQueues.java:52)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
	at picocli.CommandLine.access$1300(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
	at picocli.CommandLine.execute(CommandLine.java:2078)
	at crawlercommons.urlfrontier.client.Client.main(Client.java:40)
Jun 23, 2021 12:58:07 PM io.grpc.internal.AbstractClientStream$TransportState inboundDataReceived
INFO: Received data on closed stream

Limiting the size of the returned list, e.g. via java -jar ./target/urlfrontier-client*.jar ListQueues -n 1000 avoids the exception. However, since results are apparently returned in a consistent order and the -n only controls the number of items from the start of the list, this makes it difficult for a client to obtain the tail part of the list.

We should paginate the results and return a richer output with: total number of queues, start and end offsets etc...

ShardedRocksDBService does not return ack when identical URLs are sent in short succession

This is due to

https://github.com/crawler-commons/url-frontier/blob/master/service/src/main/java/crawlercommons/urlfrontier/service/cluster/DistributedFrontierService.java#L141

Either the cache should have a list of all the incoming messages associated with a URL or - simpler option - block if a URL is about to be sent to an external Frontier but we already have something being processed for it.

Set an explicit limit to the size of a key to 255 chars

This is the max length that a fully qualified domain name can have. Any URL sent to the Frontier with an explicit or computed key greater than the limit should be ignored.
This would prevent a situation where resources would get exhausted if very large keys were used

PutURLs calls failing with java.lang.IllegalStateException

This issue is related to #71. When uploading urls using e.g. the command java -jar target/urlfrontier-client-2.4-SNAPSHOT.jar PutURLs -f <some_path>/2000urls, the server throws a row of IllegalStateExceptions.
Analogously to #71, I managed to fix it by putting the code line unacked.incrementAndGet(); (class DistributedFrontierService line 512) right in front of the execute call of the writeExecutorService (line 510).

Exception in thread "pool-2-thread-1" java.lang.IllegalStateException: Stream is already completed, no further calls are allowed
        at com.google.common.base.Preconditions.checkState(Preconditions.java:502)
        at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:375)
        at crawlercommons.urlfrontier.service.SynchronizedStreamObserver.onNext(SynchronizedStreamObserver.java:59)
        at crawlercommons.urlfrontier.service.cluster.DistributedFrontierService$6.lambda$onNext$0(DistributedFrontierService.java:524)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Exception in thread "pool-2-thread-2" java.lang.IllegalStateException: Stream is already completed, no further calls are allowed
        at com.google.common.base.Preconditions.checkState(Preconditions.java:502)
        at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:375)
        at crawlercommons.urlfrontier.service.SynchronizedStreamObserver.onNext(SynchronizedStreamObserver.java:59)
        at crawlercommons.urlfrontier.service.cluster.DistributedFrontierService$6.lambda$onNext$0(DistributedFrontierService.java:524)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Exception in thread "pool-2-thread-3" java.lang.IllegalStateException: Stream is already completed, no further calls are allowed
        at com.google.common.base.Preconditions.checkState(Preconditions.java:502)
        at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:375)
        at crawlercommons.urlfrontier.service.SynchronizedStreamObserver.onNext(SynchronizedStreamObserver.java:59)
        at crawlercommons.urlfrontier.service.cluster.DistributedFrontierService$6.lambda$onNext$0(DistributedFrontierService.java:524)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Exception in thread "pool-2-thread-4" java.lang.IllegalStateException: Stream is already completed, no further calls are allowed
        at com.google.common.base.Preconditions.checkState(Preconditions.java:502)
        at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:375)
        at crawlercommons.urlfrontier.service.SynchronizedStreamObserver.onNext(SynchronizedStreamObserver.java:59)
        at crawlercommons.urlfrontier.service.cluster.DistributedFrontierService$6.lambda$onNext$0(DistributedFrontierService.java:524)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

ignite.purge answers with 'couldn't delete workdir'

When using the ignite.purge config argument, the corresponding working directory couldn't be deleted. As a workaround, I modified the lines 156-159 and the lines 169-172, and used the FileUtils.cleanDirectory() instead which did the job.

michael@pc:~/Desktop/Git/url-frontier/service$ java -Xmx2G -cp target/urlfrontier-service-*.jar crawlercommons.urlfrontier.service.URLFrontierServer implementation=crawlercommons.urlfrontier.service.ignite.IgniteService ignite.path=/home/michaeld/Desktop/Git/owler/frontier/data_frontier ignite.purge
17:46:54.065 [main] INFO  c.u.service.AbstractFrontierService - Available processor(s) 12
17:46:54.067 [main] INFO  c.u.service.AbstractFrontierService - Using 3 threads for reading from queues
17:46:54.067 [main] INFO  c.u.service.AbstractFrontierService - Using 3 threads for writing to queues
17:46:54.251 [main] ERROR c.u.service.ignite.IgniteService - Couldn't delete workdir /home/michaeld/Desktop/Git/owler/frontier/data_frontier

java.lang.NoClassDefFoundError: com/google/common/cache/RemovalNotification when stopping ShardedRocksDBService

jnioche@node1:/data$ java -Xmx5G -XX:+UseG1GC -Djava.net.preferIPv4Stack=true -cp urlfrontier-service-*.jar crawlercommons.urlfrontier.service.URLFrontierServer -h $Node3 -s 9100 rocksdb.path=/data/rocksdb implementation=crawlercommons.urlfrontier.service.rocksdb.ShardedRocksDBService nodes=$Node1:7071,$Node2:7071,$Node3:7071 -c config.ini > frontier.log
^CException in thread "Thread-1" java.lang.NoClassDefFoundError: com/google/common/cache/RemovalNotification
at com.google.common.cache.LocalCache$Segment.enqueueNotification(LocalCache.java:2527)
at com.google.common.cache.LocalCache$Segment.removeValueFromChain(LocalCache.java:3147)
at com.google.common.cache.LocalCache$Segment.removeEntry(LocalCache.java:3314)
at com.google.common.cache.LocalCache$Segment.expireEntries(LocalCache.java:2511)
at com.google.common.cache.LocalCache$Segment.runLockedCleanup(LocalCache.java:3368)
at com.google.common.cache.LocalCache$Segment.preWriteCleanup(LocalCache.java:3350)
at com.google.common.cache.LocalCache$Segment.clear(LocalCache.java:3104)
at com.google.common.cache.LocalCache.clear(LocalCache.java:4137)
at com.google.common.cache.LocalCache$LocalManualCache.invalidateAll(LocalCache.java:4725)
at crawlercommons.urlfrontier.service.cluster.DistributedFrontierService.close(DistributedFrontierService.java:357)
at crawlercommons.urlfrontier.service.rocksdb.ShardedRocksDBService.close(ShardedRocksDBService.java:80)
at crawlercommons.urlfrontier.service.URLFrontierServer.stop(URLFrontierServer.java:215)
at crawlercommons.urlfrontier.service.URLFrontierServer.lambda$registerShutdownHook$0(URLFrontierServer.java:200)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: com.google.common.cache.RemovalNotification
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 14 more

Publish service jar on Maven

This way anyone wanting to have a custom implementation of the service could access and leverage the existing abstract classes.

Getting a stack dump when closing RocksDB at the end of the service

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fa253424f47, pid=568380, tid=0x00007fa1ff7fa700
#
# JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-8u292-b10-0ubuntu1~20.04-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [librocksdbjni2589559341903054454.so+0x326f47]  rocksdb::DBImpl::NewIterator(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*)+0x47
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /users/jnioche/url-frontier/service/hs_err_pid568380.log
[thread 140333751965440 also had an error]
[thread 140333838980864 also had an error]
[thread 140333883029248 also had an error]
[thread 140335108507392 also had an error]
[thread 140333745649408 also had an error]
[thread 140333750912768 also had an error]
[thread 140333754070784 also had an error]
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

Add contributions policy

See #20

Batch write operations

This will limit the risk of having discrepancies between the status and the schedule tables e.g. when the service is being stopped or crashes.

Exception caught when deleting queue

when injecting apple.com and apple.com.br and deleting apple.com

10:22:01.745 [grpc-default-executor-2] ERROR c.u.service.rocksdb.RocksDBService - Exception caught when deleting ranges - DEFAULT_apple.com_ - DEFAULT_apple.com.br_
org.rocksdb.RocksDBException: end key comes before start key
	at org.rocksdb.RocksDB.deleteRange(Native Method)
	at org.rocksdb.RocksDB.deleteRange(RocksDB.java:1415)
	at crawlercommons.urlfrontier.service.rocksdb.RocksDBService.deleteRanges(RocksDBService.java:613)

The dot has a unicode value of 002E whereas the underscore has 005F. One option would be to chose a different separator, with a byte value lower than anything else. Alternatively, when working out the ranges, the sorting of the queue names could take the separator into account.

use CLI library for parsing command line argument for Java Client

Will need to display help, description of the options and so on.

picocli could be a good candidate

Add GetActive endpoint to API

rpc SetActive(Boolean) returns (Empty) {} can activate / deactivate the GetURLs() activity, but clients apparently have no way to ask for the current state AFAIK

RocksDB backend - faster restarts

When the RocksDB-based service gets restarted, it can take a substantial amount of time as it needs to go through its table to rebuild the information about the queues, namely the number of active URLs they contain and number of URLs already processed.
What we could do instead (in case of a polite and clean termination) would be to populate a table containing the queue names as well as these counts. When restarting, if such a table exists, it would be only a matter of reading the data from it instead of going through the whole URL table. Once read, the table would be deleted.
In case of a crash, such a table would not be written at all and we would rely on the existing mechanism.

dummy server to support never refetch

zero-padding the timestamp used in the keys for RocksDB

see #12
Even though in most cases the dates will always be represented with 10 chars, if a date was chosen in the distant past to be prioritized, the order would not be correct if the date had a shorter string representation. By padding the String representation of the timestamp we would always get the keys sorted correctly.

Stopping the Docker image with Docker kill leaves RocksDB in a corrupt state

See https://blog.no42.org/code/docker-java-signals-pid1/

Some calls to GetURLs failing with java.lang.IllegalStateException

I'm using this client to talk to a URLFronter v2.3 service running via Docker on an Ubuntu OS. I'm attempting to write a command line tool that queries the frontier to list URLs, but without interfering too much with the crawl by setting delay_requestable=1. Sometimes it works as expected, other times no results come back, and I see this in the URLFrontier service logs:

scrapy-url-frontier-urlfrontier-1  | 17:33:54.999 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 2147483647, max URLs 2147483647, delay 5] 53a5f051-bb28-4e7b-954a-43bb5e626e80
scrapy-url-frontier-urlfrontier-1  | 17:33:55.000 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 0 msec; tried 0 queues. 53a5f051-bb28-4e7b-954a-43bb5e626e80
scrapy-url-frontier-urlfrontier-1  | 17:33:56.032 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 2147483647, max URLs 2147483647, delay 5] 52b9dec0-5df4-40e0-ac7a-0e54c00dd5bc
scrapy-url-frontier-urlfrontier-1  | 17:33:56.032 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 0 msec; tried 0 queues. 52b9dec0-5df4-40e0-ac7a-0e54c00dd5bc
^[[A
scrapy-url-frontier-urlfrontier-1  | 17:34:05.119 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 2147483647, max URLs 2147483647, delay 5] 0a5d48e3-d4e6-49d4-ba86-83cba5e59fda
scrapy-url-frontier-urlfrontier-1  | 17:34:05.120 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 0 msec; tried 0 queues. 0a5d48e3-d4e6-49d4-ba86-83cba5e59fda
scrapy-url-frontier-urlfrontier-1  | 17:34:07.414 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 2147483647, max URLs 2147483647, delay 5] 22c7f96a-dec4-43f6-ae99-d85dec75013a
scrapy-url-frontier-urlfrontier-1  | 17:34:07.415 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 0 msec; tried 0 queues. 22c7f96a-dec4-43f6-ae99-d85dec75013a
scrapy-url-frontier-urlfrontier-1  | 17:34:10.814 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 2147483647, max URLs 2147483647, delay 5] 0dc31afb-e60c-4f37-82c4-6d0c8663f297
scrapy-url-frontier-urlfrontier-1  | 17:34:10.815 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 1 msec; tried 0 queues. 0dc31afb-e60c-4f37-82c4-6d0c8663f297
scrapy-url-frontier-urlfrontier-1  | 17:34:10.816 [pool-2-thread-1] ERROR c.u.service.rocksdb.RocksDBService - Caught unlikely error 
scrapy-url-frontier-urlfrontier-1  | java.lang.IllegalStateException: Stream is already completed, no further calls are allowed
scrapy-url-frontier-urlfrontier-1  | 	at com.google.common.base.Preconditions.checkState(Preconditions.java:502)
scrapy-url-frontier-urlfrontier-1  | 	at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:375)
scrapy-url-frontier-urlfrontier-1  | 	at crawlercommons.urlfrontier.service.SynchronizedStreamObserver.onNext(SynchronizedStreamObserver.java:59)
scrapy-url-frontier-urlfrontier-1  | 	at crawlercommons.urlfrontier.service.rocksdb.RocksDBService.sendURLsForQueue(RocksDBService.java:338)
scrapy-url-frontier-urlfrontier-1  | 	at crawlercommons.urlfrontier.service.AbstractFrontierService.lambda$getURLs$0(AbstractFrontierService.java:681)
scrapy-url-frontier-urlfrontier-1  | 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
scrapy-url-frontier-urlfrontier-1  | 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
scrapy-url-frontier-urlfrontier-1  | 	at java.base/java.lang.Thread.run(Thread.java:829)
scrapy-url-frontier-urlfrontier-1  | 17:34:10.816 [pool-2-thread-1] ERROR c.u.service.rocksdb.RocksDBService - Caught unlikely error 
scrapy-url-frontier-urlfrontier-1  | java.lang.IllegalStateException: Stream is already completed, no further calls are allowed
scrapy-url-frontier-urlfrontier-1  | 	at com.google.common.base.Preconditions.checkState(Preconditions.java:502)
scrapy-url-frontier-urlfrontier-1  | 	at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:375)
scrapy-url-frontier-urlfrontier-1  | 	at crawlercommons.urlfrontier.service.SynchronizedStreamObserver.onNext(SynchronizedStreamObserver.java:59)
scrapy-url-frontier-urlfrontier-1  | 	at crawlercommons.urlfrontier.service.rocksdb.RocksDBService.sendURLsForQueue(RocksDBService.java:338)
scrapy-url-frontier-urlfrontier-1  | 	at crawlercommons.urlfrontier.service.AbstractFrontierService.lambda$getURLs$0(AbstractFrontierService.java:681)
scrapy-url-frontier-urlfrontier-1  | 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
scrapy-url-frontier-urlfrontier-1  | 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
scrapy-url-frontier-urlfrontier-1  | 	at java.base/java.lang.Thread.run(Thread.java:829)

The latter exception is repeated many times.

Note that in the first few lines, the Sent 0 from 0 queue(s) in 0 msec; tried 0 queues. is also not accurate, as multiple URLs were viewed.

Reduce thread contention when the DB gets large

The queues object we use is a LinkedHashMap wrapped by a synchronized map; this is so that we can control the iteration order while benefiting from a simple locking mechanism. The only time we need to explicitly lock the access to the map is when we iterate on its content in order to avoid a ConcurrentModificationException.

When the DB gets large (e.g. 1 billion URLs in total in a single instance), the threads deadlock when trying to access the queues, in particular, AbstractFrontierService.getURLs() hogs the queues for a long time and prevents the addition of new URLs (which needs to run computeIfAbsent() on the queues). As a result the URLs take a long time to get added which can cause timeout problems on the crawler side.

Given that the getURLs process only needs to get an iterator to get the head of the queue and then pop it back at the end, there is no need to lock on the queues while calling the costly sendURLsForQueue.

Tuning Rockdb configuration

RocksDB has many options for its configuration, some of them might impact the performance.
Luckily, Rocksdb-Adviser can be used to suggest changes based on RockDB's log.

Here are the suggestions I got when running

python3 -m advisor.rule_parser_example --rules_spec=rules.ini --rocksdb_options=OPTIONS-000007 --log_files_path_prefix=LOG --stats_dump_period_sec=20

WARNING(TimeSeriesData) check_and_trigger: float division by zero

Rule: stall-too-many-L0
LogCondition: stall-too-many-L0 regex: Stalling writes because we have \d+ level-0 files
Suggestion: inc-max-subcompactions option : DBOptions.max_subcompactions action : increase
Suggestion: inc-max-bg-compactions option : DBOptions.max_background_compactions action : increase suggested_values : ['2']
Suggestion: inc-write-buffer-size option : CFOptions.write_buffer_size action : increase
Suggestion: dec-max-bytes-for-level-base option : CFOptions.max_bytes_for_level_base action : decrease
Suggestion: inc-l0-slowdown-writes-trigger option : CFOptions.level0_slowdown_writes_trigger action : increase
scope: col_fam:
{'queues'}

Rule: level0-level1-ratio
OptionCondition: level0-level1-ratio options: ['CFOptions.level0_file_num_compaction_trigger', 'CFOptions.write_buffer_size', 'CFOptions.max_bytes_for_level_base'] expression: int(options[0])*int(options[1])-int(options[2])>=1 trigger: {'default': ['4', '134217728', '268435456'], 'queues': ['4', '134217728', '268435456']}
Suggestion: inc-base-max-bytes option : CFOptions.max_bytes_for_level_base action : increase
scope: col_fam:
{'queues', 'default'}

Rule: tuning-iostat-burst
TimeSeriesCondition: large-db-get-p99 statistics: ['[]rocksdb.db.get.micros.p50', '[]rocksdb.db.get.micros.p99'] behavior: evaluate_expression expression: (keys[1]/keys[0])>5 trigger: {'ENTITY_PLACEHOLDER': {1632913979: [9.09997, 55.291267], 1632914579: [14.765463, 108.578789], 1632915179: [16.563531, 104.759209]}}
Suggestion: bytes-per-sync-non0 option : DBOptions.bytes_per_sync action : set suggested_values : ['1048576']
Suggestion: wal-bytes-per-sync-non0 option : DBOptions.wal_bytes_per_sync action : set suggested_values : ['1048576']
Suggestion: set-rate-limiter option : rate_limiter_bytes_per_sec action : set suggested_values : ['1024000']
scope: entities:
{'ENTITY_PLACEHOLDER'}
scope: col_fam:
{'queues', 'default'}

Use Java 11

Active support for Java 8 has ended. Some of the libraries or Maven plugins that URLFrontier relies on require Java 11.

Update version of gRPC to 1.50.0

https://github.com/grpc/grpc-java/releases

Faster recovery for RocksDB service implementation

In the versions <2.1, the recovery of data when restarting RocksDB is pretty slow. This is due to the fact that it reads all the data from both tables in order to check that the count of active URLs in the queues table matches what is found in the default table without a value (i.e. no refetching planned for it).

This is not strictly necessary and it is possible to regenerate the info from the queues only by reading the default table. The check is now optional and triggered by the config rocksdb.recovery.check.

A test on a small crawl showed a reduction in recovery time from 2.8s to 1.6s.

Client needs to throttle sending seeds to Frontier

Exception in thread "main" io.grpc.netty.shaded.io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 2097152 byte(s) of direct memory (used: 3693084951, max: 3695181824)

crawler-commons / url-frontier Goto Github PK

url-frontier's Introduction

Overview

Table of Contents

User Documentation

Javadocs

Mailing List

Installation

News

18th July 2023 - crawler-commons 1.4 released

28th July 2022 - crawler-commons 1.3 released

14th October 2021 - crawler-commons 1.2 released

29th June 2020 - crawler-commons 1.1 released

21st March 2019 - crawler-commons 1.0 released

7th June 2018 - crawler-commons 0.10 released

31st October 2017 - crawler-commons 0.9 released

9th June 2017 - crawler-commons 0.8 released

24th November 2016 - crawler-commons 0.7 released

11th June 2015 - crawler-commons 0.6 is released

22nd April 2015 - crawler-commons has moved

15th October 2014 - crawler-commons 0.5 is released

11th April 2014 - crawler-commons 0.4 is released

11 Oct 2013 - crawler-commons 0.3 is released

24 Jun 2013 - Nutch 1.7 now uses crawler-commons for robots.txt parsing

08 Jun 2013 - Nutch 2.2 now uses crawler-commons for robots.txt parsing

02 Feb 2013 - crawler-commons 0.2 is released

License

url-frontier's People

Contributors

Stargazers

Watchers

Forkers

url-frontier's Issues

Recommend Projects

Recommend Topics

Recommend Org