Expected Behavior Skip the siva file and continue processing.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

You can skip siva read errors: <a href="https://github.com/src-d/engine/blob/master/py

This must be a siva-java issue, go-siva unpacks it just fine. <div class="highligh

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Engine fails on siva files from pipeline-staging cluster about jgit-spark-connector HOT 14 CLOSED

fulaphex commented on June 3, 2024

Engine fails on siva files from pipeline-staging cluster

from jgit-spark-connector.

Comments (14)

fulaphex commented on June 3, 2024 2

@smola /pga2 works better, I was able to process everything there.

from jgit-spark-connector.

smola commented on June 3, 2024 1

spark.tech.sourced.engine.skip.read.errors was a compromise solution that we agreed on because final PGA release contains some repository with missing objects. We just needed to work with our own released dataset.

On the other hand, if we have further corruption in one of our downloaded copies, I do not see much value in us working with that corrupted copy. I would rather direct our efforts at fixing pga tool as well as ensuring that we have a proper copy in our cluster.

@fulaphex In the mean time, you can try with the PGA copy at /pga2 directory, which was a second download with improved pga tool.

from jgit-spark-connector.

ajnavarro commented on June 3, 2024

You can skip siva read errors: https://github.com/src-d/engine/blob/master/python/sourced/engine/engine.py#L30

Just change

engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva")

engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva",False,True)

from jgit-spark-connector.

fulaphex commented on June 3, 2024

When I encountered the error I was using this option and it still fails in the same way with:

from sourced.engine import Engine
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder\
.master("local[*]").appName("Examples")\
.getOrCreate()

engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva", False, True)

print("%d repositories successfully loaded" % (engine.repositories.count()/2))

Updated logs.

from jgit-spark-connector.

erizocosmico commented on June 3, 2024

This must be a siva-java issue, go-siva unpacks it just fine.

$ echo '5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva' | pga get -i
$ cd siva/latest/55/
$ siva unpack 5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva
$ ls
5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva
HEAD
config
objects
refs

from jgit-spark-connector.

fulaphex commented on June 3, 2024

@erizocosmico I downloaded all siva files starting with 55 with pga tool and it works fine. The problem seems to be with the files on the hdfs on the pipeline-staging cluster.

btw: I also encountered this:

tech.sourced.siva.SivaException: Exception at file 9b1a3c1efc7e3c9ee0625159b909e6e7f98d2963.siva: Error reading index of file.

from jgit-spark-connector.

erizocosmico commented on June 3, 2024

You mean if you download the file locally the engine works just fine with it?

from jgit-spark-connector.

erizocosmico commented on June 3, 2024

Turns out opening the file with siva java does not fail either, so the problem is not there

from jgit-spark-connector.

fulaphex commented on June 3, 2024

Yes, when I download the single file it works fine.
I also tried recreating same scenario with files downloaded with pga tool and it also works fine.
It only crashes when I run on cluster reading files from hdfs, although I don't know if the files on hdfs are the same ones that pga tool downloads.

from jgit-spark-connector.

erizocosmico commented on June 3, 2024

You gave me an idea. I downloaded the same siva file, this time from HDFS instead of pga. go-siva and siva-java cannot read it now. So the problem is the files are corrupted in HDFS.

@ajnavarro what should we do about that?

from jgit-spark-connector.

erizocosmico commented on June 3, 2024

@fulaphex whoever did the copy of the original siva files to that folder you're using did it wrong or some files were corrupted during that copy, because the siva file there is corrupted, but the original (in /apps/borges/root-repositories/) is not.

You can either use /apps/borges/root-repositories/ or try to do the copy again

from jgit-spark-connector.

bzz commented on June 3, 2024

@erizocosmico @ajnavarro this is indeed the case of .siva file beeing corrupted

whoever did the copy of the original siva files to that folder you're using did it wrong

Although this suggestion does not look very constructive, this most probably happened on pga get, before it get md5 verification on download src-d/datasets#69 (which BTW is still not merged yet, so this means ALL pga users have high chance of stumbling up on this)

You guys did a great job introducing option spark.tech.sourced.engine.skip.read.errors=true for not breaking Engine on reading corrupted repositories in https://github.com/src-d/engine/pull/395 but it only covers things, reachable with iterators.

Right now, we can see Engine canceling the job on read error in a different place, in RepositoryProvider.genSivaRepository.

Here is the relevant log:

tech.sourced.siva.SivaException: Exception at file 022c7272f0c1333a536cb319beadc4171cc8ff6a.siva: At Index footer, index size: Java implementation of siva doesn't support values greater than 9223372036854775807

And this is the .siva file https://drive.google.com/open?id=1yyDAfzFiDgK8YwjOuByc6zfifPCkGw84

What do you think if with the same configuration option we allow RepositoryProvider to also skip broken .siva files (ideally with some metric counter)?

This way, it can be treated as part of original https://github.com/src-d/engine/issues/393

from jgit-spark-connector.

ajnavarro commented on June 3, 2024

We cannot expect engine working correctly if the files themselves are not even readable. If some parquet file is corrupted, what do you expect that spark will do? continue? or fail?

The bug here is in the tool or process used to download siva files in my opinion.

from jgit-spark-connector.

bzz commented on June 3, 2024

I see! I'm not saying it's a bug in Engine - engine works and siva files are corrupted for sure.

Rather I was trying to say is that as a user, I would love to have an option not to fail the whole job on such files, and was asking if you would be open to extend spark.tech.sourced.engine.skip.read.errors behaviour to avoid this failure mode as well.

If that case, we could close this in favor of https://github.com/src-d/engine/issues/393 (or a new similar feature request) and then I would be happy to look deeper into this and submit a patch. WDYT?

from jgit-spark-connector.

Engine fails on siva files from pipeline-staging cluster about jgit-spark-connector HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent