Coder Social home page Coder Social logo

Comments (14)

fulaphex avatar fulaphex commented on June 3, 2024 2

@smola /pga2 works better, I was able to process everything there.

from jgit-spark-connector.

smola avatar smola commented on June 3, 2024 1

spark.tech.sourced.engine.skip.read.errors was a compromise solution that we agreed on because final PGA release contains some repository with missing objects. We just needed to work with our own released dataset.

On the other hand, if we have further corruption in one of our downloaded copies, I do not see much value in us working with that corrupted copy. I would rather direct our efforts at fixing pga tool as well as ensuring that we have a proper copy in our cluster.

@fulaphex In the mean time, you can try with the PGA copy at /pga2 directory, which was a second download with improved pga tool.

from jgit-spark-connector.

ajnavarro avatar ajnavarro commented on June 3, 2024

You can skip siva read errors: https://github.com/src-d/engine/blob/master/python/sourced/engine/engine.py#L30

Just change

engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva")

to

engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva",False,True)

from jgit-spark-connector.

fulaphex avatar fulaphex commented on June 3, 2024

When I encountered the error I was using this option and it still fails in the same way with:

from sourced.engine import Engine
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder\
.master("local[*]").appName("Examples")\
.getOrCreate()

engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva", False, True)

print("%d repositories successfully loaded" % (engine.repositories.count()/2))

Updated logs.

from jgit-spark-connector.

erizocosmico avatar erizocosmico commented on June 3, 2024

This must be a siva-java issue, go-siva unpacks it just fine.

$ echo '5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva' | pga get -i
$ cd siva/latest/55/
$ siva unpack 5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva
$ ls
5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva
HEAD
config
objects
refs

from jgit-spark-connector.

fulaphex avatar fulaphex commented on June 3, 2024

@erizocosmico I downloaded all siva files starting with 55 with pga tool and it works fine. The problem seems to be with the files on the hdfs on the pipeline-staging cluster.

btw: I also encountered this:

tech.sourced.siva.SivaException: Exception at file 9b1a3c1efc7e3c9ee0625159b909e6e7f98d2963.siva: Error reading index of file.

from jgit-spark-connector.

erizocosmico avatar erizocosmico commented on June 3, 2024

You mean if you download the file locally the engine works just fine with it?

from jgit-spark-connector.

erizocosmico avatar erizocosmico commented on June 3, 2024

Turns out opening the file with siva java does not fail either, so the problem is not there

from jgit-spark-connector.

fulaphex avatar fulaphex commented on June 3, 2024

Yes, when I download the single file it works fine.
I also tried recreating same scenario with files downloaded with pga tool and it also works fine.
It only crashes when I run on cluster reading files from hdfs, although I don't know if the files on hdfs are the same ones that pga tool downloads.

from jgit-spark-connector.

erizocosmico avatar erizocosmico commented on June 3, 2024

You gave me an idea. I downloaded the same siva file, this time from HDFS instead of pga. go-siva and siva-java cannot read it now. So the problem is the files are corrupted in HDFS.

@ajnavarro what should we do about that?

from jgit-spark-connector.

erizocosmico avatar erizocosmico commented on June 3, 2024

@fulaphex whoever did the copy of the original siva files to that folder you're using did it wrong or some files were corrupted during that copy, because the siva file there is corrupted, but the original (in /apps/borges/root-repositories/) is not.

You can either use /apps/borges/root-repositories/ or try to do the copy again

from jgit-spark-connector.

bzz avatar bzz commented on June 3, 2024

@erizocosmico @ajnavarro this is indeed the case of .siva file beeing corrupted

whoever did the copy of the original siva files to that folder you're using did it wrong

Although this suggestion does not look very constructive, this most probably happened on pga get, before it get md5 verification on download src-d/datasets#69 (which BTW is still not merged yet, so this means ALL pga users have high chance of stumbling up on this)

You guys did a great job introducing option spark.tech.sourced.engine.skip.read.errors=true for not breaking Engine on reading corrupted repositories in https://github.com/src-d/engine/pull/395 but it only covers things, reachable with iterators.

Right now, we can see Engine canceling the job on read error in a different place, in RepositoryProvider.genSivaRepository.

Here is the relevant log:

tech.sourced.siva.SivaException: Exception at file 022c7272f0c1333a536cb319beadc4171cc8ff6a.siva: At Index footer, index size: Java implementation of siva doesn't support values greater than 9223372036854775807

And this is the .siva file https://drive.google.com/open?id=1yyDAfzFiDgK8YwjOuByc6zfifPCkGw84

What do you think if with the same configuration option we allow RepositoryProvider to also skip broken .siva files (ideally with some metric counter)?

This way, it can be treated as part of original https://github.com/src-d/engine/issues/393

from jgit-spark-connector.

ajnavarro avatar ajnavarro commented on June 3, 2024

We cannot expect engine working correctly if the files themselves are not even readable. If some parquet file is corrupted, what do you expect that spark will do? continue? or fail?

The bug here is in the tool or process used to download siva files in my opinion.

from jgit-spark-connector.

bzz avatar bzz commented on June 3, 2024

I see! I'm not saying it's a bug in Engine - engine works and siva files are corrupted for sure.

Rather I was trying to say is that as a user, I would love to have an option not to fail the whole job on such files, and was asking if you would be open to extend spark.tech.sourced.engine.skip.read.errors behaviour to avoid this failure mode as well.

If that case, we could close this in favor of https://github.com/src-d/engine/issues/393 (or a new similar feature request) and then I would be happy to look deeper into this and submit a patch. WDYT?

from jgit-spark-connector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.