Comments (14)
@smola /pga2
works better, I was able to process everything there.
from jgit-spark-connector.
spark.tech.sourced.engine.skip.read.errors
was a compromise solution that we agreed on because final PGA release contains some repository with missing objects. We just needed to work with our own released dataset.
On the other hand, if we have further corruption in one of our downloaded copies, I do not see much value in us working with that corrupted copy. I would rather direct our efforts at fixing pga tool as well as ensuring that we have a proper copy in our cluster.
@fulaphex In the mean time, you can try with the PGA copy at /pga2
directory, which was a second download with improved pga tool.
from jgit-spark-connector.
You can skip siva read errors: https://github.com/src-d/engine/blob/master/python/sourced/engine/engine.py#L30
Just change
engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva")
to
engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva",False,True)
from jgit-spark-connector.
When I encountered the error I was using this option and it still fails in the same way with:
from sourced.engine import Engine
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder\
.master("local[*]").appName("Examples")\
.getOrCreate()
engine = Engine(spark, "hdfs://hdfs-namenode/pga/siva/latest/55", "siva", False, True)
print("%d repositories successfully loaded" % (engine.repositories.count()/2))
Updated logs.
from jgit-spark-connector.
This must be a siva-java issue, go-siva unpacks it just fine.
$ echo '5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva' | pga get -i
$ cd siva/latest/55/
$ siva unpack 5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva
$ ls
5559a23b77e8cbf94961deaa3c3473ad3734fbe1.siva
HEAD
config
objects
refs
from jgit-spark-connector.
@erizocosmico I downloaded all siva files starting with 55 with pga tool and it works fine. The problem seems to be with the files on the hdfs on the pipeline-staging cluster.
btw: I also encountered this:
tech.sourced.siva.SivaException: Exception at file 9b1a3c1efc7e3c9ee0625159b909e6e7f98d2963.siva: Error reading index of file.
from jgit-spark-connector.
You mean if you download the file locally the engine works just fine with it?
from jgit-spark-connector.
Turns out opening the file with siva java does not fail either, so the problem is not there
from jgit-spark-connector.
Yes, when I download the single file it works fine.
I also tried recreating same scenario with files downloaded with pga tool and it also works fine.
It only crashes when I run on cluster reading files from hdfs, although I don't know if the files on hdfs are the same ones that pga tool downloads.
from jgit-spark-connector.
You gave me an idea. I downloaded the same siva file, this time from HDFS instead of pga
. go-siva
and siva-java
cannot read it now. So the problem is the files are corrupted in HDFS.
@ajnavarro what should we do about that?
from jgit-spark-connector.
@fulaphex whoever did the copy of the original siva files to that folder you're using did it wrong or some files were corrupted during that copy, because the siva file there is corrupted, but the original (in /apps/borges/root-repositories/
) is not.
You can either use /apps/borges/root-repositories/
or try to do the copy again
from jgit-spark-connector.
@erizocosmico @ajnavarro this is indeed the case of .siva file beeing corrupted
whoever did the copy of the original siva files to that folder you're using did it wrong
Although this suggestion does not look very constructive, this most probably happened on pga get
, before it get md5 verification on download src-d/datasets#69 (which BTW is still not merged yet, so this means ALL pga
users have high chance of stumbling up on this)
You guys did a great job introducing option spark.tech.sourced.engine.skip.read.errors=true
for not breaking Engine on reading corrupted repositories in https://github.com/src-d/engine/pull/395 but it only covers things, reachable with iterators.
Right now, we can see Engine canceling the job on read error in a different place, in RepositoryProvider.genSivaRepository.
Here is the relevant log:
tech.sourced.siva.SivaException: Exception at file 022c7272f0c1333a536cb319beadc4171cc8ff6a.siva: At Index footer, index size: Java implementation of siva doesn't support values greater than 9223372036854775807
And this is the .siva file https://drive.google.com/open?id=1yyDAfzFiDgK8YwjOuByc6zfifPCkGw84
What do you think if with the same configuration option we allow RepositoryProvider
to also skip broken .siva files (ideally with some metric counter)?
This way, it can be treated as part of original https://github.com/src-d/engine/issues/393
from jgit-spark-connector.
We cannot expect engine working correctly if the files themselves are not even readable. If some parquet file is corrupted, what do you expect that spark will do? continue? or fail?
The bug here is in the tool or process used to download siva files in my opinion.
from jgit-spark-connector.
I see! I'm not saying it's a bug in Engine - engine works and siva files are corrupted for sure.
Rather I was trying to say is that as a user, I would love to have an option not to fail the whole job on such files, and was asking if you would be open to extend spark.tech.sourced.engine.skip.read.errors
behaviour to avoid this failure mode as well.
If that case, we could close this in favor of https://github.com/src-d/engine/issues/393 (or a new similar feature request) and then I would be happy to look deeper into this and submit a patch. WDYT?
from jgit-spark-connector.
Related Issues (20)
- Return of org.eclipse.jgit.errors.RevWalkException: Walk failure HOT 2
- [feature request] Handy handling of bad siva-files HOT 2
- Engine fails on UAST when run spark-shell with --packages argument in cluster mode HOT 6
- Modify requirements file, Pyspark version HOT 2
- Add a configuration flag to skip siva read errors
- [Change] Files not cleaned in /tmp folder on Spark Workers when using apollo HOT 2
- commits filtering counterintuitive behaviour HOT 1
- Engine fails to extract UASTs on actual Spark cluster HOT 5
- Latest version of Docker image engine-jupyter missing bblfsh HOT 6
- Expose Bblfsh UAST extraction errors in Engine
- Jupyter example notebook doesn't work out-of-the-box
- Docker Hub 'latest' image doesn't point to master Dockerfile. HOT 2
- Missing dependencies for pyspark in README HOT 6
- Build and test process leave the repository in dirty state for git describe HOT 3
- Update README with EOL
- "standard" repository format does not work HOT 7
- ReferencesDataFrame files property HOT 4
- WARN Repository: close() called when useCnt is already zero
- WARN CommitIterator: missing object HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jgit-spark-connector.