lucacanali / miscellaneous Goto Github PK

Includes notes on using Apache Spark in general, notes on using Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark, tools for performance testing CPUs, Jupyter notebooks examples for Spark, examples for Oracle and other DB systems.

License: Apache License 2.0

PLSQL 0.06% Jupyter Notebook 99.39% Python 0.34% Scala 0.08% Rust 0.04% Shell 0.01% HTML 0.10% Dockerfile 0.01% Roff 0.01%

apache-spark database jupyter-notebooks performance-analysis performance-monitoring performance-testing

miscellaneous's Introduction

Miscellaneous projects and scripts.

Author and contact: [email protected]

Spark and Performance Engineering

Folder	Description
Spark Dashboard	A tool for Apache monitoring, use to build a performance dashboard and troubleshoot Spark jobs.
Spark Notes	Miscellaneous tips and code snippets about Apache Spark.
Spark for Physics	Examples, with code and data of how Apache Spark can be used in the domain of High Energy Physics data analysis.
Performance Testing	Code and examples, includes: - A tool to run TPCDS at scale with PySpark and collect execution metrics - Tools for load-testing CPUs in writetn Python and Rust - Notes on how to use tooling for performace measurements

Data Engineering and Data Science

Folder	Description
Deep Learning Notes	Notes and examples on Deep Learning tools and related data pipelines.
Pyspark_SQL_Magic_Jupyter	How to write Jupyter SQL magic functions for PySpark and Spark SQL.
Trino and Presto on Jupyter	Example of using Trino or Presto on a Jupyter notebook.
PostgreSQL and YugabyteDB on Jupyter	Example of using PostgreSQL or YugabyteDB on a Jupyter notebook.
Oracle_Jupyter	Examples of how to query Oracle using Jupyter/IPython notebooks.
Impala_SQL_Jupyter	Examples of how to run SQL on Apache Impala using Jupyter/IPython notebooks.
SQL_color_Mandelbrot	How to use SQL to compute and display the Mandelbrot set with colors. Examples for Oracle and PostgreSQL.
PLSQL_Neural_Network	An example of how to deploy a DL serving engine for Oracle using PL/SQL.

miscellaneous's People

Stargazers

Watchers

Forkers

frankfineis elgorn73 mabidm yong93 jessewei acaramia tdrjnr gmsharpe oopsoutofmemory sjanulonoks faisal-w shpimit ptzagk bdgowda1 tr0k kgtdbx snowdj johnj kevinwkc arattinger prasad-goski umayrh ahmedmustahid pavlolobachov vijayakumar75 aalnafessah jsnorman pchanumolu milin-k csorsby techgoldy gopinath678 ericxiao251 gnahckire manish3j kr-satis hhy5277 reginababo vahid83 gregbatsi huangylqf wenxingfang bigdata-job gongliangz aravinthsci leicozit meiriweixin oshyshkin jcestevezc 86085602 wahibium yoshiyukikono npckenny umapathyv ksheerabdhi jreissup krishdey andrewm-bose ssrisunt sivsanb christian-sattler cpranava charles1614 nuthanreddy co360 uygarpe samuelan 7mming7 yuexihai yunshengwang gouthamssc angerszhuuuu hassan-yamin hellodk rlagnlrns wenjinchao walkergregt turbofart pavanks2007 mesumraza surya-lehar-zeotap sambhav37 abrown kioco psyoblade rknutalapati cyofeiyue ramch22 forget6 melodylail taskset quyao shuyouzz dansingeratcharter voiddrum un-knower orcascope fausaitalk code360in sathya-reddy-m

miscellaneous's Issues

Scaling questions

Hey Luca!

Thanks again for your spark dashboarding work. It gave me a great leg up on implementing our own metrics solution.

One thing I'm noticing though is the spark metrics being per app-id have really high cardinality and our metrics receiver (prometheus & victoria metrics) seems to be struggling as as the number of series grows (seeing up to 30MM series per cluster in some cases).

Have you seen anything like this on your installation? Does influx maybe just handle it better?

worth adding

Hi, I was coming across your repository and I thought it maybe worth adding the another method of acquiring flame graph in k8s environment for any java application and many more.

The easiest is to setup account on profiler.granulate then just download the ready to use template of yaml, ready to be deployed on k8s. Special workers profiling the cluster and dump data to the web, with ready to use flamegraphs.

gprofiler-k8s-deploy-yaml

Regards,
Patryk.

Spark 3.2 support?

Hi @LucaCanali - hope this is an okay channel to reach out on. Love your spark dashboards! 🧡

I was trying to set this up for work on AWS EMR and it seems like the metrics listed on https://spark.apache.org/docs/latest/monitoring.html#component-instance--executor don't get produced from applications on spark prior to 3.3

But I see 3.2 listed in the tags of apache/spark@1ffe03d

Do you happen to know if there's a way to get your dashboards working on 3.2? Perhaps the metrics.properties just needs to be different?

I'm specially curious about the active jobs and executor run time per process graphs.

These work fine for me on 3.3

But if I set up an EMR running spark 3.2 to publish to the same influx, it's just blank.

add jars to hbase server side

Add jars to hbase server side according to https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_HBase_Connector.md. But it won't work for me. I get error as below. Please help me.

java.lang.NoSuchMethodError: org.apache.hadoop.hbase.spark.protobuf.generated.SparkFilterProtos$SQLPredicatePushDownFilter$Builder.addValueFromQueryArray(Lorg/apache/hbase/thirdparty/com/google/protobuf/ByteString;)Lorg/apache/hadoop/hbase/spark/protobuf/generated/SparkFilterProtos$SQLPredicatePushDownFilter$Builder;
	at org.apache.hadoop.hbase.spark.SparkSQLPushDownFilter.toByteArray(SparkSQLPushDownFilter.java:257)
	at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$.$anonfun$toSerializedTypedFilter$1(HBaseTableScanRDD.scala:273)
	at scala.Option.map(Option.scala:230)
	at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$.toSerializedTypedFilter(HBaseTableScanRDD.scala:273)
	at org.apache.hadoop.hbase.spark.datasources.HBaseTableScanRDD.$anonfun$getPartitions$2(HBaseTableScanRDD.scala:85)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
	at org.apache.hadoop.hbase.spark.datasources.HBaseTableScanRDD.getPartitions(HBaseTableScanRDD.scala:77)

Aggregate all tasks metrics in every stage

Thanks for your help in advance. I have read your notes in ( https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_EventLog.md )
I want to aggregate all task metrics in every stage. For example, to sum all the tasks metrics (eg: Disk Bytes Spilled,Executor CPU Time,Executor Deserialize CPU Time,and Executor Deserialize Time )

How can i do that using the spark logs?

lucacanali / miscellaneous Goto Github PK

miscellaneous's Introduction

Miscellaneous projects and scripts.

Spark and Performance Engineering

Data Engineering and Data Science

miscellaneous's People

Stargazers

Watchers

Forkers

miscellaneous's Issues

Scaling questions

worth adding

Spark 3.2 support?

add jars to hbase server side

Aggregate all tasks metrics in every stage

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent