twosigma / flint Goto Github PK
View Code? Open in Web Editor NEWA Time Series Library for Apache Spark
License: Apache License 2.0
A Time Series Library for Apache Spark
License: Apache License 2.0
Hi,
I'm trying to use flint submitting a pyspark job on yarn.
>> ./bin/pyspark --master yarn --deploy-mode client --jars /opt/flint-assembly-0.2.0-SNAPSHOT.jar --py-files /opt/flint-assembly-0.2.0-SNAPSHOT.jar`
[..]
SparkSession available as 'spark'.
>>> import ts.flint
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'ts'
>>> import sys
>>> sys.path
['', '/tmp/spark-1d453f8f-379a-4f22-a7e4-cabe9dad15c5/userFiles-0b6ed883-15b1-4893-acb8-977f74b09913/flint-assembly-0.2.0-SNAPSHOT.jar', '/tmp/spark-1d453f8f-379a-4f22-a7e4-cabe9dad15c5/userFiles-0b6ed883-15b1-4893-acb8-977f74b09913', '/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip', '/usr/hdp/2.6.1.0-129/spark2/python', '/usr/hdp/2.6.1.0-129/spark2', '/opt/miniconda3/envs/lhd_spark/lib/python36.zip', '/opt/miniconda3/envs/lhd_spark/lib/python3.6', '/opt/miniconda3/envs/lhd_spark/lib/python3.6/lib-dynload', '/opt/miniconda3/envs/lhd_spark/lib/python3.6/site-packages', '/opt/miniconda3/envs/lhd_spark/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg']
Using same approach on master local works properly while on yarn seems to refer to invalid path and the import fails.
I was able to use the library following this similar topic extracting python code from the jar and copying it in my working directory.
Is that ok or there is a better way to proceed?
`[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: org.slf4j#slf4j-api;1.7.7: several problems occurred while resolving dependency: org.slf4j#slf4j-api;1.7.7 {compile=[compile(), master(compile)], runtime=[runtime()]}:
[warn] several problems occurred while resolving dependency: org.slf4j#slf4j-parent;1.7.7 {}:
[warn] URI has an authority component
[warn] URI has an authority component
[warn]
[warn] URI has an authority component
[warn] :: org.apache.spark#spark-core_2.11;2.1.0: several problems occurred while resolving dependency: org.apache.spark#spark-core_2.11;2.1.0 {provided=[default(compile)]}:
[warn] several problems occurred while resolving dependency: org.apache.spark#spark-parent_2.11;2.1.0 {}:
[warn] URI has an authority component
[warn] URI has an authority component
[warn]
[warn] URI has an authority component
[warn] :: org.apache.spark#spark-mllib_2.11;2.1.0: several problems occurred while resolving dependency: org.apache.spark#spark-mllib_2.11;2.1.0 {provided=[default(compile)]}:
[warn] several problems occurred while resolving dependency: org.apache.spark#spark-parent_2.11;2.1.0 {}:
[warn] URI has an authority component
[warn] URI has an authority component
[warn]
[warn] URI has an authority component
[warn] :: org.apache.spark#spark-sql_2.11;2.1.0: several problems occurred while resolving dependency: org.apache.spark#spark-sql_2.11;2.1.0 {provided=[default(compile)]}:
[warn] several problems occurred while resolving dependency: org.apache.spark#spark-parent_2.11;2.1.0 {}:
[warn] URI has an authority component
[warn] URI has an authority component
[warn]
[warn] URI has an authority component
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] org.slf4j:slf4j-api:1.7.7
[warn] +- org.clapper:grizzled-slf4j_2.11:1.3.0 (D:\flint-master\build.sbt#L98)
[warn] +- com.twosigma:flint_2.11:0.2.0-SNAPSHOT
[warn] org.apache.spark:spark-core_2.11:2.1.0 (D:\flint-master\build.sbt#L98)
[warn] +- com.twosigma:flint_2.11:0.2.0-SNAPSHOT
[warn] org.apache.spark:spark-mllib_2.11:2.1.0 (D:\flint-master\build.sbt#L98)
[warn] +- com.twosigma:flint_2.11:0.2.0-SNAPSHOT
[warn] org.apache.spark:spark-sql_2.11:2.1.0 (D:\flint-master\build.sbt#L98)
[warn] +- com.twosigma:flint_2.11:0.2.0-SNAPSHOT
sbt.ResolveException: unresolved dependency: org.slf4j#slf4j-api;1.7.7: several problems occurred while resolving dependency: org.slf4j#slf4j-api;1.7.7 {compile=[compile(), master(compile)], runtime=[runtime()]}:
several problems occurred while resolving dependency: org.slf4j#slf4j-parent;1.7.7 {}:
URI has an authority component
URI has an authority component
URI has an authority component
unresolved dependency: org.apache.spark#spark-core_2.11;2.1.0: several problems occurred while resolving dependency: org.apache.spark#spark-core_2.11;2.1.0 {provided=[default(compile)]}:
several problems occurred while resolving dependency: org.apache.spark#spark-parent_2.11;2.1.0 {}:
URI has an authority component
URI has an authority component
URI has an authority component
unresolved dependency: org.apache.spark#spark-mllib_2.11;2.1.0: several problems occurred while resolving dependency: org.apache.spark#spark-mllib_2.11;2.1.0 {provided=[default(compile)]}:
several problems occurred while resolving dependency: org.apache.spark#spark-parent_2.11;2.1.0 {}:
URI has an authority component
URI has an authority component
URI has an authority component
unresolved dependency: org.apache.spark#spark-sql_2.11;2.1.0: several problems occurred while resolving dependency: org.apache.spark#spark-sql_2.11;2.1.0 {provided=[default(compile)]}:
several problems occurred while resolving dependency: org.apache.spark#spark-parent_2.11;2.1.0 {}:
URI has an authority component
URI has an authority component
URI has an authority component
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:313)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:168)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:133)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:95)
at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:80)
at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:99)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:60)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:50)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:65)
at sbt.IvySbt.withIvy(Ivy.scala:128)
at sbt.IvySbt.withIvy(Ivy.scala:125)
at sbt.IvySbt$Module.withModule(Ivy.scala:156)
at sbt.IvyActions$.updateEither(IvyActions.scala:168)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1439)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1435)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$90.apply(Defaults.scala:1470)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$90.apply(Defaults.scala:1468)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1473)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1467)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1490)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1417)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1369)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[error] (:update) sbt.ResolveException: unresolved dependency: org.slf4j#slf4j-api;1.7.7: several problems occurred while resolving dependency: org.slf4j#slf4j-api;1.7.7 {compile=[compile(), master(compile)], runtime=[runtime(*)]}:
[error] several problems occurred while resolving dependency: org.slf4j#slf4j-parent;1.7.7 {}:
[error] URI has an authority component
[error] URI has an authority component
[error]
[error] URI has an authority component
[error]
[error] unresolved dependency: org.apache.spark#spark-core_2.11;2.1.0: several problems occurred while resolving dependency: org.apache.spark#spark-core_2.11;2.1.0 {provided=[default(compile)]}:
[error] several problems occurred while resolving dependency: org.apache.spark#spark-parent_2.11;2.1.0 {}:
[error] URI has an authority component
[error] URI has an authority component
[error]
[error] URI has an authority component
[error]
[error] unresolved dependency: org.apache.spark#spark-mllib_2.11;2.1.0: several problems occurred while resolving dependency: org.apache.spark#spark-mllib_2.11;2.1.0 {provided=[default(compile)]}:
[error] several problems occurred while resolving dependency: org.apache.spark#spark-parent_2.11;2.1.0 {}:
[error] URI has an authority component
[error] URI has an authority component
[error]
[error] URI has an authority component
[error]
[error] unresolved dependency: org.apache.spark#spark-sql_2.11;2.1.0: several problems occurred while resolving dependency: org.apache.spark#spark-sql_2.11;2.1.0 {provided=[default(compile)]}:
[error] several problems occurred while resolving dependency: org.apache.spark#spark-parent_2.11;2.1.0 {}:
[error] URI has an authority component
[error] URI has an authority component
[error]
[error] URI has an authority component
[error] Total time: 32 s, completed Feb 9, 2018 3:53:50 PM`
Can you help me to resolve this issue while building the sbt project
Hi,
If I save a TimeseriesRDD to parquet as described in the documentation and then try to reread it without sorting then it will often fail with an error of the form:
`"Partitions are not sorted. The partition 0 has the first key 1541810691000000000 and the partition 1 has the first key 1541780937000000000."
Doing a bit more research- I can see you have PR#16 which I think describes the issue in more detail and says that you are working around it using custom patched version of spark. I can see that the PR is year or so old, so presume it doesn't solve the issue?
If this is the case is there any other workaround that can be used so that you can save a timeseries rdd to disk and subsequently reload it without paying the cost of a sort?
`
I have a dataset which time index is daily, and I want to sum all the values for the same day (same index). I can do this function using the summarize method, but it doesn't work specifying the key 'time' (my flint time index).
Using:
flintdf.summarize(summarizers.sum('values'), key='time')
or
flintdf.summarizeCycles(summarizers.sum('values'), key='time')
gives me this error:
Py4JJavaError: An error occurred while calling o449.summarize.
: com.twosigma.flint.timeseries.row.DuplicateColumnsException: Found duplicate columns List(time) in schema...
The point is that I also tried using summarizeCycles without a key, but it sum the time too and gives me the total sum of absolutely everything:
flintdf.summarizeCycles(summarizers.count())
returns me something like this:
|-------------time------------|count|
|9223372036854775807| 3030|
I think this could be another possible malfunction, because there are many different timestamps in my dataset and as the documentation says,
"Computes aggregate statistics of rows that share a timestamp." .
And the last thing I tried, was to use another date field, which permits me to use the summarize and summarizeCycles with date as key, but it looks like the summarize method deletes the timeindex, making all the values 0 and the resulting dataframe with the values unsorted.
Using summarizeCycles with key, returns the same dataframe, taking only the first element with a timestamp, the repeated index rows are deleted and as index, it uses the same for every value, that is the sum of all the times, in my case 9223372036854775807
Python version: 3.5
The clocks function for Flint in python is returning incorrect intervals.
The time intervals appear far too large than what I am specifying into the function.
For example:
from ts.flint import clocks
clock = clocks.uniform(sqlContext, frequency="1s", offset="0ns")
clock.show()
returns
time:timestamp
+-------------------+
| time|
+-------------------+
|1970-01-01 00:00:00|
|1970-01-01 00:16:40|
|1970-01-01 00:33:20|
|1970-01-01 00:50:00|
|1970-01-01 01:06:40|
|1970-01-01 01:23:20|
|1970-01-01 01:40:00|
|1970-01-01 01:56:40|
|1970-01-01 02:13:20|
|1970-01-01 02:30:00|
|1970-01-01 02:46:40|
|1970-01-01 03:03:20|
|1970-01-01 03:20:00|
|1970-01-01 03:36:40|
|1970-01-01 03:53:20|
|1970-01-01 04:10:00|
|1970-01-01 04:26:40|
|1970-01-01 04:43:20|
|1970-01-01 05:00:00|
|1970-01-01 05:16:40|
+-------------------+
only showing top 20 rows
It should be 1 second intervals but returns intervals of 16 min 40 seconds.
Similarly, an interval of 1 day returns intervals of 2 years.
from ts.flint import clocks
clock = clocks.uniform(sqlContext, frequency="1d", offset="0ns", )
clock.show()
+-------------------+
| time|
+-------------------+
|1970-01-01 00:00:00|
|1972-09-27 00:00:00|
|1975-06-24 00:00:00|
|1978-03-20 00:00:00|
|1980-12-14 00:00:00|
|1983-09-10 00:00:00|
|1986-06-06 00:00:00|
|1989-03-02 00:00:00|
|1991-11-27 00:00:00|
|1994-08-23 00:00:00|
|1997-05-19 00:00:00|
|2000-02-13 00:00:00|
|2002-11-09 00:00:00|
|2005-08-05 00:00:00|
|2008-05-01 00:00:00|
|2011-01-26 00:00:00|
|2013-10-22 00:00:00|
|2016-07-18 00:00:00|
|2019-04-14 00:00:00|
|2022-01-08 00:00:00|
+-------------------+
only showing top 20 rows
Also when I supply custom start and end times, the years returned are way out of range.
from ts.flint import clocks
clock = clocks.uniform(sqlContext, frequency="1d", offset="0ns", begin_date_time="2014-04-23", end_date_time="2015-04-23")
clock.show()
time:timestamp
+--------------------+
| time|
+--------------------+
|46277-07-20 00:00...|
|46280-04-15 00:00...|
|46283-01-10 00:00...|
|46285-10-06 00:00...|
|46288-07-02 00:00...|
|46291-03-29 00:00...|
|46293-12-23 00:00...|
|46296-09-18 00:00...|
|46299-06-15 00:00...|
|46302-03-12 00:00...|
|46304-12-06 00:00...|
|46307-09-02 00:00...|
|46310-05-29 00:00...|
|46313-02-22 00:00...|
|46315-11-19 00:00...|
|46318-08-15 00:00...|
|46321-05-11 00:00...|
|46324-02-05 00:00...|
|46326-11-01 00:00...|
|46329-07-28 00:00...|
+--------------------+
only showing top 20 rows
I can successfully open the PySpark shell with the command provided on your python/README.md file
pyspark --master=local --jars /path/to/assembly/flint-assembly-0.6.0-SNAPSHOT.jar --py-files /path/to/assembly/flint-assembly-0.2.0-SNAPSHOT.jar
However, when I try to import the ts.flint module with the command
import ts.flint
I always get the error
cannot import name 'rankers'
Am I missing something? The file python/ts/flint/dataframe.py contains the statement
from . import rankers
but I cannot find any such file in the repository.
Any help would be appreciated.
When time column is not the first column, it gives an confusing exception:
ValueError: object of IntegerType out of range, got: 1357084800000000000
Hi,
Is there a way to specify the install directory for sbt assembly?
Thanks,
Martin
Hello, I am using many regressions in parallel over a single call to summarize
. I've noticed that if I run ~20 regressions on a dataset with 5M rows, it seems to take 45-60 minutes to summarize. If I run a single regression on a similarly-sized dataset, however, it only takes a minute or two to summarize. What kinds of performance characteristics should I expect, and how can I avoid this kind of performance collapse?
Thank you!
Hi - I am getting the following error when trying to run the Python example:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/Users/__/flint-master/python/ts/flint/dataframe.py", line 592, in summarize
tsrdd = self.timeSeriesRDD.summarize(composed_summarizer._jsummarizer(self._sc), scala_key)
File "/Users/__/flint-master/python/ts/flint/dataframe.py", line 133, in timeSeriesRDD
self._jdf, self._is_sorted, self._junit, self._time_column)
File "/usr/local/lib/python3.6/site-packages/py4j/java_gateway.py", line 1160, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Users/__/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.6/site-packages/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o284.fromDF.
java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.internalCreateDataFrame(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/sql/types/StructType;)Lorg/apache/spark/sql/Dataset;
at org.apache.spark.sql.DFConverter$.toDataFrame(DFConverter.scala:37)
at com.twosigma.flint.timeseries.TimeSeriesStore$.apply(TimeSeriesStore.scala:72)
at com.twosigma.flint.timeseries.TimeSeriesStore$.apply(TimeSeriesStore.scala:59)
at com.twosigma.flint.timeseries.TimeSeriesRDD$.fromDFWithPartInfo(TimeSeriesRDD.scala:388)
at com.twosigma.flint.timeseries.TimeSeriesRDD$.fromDF(TimeSeriesRDD.scala:271)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Is this due to Spark support being limited to 2.0?
Thank you!!
Just dive into the source code of flint, looks very good. Now I have a new requirement and I'm trying to create a custom summarizer. It would be better if you guys can give me some feedback on the usage.
So, say we have a time series of events, some events are snapshot, but most events are delta, which will be applied to previous snapshot. The snapshot might be very large, say, the whole order book levels. This is why we cannot store the snapshot (the actual state) into the time series data.
So I started a simplified version. Say the snapshot is just a Long, and the delta is also Long. Like this:
+----+--------+-----+
|time|snapshot|delta|
+----+--------+-----+
|1000| null| 1|
|1010| null| 1|
|1050| 8| null|
|1100| null| 1|
|1200| null| 1|
|1250| null| 1|
|1350| 11| null|
|1550| null| 1|
+----+--------+-----+
From there, I want to create a summarizer that can reconstruct the snapshots, then grab some information of the snapshot at each event. Say, I want to generate the final data frame like:
+----+--------+-----+----------------------+
|time|snapshot|delta|reconstructed_snapshot|
+----+--------+-----+----------------------+
|1000| null| 1| null|
|1010| null| 1| null|
|1050| 8| null| 8|
|1100| null| 1| 9|
|1200| null| 1| 10|
|1250| null| 1| 11|
|1350| 11| null| 11|
|1550| null| 1| 12|
+----+--------+-----+----------------------+
The first two rows are still null because there is no start snapshot. But after the first snapshot, we should be able to reconstruct the snapshot for each event.
Now, I want to extend a custom summarizer, and I saw most of the summarizers are extended from LeftSubtractableSummarizer
. I also found there is a property test for LeftSubtractableProperty
, there it wrote: "Check if (a + b) + c - a = b + c"
In logic, the snapshot reconstruction is left subtractable. But in reality, for a window, we must know the snapshot of the first event in that window, otherwise it cannot construct. So for this reason, I start to define my summarizer as: case class Experiment1Summarizer() extends LeftSubtractableSummarizer[Option[Long], Option[Long], Option[Long]]
The three types, T, U and V are all Option[Long]
because of this reason. So it will not pass the LeftSubtractableProperty
obviously. For example:
a = |1050| 8| null|
b = |1100| null| 1|
c = |1200| null| 1|
a + b + c = Some(10) a is snapshot
a + b + c - a = Some(10), because remove an old event will not effect the state
but, b + c = None, because there is no start snapshot.
So, does this make sense to use LeftSubtractableSummarizer
to implement?
Regards,
Xiang.
In this interface: https://github.com/twosigma/flint/blob/master/src/main/scala/com/twosigma/flint/timeseries/summarize/Summarizer.scala#L211
Sometime it is also good to know on which row it is rending. Is it possible to have something like: def fromV(v: V, t: T): InternalRow
Are there any plans to update Flint to run on Spark 2.3? I would love to be able to use pandas_udf's on my Time Series DataFrames.
Thanks!
Is there a reason why the convenience object Schema is private?
private[timeseries] object Schema
For instance:
// preferred but not working because Schema private
val tsRdd = TimeSeriesRDD.fromRDD(sc.parallelize(data, defaultNumPartitions), Schema("time" -> LongType, "id" -> IntegerType, "price" -> DoubleType))(isSorted = true, timeUnit = TimeUnit.NANOSECONDS)
val schema = StructType(
StructField("time", LongType) ::
StructField("id", IntegerType) ::
StructField("price", DoubleType) :: Nil)
val tsRdd1 = TimeSeriesRDD.fromRDD(sc.parallelize(data, defaultNumPartitions), schema)(isSorted = true, timeUnit = TimeUnit.NANOSECONDS)
Also, some TimeSeriesRDD constructors are private, which may be useful:
private[timeseries] def fromSeq(
sc: SparkContext,
rows: Seq[InternalRow],
schema: StructType,
isSorted: Boolean,
numSlices: Int = 1
): TimeSeriesRDD
private[flint] def fromOrderedRDD(
rdd: OrderedRDD[Long, Row],
schema: StructType
): TimeSeriesRDD = {
val converter = CatalystTypeConvertersWrapper.toCatalystRowConverter(schema)
TimeSeriesRDD.fromInternalOrderedRDD(rdd.mapValues {
case (_, row) => converter(row)
}, schema)
}
Also for testing access to the OrderedRdd is valuable, but that is also private
private[flint] def orderedRdd: OrderedRDD[Long, InternalRow]
This may open the implementation too much.
Hi,
I am trying to install flint on my mac but when I run 'make dist', I get this error:
Himanshus-MacBook-Pro:flint-master himanshugupta$ make dist
find . -name '.pyc' -exec rm -f {} +
find . -name '.pyo' -exec rm -f {} +
find . -name '*~' -exec rm -f {} +
find . -name 'pycache' -exec rm -fr {} +
sbt "set test in assembly := {}" clean assembly
[info] Loading project definition from /Users/himanshugupta/spark/flint-master/project
java.lang.NullPointerException
at java.base/java.util.regex.Matcher.getTextLength(Matcher.java:1769)
at java.base/java.util.regex.Matcher.reset(Matcher.java:416)
at java.base/java.util.regex.Matcher.(Matcher.java:253)
at java.base/java.util.regex.Pattern.matcher(Pattern.java:1147)
at java.base/java.util.regex.Pattern.split(Pattern.java:1264)
at java.base/java.util.regex.Pattern.split(Pattern.java:1335)
at sbt.IO$.pathSplit(IO.scala:797)
at sbt.IO$.parseClasspath(IO.scala:912)
at sbt.compiler.CompilerArguments.extClasspath(CompilerArguments.scala:66)
at sbt.compiler.MixedAnalyzingCompiler$.withBootclasspath(MixedAnalyzingCompiler.scala:188)
at sbt.compiler.MixedAnalyzingCompiler$.searchClasspathAndLookup(MixedAnalyzingCompiler.scala:166)
at sbt.compiler.MixedAnalyzingCompiler$.apply(MixedAnalyzingCompiler.scala:176)
at sbt.compiler.IC$.incrementalCompile(IncrementalCompiler.scala:138)
at sbt.Compiler$.compile(Compiler.scala:152)
at sbt.Compiler$.compile(Compiler.scala:138)
at sbt.Defaults$.sbt$Defaults$$compileIncrementalTaskImpl(Defaults.scala:860)
at sbt.Defaults$$anonfun$compileIncrementalTask$1.apply(Defaults.scala:851)
at sbt.Defaults$$anonfun$compileIncrementalTask$1.apply(Defaults.scala:849)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
at java.base/java.lang.Thread.run(Thread.java:844)
[error] (compile:compileIncremental) java.lang.NullPointerException
Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? l
[info] Loading project definition from /Users/himanshugupta/spark/flint-master/project
[debug] Forcing garbage collection...
java.lang.NullPointerException
at java.base/java.util.regex.Matcher.getTextLength(Matcher.java:1769)
at java.base/java.util.regex.Matcher.reset(Matcher.java:416)
at java.base/java.util.regex.Matcher.(Matcher.java:253)
at java.base/java.util.regex.Pattern.matcher(Pattern.java:1147)
at java.base/java.util.regex.Pattern.split(Pattern.java:1264)
at java.base/java.util.regex.Pattern.split(Pattern.java:1335)
at sbt.IO$.pathSplit(IO.scala:797)
at sbt.IO$.parseClasspath(IO.scala:912)
at sbt.compiler.CompilerArguments.extClasspath(CompilerArguments.scala:66)
at sbt.compiler.MixedAnalyzingCompiler$.withBootclasspath(MixedAnalyzingCompiler.scala:188)
at sbt.compiler.MixedAnalyzingCompiler$.searchClasspathAndLookup(MixedAnalyzingCompiler.scala:166)
at sbt.compiler.MixedAnalyzingCompiler$.apply(MixedAnalyzingCompiler.scala:176)
at sbt.compiler.IC$.incrementalCompile(IncrementalCompiler.scala:138)
at sbt.Compiler$.compile(Compiler.scala:152)
at sbt.Compiler$.compile(Compiler.scala:138)
at sbt.Defaults$.sbt$Defaults$$compileIncrementalTaskImpl(Defaults.scala:860)
at sbt.Defaults$$anonfun$compileIncrementalTask$1.apply(Defaults.scala:851)
at sbt.Defaults$$anonfun$compileIncrementalTask$1.apply(Defaults.scala:849)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
at java.base/java.lang.Thread.run(Thread.java:844)
[error] (compile:compileIncremental) java.lang.NullPointerException
[debug] > load-failed
[debug] > last
Any idea what's causing this and how I can fix it?
Thanks
I no longer see cookbook.ipynb. Can we get re-add?
Hi all:
I found that leftJoin generates df that smaller than the left df
[In] [1]: joined_flint = left_flint.leftJoin(right_flint, tolerance=tolerance, key=by)
[In] [2]: print (joined_flint.count() < left_flint.count())
True
I consider this is a false result since left join does not drop any row in the left table.
Any explanation or suggestion?
Whenever I try to use Flint here locally (no Hadoop/EMR involved), it keep barfing at me with the above error message in the subject. It's a setup on top of Python 3.7 with PySpark 2.4.4 and OpenJDK 8; an Ubuntu 19.04 install.
Note: As I'm running locally only, I'm getting this log message from Spark, but everything does run perfectly using vanilla PySpark:
19/10/23 09:59:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
It happens when I try to either read a PySpark dataframe into a ts.flint.TimeSeriesDataFrame
. This example is adapted from the Flint Example.ipynb
:
import pyspark
import ts.flint
from ts.flint import FlintContext
sc = pyspark.SparkContext('local', 'Flint Example')
spark = pyspark.sql.SparkSession(sc)
flint_context = FlintContext(spark)
sp500 = (
spark.read
.option('header', True)
.option('inferSchema', True)
.csv('sp500.csv')
.withColumnRenamed('Date', 'time')
)
sp500 = flint_context.read.dataframe(sp500)
The last line causes the "boom", with this (first part of) the stack trace:
TypeError Traceback (most recent call last)
~/.virtualenvs/pyspark-test/lib/python3.7/site-packages/ts/flint/java.py in new_reader(self)
37 try:
---> 38 return utils.jvm(self.sc).com.twosigma.flint.timeseries.io.read.TSReadBuilder()
39 except TypeError:
TypeError: 'JavaPackage' object is not callable
Any ideas what may be going wrong and how the problem could be solved?
Hi,
I've downloaded and built flint with sbt.
I've then copy and paste the .jar from the target folder into spark/jars
But this create a conflit with com.fasterxml when I try your first example code (CSV)
The error : com.fasterxml.jackson.databind.JsonMappingException: Jackson version is too old 2.3.2
Do you know why ?
Thks,
I created an experimental summarizer to try to understand how it works: https://github.com/soloman817/flint/blob/feature/experiment/src/test/scala/com/twosigma/flint/timeseries/experiment/AccumulateSummarizerSpec.scala
There I print informations when the methods of a summarizer are called, such as add, merge, etc.
I found that, if I call TimeSeriesRDD.summarize()
, then the subtract
will not be called, and if I have multiple partitions, the merge
will be called. which means, aggregation on all rows will run in parallel and eventually merged into final result.
But if I run the summarizer with TimeSeriesRDD.summarizeWindow()
, then for the window aggregation, add
and subtract
will be called, but not merge
. Which means, inside one window, it is not parallel.
Am I right? This knowlege will be very helpful for my implementation to our problem, which is an extension of that experimental code.
Thanks,
Xiang.
Hello, I'm working on Time Series Clustering. Is there any module in ts-flint for time series clustering?
I have tried below way
df = fc.read.option("timeColumn","ds").option('isSorted', False).dataframe(spark.read.parquet('/test/SALECOUNT_OUT'))
df = fc.read.option("timeColumn","ds").option('isSorted', False).parquet('/test/SALECOUNT_OUT')
df = fc.read.option('isSorted', False).parquet('/test/SALECOUNT_OUT', time_column='ds')
Always throw error
Py4JJavaError: An error occurred while calling o91.canonizeTime.
: java.lang.IllegalArgumentException: Field "time" does not exist.
Available fields: store_id, product_id, store_product_id, sale_count, ds
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:273)
at com.twosigma.flint.timeseries.TimeSeriesRDD$.canonizeTime(TimeSeriesRDD.scala:123)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
During handling of the above exception, another exception occurred:
IllegalArgumentException Traceback (most recent call last)
<ipython-input-3-95bfec85ec04> in <module>
----> 1 df = fc.read.option("timeColumn", "ds").parquet('/test/SALECOUNT_OUT')
~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/readwriter.py in parquet(self, *paths)
399 """
400 df = self._sqlContext.read.parquet(*paths)
--> 401 return self.dataframe(df)
402
403 def _reconcile_reader_args(self, begin=None, end=None, timezone='UTC',
~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/readwriter.py in dataframe(self, df, begin, end, timezone, is_sorted, time_column, unit)
362 time_column=time_column,
363 is_sorted=is_sorted,
--> 364 unit=self._parameters.timeUnitString())
365
366 def parquet(self, *paths):
~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/dataframe.py in _from_df(df, time_column, is_sorted, unit)
248 time_column=time_column,
249 is_sorted=is_sorted,
--> 250 unit=unit)
251
252 @staticmethod
~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/dataframe.py in __init__(self, df, sql_ctx, time_column, is_sorted, unit, tsrdd_part_info)
133 # throw exception
134 if time_column in df.columns:
--> 135 self._jdf = self._jpkg.TimeSeriesRDD.canonizeTime(self._jdf, self._junit)
136
137 if tsrdd_part_info:
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: 'Field "time" does not exist.\nAvailable fields: store_id, product_id, store_product_id, sale_count, ds'
ds
is timestamp type .
Currently TimeSeriesDataFrame stores a time unit and a time column name internally. This is not consistent with the Scala implementation of TimeSeriesRDD and causes extra copying and complexity. We should just force the timeUnit and timeColumn to be nanoseconds and 'time'
There is a trait for count based windows but it looks like it is not yet finished?
Example of SQLContext in Weather.ipynb and Flint Example.ipynb?
I get TypeError: 'JavaPackage' object is not callable
by trying some SQLContext (although I believe this is my configuration issue).
Thanks!
I saw this code in overlapped RDD implementation: https://github.com/twosigma/flint/blob/master/src/main/scala/com/twosigma/flint/rdd/OverlappedOrderedRDD.scala#L50
Then I read about what does narrow dependency mean: https://github.com/rohgar/scala-spark-4/wiki/Wide-vs-Narrow-Dependencies , there it says: "Narrow dependencies: Each partition of the parent RDD is used by at most one partition of the child RDD."
But in flint implementation, the overlappedRDD is the child RDD and the orderRDD is its parent. But one partition in the parent will be used in multiple partitions in the child overlappedRDD, because of the overlapping.
So, what am I missing to understand this?
Thanks in advance, Xiang
Thank you for this amazing library! ๐ฅ
I'm running Spark 2.2.0 and tried to initialize a clock:
clock = clocks.uniform(sqlContext, frequency="1day")
This threw an exception:
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.twosigma.flint.timeseries.Clocks.uniform.
: java.lang.NoSuchMethodError: org.apache.spark.sql.SparkSession.internalCreateDataFrame$default$3()Z
at org.apache.spark.sql.DFConverter$.toDataFrame(DFConverter.scala:42)
at com.twosigma.flint.timeseries.clock.Clock.asTimeSeriesRDD(Clock.scala:148)
at com.twosigma.flint.timeseries.Clocks$.uniform(Clocks.scala:54)
at com.twosigma.flint.timeseries.Clocks.uniform(Clocks.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
NoSuchMethodError: org.apache.spark.sql.SparkSession.internalCreateDataFrame$default$3()Z
I found this method in the docs: https://spark.apache.org/docs/preview/api/java/org/apache/spark/sql/SparkSession.html#internalCreateDataFrame(org.apache.spark.rdd.RDD,%20org.apache.spark.sql.types.StructType)
However it seems like it's not available in the version I'm running? Can you please provide me with a hint on how to resolve this issue? Thanks! ๐
Running sbt assembly compilation fails due to missing slf, it was resolved by adding :
Adding,
"org.slf4j" % "slf4j-simple" % "1.6.4"
to libraryDependencies allowed it to compile.
Though i had no problems when i ran make dist.
@icexelloss hi there!
I'm glad the issues are being (pro)actively monitored and attended to, I wasn't expecting that.
Here is one issue I'm facing, it's not a big one, but an inconvenient one:
print( sc.version )
print( tm )
n = df.filter( df['Container'] == 'dbc94d4e3af6' ).select( tm, 'MemPercentG', 'CpuPercentG' )
n.show( truncate = False )
n.printSchema()
from ts.flint import FlintContext, clocks
from ts.flint import utils
fc = FlintContext( sqlContext )
r = fc.read \
.option('isSorted', False) \
.option('timeUnit', 's') \
.option('timeColumn', tm) \
.dataframe( n )
The output is:
2.3.1
TimeStamp
+-------------------+---------------+------------+
|TimeStamp |MemPercentG |CpuPercentG |
+-------------------+---------------+------------+
|2018-08-01 05:55:35|0.0030517578125|0.002331024 |
|2018-08-01 05:58:05|0.0030517578125|0.0031538776|
|2018-08-01 05:59:05|0.0030517578125|0.0030176123|
+-------------------+---------------+------------+
root
|-- TimeStamp: timestamp (nullable = true)
|-- MemPercentG: double (nullable = true)
|-- CpuPercentG: float (nullable = true)
IllegalArgumentException: 'Field "time" does not exist.\nAvailable fields: TimeStamp, MemPercentG, CpuPercentG'
---------------------------------------------------------------------------
IllegalArgumentException Traceback (most recent call last)
<command-911439891027714> in <module>()
14 fc = FlintContext( sqlContext )
15
---> 16 r = fc.read .option('isSorted', False) .option('timeUnit', 's') .option('timeColumn', tm) .dataframe( n )
17
18
/databricks/python/lib/python3.5/site-packages/ts/flint/readwriter.py in dataframe(self, df, begin, end, timezone, is_sorted, time_column, unit)
362 time_column=time_column,
363 is_sorted=is_sorted,
--> 364 unit=self._parameters.timeUnitString())
365
366 def parquet(self, *paths):
/databricks/python/lib/python3.5/site-packages/ts/flint/dataframe.py in _from_df(df, time_column, is_sorted, unit)
248 time_column=time_column,
249 is_sorted=is_sorted,
--> 250 unit=unit)
251
252 @staticmethod
/databricks/python/lib/python3.5/site-packages/ts/flint/dataframe.py in __init__(self, df, sql_ctx, time_column, is_sorted, unit, tsrdd_part_info)
133 # throw exception
134 if time_column in df.columns:
--> 135 self._jdf = self._jpkg.TimeSeriesRDD.canonizeTime(self._jdf, self._junit)
136
137 if tsrdd_part_info:
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: 'Field "time" does not exist.\nAvailable fields: TimeStamp, MemPercentG, CpuPercentG'
Is this reproducible for you?
Please, advise, if I'm not using/calling it correctly or if it's a bug.
The flint libraries (the Scala and the Python ones) I installed on DataBricks via its UI (from the respective online repos, which might be dated) - I can try and install the latest builds from the freshest source code, if you think that will help.
Thanks.
(even though the "timeColumn" argument error can be bypassed by renaming the column in question to time
) the joinLeft
is not working for me:
print( sc.version )
print( tm )
n = df.filter( df['Container'] == 'dbc94d4e3af6' ).select( tm, 'MemPercentG', 'CpuPercentG' )
from ts.flint import FlintContext, clocks
from ts.flint import utils
fc = FlintContext( sqlContext )
r = fc.read \
.option('isSorted', False) \
.option('timeUnit', 's') \
.dataframe( n )
r.show(truncate=False)
print( r )
r.printSchema()
l = clocks.uniform(fc, '30s', begin_date_time='2018-8-1 5:55:35', end_date_time='2018-08-01 05:59:05')
print( type( l ) )
print( l )
l.printSchema()
# l.show(truncate=False)
j = l.leftJoin( r )
With the output being:
2.3.1
time
+-------------------+---------------+------------+
|time |MemPercentG |CpuPercentG |
+-------------------+---------------+------------+
|2018-08-01 05:55:35|0.0030517578125|0.002331024 |
|2018-08-01 05:58:05|0.0030517578125|0.0031538776|
|2018-08-01 05:59:05|0.0030517578125|0.0030176123|
+-------------------+---------------+------------+
TimeSeriesDataFrame[time: timestamp, MemPercentG: double, CpuPercentG: float]
root
|-- time: timestamp (nullable = true)
|-- MemPercentG: double (nullable = true)
|-- CpuPercentG: float (nullable = true)
<class 'ts.flint.dataframe.TimeSeriesDataFrame'>
TimeSeriesDataFrame[time: timestamp]
root
|-- time: timestamp (nullable = true)
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.codegen.ExprCode.value()Ljava/lang/String;
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<command-911439891027714> in <module>()
22 l.printSchema()
23 # l.show(truncate=False)
---> 24 j = l.leftJoin( r )
25
26
/databricks/python/lib/python3.5/site-packages/ts/flint/dataframe.py in leftJoin(self, right, tolerance, key, left_alias, right_alias)
606 tolerance = self._timedelta_ns('tolerance', tolerance, default='0ns')
607 scala_key = utils.list_to_seq(self._sc, key)
--> 608 tsrdd = self.timeSeriesRDD.leftJoin(right.timeSeriesRDD, tolerance, scala_key, left_alias, right_alias)
609 return TimeSeriesDataFrame._from_tsrdd(tsrdd, self.sql_ctx)
610
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o3324.leftJoin.
: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.codegen.ExprCode.value()Ljava/lang/String;
at org.apache.spark.sql.TimestampCast$class.doGenCode(TimestampCast.scala:77)
at org.apache.spark.sql.NanosToTimestamp.doGenCode(TimestampCast.scala:31)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:111)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:108)
at org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:143)
at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext$$anonfun$24$$anonfun$apply$5.apply(CodeGenerator.scala:1367)
at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext$$anonfun$24$$anonfun$apply$5.apply(CodeGenerator.scala:1366)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext$$anonfun$24.apply(CodeGenerator.scala:1366)
at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext$$anonfun$24.apply(CodeGenerator.scala:1366)
at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.withSubExprEliminationExprs(CodeGenerator.scala:1227)
at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressionsForWholeStageWithCSE(CodeGenerator.scala:1365)
at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:67)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:181)
at org.apache.spark.sql.execution.RDDScanExec.consume(ExistingRDD.scala:176)
at org.apache.spark.sql.execution.RDDScanExec.doProduce(ExistingRDD.scala:221)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:88)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:190)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:187)
at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:83)
at org.apache.spark.sql.execution.RDDScanExec.produce(ExistingRDD.scala:176)
at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:50)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:88)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:190)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:187)
at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:83)
at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:40)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:530)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:582)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:150)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:138)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:190)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:187)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:108)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:108)
at com.twosigma.flint.timeseries.NormalizedDataFrameStore.toOrderedRdd(TimeSeriesStore.scala:252)
at com.twosigma.flint.timeseries.NormalizedDataFrameStore.orderedRdd(TimeSeriesStore.scala:237)
at com.twosigma.flint.timeseries.TimeSeriesRDDImpl.orderedRdd(TimeSeriesRDD.scala:1346)
at com.twosigma.flint.timeseries.TimeSeriesRDDImpl.leftJoin(TimeSeriesRDD.scala:1515)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
By the way, what is the way to even display the contents of the clocks DataFrame?
The second to last commented out line (with the .show
command) errors out, so I don't understand how TimeSeriesDataFrame
is inheriting from a regular DataFrame
, for which that method is available...
The display
method also fails...
Anyways, what is wrong with the leftJoin
here? The clock is on the left like you indicated it should be. Swapping left and right data frames also does not help.
Is this reproducible for you?
Please, advise, if I'm not using/calling it correctly or if it's a bug.
The flint libraries (the Scala and the Python ones) I installed on DataBricks via its UI (from the respective online repos, which might be dated) - I can try and install the latest builds from the freshest source code, if you think that will help.
Thanks.
hello and thanks for this very interesting project. I wonder if someone can use it with sparklyr? any ideas?
thanks!
Running sbt assembly
yields the following:
[info] - should correctly convert SQL TimestampType with default format *** FAILED *** (244 milliseconds)
[info] 1199232000000000000 did not equal 1199253600000000000 (CSVSpec.scala:100)
Now, 1199232000000000000, the result of first.getAs[Long]("time")
, where first
contains the string "2018-01-02 00:00:00" equals 2018-01-01 18:00:00, which is in turn equal to "2018-01-02 00:00:00 UTC-6."
1199253600000000000, on the other hand, equals "2018-01-02 00:00:00" and since its timezone was extracted from format.setTimeZone(TimeZone.getDefault)
(my computer resides in Mexico City, UTC-6), I believe it is being interpreted as "2018-01-02 00:00:00 UTC-6".
tl;dr: I believe the test is reading "2018-01-02 00:00:00" from CSV and assuming UTC and then converting it to UTC-6, yet simply appending my default timezone to the hard-coded string "2018-01-02 00:00:00".
Flint python API does't check argument types currently, this leads to bad user experience.
For instance, summarizer.linearRegression takes a arg "xcols" as a list of string, if user pass a string, it throws a confusing exception:
Py4JError: An error occurred while calling z:com.twosigma.huohua.timeseries.summarize.Summary.linearRegression. Trace:
py4j.Py4JException: Method linearRegression([class java.lang.String, class java.lang.String, class java.lang.String, class java.lang.Boolean]) does not exist
Hi,
Would it be possible to update the documentation + other meta data for ts-flint published in PyPI.
https://pypi.org/project/ts-flint/
There are tools which parse this information from PyPI for ex. automated license checkers.
Thanks,
Rohan
Ubuntu 18.04 with Java 8. Could you please help?
I get NPE when I run:
git clone https://github.com/twosigma/flint.git
cd flint
sbt assemblyNoTest
stacktrace is:
(base) jcarlson@smaaetlinuxsrv04:~/ts_flint/flint$ sbt assemblyNoTest
Picked up _JAVA_OPTIONS: -Xmx64g
Picked up _JAVA_OPTIONS: -Xmx64g
[info] Loading project definition from /home/jcarlson/ts_flint/flint/project
[info] Updating {file:/home/jcarlson/ts_flint/flint/project/}flint-build...
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by sbt.ivyint.ErrorMessageAuthenticator$ (file:/home/jcarlson/.sbt/boot/scala-2.10.6/org.scala-sbt/sbt/0.13.11/ivy-0.13.11.jar) to field java.net.Authenticator.theAuthenticator
WARNING: Please consider reporting this to the maintainers of sbt.ivyint.ErrorMessageAuthenticator$
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
java.lang.NullPointerException
at java.base/java.util.regex.Matcher.getTextLength(Matcher.java:1770)
at java.base/java.util.regex.Matcher.reset(Matcher.java:416)
at java.base/java.util.regex.Matcher.(Matcher.java:253)
at java.base/java.util.regex.Pattern.matcher(Pattern.java:1133)
at java.base/java.util.regex.Pattern.split(Pattern.java:1261)
at java.base/java.util.regex.Pattern.split(Pattern.java:1334)
at sbt.IO$.pathSplit(IO.scala:797)
at sbt.IO$.parseClasspath(IO.scala:912)
at sbt.compiler.CompilerArguments.extClasspath(CompilerArguments.scala:66)
at sbt.compiler.MixedAnalyzingCompiler$.withBootclasspath(MixedAnalyzingCompiler.scala:188)
at sbt.compiler.MixedAnalyzingCompiler$.searchClasspathAndLookup(MixedAnalyzingCompiler.scala:166)
at sbt.compiler.MixedAnalyzingCompiler$.apply(MixedAnalyzingCompiler.scala:176)
at sbt.compiler.IC$.incrementalCompile(IncrementalCompiler.scala:138)
at sbt.Compiler$.compile(Compiler.scala:152)
at sbt.Compiler$.compile(Compiler.scala:138)
at sbt.Defaults$.sbt$Defaults$$compileIncrementalTaskImpl(Defaults.scala:860)
at sbt.Defaults$$anonfun$compileIncrementalTask$1.apply(Defaults.scala:851)
at sbt.Defaults$$anonfun$compileIncrementalTask$1.apply(Defaults.scala:849)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
[error] (compile:compileIncremental) java.lang.NullPointerException
Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? q
val tsRdd = TimeSeriesRDD.fromDF(dataFrame = df)(isSorted = true, timeUnit = MILLISECONDS)
throws
scala> val tsRdd = TimeSeriesRDD.fromDF(dataFrame = cellFeed)(isSorted = true, timeUnit = MILLISECONDS)
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.plans.physical.ClusteredDistribution$.apply$default$2()Lscala/Option;
at com.twosigma.flint.timeseries.TimeSeriesStore$.isClustered(TimeSeriesStore.scala:149)
at com.twosigma.flint.timeseries.TimeSeriesStore$.apply(TimeSeriesStore.scala:64)
at com.twosigma.flint.timeseries.TimeSeriesRDD$.fromDFWithPartInfo(TimeSeriesRDD.scala:509)
at com.twosigma.flint.timeseries.TimeSeriesRDD$.fromDF(TimeSeriesRDD.scala:304)
... 52 elided
on spark 2.2 when trying to create the initial RDD.
Minimal reproducible sample:
import spark.implicits._
import com.twosigma.flint.timeseries.TimeSeriesRDD
import scala.concurrent.duration._
val df = Seq((1, 1, 1L), (2, 3, 1L), (1, 4, 2L), (2, 2, 2L)).toDF("id", "value", "time")
val tsRdd = TimeSeriesRDD.fromDF(dataFrame = df)(isSorted = true, timeUnit = MILLISECONDS)
on spark 2.2 via HDP 2.6.4
I would like to extend the window capabilities and like to discuss how to best implement these. Considering existing functionality we can do:
val result = priceTSRdd.addWindows(Window.pastAbsoluteTime("1000ns"))
// time price window_past_1000ns
// ------------------------------------------------------
// 1000L 1.0 [[1000L, 1.0]]
// 1500L 2.0 [[1000L, 1.0], [1500L, 2.0]]
// 2000L 3.0 [[1000L, 1.0], [1500L, 2.0], [2000L, 3.0]]
// 2500L 4.0 [[1500L, 2.0], [2000L, 3.0], [2500L, 4.0]]
Window at predefined time stamps only
This creates a window at each row backward. For very "dense" time series with samples at nano scale we might not do the window at each observation but run some statistics or other discovery method to find those points at which we want to create a window.
Windows of varying length
In the trading world we can imagine windows of varying time length, e.g. determined by a "volume clock"
Windows of fixed number of observations. I saw a count window but not clear to me how to use it.
Two segment windows.
A first segment (section) of a window could be used to calculate some online statistics which are then consumed by a summary function which is applied over the second part of the window (adaptive summary stats, e.g. consider thresholds based on an online volatility estimator).
How would these more general features best implemented? Any good advise how to add these extensions? Happy to contribute as well.
When running sbt assembly
, I get an error that looks like it's picking up an ambient timezone somewhere in my environment:
[info] UniformClock
[info] - should generate clock ticks correctly (4 milliseconds)
[info] - should generate clock ticks in RDD correctly (90 milliseconds)
[info] - should generate clock ticks in TimeSeriesRDD correctly (164 milliseconds)
[info] - should generate clock ticks with offset in TimeSeriesRDD correctly (74 milliseconds)
[info] - should generate clock ticks with offset & time zone in TimeSeriesRDD correctly (75 milliseconds)
[info] - should generate clock ticks with default in TimeSeriesRDD correctly (172 milliseconds)
[info] - should generate timestamp correctly *** FAILED *** (160 milliseconds)
[info] 1989-12-31 18:00:00.0 did not equal 1990-01-01 00:00:00.0 (ClockSpec.scala:85)
Is this a known issue, or a known problem in my setup maybe?
I am getting a different error from issue #24 .
Can you please let me know where am I going wrong? Following is the error.
~/flint-master$ sbt assemblyNoTest
[ERROR] Failed to construct terminal; falling back to unsupported
java.lang.NumberFormatException: For input string: "0x100"
at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.base/java.lang.Integer.parseInt(Integer.java:652)
at java.base/java.lang.Integer.valueOf(Integer.java:983)
at jline.internal.InfoCmp.parseInfoCmp(InfoCmp.java:59)
at jline.UnixTerminal.parseInfoCmp(UnixTerminal.java:233)
at jline.UnixTerminal.(UnixTerminal.java:64)
at jline.UnixTerminal.(UnixTerminal.java:49)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:488)
at java.base/java.lang.Class.newInstance(Class.java:560)
at jline.TerminalFactory.getFlavor(TerminalFactory.java:209)
at jline.TerminalFactory.create(TerminalFactory.java:100)
at jline.TerminalFactory.get(TerminalFactory.java:184)
at jline.TerminalFactory.get(TerminalFactory.java:190)
at sbt.ConsoleLogger$.ansiSupported(ConsoleLogger.scala:123)
at sbt.ConsoleLogger$.(ConsoleLogger.scala:117)
at sbt.ConsoleLogger$.(ConsoleLogger.scala)
at sbt.GlobalLogging$.initial(GlobalLogging.scala:43)
at sbt.StandardMain$.initialGlobalLogging(Main.scala:61)
at sbt.StandardMain$.initialState(Main.scala:70)
at sbt.xMain.run(Main.scala:29)
at xsbt.boot.Launch$$anonfun$run$1.apply(Launch.scala:109)
at xsbt.boot.Launch$.withContextLoader(Launch.scala:128)
at xsbt.boot.Launch$.run(Launch.scala:109)
at xsbt.boot.Launch$$anonfun$apply$1.apply(Launch.scala:35)
at xsbt.boot.Launch$.launch(Launch.scala:117)
at xsbt.boot.Launch$.apply(Launch.scala:18)
at xsbt.boot.Boot$.runImpl(Boot.scala:56)
at xsbt.boot.Boot$.main(Boot.scala:18)
at xsbt.boot.Boot.main(Boot.scala)
I can able to convert spark df to flint df and apply the summarizeWindows function also working.
After applying any flint function , I can't able to access the df (like i can't convert back to spark df, or can't able to save the result file.
Even flint.show() or spark _df.show() getting error. But function is working correctly, i cant able to access the result df.
See above.
It would be great if also a version for Scala 2.10 would be available.
We are able to test python binding using a library we have internally to create sparkContext. We need a way to run flint test with pyspark in OSS.
I have a time series RDD object, and I know internally it is sorted by the timestamps. What is the efficient way to get the start time and last time? There is a TimeSeriesRDD.first
which returns the first row, so I can get the start time. But how to get the last row efficiently?
How can I join not only by time but also by a column?
Currently, I get: Found duplicate columns
, but I would like to perform the time series join per group.
val left = Seq((1,1L, 0.1), (1, 2L,0.2), (3,1L,0.3), (3, 2L,0.4)).toDF("group", "time", "valueA")
val right = Seq((1,1L, 11), (1, 2L,12), (3,1L,13), (3, 2L,14)).toDF("group", "time", "valueB")
val leftTs = TimeSeriesRDD.fromDF(dataFrame = left)(isSorted = false, timeUnit = MILLISECONDS)
val rightTS = TimeSeriesRDD.fromDF(dataFrame = right)(isSorted = false, timeUnit = MILLISECONDS)
val mergedPerGroup = leftTs.leftJoin(rightTS, tolerance = "1s")
fails due to duplicate columns.
When renaming the columns:
val left = Seq((1,1L, 0.1), (1, 2L,0.2), (3,1L,0.3), (3, 2L,0.4)).toDF("groupA", "time", "valueA")
val right = Seq((1,1L, 11), (1, 2L,12), (3,1L,13), (3, 2L,14)).toDF("groupB", "time", "valueB")
val leftTs = TimeSeriesRDD.fromDF(dataFrame = left)(isSorted = false, timeUnit = MILLISECONDS)
val rightTS = TimeSeriesRDD.fromDF(dataFrame = right)(isSorted = false, timeUnit = MILLISECONDS)
val mergedPerGroup = leftTs.leftJoin(rightTS, tolerance = "1s")
mergedPerGroup.toDF.printSchema
mergedPerGroup.toDF.show
+-------+------+------+------+------+
| time|groupA|valueA|groupB|valueB|
+-------+------+------+------+------+
|1000000| 1| 0.1| 3| 13|
|1000000| 3| 0.3| 3| 13|
|2000000| 1| 0.2| 3| 14|
|2000000| 3| 0.4| 3| 14|
+-------+------+------+------+------+
a cross join is performed between each group and time series.
that needs to be manually reduced.
mergedPerGroup.toDF.filter(col("groupA") === col("groupB")).show
+-------+------+------+------+------+
| time|groupA|valueA|groupB|valueB|
+-------+------+------+------+------+
|1000000| 3| 0.3| 3| 13|
|2000000| 3| 0.4| 3| 14|
Is there any functionality to perform this type of join more efficiently / built in?
The clock utility is very handy sometime, but it is not public to the outside world:
Is there any reason to hide this?
Does this library currently work with spark 2.4?
I have been using the Summarizers (mean, count, max, min) from the Python API in the last couple of weeks. I use them jointly with a clock, which gives me uniform timestamps of one minute bars. Now I was wondering if it is possible to have a summerizer in Python which gives me the close value (last value of the interval). I am dealing with price data and I saw that the Scala implementation of the summarizers have the "close" functionality. Is it possible to expose this summarizer into the Python API?
Thank you very much for your support
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.