asardaes / dtwclust Goto Github PK

View Code? Open in Web Editor NEW

252.0 252.0 29.0 154.82 MB

R Package for Time Series Clustering Along with Optimizations for DTW

Home Page: https://cran.r-project.org/package=dtwclust

License: GNU General Public License v3.0

R 73.84% C 0.02% C++ 16.88% MATLAB 0.17% TeX 9.08%

clustering dtw time-series

dtwclust's People

Contributors

Stargazers

Watchers

dtwclust's Issues

issues with zoo data frames

Hi, I am exploring/testing dtwclust for financial time series clustering analysis purpose.
I find it very effective but I am missing something about how it works.

I am using a multivariate dataset, which I have properly cleaned and preprocessed before using it as an input. In particular, I have:

replaced +/-Inf with NAs
interpolate the NAs between two entries (start-end valid values before and after NAs)
replace the remaining NAs, which are not between two valid values, with zeros

The dataset consists on a list of 71 zoo data frames, each having 67 rows (time dimension, in quarters, equal for all 71 objects) and 13 columns (my variables, also equal for all 71 objects). Since the variables vary substantially, I do apply z-normalization to the dataset before using it as input to dtwclust.
My aim is to test the results of different algorithms, using different distance measures, clustering methods and centroid calculations. Here I show only the case study for GAK distances.
Below are my tests, examples, the errors I get and the questions I have for you.

distance = "gak", type = "hierarchical"
mvc_gak_h <- tsclust(mydata, k = 4L, type = "hierarchical", distance = "gak", seed = 390, args = tsclust_args(dist = list(sigma = 100)))
In this case I have no problems at all, and can easily plot the results in a dendrogram.

distance = "gak", type = "partitional", centroid = "pam"

mvc_gak_p_pam <- tsclust(mydata, k = 4L, type = "partitional", distance = "gak", seed = 392, centroid = "pam", args = tsclust_args(dist = list(sigma = 100)))
plot(mvc_gak_p_pam)
Error in vector(type, length) : 
  vector: cannot make a vector of mode 'NULL'.
In addition: There were 50 or more warnings (use warnings() to see the first 50)

In this case I get an error when I try to plot the clustering results and I don't understand the error message. The clustering algorithm seem to work well. When I type mvc_gak_p_pam I obtain the following output:

partitional clustering with 4 clusters
Using gak distance
Using pam centroids
Time required for analysis:
   user  system elapsed 
   2.62    0.00    2.64 
Cluster sizes with average intra-cluster distance:
  size    av_dist
1   29 0.06585621
2   14 0.07254417
3   15 0.07099164
4   13 0.07744559

distance = "gak", type = "partitional", centroid = "dba"

 mvc_gak_p_dba <- tsclust(mydata, k = 4L, type = "partitional", distance = "gak", seed = 394, centroid = "dba", args = tsclust_args(dist = list(sigma = 100)))
plot(mvc_gak_p_dba)
Error in vector(type, length) : 
  vector: cannot make a vector of mode 'NULL'.
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Similarly to the previous case, I get an error when I try to plot the clustering results, and I don't understand the error message. Where is the error exactly? The clustering algorithm seem to work well. When I type mvc_gak_p_dba I obtain the following output:

partitional clustering with 4 clusters
Using gak distance
Using dba centroids
Time required for analysis:
   user  system elapsed 
   5.29    0.00    5.46 
Cluster sizes with average intra-cluster distance:
  size    av_dist
1   20 0.06968869
2    6 0.07500183
3   27 0.06434661
4   18 0.05622174

distance = "gak", type = "partitional", centroid = "shape"

mvc_gak_p_shape <- tsclust(mydata, k = 4L, type = "partitional", distance = "gak", seed = 396, centroid = "shape", args = tsclust_args(dist = list(sigma = 100)))
Error in { : task 1 failed - "task 1 failed - "indexes overlap""

In this case I can't even run the clustering algorithm, and I do not understand the error message.

Finally, I use the option args = tsclust_args(dist = list(sigma = 100)) just because I have seen it in the example for "Multivariate time series" in the dtwclust documentation (pag. 46), but I don't quite understand what it means and I didn't find any clear explanations neither in online forums, nor in the documentation. I only notices that the "av_dist" between clusters changes if I use or not use this option but I don't see why. Would you please help me clarifying these issues? Is there a better documented manual or website to consult?

Thanks,
Stefano

A question about 'pam' in dtwclust('fuzzy').

When use centroid calculation of ''pam'', the centers of each cluster are supposed to be one of those series in the cluster. However, when I tested this:

a<-list()
for (i in 1:5) {a[[i]]<-round(rnorm(10,5,5),digits = 0)}      
c<-dtwclust(a,'fuzzy',k=3,distance = 'dtw','pam',seed=100)
c@centers

The centers are seemly means of the clusters rather than one of the series.
Additionally, when I tried k=2 in c<-dtwclust(a,'fuzzy',k=2,distance = 'dtw','pam',seed=100), the response is Error in apply(cluster[, -1L], 1L, sum) : dim(X) must have a positive length.
I don't understand why these happen.

A question about the Hierarchical Clustering in dtwclust()

When using the Hierarchical Clustering in dtwclust() like result<-dtwclust(data,'hierarchical',distance = 'dtw'), does the result@k always return '2' and the result@cluster return the time series in two clusers even though the number of bottom branches are more than two ?
Or dose it mean the given time series data set is best divided when clustered into two clusters ?

How to add legend onto the dtwclust plot?

After using the patitional clustering, I can't tell which series belongs to which clust from the plot result. If there is a legend or a table that shows each cluster and its members' name, it would be much easier. But how could i add such a legend or a table?

Getting Inf as a result when using dtw_basic

Hello,

I want to cluster time series of different length, and this R package is an amazing way to do it! Thanks for it.

However, I have some difficulties to find a method to evaluate the optimal number of cluster for my partitionnal clustering. When I do clustering (with more classical data, and distance metric), I'm used to obtain it through elbow method, silhouette... But I can't find how to do it in my actual case.

This is my actual case :

pc <- tsclust(list_imp_lag, type = "partitional", k = c(3:15),
distance = "dtw_basic", centroid = "pam",
seed = 3247L, trace = TRUE,
args = tsclust_args(dist = list(window.size = 20L)))

where list_imp_lag is a list of 241 of numeric vector (extract below) :

List of 241
$ : num [1:720] 99650 1860 0 0 0 ...
$ : num [1:254] 2830 0 0 0 0 0 0 0 0 0 ...
$ : num [1:687] 28510 75121 0 0 0 ...
$ : num [1:75] 5757 30288 0 0 0 ...
$ : num [1:720] 20437 14563 9451 0 0 ...
$ : num [1:84] 3430 0 0 0 0 0 0 0 0 0 ...
$ : num [1:696] 3495 3157 0 0 0 ...
$ : num [1:30] 13046 0 0 0 0 ...
$ : num [1:38] 71305 848300 477887 0 0 ...
$ : num [1:404] 179465 168423 144280 117150 5215 ...
$ : num [1:119] 2694 0 0 0 0 ...
$ : num [1:402] 32805 0 0 0 0 ...
$ : num [1:51] 6979 31930 23705 22625 31117 ...
$ : num [1:30] 24453 22145 16658 13891 12101 ...

My distance matrix is obviously not symetric in that situation.

I tried to use cvi function but got errors :

sapply(pc, cvi, type = "valid")
Error in silhouette.default(a@cluster, dmatrix = distmat) :
objet 'sildist' introuvable
De plus : Warning messages:
1: In FUN(X[[i]], ...) :
Internal CVIs: series' cross-distance matrix is NOT symmetric, which can be problematic for:
Sil D COP
2: In FUN(X[[i]], ...) :
Internal CVIs: centroids' cross-distance matrix is NOT symmetric, which can be problematic for:
DB DB*

It would be very helpful if someone can help me with that problem.

Thanks in advance!

sigma error

when running the tsclust function on a multivariate model with varying time series length, I run into the error: Parameter 'sigma' must be positive. Do you have any idea what could be the issue? Thanks in advance!

How I know how many seed that I should fill

Dear @asardaes
In below the code, How I know how many seeds that I should fill.
and Is seed function suitable for all clustering algorithm such as hierarchical, partitional and fuzzy?
I just know the definition of seed that is the random seed for reproducibility.

pc_dtw <- tsclust(data_z, k = 6L,
                  distance = "dtw_basic", centroid = "dba",
                  trace = TRUE, **seed = 8,**
                  norm = "L2", window.size = 20L,
                  args = tsclust_args(cent = list(trace = TRUE)))

Using dtwclust with RODBC (database connection)

Hi, I'm trying to use RODBC to connect to my database and work with a very big data (>15.000.000 rows and 60 columns) and as I expected, the memory of my computer is not enough for download all the data.

I don't know if is possible to use that package with this, or if there is someone that can replace the database connection.

I'll appreciate some help here ;)

Silhouette width for TADPole method

Hi,
I am trying to use TADPole (and other methods) for time series clustering. I want to use silhouette width to compare different solutions for varying cluster counts k . For SBD, GAK, etc. I can easily extract the silhouette width, but for TADPole, I get the following report:

A second set of cluster membership indices is required in 'b' for this/these CVI(s).

In the cvi function (which I am using to get the silhouette widths), you give for b:

b - If needed, a vector that can be coerced to integers which indicate the cluster
memeberships. The ground truth (if known) should be provided here

but this makes little sense to me since providing the ground truth (which is unlikely to be known) somewhat defeats the purpose of using silhouette width to find a best value for k. Could you perhaps clarify what this means? Is it because of TADPole's pruning of distance calculations that the silhouette cannot be calculated? What should I provide here as second set of membership indices? I would be very grateful for your feedback. I have looked into the source code but as an R novice, I fear it's a bit beyond my comprehension.

Choosing optimum number of clusters with cvi

Hi Alexis and thanks for dtwclust package

I'm trying to cluster a set of 115 temperature series with dtwclust but I'm not sure how to choose the optimum number or cluster and clusterting method. I have tried partitional (as seen in an example) with a predefined number of clusters.

  # Anàlisi de clúster
  pc_dtw.max <- tsclust(tmax.estiu2, k = 10:20, preproc = zscore,
                  type="partitional",
                  distance = "dtw_basic", centroid = "dba",
                  trace = TRUE, seed = 100,
                  window.size = 10L,
                  args = tsclust_args(cent = list(trace = TRUE)))#

As not an expert I have tried changing some parameters from reading dtwclust documentation but could not find big differences in the results. Now I'm trying to run cvi for a sample of different number of clusters to see which nclusters parameter is "better".

Try sapply(pc_dtw.max, cvi, type="internal")

whit this output

                [,1]        [,2]         [,3]        [,4]         [,5]        [,6]
  Sil     0.007919614 0.005149032 -0.005419144 0.008966472 0.0001571701 -0.02020825
  SF      0.000000000 0.000000000  0.000000000 0.000000000 0.0000000000  0.00000000
  CH     10.461114581 9.445447091  8.796729501 8.349410182 7.7393179829  7.13712605
  DB      1.996880488 1.687455700  1.570076739 1.710654571 1.8024923232  1.98925128
  DBstar  2.284530403 1.867781923  1.713612863 1.857947612 2.0986865794  2.23919398
  D       0.318891976 0.351217226  0.340582275 0.360524182 0.3591513913  0.38330590
  COP     0.496265373 0.487373747  0.469942617 0.464935361 0.4707556623  0.45977107
               [,7]        [,8]         [,9]       [,10]       [,11]
  Sil    -0.01923315 -0.01936116 -0.009273884 -0.02458494 -0.02990423
  SF      0.00000000  0.00000000  0.000000000  0.00000000  0.00000000
  CH      6.89926316  6.18977503  5.820197842  5.39745880  5.23057420
  DB      1.88083424  1.73262791  1.655580859  1.85372970  1.75415734
  DBstar  2.13514378  1.92340902  1.888174174  2.10086193  2.07282527
  D       0.29840968  0.35715701  0.326376637  0.34778048  0.31852244
  COP     0.46009509  0.45549048  0.451436062  0.44625196  0.44034457

but can't find out how to manage all these indexes. Should I look for the absolute lowest value between all indexes and choose the associated number of clusters? Are "Sil" negative values meaningless? Or should I look for the n clusters with more lower values from all indexes?

Thanks and best regards

Too long vector?

Hi, i'm trying to use tsclust function over my database (496306 rows and 12 cols) and i get the following error:

mvc <- tsclust(data3[,15:26], k = nclust, distance = "dtw_basic", seed = 390,centroid="pam")
I've also try with GAK

Error in .Call("pairs", n, lower, PACKAGE = "dtwclust") :
no se puede asignar un vector de longitud 1506013458

I don't know if it is because a limitation in R, in the package or in mi computer.

I'll be very glad if you could do someting ;)

How to get the n best configs after `compare_clusterings()`?

Hi,

Following the examples in the vignette and manual pages, I'm using compare_clusterings_configs() plus compare_clusterings() to obtain the best cluster configuration for my dataset.

I wonder now if there's a way to get the 10 best configurations, for example, or the best for each distance that I evaluate, i.e., DTW, SBD, etc. How can I do the scoring myself and select the best configs from the huge table at comparison_part$results$partitional ?

# configs
cfg <- compare_clusterings_configs(
  types = "partitional",
  k = 2:5,
  controls = list(partitional = partitional_control(iter.max = 100L, 
                                                    nrep = 5L)),
  preprocs = pdc_configs("preproc",
                         none = list(),
                         zscore = list(center = c(FALSE, TRUE))),
  distances = pdc_configs("distance",
                          partitional = list(
                            dtw_basic = list(
                              window.size = seq(from = 1L, to = 5L, by = 1L),
                              norm = "L2"),
                            dtw_lb = list(
                              window.size = seq(from = 1L, to = 5L, by = 1L),
                              norm = "L2"),
                            sbd = list()
                            )
                          ),
  centroids = pdc_configs("centroid",
                          share.config = c("p"),
                          dba = list(
                            window.size = seq(from = 1L, to = 5L, by = 1L),
                            norm = "L2"),
                          shape = list(znorm = TRUE),
                          pam = list()
                          ),
  no.expand = "window.size"
)

# set score and pick functions
vi_evaluators <- cvi_evaluators("valid")
score_fun <- vi_evaluators$score
pick_fun <- vi_evaluators$pick 

# compare
comparison_part <- 
  compare_clusterings(data,
                      types = "partitional",
                      configs = cfg,
                      seed = 3L,
                      trace = TRUE,
                      score.clus = score_fun,
                      pick.clus = pick_fun,
                      shuffle.configs = TRUE,
                      return.objects = TRUE)

# info of the best rep
comparison_part$pick$config

Thank in advance for any hints!

Multivariate clustering with different distances/prototyping functions for each variable v

Dear Alexis,

I am studying the possibility to perform multivariate time series clustering on variables of different natures. In that context, the best distance to consider is not the same for each variable (the same is valid for prototyping functions).

So my question: is there a straightforward way to create "custom" multivariate distances, for which the distance used for each variable v can change ? Moreover, since the distances for each v may take very different values, I suppose that these should be normalized in a certain way (to be defined) during the computation of the global distance (i.e. the one resulting form the summation over all the v variables).

If there is no simple way to do that, I would be glad to contribute (please just indicate which functions I should focus on).

Of course, the same question is valid for prototyping.

Thanks a lot in advance.

All the best,

Zacharie

Using PAM centroids with a sparse matrix is unlikely to be useful

Explanation will follow.

Incompatibility with saved object from previous version

Hi Alexis.
I have my dtwclust-class objects produced under one previous release of package dtwclust. After I update the package to the lastest version, the predict(previous_object,series) doesn't work.
So I downloaded the previous release zip and try to install it.
However, when I use install.packages("d:/dtwclust-2.2.1.zip", repos = NULL, type = "source"), no error comes out but the package is not successfully installed beacause it can't be found in my package list.
It would be difficult to reproduce my dtwclust-class objects in the latest version because it's extreme complicating.
Is there any other solution to get the previous package successfully installed?

Prototype before clustering?

Im analyzing time series of dominant frequency for an insect genus. The objectives are to (1) calculate acoustic distance matrices between pairs of species and populations and (2) use hierarchical cluster analysis to reveal the structure of the acoustic signals amongst the genus. I have a huge data set with several time-series per individual and several individuals per species. I was thinking that the best way to do this is to calculate the time-series prototype for each individual, using the DBA function and then used this series as input to calculate the distance matrices and the cluster analysis. Is it correct to use prototypes this way?

Failing to install on Ubuntu (17.10)

Same isssue with install.packages() and remotes::install_github(). After compiling (with g++), I get:

Warning: S3 methods ‘eigs.matrix’, ‘eigs.dgeMatrix’, ‘eigs.dgCMatrix’, ‘eigs.dgRMatrix’, ‘eigs.dsyMatrix’, ‘eigs.function’, ‘eigs_sym.matrix’, ‘eigs_sym.dgeMatrix’, ‘eigs_sym.dgCMatrix’, ‘eigs_sym.dgRMatrix’, ‘eigs_sym.function’, ‘svds.matrix’, ‘svds.dgeMatrix’, ‘svds.dgCMatrix’, ‘svds.dgRMatrix’, ‘svds.dsyMatrix’, ‘svds.function’ were declared in NAMESPACE but not found
Error in namespaceExport(ns, exports) : 
  undefined exports: eigs, eigs_sym, svds

Can dtwclust() deal with high dimension time series?

The time series X : x1, x2, x3, ..., xn;
each element consists of three variables: x1=(a1,b1,c1), x2=(a2,b2,c2), x3=(a3,b3,c3),...., xn=(an,bn,cn)
So are the series X', X'', X''' ,...
Can these three dimension time series be clustered with dtwclust()?

compare_clusterings pick - config and object differ

Hi,

I have been trying to use dtwclust as a basis to test different clustering algorithms against each other. For this, I have relied on the compare_clusterings() function, but when used I see that the object chosen by the picking function (a product of cvi_evaluators(type = "internal")) does not correspond to the model configuration described.

I have found that the CVI displayed corresponds to that of the attached object. However, the configuration registered is completely off.

installation problem

hi... i have a problems with the installation when I try to use "install.packages("dtwclust")"...
this is the output message and I can't find a solution:

the version of R was updated for 3.5.2 and try it again

install.packages("dtwclust")
probando la URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/dtwclust_5.5.1.tgz'
Content type 'application/x-gzip' length 4141751 bytes (3.9 MB)
==================================================
downloaded 3.9 MB

tar: Failed to set default locale

The downloaded binary packages are in
/var/folders/vg/8jx_2g9s1v3fz5w2x10y_qkh0000gn/T//Rtmpyr24Ub/downloaded_packages

multivariate series/centroids plots not working

Using the included multivariate example, as well as my own data using partition clustering, the plotting function throws the following error, does not append the additional dimensions to the plots, and includes fill and gradients under the lines that is not there in the univariate case.

I'm including my code and output using the example from the reference manual. This has occurred for me using DTW, DTW2, and GAK as the distance metric. All of these work in the univariate case, and hierarchical cluster plotting seems to work in all cases.

# Multivariate series, provided as a list of matrices
mv <- CharTrajMV[1L:20L]
# Using GAK distance
mvc <- tsclust(mv, k = 4L, distance = "gak", seed = 390,
               args = tsclust_args(dist = list(sigma = 100)))
# Note how the variables of each series are appended one after the other in the plot
plot(mvc)
Warning messages:
1: In data.frame(dfm, do.call(rbind, dfm_tcc)) :
  row names were found from a short variable and have been discarded
2: In data.frame(dfcm, do.call(rbind, dfcm_tc)) :
  row names were found from a short variable and have been discarded

I am using R 3.3.2, RStudio 1.0.44, ggplot 2.2.0 and dtwclust 4.0.2

Can I define certain series to always be the clustering centers？

Hey guys! I come across with one question.
When using dtwclust with the method 'patitional', am I able to have some series to be the centeriod throughout the whole process?
For example, I have 1000 series,in which there are s1,s2 and s3.
I want to use dtwclust to do patitional clustering, and the outcome should be three clusts with s1,s2 and s3 to be their centroid respectively.
Is this possible?
Thank you for your attention! ^_^

How to calculate SSE from tsclust result of each cluster?

Dear @asardaes
How to calculate SSE(sum square error) from tsclust result of each cluster?

Wrong behaviour of plot command

Hello,

Congratulations for your great job with dtwclust first !

I do not know if it is the right place to ask my question but I am beginning with dtwclust. I just noticed a wrong behaviour of the plot command on my system (Linux Mint 17.1, Rstudio 0.99.903, R version 3.3.1).

Let's say I use the command dtwclust to produce a partitional clustering of the data hold in with the command:

ClustResults_kshape <- dtwclust(Patterns, type = "partitional", k=10L, distance = "sbd",centroid="shape",control=list(trace=TRUE))

Then, when I want to plot some results, I issue the following command:

plot(ClustResults_kshape,type="sc")

and gets the following error:
Error in plot.hclust(ClustResults_kshape, type = "sc") :
invalid dendrogram

I get this systematically, even with the examples provided in the documentation. It seems that the "type" argument is not taken into account in my case.

Am I missing something or is it a bug ?

Thank you very much in advance.

Zacharie, Univ. of Mons

Difference in Silhouette results

Hi,
I have observed that I obtain different results for the average silhouette index when comparing the values obtained internally using cvi and "Sil" to the average value I get using the silhouette function and averaging the third column of the results. How is this possible?
I am performing a clustering using the Shape-based distance "SBD" and "shape_extraction" for the centroid.

Zero-size clusters and clusinfo$size

Sometimes the timeseries get split into less number of clusters than requested, which causes zeros in clusinfo$size. This is expected, however when it's the last cluster that is empty then the clusinfo$size vector is shorter than the number of clusters due to how the tabulate() function works (I believe it is missing the nbin argument).

E.g. for 3 clusters:

> k <- 3
> tabulate(c(1,1, 2,2, 3,3))
[1] 2 2 2
> tabulate(c(1,1, 2,2))
[1] 2 2 # but we need 2 2 0
> tabulate(c(1,1, 2,2), nbin = k)
[1] 2 2 0 # this is desired

My understanding is that the following line is used to calculate the clusinfo$size vector and it is this line that needs to be fixed:
https://github.com/cran/dtwclust/blob/c9c03ba5da463ebe14d03d46bb21bfb0f1b41181/R/partitional-fuzzy.R#L128

Warnings with non-symmetric distance

Hi,
I am trying to use CDMDistance TSclust "CDMdistance" and to compare cvis for different number of clusters. After
proxy::dist(data, method = "CDMdis")
p1<-tsclust(data, type="hierarchical",k=2:5, distance="CDMdis", control=hierarchical_control(method="ward.D")
I get warning Distance matrix is not symmetric, and hierarchical clustering assumes it is (it ignores the upper triangular).
After sapply(p1, cvi, type = "internal") the indices are provided, but there are:
Warning messages:
1: In FUN(X[[i]], ...) :
Internal CVIs: series' cross-distance matrix is NOT symmetric, which can be problematic for:
Sil D COP
I guess there is something I am doing wrong. I'd be grateful if you could help me.

De-normalizing after zscore preprocessing

Hello,

Is there a simple way with dtwclust to get the means and standard deviations used by zscore during preprocessing, in order to perform denomalization ?

Thank you in advance.

Regards,

Zacharie

[GAK][non-proxy] Sigma estimation is flawed

The sampling part was not repeated with different randomness each time.

Mean and SD not keeped when using DBA centroids in tsclust

If I run the following code using a list of multivariate time series:

data <- zscore(my_list, keep.attributes = TRUE)

pc_dtw_dba <- tsclust(data, k = 2L:10L,
					distance = "dtw_basic", centroid = "dba",
					trace = trace, seed = seed,
					norm = "L2",
					args = tsclust_args(cent = list(trace = trace)) )

names(pc_dtw_dba) <- paste0("k_", 2L:10L)

centroids don't keep the mean and sd attributes:

attr( pc_dtw_dba$k_10@centroids[[1]], "scaled:scale" )
# NULL

Follow-up on adding legends to DTWClust plots

How can I retain the name of each of my time series in the plots and dendrograms produced by DTWClust? So far I have found the package very useful but have to keep going back and forth between the order in @cluster and my data set to identify cluster membership. Also, the dendrograms produced in hierarchical clustering do not retain the unique names of my 24 time series, and instead label them 1-24. I know that two years ago there was no mechanism to do this (#2) but wanted to check and see if a new feature has been added. Thank you!

DBA documentation's ellipsis entry is wrong

It states that ... is ignored. It is not, it is still passed to dtw_basic.

Calculate distmat in parallel using tsclustFamily with a lot of data

Hello, now I'm trying to calculate the distance matrix but i dont get a matriz when i do this:

fam <- new("tsclustFamily", dist = "dtw",preproc=zscore,control=partitional_control(symmetric=FALSE))
fam@dist(muestra[1:10,])

I get this:

Error in attr(d, "dimnames") <- dim_names :
length of 'dimnames' [1] not equal to array extent

If i change the second line for this fam@dist(muestra[1:10,],pairwise=TRUE)
I get:

[1] 0 0 0 0 0 0 0 0 0 0

I know that I can use proxy::dist(muestra[1:100,], method = "dtw") but it's not valid por paralelization, and the 10 is just for test, it'd 10k.

The data estructure is the following:

head(muestra)
         X00000001        X00000002        X00000003        X00000004        X00000005        X00000006        X00000007
1              546              335              397                0              350              192                0
2             9482            11608             4692             2244             2114             5753             8482
3             4059             3500             4233             6023             3772             3936             4638
4             7474             4193             4122             7488             4768             4300             5302
5                0                2                0                2                0                9                0

> dim(muestra);class(muestra)
[1] 10000    90
[1] "data.table" "data.frame"

Thanks ;)

PD: Using v 4.0.1

Lack of reproducibility in tsclust

Hi, I am using the tsclust function to cluster my 150 timeseries. Timeseries length vary from 40 to 140.

I can't understand why the function does not always returns the same results. If I run it twice with the exact same parameters (distance="DTW", type="partitional" and k from 2 to 10), I don't obtain the same results.
Can you help me understand the theory behind this issue ? Maybe a link to a paper could help me.
Is it the curse of dimensionality ?

Thanks a lot !

dtwclust - examples cannot be reproduced

Hello,

I have started using dtwclust package. However, there are several examples that cannot be run. So far I have encountered the following problems:

create_dtwclust: the function cannot be found.
page 27: reinterpolate(CharTraj, new.length = max(lengths(CharTraj))), new.length should newLength and R does not recognize lengths. After making some changes, I get:

series <- reinterpolate(CharTraj, newLength = max(length(CharTraj)))
Error in xy.coords(x, y) :
'x' is a list, but does not have components 'x' and 'y'

Thank you,
Golnaz

Fuzzy mediods clustering produces probabilities that don't sum to 1

Hi,

I have been using tsclust with the following call:

hdl=dtwclust(datalist, type="fuzzy", k=2, centroid = "fcmdd", fuzzy_control(fuzziness = 2, iter.max = 100L, delta = 0.001,
                                                                          packages = character(0L), symmetric = FALSE, version = 2L,
                                                                          distmat = NULL))

I ran this with similar data and got sensible results, but this particular call resulted in the following probabilities (using @fclust). Note that I had 74 observations and I have truncated the table as it just has more 0.5,0.5 rows. The last row does not sum to 1. What could be the cause of such strange results? Also, I ran this with fuzziness parameter 1.3 and got identical results.

Thanks!

Cluster	Prob 1	Prob 2
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	1	1

negative length vectors are not allowed - dtwclust with L2 distance works with 46341 TS but not 46342 TS

[1] 46342
Error in do.call(".External", c(list(CFUN, x, y, pairwise, if (!is.function(method)) get(method) else method),  : 
  negative length vectors are not allowed
Calls: tsclust ... <Anonymous> -> .proxy_external -> do.call -> .External
Execution halted

Reproduced with R version 3.4.4 and R version 3.6.1, using dtwclust both from CRAN and version 5.5.5 from Github.

The following code will reproduce the issue. Note that it takes considerable memory (> 16GB) and time to run the 46341 working test case, while 46342 throws the error immediately.

library(dtwclust)
for (num_ts in 46341:46342) {
  print(num_ts)
  cl_k_nrep <-
    tsclust(lapply(1:num_ts, function(x)
      return(0)),
      k = 2,
      distance = "l2")
}

Question about number of clusters generated by hierarchical clustering

In the dtwclust package, does hierarchical clustering automatically create 2 clusters if k is not specified?

Series order in the interactive_clustering's Explore dashboard not preserved

Given my series in a list contained in the data variable, the series ordering in the Explore dashboard is the one given by sort( names(data) ) and not the original one given by names(data). This makes the analysis more difficult if you want to select your series by indexes based on your original ordering.

Getting the indices of clusters

Say 4 clusters are obtained, how can the indices of the vector from the input time series be obtained? So I can know which elements of this vector belong to which cluster?

Error when doing predict() with previous dtwclust object

Hi Alex. It has been a while since I last used your package. Today I found that the predict() function doesn't work with my previously saved dtwclust object. When I use predict(old_object,series), it returned Error in (function (x, centroids = NULL, ...) : could not find function "check_parallel" while it works well with new dtwclust object that I created today. I think this problem is different from the issue I created before: #9 .So I would like to know if there is any method to solve this problem without creating a new dtwclust object.

Compilation failed

Good afternoon,

I cannot install due to a compilation error. I'm running

install.packages("dtwclust")

And I get:

Installing package into ‘/home/my_user/Rpackages’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/dtwclust_5.5.4.tar.gz'
Content type 'application/x-gzip' length 2375674 bytes (2.3 MB)
==================================================
downloaded 2.3 MB

* installing *source* package ‘dtwclust’ ...
** package ‘dtwclust’ successfully unpacked and MD5 sums checked
** libs
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG  -I"/home/my_user/Rpackages/Rcpp/include" -I"/home/my_user/Rpackages/RcppArmadillo/include" -I"/home/my_user/Rpackages/RcppParallel/include" -I"/home/my_user/Rpackages/RcppThread/include"   -DRCPP_USE_UNWIND_PROTECT -fpic  -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c init.
cpp -o init.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG  -I"/home/my_user/Rpackages/Rcpp/include" -I"/home/my_user/Rpackages/RcppArmadillo/include" -I"/home/my_user/Rpackages/RcppParallel/include" -I"/home/my_user/Rpackages/RcppThread/include"   -DRCPP_USE_UNWIND_PROTECT -fpic  -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c centr
oids/sdtw-cent.cpp -o centroids/sdtw-cent.o
In file included from /home/my_user/Rpackages/RcppThread/include/RcppThread.h:11:0,
                 from centroids/../utils/ParallelWorker.h:8,
                 from centroids/sdtw-cent.cpp:13:
/home/my_user/Rpackages/RcppThread/include/RcppThread/Thread.hpp: In lambda function:
/home/my_user/Rpackages/RcppThread/include/RcppThread/Thread.hpp:42:19: error: parameter packs not expanded with ‘...’:
                 f(args...);
                   ^
/home/my_user/Rpackages/RcppThread/include/RcppThread/Thread.hpp:42:19: note:         ‘args’
/home/my_user/Rpackages/RcppThread/include/RcppThread/Thread.hpp:42:23: error: expansion pattern ‘args’ contains no argument packs
                 f(args...);
                       ^
In file included from /home/my_user/Rpackages/RcppThread/include/RcppThread.h:13:0,
                 from centroids/../utils/ParallelWorker.h:8,
                 from centroids/sdtw-cent.cpp:13:
/home/my_user/Rpackages/RcppThread/include/RcppThread/ThreadPool.hpp: In member function ‘void RcppThread::ThreadPool::push(F&&, Args&& ...)’:
/home/my_user/Rpackages/RcppThread/include/RcppThread/ThreadPool.hpp:127:31: error: expected ‘,’ before ‘...’ token
         jobs_.emplace([f, args...] { f(args...); });
                               ^
/home/my_user/Rpackages/RcppThread/include/RcppThread/ThreadPool.hpp:127:31: error: expected identifier before ‘...’ token
/home/my_user/Rpackages/RcppThread/include/RcppThread/ThreadPool.hpp:127:34: error: parameter packs not expanded with ‘...’:
         jobs_.emplace([f, args...] { f(args...); });
                                  ^
/home/my_user/Rpackages/RcppThread/include/RcppThread/ThreadPool.hpp:127:34: note:         ‘args’
/home/my_user/Rpackages/RcppThread/include/RcppThread/ThreadPool.hpp: In lambda function:
/home/my_user/Rpackages/RcppThread/include/RcppThread/ThreadPool.hpp:127:44: error: expansion pattern ‘args’ contains no argument packs
         jobs_.emplace([f, args...] { f(args...); });
                                            ^
/home/my_user/Rpackages/RcppThread/include/RcppThread/ThreadPool.hpp: In member function ‘std::future<decltype (f(args ...))> RcppThread::ThreadPool::pushReturn(F&&, Args&& ...)’:
/home/my_user/Rpackages/RcppThread/include/RcppThread/ThreadPool.hpp:144:54: error: expected ‘,’ before ‘...’ token
     auto job = std::make_shared<jobPackage>([&f, args...] {
                                                      ^
/home/my_user/Rpackages/RcppThread/include/RcppThread/ThreadPool.hpp:144:54: error: expected identifier before ‘...’ token
/home/my_user/Rpackages/RcppThread/include/RcppThread/ThreadPool.hpp:144:57: error: parameter packs not expanded with ‘...’:
     auto job = std::make_shared<jobPackage>([&f, args...] {
                                                         ^
/home/my_user/Rpackages/RcppThread/include/RcppThread/ThreadPool.hpp:144:57: note:         ‘args’
/home/my_user/Rpackages/RcppThread/include/RcppThread/ThreadPool.hpp: In lambda function:
/home/my_user/Rpackages/RcppThread/include/RcppThread/ThreadPool.hpp:145:22: error: expansion pattern ‘args’ contains no argument packs
         return f(args...);
                      ^
make: *** [centroids/sdtw-cent.o] Error 1
ERROR: compilation failed for package ‘dtwclust’
* removing ‘/home/my_user/Rpackages/dtwclust’

The downloaded source packages are in
        ‘/tmp/RtmpNqaE8l/downloaded_packages’
Warning message:
In install.packages("dtwclust") :
  installation of package ‘dtwclust’ had non-zero exit status

I also tried devtools::install_github("asardaes/dtwclust") with the same results, even after installing the package RSpectra (as suggested in #30). I'm running the following R:

$platform
[1] "x86_64-pc-linux-gnu"
$arch
[1] "x86_64"
$os
[1] "linux-gnu"
$system
[1] "x86_64, linux-gnu"
$status
[1] ""
$major
[1] "3"
$minor
[1] "5.1"
$year
[1] "2018"
$month
[1] "07"
$day
[1] "02"
$`svn rev`
[1] "74947"
$language
[1] "R"
$version.string
[1] "R version 3.5.1 (2018-07-02)"
$nickname
[1] "Feather Spray"

On this server:

Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty
Linux my_server 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Thanks!

Clustering series with NA

Hi and thanks for building such a useful package.
I'd like to know how to deal with series containing NA values. I have a bunch of daily temperatures (one value a day) but as they have intervals with NA values I could not use tsclust for the complete series, needed to join all the series in a common time interval.
Can dtwclust deal with time series with gaps? Do the series need to be complete with data in every day?
Thanks again,
Paco

I want to compare the cluster algorithm, but it has an error because ncol too large.

I want to compare the clustering algorithm, but it has an error because ncol too large.
my dataset has 155 column and 1900 observations.
and I also have another dataset that has 300 column and 1900 observations.

how should i do?

> sapply(list(HC=hc, DTW = pc_dtw, kShape = pc_sbd, FuzzyM = fcm),
+        cvi, b = lst_label, type = "VI") #Internal, Minimize
Error in matrix(0L, nrow(a), max(b)) : 
  invalid 'ncol' value (too large or NA)
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Question - Prototyping of Cluster Series

I have a quick question (hope this is the best place). If we use DTW distance and then a clustering algorithm (pam or hierarchical) how to best present a prototypical series from each cluster? I have many 1000's of series so presenting all for a cluster is not feasible. Is it meaningful to look at the centroid of each cluster (mean) (average value of the series within each cluster for each time point)?

Drawing centroid with a solid line instead of a dashed line

I am following the worked example in the documentation (https://rdrr.io/cran/dtwclust/man/tsclusters-methods.html) to plot the time series along with the centroid. By default, the centroid is drawn with a dashed grey line infront of the time series. My goal is to change the linetype to solid (and maybe change color too). But the problem is when I change the parameter linetype = "solid", the centroid is being sent to the background. That is, the other time series are obscuring the centroid. But I know for a fact the solid line is plotted because if I increase the size parameter to 3.5, I can see the centroid time series in the background (see the screenshot below).

The worked example in the documentation above shows how to change the linetype to solid (line 24 under Examples section) but therein the parameter type = "c" so there is no concept of foreground or background, whereas in my case I want both series and centroid to be plotted but with centroid in front of the time series. Something like this

Any ideas on how that can be done?

Partitional clustering with PAM centroids, pam.precompute=FALSE, and dtw_lb distance gives wrong results

The step that updates centroids should use a distance matrix that is calculated entirely with DTW.

How to properly declare distance from TSclust with option

I would be grateful to make sure if I properly deal with option for diss.CORT form TSclust. I want the deltahmethod to be set to DTW.

There's no error nor warning, but I wonder whether there is some other way to do it? The code is as follows:

mydist <- function(x, y){
       diss.CORT(x, y, k = 2, deltamethod="DTW")
  }

proxy::pr_DB$set_entry(FUN = mydist, names = c("CORT++dtw"),
                       loop = TRUE, type = "metric", distance = TRUE,
                       description = "CORT with DTW")

tsclust(data, type="hierarchical", k = 2,
        distance = "CORT++dtw",
        control=hierarchical_control(method="ward.D"))

I am deeply grateful for dtwclust package and your kind assitance.

bug in timestamp consistency

If I provide a param time in plot (~,time=100), it always returns " Length mismatch between values and timestamps").

here is the source code.

timestamp consistency

if (!is.null(time) && length(time) < max(L1, L2))
    stop("Length mismatch between values and timestamps") # nocov

install issue on windows

Hi there,

I am trying to install DTWCLUST on R version 3.5.0. IT seems to install ok to begin with using the command : "

C:\Program Files\R\R-3.5.0\bin\Rscript" -e "install.packages('dtwclust',repos='http://cran.r-project.org',dependencies=TRUE)"

I receive the output:

"trying URL 'http://cran.r-project.org/bin/windows/contrib/3.5/dtwclust_5.4.0.zip'
Content type 'application/zip' length 3723853 bytes (3.6 MB)

downloaded 3.6 MB

package 'dtwclust' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
C:\Users\Administrator.QUB\AppData\Local\Temp\RtmpOeAlKI\downloaded_packages"

when i try to add the package while in R 3.5.0 i receive the following issue:

"> library("dtwclust")
Loading required package: proxy

Attaching package: ‘proxy’

The following objects are masked from ‘package:stats’:

as.dist, dist

The following object is masked from ‘package:base’:

as.matrix

Loading required package: dtw
Loaded dtw v1.18-1. See ?dtw for help, citation("dtw") for use in publication.

Error: package or namespace load failed for ‘dtwclust’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
there is no package called ‘stringi’"

Is there a resolution for this?

Cluster	Prob 1	Prob 2
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	1	1

Cluster	Prob 1	Prob 2
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	1	1

asardaes / dtwclust Goto Github PK

dtwclust's People

Contributors

Stargazers

Watchers

Forkers

dtwclust's Issues

timestamp consistency

"trying URL 'http://cran.r-project.org/bin/windows/contrib/3.5/dtwclust_5.4.0.zip' Content type 'application/zip' length 3723853 bytes (3.6 MB)

Recommend Projects

Recommend Topics

Recommend Org

"trying URL 'http://cran.r-project.org/bin/windows/contrib/3.5/dtwclust_5.4.0.zip'
Content type 'application/zip' length 3723853 bytes (3.6 MB)

Cluster	Prob 1	Prob 2
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	0.5	0.5
1	1	1