klausvigo / kknn Goto Github PK

View Code? Open in Web Editor NEW

23.0 5.0 10.0 1.07 MB

Weighted k-Nearest Neighbors

Home Page: http://klausvigo.github.io/kknn/

R 84.94% C 15.06%

r nearest-neighbor

kknn's Introduction

kknn

kknn is a R package for Weighted k-Nearest Neighbors Classification, Regression and Clustering.

You can install

the latest released version install.packages("kknn")
the latest development version devtools::install_github("KlausVigo/kknn")

If you use kknn please cite:

Hechenbichler K. and Schliep K.P. (2004) Weighted k-Nearest-Neighbor Techniques and Ordinal Classification, Discussion Paper 399, SFB 386, Ludwig-Maximilians University Munich

License

kknn is licensed under the GPLv2.

kknn's People

Contributors

Stargazers

Watchers

Forkers

antoine-lizee alkment lgatto 505555998 mkuehn10 topepo vandenman dfalbel slagtermaarten kucharssim

kknn's Issues

Confidence/prediction intervals for predictions

Hi, thanks for developing this package first and foremost!

I'd like to have confidence intervals based on the weighted means of the k nearest neighbors for my continuous response variables - there are virtually no replicates in my training points and bootstrapped CIs do not seem to be an option for me. I've just augmented the predict.kknn method and kknn function to get a table of CIs for continuous variables but the method I'm using is probably not optimal yet. Would you be interested at all in merging something like this into your master branch? If so, I'd love to work it out more fully. My fork is here

ks too large

train.kknn protects against kmax being too large but not ks:

library(kknn)

train.kknn(mpg ~ ., data = mtcars, ks = 32)
#> Error in D[, 1:j]: subscript out of bounds

train.kknn(mpg ~ ., data = mtcars, ks = 1:32)
#> Error in D[, 1:j]: subscript out of bounds

^{Created on 2020-07-28 by the reprex package (v0.3.0)}

Session info

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.0 (2020-04-24)
#>  os       macOS Catalina 10.15.5      
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2020-07-28                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                            
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.0)                    
#>  backports     1.1.8      2020-06-17 [1] CRAN (R 4.0.0)                    
#>  callr         3.4.3      2020-03-28 [1] CRAN (R 4.0.0)                    
#>  cli           2.0.2      2020-02-28 [1] CRAN (R 4.0.0)                    
#>  crayon        1.3.4.9000 2020-06-09 [1] Github (r-lib/crayon@dcf6d44)     
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 4.0.0)                    
#>  devtools      2.3.0      2020-04-10 [1] CRAN (R 4.0.0)                    
#>  digest        0.6.25     2020-02-23 [1] CRAN (R 4.0.0)                    
#>  ellipsis      0.3.1      2020-05-15 [1] CRAN (R 4.0.0)                    
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.0)                    
#>  fansi         0.4.1      2020-01-08 [1] CRAN (R 4.0.0)                    
#>  fs            1.4.2      2020-06-30 [1] CRAN (R 4.0.0)                    
#>  glue          1.4.1      2020-05-13 [1] CRAN (R 4.0.0)                    
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.0)                    
#>  htmltools     0.5.0      2020-06-16 [1] CRAN (R 4.0.0)                    
#>  igraph        1.2.5      2020-03-19 [1] CRAN (R 4.0.0)                    
#>  kknn        * 1.3.2.1    2020-07-28 [1] CRAN (R 4.0.0)                    
#>  knitr         1.29       2020-06-23 [1] CRAN (R 4.0.0)                    
#>  lattice       0.20-41    2020-04-02 [1] CRAN (R 4.0.0)                    
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 4.0.0)                    
#>  Matrix        1.2-18     2019-11-27 [1] CRAN (R 4.0.0)                    
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 4.0.0)                    
#>  pkgbuild      1.0.8      2020-05-07 [1] CRAN (R 4.0.0)                    
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.0.0)                    
#>  pkgload       1.1.0      2020-05-29 [1] CRAN (R 4.0.0)                    
#>  prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.0.0)                    
#>  processx      3.4.3      2020-07-05 [1] CRAN (R 4.0.0)                    
#>  ps            1.3.3      2020-05-08 [1] CRAN (R 4.0.0)                    
#>  R6            2.4.1      2019-11-12 [1] CRAN (R 4.0.0)                    
#>  remotes       2.1.1      2020-02-15 [1] CRAN (R 4.0.0)                    
#>  rlang         0.4.7      2020-07-09 [1] CRAN (R 4.0.0)                    
#>  rmarkdown     2.3.2      2020-07-07 [1] Github (rstudio/rmarkdown@ff1b279)
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 4.0.0)                    
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.0)                    
#>  stringi       1.4.6      2020-02-17 [1] CRAN (R 4.0.0)                    
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.0)                    
#>  testthat      2.3.2      2020-03-02 [1] CRAN (R 4.0.0)                    
#>  usethis       1.6.1      2020-04-29 [1] CRAN (R 4.0.2)                    
#>  withr         2.2.0      2020-04-20 [1] CRAN (R 4.0.0)                    
#>  xfun          0.15       2020-06-21 [1] CRAN (R 4.0.0)                    
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.0)                    
#> 
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

Are you planning a CRAN release any time soon? If so, I might submit another issue or PR in the short term.

running when imported

When kknn is referenced by namespace (as opposed to being fully attached via library) an error occurs:

train.con <- 
  kknn::train.kknn(
    Sepal.Width ~ ., 
    data = iris, 
    kmax = 6, 
    kernel = c("rectangular", "triangular", "epanechnikov",
               "gaussian", "rank", "optimal")
  )
#> Error in get(ctr, mode = "function", envir = parent.frame()): object 'contr.dummy' of mode 'function' was not found
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_3.5.0  magrittr_1.5    Matrix_1.2-14   tools_3.5.0     igraph_1.2.2    yaml_2.2.0      grid_3.5.0     
 [8] pkgconfig_2.0.2 kknn_1.3.2      lattice_0.20-35

Created on 2018-10-02 by the reprex package (v0.2.1)

(the kknn version was installed from GH)

Perhaps defining the argument to be

contrasts = c(unordered = "kknn::contr.dummy", ordered = "kknn::contr.ordinal")

or use getFromNamespace or getAnywhere to reference it would work.

We'd rather import the package when using it in another package to avoid name conflicts.

cv.kknn fold sizes are different

The line val<-sample(kcv, size=l, replace=TRUE) has the potential to make different sized folds for the cross-validation.

A simple fix could be to do val <- cut(seq(1, nrow(data)), breaks = kcv, labels = FALSE) instead to make the folds as similar in size as possible.

Error When Character Level In Train But Not Test

Hello! I think there is a bug where if the testset does not contain levels of a character variable that are in the trainset, the model will not run.

library(kknn)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

#Character level mis-alignment
dta <- data.frame(
  state = as.character(sample(1:10, 100, replace = T)),
  x = rnorm(100), stringsAsFactors = F)
dta$y <- rnorm(100)

test_dta <- data.frame(state = as.character(1:3), x = rnorm(3),
          stringsAsFactors = F)

kknn(formula = y ~ x + state, train = dta,
     test = test_dta)
#> Error in valid[, ord, drop = FALSE]: subscript out of bounds


#Also fails
dta <- dta %>% mutate(state = factor(state))
kknn(formula = y ~ x + state, 
     train = dta,
     test = test_dta %>% mutate(state = factor(state)))
#> Error in valid[, ord, drop = FALSE]: subscript out of bounds
#Works
kknn(formula = y ~ x + state, 
     train = dta,
     test = test_dta %>% mutate(state = factor(state, levels = levels(dta$state))))
#> 
#> Call:
#> kknn(formula = y ~ x + state, train = dta, test = test_dta %>%     mutate(state = factor(state, levels = levels(dta$state))))
#> 
#> Response: "continuous"

^{Created on 2021-01-24 by the reprex package (v0.3.0)}

cv.kknn error when response is a factor

It appears the misclassification is not being calculated correctly on line 713.

Here is my attempt at reproducible example and a possible fix.

library(kknn)
set.seed(1)
data <-
  data.frame(
    x1 = runif(100),
    x2 = runif(100),
    x3 = runif(100),
    r1 = as.factor(sample(0:1, 100, replace = TRUE))
  )

kknn_cv <- cv.kknn(r1 ~ x1 + x2 + x3,
                   data)


kknn_cv[2]
#> [[1]]
#> [1] 0.77
1 - mean(kknn_cv[[1]][, 1] == kknn_cv[[1]][, 2])
#> [1] 0.48


cv.kknn2 <- function(formula, data, kcv = 10, ...)
{
  mf <- model.frame(formula, data = data)
  # terms(formula, data = data) keine kopie der Daten?
  y <- model.response(mf)
  l <- length(y)    # nrow(data)
  val <- sample(kcv, size = l, replace = TRUE)
  yhat <- numeric(l)
  for (i in 1:kcv) {
    m <- dim(data)[1]
    learn <- data[val != i,]
    valid <- data[val == i,]
    fit <- kknn(formula , learn, valid, ...)
    yhat[val == i] <- predict(fit)
  }
  if (is.factor(y))
    MISCLASS <- sum(as.numeric(y) != yhat) / l
  if (is.numeric(y) |
      is.ordered(y))
    MEAN.ABS <- sum(abs(as.numeric(y) -
                          as.numeric(yhat))) /
    l
  if (is.numeric(y) |
      is.ordered(y))
    MEAN.SQU <- sum((as.numeric(y) -
                       as.numeric(yhat)) ^
                      2) / l
  if (is.numeric(y))
    result <- c(MEAN.ABS, MEAN.SQU)
  if (is.ordered(y))
    result <- c(MISCLASS, MEAN.ABS, MEAN.SQU)
  if (is.factor(y) & !is.ordered(y))
    result <- MISCLASS
  list(cbind(y = y, yhat = yhat), result)
}


set.seed(1)
data <-
  data.frame(
    x1 = runif(100),
    x2 = runif(100),
    x3 = runif(100),
    r1 = as.factor(sample(0:1, 100, replace = TRUE))
  )
kknn_cv2 <- cv.kknn2(r1 ~ x1 + x2 + x3,
                    data)

kknn_cv2[2]
#> [[1]]
#> [1] 0.48
1 - mean(kknn_cv2[[1]][, 1] == kknn_cv2[[1]][, 2])
#> [1] 0.48

Created on 2018-08-31 by the reprex package (v0.2.0).

error in predicting with predict.train.kknn

I ran into a bug when the number of neighbors is greater than the number of samples being predicted.

It only occurs on the GH version; the CRAN version is fine (see note below)

library(kknn)

data(miete)

miete_tr <- miete[-(1:5),    ]
miete_te <- miete[  1:5 , -13]

train.con <- train.kknn(
  nmqm ~ wfl + bjkat + zh,
  data = miete_tr,
  ks = 8,
  kernel = "rectangular"
)

# Try to get the 8-nearest neighbors from the training set of 
# `nrow(miete_tr)` = 1077 households. 
predict(train.con, miete_te)
#> Error in kknn(formula(terms(object)), object$data, newdata, k = object$best.parameters$k, : k must be smaller or equal the number of rows of the training
#>                  set
sessionInfo()
#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] kknn_1.3.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.19.3  lattice_0.20-35 digest_0.6.18   rprojroot_1.3-2
#>  [5] grid_3.5.0      backports_1.1.2 magrittr_1.5    evaluate_0.12  
#>  [9] stringi_1.2.4   Matrix_1.2-14   rmarkdown_1.9   tools_3.5.0    
#> [13] stringr_1.3.1   igraph_1.2.2    yaml_2.2.0      compiler_3.5.0 
#> [17] pkgconfig_2.0.2 htmltools_0.3.6 knitr_1.20

Created on 2018-10-16 by the reprex package (v0.2.1)

I think that the line

if(k>p) stop('k must be smaller or equal the number of rows of the training
                 set')

should be

if(k>m) stop('k must be smaller or equal the number of rows of the training
                 set')

(edit) - fixed suggested fix.

predict.train.kknn() does not respect all parameters from train.kknn()

predict.train.kknn() does not respect all parameters passed to train.kknn(). An example is scale.

For example, predicting with scale = FALSE and scale = TRUE with train.kknn() give the same results:

library(tidymodels)
data("mtcars")
set.seed(1)
mtcars_split <- initial_split(mtcars, prop = 0.7)

## scale = FALSE
kknn::train.kknn(formula = mpg ~ disp + wt, data = training(mtcars_split), 
                 ks = 5, scale = FALSE) %>% 
  predict(testing(mtcars_split))
#> [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620

## scale = TRUE
kknn::train.kknn(formula = mpg ~ disp + wt, data = training(mtcars_split), 
                 ks = 5, scale = TRUE) %>% 
  predict(testing(mtcars_split))
#> [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620

But kknn() correctly shows a slight difference:

## scale = FALSE
kknn::kknn(formula = mpg ~ disp + wt, train = training(mtcars_split), 
           test = testing(mtcars_split), k = 5, scale = FALSE) %>% 
  predict(newdata = testing(mtcars_split))
#> [1] 21.276 21.276 16.860 16.276 21.276 16.404 29.680 15.700 16.020

## scale = TRUE
kknn::kknn(formula = mpg ~ disp + wt, train = training(mtcars_split), 
           test = testing(mtcars_split), k = 5, scale = TRUE) %>% 
  predict(newdata = testing(mtcars_split))
#> [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620

The issue is that kknn::predict.train.kknn() only respects some of the parameters originally passed to train.kknn(), but not all. scale, na.action, ykernel and contrasts aren't passed along to kknn() inside kknn::predict.train.kknn().

A fix would involve parsing the $call entry of the train.kknn-object more carefully.

New CRAN release?

Are you planning to update the kknn version on CRAN, or is the GitHub version in some way the official release?

Currently the invocation

replicate(100, kknn::kknn(speed ~ dist, cars[1:3, ], cars[1:3, ], k = 7)$fitted.values)

segfaults on the CRAN version, whereas this is fixed here on GitHub; it would be nice to have a version on CRAN that stop()s instead of segfaults here...

na.action ?

Hello,

First of all , thank you very much for your work on this package.
I just tried to use it and wanted to apply action to NA values but it seems that this paramater has no effects on the output.

It seems to also produce some "bad" predictions. The "C" values look correct but the fitted.values and "CL" doesn't match with the target (call it Y) values of neighbors we can get with "C" and the train data. Instead it uses Y values of other lines...

The issues seems to come from the NA values in the data which lead us to think that na.action is uneffective.
Have you faced this kind of issue?

Could you help use about this problem. I read your code and didn't find any line which seems to deal with the na.action. Please don't hesitate to point out this part of the code to make us understant better the action the function takes.

Anyway, thank you again.

ALy

Citation link does not work anymore.

The link for the citation below does not work anymore:

Hechenbichler K. and Schliep K.P. (2004) Weighted k-Nearest-Neighbor Techniques and Ordinal Classification, Discussion Paper 399, SFB 386, Ludwig-Maximilians University Munich

train.kknn() fails on this data set

Here is a data set on which train.knn() fails:

args <- list(formula = target ~ ., data = structure(list(V1 = c(270.093653981481, 
279.505245324074, 269.973504027778, 272.901858449074), V2 = c(272.251986666667, 
263.08405, 272.924066666667, 278.41284), V3 = c(271.785323333333, 
262.78406, 272.524073333333, 277.772833333333), V4 = c(272.018636666667, 
263.08405, 272.757393333333, 277.486173333333), V5 = c(272.25199, 
263.38406, 272.890733333333, 277.096166666667), V6 = c(271.785323333333, 
262.78406, 272.390733333333, 277.18616), V7 = c(271.91864, 262.98407, 
272.4574, 276.839506666667), V8 = c(272.051973333333, 263.28406, 
272.59074, 276.626166666667), V9 = c(271.55198, 262.38406, 272.024073333333, 
276.269506666667), V10 = c(271.68532, 262.68405, 272.157413333333, 
276.019493333333), V11 = c(271.91864, 263.08405, 272.290726666667, 
275.92616), V12 = c(272.651983333333, 263.48407, 273.257393333333, 
278.152833333333), V13 = c(272.885303333333, 264.18405, 273.49074, 
277.609506666667), V14 = c(272.151983333333, 263.08405, 272.890733333333, 
278.1795), V15 = c(272.485313333333, 263.68405, 273.157413333333, 
277.792833333333), V16 = c(272.818656666667, 263.78406, 273.49074, 
277.526166666667), V17 = c(272.05198, 263.18405, 272.824066666667, 
277.819493333333), V18 = c(272.318656666667, 263.78406, 273.057413333333, 
277.442833333333), V19 = c(272.718656666667, 263.78406, 273.290726666667, 
277.31284), V20 = c(272.7428, 268.4415, 273.31012, 279.466216666667
), V21 = c(272.316113333333, 266.9915, 272.953453333333, 278.729526666667
), V22 = c(272.626133333333, 267.26147, 273.210133333333, 278.232876666667
), V23 = c(272.726113333333, 267.76147, 273.28346, 277.552876666667
), V24 = c(272.2628, 266.7115, 272.920126666667, 277.812876666667
), V25 = c(272.492786666667, 266.9115, 273.086786666667, 277.316213333333
), V26 = c(272.579446666667, 267.3415, 273.110133333333, 276.89288
), V27 = c(272.099446666667, 266.36148, 272.676786666667, 276.769553333333
), V28 = c(272.296133333333, 266.55148, 272.880126666667, 276.31955
), V29 = c(272.39611, 266.85147, 272.906786666667, 276.13288), 
    V30 = c(273.06945, 268.7415, 273.610133333333, 279.082886666667
    ), V31 = c(273.202793333333, 268.82147, 273.783466666667, 
    278.126206666667), V32 = c(272.606106666667, 268.02148, 273.2368, 
    279.17288), V33 = c(272.81945, 268.23148, 273.470133333333, 
    278.402873333333), V34 = c(273.222786666667, 268.85147, 273.813466666667, 
    277.992866666667), V35 = c(272.526126666667, 267.60147, 273.180133333333, 
    278.68954), V36 = c(272.65278, 267.79147, 273.3268, 277.78286
    ), V37 = c(273.07945, 268.5315, 273.663466666667, 277.686206666667
    ), V38 = c(272.35208, 268.547208333333, 272.98351, 276.773517222222
    ), V39 = c(271.919287777778, 266.888871666667, 272.522400555556, 
    276.060173333333), V40 = c(272.198186666667, 267.080536666667, 
    272.846842222222, 275.789623888889), V41 = c(272.285394444444, 
    267.532205, 272.974061666667, 275.444076111111), V42 = c(271.83708, 
    266.605541666667, 272.473509444444, 275.472956111111), V43 = c(272.047065555556, 
    266.725548333333, 272.721287777778, 275.296294444444), V44 = c(272.14762, 
    267.09221, 272.783520555556, 275.118516666667), V45 = c(271.655945555556, 
    266.302203333333, 272.221287777778, 274.861854444444), V46 = c(271.849853333333, 
    266.41887, 272.456836111111, 274.800739444444), V47 = c(271.969842777778, 
    266.612201666667, 272.529621111111, 274.771848888889), V48 = c(272.667068333333, 
    268.820548333333, 273.275183888889, 276.432408888889), V49 = c(272.812626666667, 
    268.942196666667, 273.412401111111, 275.924067777778), V50 = c(272.255948888889, 
    268.047198333333, 272.902955, 276.460183333333), V51 = c(272.470393888889, 
    268.295543333333, 273.150182222222, 275.97352), V52 = c(272.808732222222, 
    268.768863333333, 273.414064444444, 275.861843333333), V53 = c(272.167622222222, 
    267.568873333333, 272.843507222222, 276.070181111111), V54 = c(272.287625555556, 
    267.79054, 273.027958333333, 275.573514444444), V55 = c(272.65373, 
    268.367206666667, 273.282409444444, 275.614634444444), V56 = c(271.765283333333, 
    263.975515, 272.534132777778, 276.21454), V57 = c(271.320848333333, 
    263.208858333333, 272.050807222222, 275.524530555556), V58 = c(271.526395, 
    263.475513333333, 272.289693888889, 275.404539444444), V59 = c(271.687515, 
    263.70886, 272.395252222222, 275.172864444444), V60 = c(271.270839444444, 
    263.058845, 271.945243888889, 275.098974444444), V61 = c(271.426391111111, 
    263.275516666667, 272.10636, 274.957872222222), V62 = c(271.537506666667, 
    263.542185, 272.161918333333, 274.922308888889), V63 = c(271.054173333333, 
    262.808846666667, 271.584137222222, 274.496197222222), V64 = c(271.215284444444, 
    263.058843333333, 271.756363888889, 274.504533333333), V65 = c(271.426391111111, 
    263.325511666667, 271.878584444444, 274.583413333333), V66 = c(272.115292777778, 
    264.175521666667, 272.839692222222, 275.871197222222), V67 = c(272.376390555556, 
    264.70885, 273.045248333333, 275.580645555556), V68 = c(271.715282777778, 
    263.775518333333, 272.478578888889, 275.903424444444), V69 = c(272.026396666667, 
    264.225521666667, 272.773032222222, 275.561202222222), V70 = c(272.237515, 
    264.308853333333, 273.011913333333, 275.555646666667), V71 = c(271.62084, 
    263.692186666667, 272.400802777778, 275.565636666667), V72 = c(271.881960555556, 
    264.158851666667, 272.639697222222, 275.308424444444), V73 = c(272.137515, 
    264.175518333333, 272.828584444444, 275.342313333333), V74 = c(61, 
    13, 37, 61), V75 = c(2, 0, 1, 2), target = c(5.407, 5.73, 
    5.407, 5.303), V77 = c(6.352, 6.388, 6.352, 6.339), V78 = c(-0.0909756944444445, 
    -0.107152777777778, -0.110204861111111, -0.111579861111111
    )), row.names = 3:6, class = "data.frame"), kmax = 3L, kernel = "rectangular")

library(kknn)
do.call(train.kknn, args)

# Error in best[1, 2] : subscript out of bounds

The error is happening because of the return value of dmEuclid in the C code. When I step through in the R debugger, I see this:

Browse[2]> dmtmp$cl
 [1]          0          1          2          2          2          0          0          0
 [9]          1          2          1          1      32676      32676      32676      32676
[17] 1001200320 1001200320 1001200320          3

I'm guessing this is either uninitialized memory or an overflow/underflow problem.

Memory management

Hey Klaus

I am working on spatial interpolation of German-speaking dialects, similar as Josh Katz did in his research: http://www4.ncsu.edu/~jakatz2/files/dialectposter.png

What I need to do is interpolate a rectangular grid (a raster, basically) with q cells from n points which contain the spoken dialect d at location lat/lon. d is nominal. The covariates are lat and lon, as simple as that. In theory, everything works fine, I end up with maps like the following (plotted with ggplot2).

A big problem is memory management, though.

q is in the order of millions (for a pixel resolution of 1000x1160, for example)
n is ideally in the order of several hundred thousands - the sample is really big, and the more I include in the interpolation the more detailed/beautiful the maps get (even with a high k). Also, the higher n, the higher I need to set k, to get the desired aggregation effect.

The above map has n=50000, k=2500 and q=104400 (pixel width = 300, so a very "low-res" example), still the computation

dialects.kknn <- kknn(dialect ~ ., 
                      dialects_train, 
                      dialects_test, 
                      kernel = "gaussian", 
                      k = 2500)

already crashes with a message that says something like "cannot allocate vector of size 2.x GB". After having upgraded my 8GB RAM to 16GB, it works (and takes around 9 mins to compute), but the same with q ~= 1mio already fails with the message "cannot allocate vector of size 10.x GB".

My question is simple: Do you know of any mitigation strategies? Or do I have to set a different parameter, change the kernel? Could this be a memory leak? I am using 1.3.1.

One idea I came up with is to interpolate to a low-res-grid in the first run and then use another method for raster resampling to higher resolution, for example with raster::resample. Don't know if nominal values are fit for that, though.

contr.ordinal

I am unable to specify the ordinal "contrasts" argument within the train.kknn function.

What is the proper way to use "contr.ordinal" within the contrasts argument in the train.kknn function?

The below code will result in a nominal response variable. But my response variable is ordinal. How do I change this so that the model knows I am working with an ordinal response??

fit <- train.kknn (response ~ var1, var2, data=data, kmax=10, distance=1, kernel="triangular", contrasts=c(unordered = "contr.dummy", ordered = "contr.ordinal"))

my sessionInfo()

R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] kknn_1.3.1 caret_6.0-77 ggplot2_2.2.1 lattice_0.20-35 GGally_1.3.2

loaded via a namespace (and not attached):
[1] Rcpp_0.12.13 ddalpha_1.3.1 gower_0.1.2 bindr_0.1 pillar_1.1.0 compiler_3.4.1 DEoptimR_1.0-8 RColorBrewer_1.1-2
[9] plyr_1.8.4 iterators_1.0.8 class_7.3-14 tools_3.4.1 rpart_4.1-11 ipred_0.9-6 lubridate_1.7.1 tibble_1.4.1
[17] nlme_3.1-131 gtable_0.2.0 pkgconfig_2.0.1 rlang_0.1.6 igraph_1.1.2 Matrix_1.2-10 foreach_1.4.3 RcppRoll_0.2.2
[25] prodlim_1.6.1 bindrcpp_0.2 e1071_1.6-8 withr_2.0.0 stringr_1.2.0 dplyr_0.7.4 recipes_0.1.0 stats4_3.4.1
[33] nnet_7.3-12 CVST_0.2-1 grid_3.4.1 glue_1.2.0 reshape_0.8.7 robustbase_0.92-7 R6_2.2.2 survival_2.41-3
[41] lava_1.5.1 purrr_0.2.4 reshape2_1.4.2 kernlab_0.9-25 magrittr_1.5 DRR_0.0.2 splines_3.4.1 scales_0.5.0
[49] codetools_0.2-15 ModelMetrics_1.1.0 MASS_7.3-47 sfsmisc_1.1-1 assertthat_0.2.0 dimRed_0.1.0 timeDate_3012.100 colorspace_1.3-2
[57] stringi_1.1.5 lazyeval_0.2.0 munsell_0.4.3

ARPACK error in specClust

For datasets with outliers, kknn::specClust runs into an igraph/ARPACK eigenvalue convergence error.

igraph/igraph#512

On that thread, ntamas recommends removing isolated vertices from your graph, so one solution might be to remove outliers prior to clustering. But, that's not working great for me so far.

Here's an MWE:

set.seed(1)
outlier_2clust = matrix(rnorm(1000), ncol = 2)
outlier_2clust[1:250, ] = outlier_2clust[1:250, ] + 40
outlier_2clust = rbind(outlier_2clust, c(40, 0))
plot(outlier_2clust)
library(kknn)
cluster_mod = kknn::specClust(outlier_2clust, centers = 2)

Here's my sessionInfo().

R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] kknn_1.3.1

loaded via a namespace (and not attached):
[1] magrittr_1.5    Matrix_1.2-8    tools_3.3.1     igraph_1.0.1    grid_3.3.1      lattice_0.20-35

Outdated examples and dangerous design choices

In your documentation for the kknn function you have some outdated examples which seem to be based on an older version of your package.

data(ionosphere)
ionosphere.learn <- ionosphere[1:200,]
ionosphere.valid <- ionosphere[-c(1:200),]
fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid)
table(ionosphere.valid$class, fit.kknn$fit)
(fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, 
	kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1))
table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class)
(fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, 
	kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2))
table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class)

For fit.train1 and fit.train2, there is no test argument used. Perhaps the fact that they are in parentheses should somehow tip the user off they are not to be used? I do not think those examples should be there at all.

Another dangerous aspect of the function's documentation is that it gives the impression that the predict() method for kknn can be used for predictions of new data. It gives the impression that kknn follows the conventions of almost every other package, where newdata can be supplied as an argument to the predict() function in order to make predictions on test data.

However, that is not the case with kknn. And I rather think you should warn users of your package about this in the documentation for kknn(). Users will expect predict.kknn to work similar to the predict methods of most other packages and may not notice that it will always give you predictions for the test set you supplied when fitting the model in kknn().

I know these distinctions may be crystal clear to you as the designer/maintainer of the package. But I'm currently in a machine learning class, and a significant portion of the class failed to understand that your predict method cannot output predictions on any other data than the test set that was already supplied.

Thanks for developing the package! This was a bit of feedback to guard against user error.

add type parameter in predict.kknn

In the current version the predict function returns only the fitted values for the test data. Would it be possible to add a type parameter so the function could return the probabilities for each class? This would make it possible to obtain the probabilities using predict.train.kknn.

For instance:

predict.kknn <- function(object, type = 'raw', ...) 
{ 
    pred <- switch(type, raw = object$fit,
                         prob = object$prob,
                         stop('invalid type for prediction'))
    pred
}

predict.train.kknn <- function (object, newdata, ...) 
{
    if (missing(newdata)) 
        return(object$fit)
    res = kknn(formula(terms(object)), object$data, newdata, 
        k = object$best.parameters$k, kernel = object$best.parameters$kernel, 
        distance = object$distance)
   return(predict(res, ...))
}

[Bug] `kknn` computes incorrect neighbours for k >= train_size - 1

Here is a reproducible example:

set.seed(126)
a <- data.frame(a = rnorm(5), b = rnorm(5), c = c("X0", "X0", "X0", "X1", "X1"))
r <- kknn::kknn(c ~ ., train = a, test = a, k = 4, kernel = "rectangular")
r$C  # The neighbours of the 3rd example are (3, 5, 2, 3), however the last 
     # 3 neighbours should be from (1, 2, 4, 5)

For k = 5, which is supposed to work fine since train has 5 examples, the algorithm goes haywire and computes some crazy indices. For k >= 6 I would expect the code to fail but it still runs (and returns some random indices).

specClust error

When I use the specClust function to do the spectral clustering for a matrix (9482*42), it always gives me the following error:

Error: C stack usage  7972064 is too close to the limit.

I check the stackoverflow website and then increase the C stack usage. error-c-stack-usage-is-too-close-to-the-limit

After that, I re-run the specClust, I got another error:

Error: evaluation nested too deeply: infinite recursion / options(expressions=)?

I tried to set the nn parameter of the specClust() from 7 to 12, those errors still occurred. The running environment is AWS Ubuntu m4.2xlarge instance(8 CPUs, 32 Gb Memory).

[Bug] `kknn` returns a vector instead of a matrix for D and C when k = 1