Coder Social home page Coder Social logo

business-science / anomalize Goto Github PK

View Code? Open in Web Editor NEW
337.0 24.0 61.0 42.91 MB

Tidy anomaly detection

Home Page: https://business-science.github.io/anomalize/

R 92.26% Rez 0.55% CSS 7.19%
detect-anomalies decomposition iqr time-series anomaly-detection anomaly r-package

anomalize's Introduction

Anomalize is being Superceded by Timetk:

anomalize

R-CMD-check Lifecycle Status Coverage status CRAN_Status_Badge

The anomalize package functionality has been superceded by timetk. We suggest you begin to use the timetk::anomalize() to benefit from enhanced functionality to get improvements going forward. Learn more about Anomaly Detection with timetk here.

The original anomalize package functionality will be maintained for previous code bases that use the legacy functionality.

To prevent the new timetk functionality from conflicting with old anomalize code, use these lines:

library(anomalize)

anomalize <- anomalize::anomalize
plot_anomalies <- anomalize::plot_anomalies

Tidy anomaly detection

anomalize enables a tidy workflow for detecting anomalies in data. The main functions are time_decompose(), anomalize(), and time_recompose(). When combined, it’s quite simple to decompose time series, detect anomalies, and create bands separating the “normal” data from the anomalous data.

Anomalize In 2 Minutes (YouTube)

Anomalize

Check out our entire Software Intro Series on YouTube!

Installation

You can install the development version with devtools or the most recent CRAN version with install.packages():

# devtools::install_github("business-science/anomalize")
install.packages("anomalize")

How It Works

anomalize has three main functions:

  • time_decompose(): Separates the time series into seasonal, trend, and remainder components
  • anomalize(): Applies anomaly detection methods to the remainder component.
  • time_recompose(): Calculates limits that separate the “normal” data from the anomalies!

Getting Started

Load the anomalize package. Usually, you will also load the tidyverse as well!

library(anomalize)
library(tidyverse)
# NOTE: timetk now has anomaly detection built in, which 
#  will get the new functionality going forward.
#  Use this script to prevent overwriting legacy anomalize:

anomalize <- anomalize::anomalize
plot_anomalies <- anomalize::plot_anomalies

Next, let’s get some data. anomalize ships with a data set called tidyverse_cran_downloads that contains the daily CRAN download counts for 15 “tidy” packages from 2017-01-01 to 2018-03-01.

Suppose we want to determine which daily download “counts” are anomalous. It’s as easy as using the three main functions (time_decompose(), anomalize(), and time_recompose()) along with a visualization function, plot_anomalies().

tidyverse_cran_downloads %>%
    # Data Manipulation / Anomaly Detection
    time_decompose(count, method = "stl") %>%
    anomalize(remainder, method = "iqr") %>%
    time_recompose() %>%
    # Anomaly Visualization
    plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.25) +
    ggplot2::labs(title = "Tidyverse Anomalies", subtitle = "STL + IQR Methods") 

Check out the anomalize Quick Start Guide.

Reducing Forecast Error by 32%

Yes! Anomalize has a new function, clean_anomalies(), that can be used to repair time series prior to forecasting. We have a brand new vignette - Reduce Forecast Error (by 32%) with Cleaned Anomalies.

tidyverse_cran_downloads %>%
    dplyr::filter(package == "lubridate") %>%
    dplyr::ungroup() %>%
    time_decompose(count) %>%
    anomalize(remainder) %>%
  
    # New function that cleans & repairs anomalies!
    clean_anomalies() %>%
  
    dplyr::select(date, anomaly, observed, observed_cleaned) %>%
    dplyr::filter(anomaly == "Yes")
#> # A time tibble: 19 × 4
#> # Index:         date
#>    date       anomaly  observed observed_cleaned
#>    <date>     <chr>       <dbl>            <dbl>
#>  1 2017-01-12 Yes     -1.14e-13            3522.
#>  2 2017-04-19 Yes      8.55e+ 3            5202.
#>  3 2017-09-01 Yes      3.98e-13            4137.
#>  4 2017-09-07 Yes      9.49e+ 3            4871.
#>  5 2017-10-30 Yes      1.20e+ 4            6413.
#>  6 2017-11-13 Yes      1.03e+ 4            6641.
#>  7 2017-11-14 Yes      1.15e+ 4            7250.
#>  8 2017-12-04 Yes      1.03e+ 4            6519.
#>  9 2017-12-05 Yes      1.06e+ 4            7099.
#> 10 2017-12-27 Yes      3.69e+ 3            7073.
#> 11 2018-01-01 Yes      1.87e+ 3            6418.
#> 12 2018-01-05 Yes     -5.68e-14            6293.
#> 13 2018-01-13 Yes      7.64e+ 3            4141.
#> 14 2018-02-07 Yes      1.19e+ 4            8539.
#> 15 2018-02-08 Yes      1.17e+ 4            8237.
#> 16 2018-02-09 Yes     -5.68e-14            7780.
#> 17 2018-02-10 Yes      0                   5478.
#> 18 2018-02-23 Yes     -5.68e-14            8519.
#> 19 2018-02-24 Yes      0                   6218.

But Wait, There’s More!

There are a several extra capabilities:

  • plot_anomaly_decomposition() for visualizing the inner workings of how algorithm detects anomalies in the “remainder”.
tidyverse_cran_downloads %>%
    dplyr::filter(package == "lubridate") %>%
    dplyr::ungroup() %>%
    time_decompose(count) %>%
    anomalize(remainder) %>%
    plot_anomaly_decomposition() +
    ggplot2::labs(title = "Decomposition of Anomalized Lubridate Downloads")

For more information on the anomalize methods and the inner workings, please see “Anomalize Methods” Vignette.

References

Several other packages were instrumental in developing anomaly detection methods used in anomalize:

  • Twitter’s AnomalyDetection, which implements decomposition using median spans and the Generalized Extreme Studentized Deviation (GESD) test for anomalies.
  • forecast::tsoutliers() function, which implements the IQR method.

Interested in Learning Anomaly Detection?

Business Science offers two 1-hour courses on Anomaly Detection:

anomalize's People

Contributors

amrrs avatar beansrowning avatar martenmm avatar mdancho84 avatar olivroy avatar tejaslodaya avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anomalize's Issues

Question: How to identify anomalous trends?

Hi folks.

First off, I'd like to give a big thank you to the maintainers and contributors of this package. It works great!

However, I do have a question: how would I go about identifying anomalous trends as opposed to transactions ?

Say I have 5 companies and I'm interested in identifying if, amongst those, there is one exhibiting an anomalous trend. Or if I'm selling a group of products, and I'm interested in identifying if any of those have anomalous trends compared to each other. How would I do that?

Again, thank you for the package.

Cheers

'x' needs to be timeBased or xtsible

I'm getting an error when trying to use the time_decompose function on my data frame. The data frame has 3 columns - one is type POSIXct, one is type double and one is type chr:

Here is an example:

start_time<S3: POSIXct> duration job_name    
2017-09-21 11:09:02 2 analyteQueryRangeJob    
2016-03-04 09:09:03 0 analyteQueryRangeJob    
2016-07-16 04:09:03 1 analyteQueryRangeJob    
2016-12-19 21:09:03 1 analyteQueryRangeJob    
2019-01-29 04:09:01 0 analyteQueryRangeJob    
2017-07-14 09:09:03 0 analyteQueryRangeJob

when I call time_decompose(batchData, duration) I get this error:

Error in try.xts(x, error = "'x' needs to be timeBased or xtsible") : 'x' needs to be timeBased or xtsible

a little cryptic, but I gather that the xts function is complaining that whatever it is being passed is not the right data type. I checked that function and it supports POSIXct as a type, so anomalize must be passing it the wrong column or something.

I wasn't able to find anywhere any information about what format the data frame had to be in for the function to work. I found it odd that there was no way to tell the function which columns had the dates/times and which the labels.

gesd marks second smallest value as an outlier, but not the smallest value

'library(anomalize)
tmp <- c(5.458, 5.515, 5.504, 5.358, 5.522, 5.398, 5.531, 5.439, 5.348, 5.538)
cbind(tmp, gesd(tmp, alpha=0.05, max_anoms=0.2))'

[1,] "5.458" "No"
[2,] "5.515" "No"
[3,] "5.504" "No"
[4,] "5.358" "Yes"
[5,] "5.522" "No"
[6,] "5.398" "No"
[7,] "5.531" "No"
[8,] "5.439" "No"
[9,] "5.348" "No"
[10,] "5.538" "No"

Counter-intuitive output: observation #4 that is marked as an outlier is not even one of the extremes (observation #9 is smaller).

'gesd(tmp, alpha=0.05, max_anoms=0.2, verbose = TRUE)$outlier_report'

A tibble: 2 x 7
rank index value limit_lower limit_upper outlier direction

1 1.00 9.00 5.35 5.32 -5.32 No NA
2 2.00 4.00 5.36 5.39 -5.39 Yes Up

Shouldn't the above suggest that there are 2 outliers: not only observation #4 (the second smallest value), but also all preceding candidates, namely observation #9 (the actual minimum)?

What is alpha measured in?

Hi there,

Great package. I've been reading the vignettes, and one thing I wasn't clear about was what alpha represents and is measured in?

The documentation says:

Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations

Is it measuring a percentile (as the Twitter package does), or some other measurement of the expected distribution?

Many thanks

Alan

method = "gesd" from the function anomalize() seems to not support very low variance data

Preconditions:
I am currently exploring for anomalies a grouped tibble with 3932 groups.
The following data may be used to reproduce the group within which I get an issue.

Code for reproducibility:

df <- tibble(poll_date = seq.Date(from = as.Date("2018-03-14"), 
                                  to = as.Date("2018-06-11"),
                                  by = "1 day"),
             mac_address = rep("c40415fe7968", 90),
             pathloss = rep(50.5, 90))

df[c(4,8),'pathloss'] <- 50.25

df %>% 
  time_decompose(pathloss, merge = TRUE, method = "twitter") %>%
  anomalize(remainder, method = "gesd")

Expected result:
A time tibble classifying each observation as either an anomaly or not.

Actual result:

Converting from tbl_df to tbl_time.
Auto-index message: index = poll_date
frequency = 7 days
median_span = 90 days

Error in if (any(vals_tbl$outlier == "No")) { : 
  missing value where TRUE/FALSE needed

I believe the error is the result of almost all remainders' being equal to 0. Consequently, the mean absolute deviation is equal to 0 and the following line of the gesd function yields either Inf or NaN (dividing by 0).

z <- abs(x_new - median(x_new))/mad(x_new)

A way of handling this event should be included where outliers are being identified.

how to get only positive anomalies

I am trying to use anomalize for heatlhcare data. However, I am only interested in positive anomalies. I have looked in the anomalize documentation package, but I am still not sure how to get only the positive anomalies. My best guess is that is related to "remainder_12"(upper limit for anomalies)

positive anomalies

Error in .f(.x[[i]], ...) : object 'X' not found

formatted_df <- master_df %>%

  • mutate(dates = mdy(dates), # 4 dates fail to parse as they are provided as 'none'
  •      job.title = as.character(job.title),
    
  •      summary = as.character(summary),
    
  •      pros = as.character(pros),
    
  •      cons = as.character(cons),
    
  •      overall.ratings = as.factor(overall.ratings)) %>%
    
  • rename(review_id = X) %>%
  • separate(job.title, into = c("employee_status", "job_title"), sep = " - ", extra = "merge")
    Error in .f(.x[[i]], ...) : object 'X' not found
    In addition: Warning message:
    10144 failed to parse.

Error when loading the library

When loading the library(anomalize) the following error was displayed

error: package or namespace load failed for ‘anomalize’:.onAttach failed in attack Namespace() for 'anomalize', details:call: NULL error: Function getThemeInfo not found in RStudio

Error in stats::stl(., s.window = "periodic", t.window = trnd, robust = TRUE) :

First of all, thanks for an awesome package -very useful!
I'm getting an error on some of the data sets I'm running and not sure what's the best way to handle it.
The input data is:
date observation
2019-01-02 35
2019-01-03 54
2019-01-04 48
2019-01-05 2
2019-01-06 3
2019-01-07 44
2019-01-08 67
2019-01-09 47
2019-01-10 53
2019-01-11 47
2019-01-12 0
2019-01-14 41
2019-01-15 67
2019-01-16 61
2019-01-17 58
2019-01-18 52
2019-01-19 3

I read the data and convert to tible:
cur_tbl <- as.tibble(myInput)
and then I run the following (which works perfectly on other sets):
cur_tbl %>%
time_decompose(observation, message = TRUE) %>%
anomalize(remainder, method = method, alpha = alpha) %>%
time_recompose())
It returns error:
Converting from tbl_df to tbl_time.
Auto-index message: index = inq_day
frequency = 1 days
trend = 17 days
Error in stats::stl(., s.window = "periodic", t.window = trnd, robust = TRUE) :
series is not periodic or has less than two periods

I read this might be due to duplicated dates, but I don't have that issue. Any help is greatly appreciated.

Thank you!

Error in .f(.x[[i]], ...) : object 'X' not found

formatted_df <- master_df %>%

  • mutate(dates = mdy(dates), # 4 dates fail to parse as they are provided as 'none'
  •      job.title = as.character(job.title),
    
  •      summary = as.character(summary),
    
  •      pros = as.character(pros),
    
  •      cons = as.character(cons),
    
  •      overall.ratings = as.factor(overall.ratings)) %>%
    
  • rename(review_id = X) %>%
  • separate(job.title, into = c("employee_status", "job_title"), sep = " - ", extra = "merge")
    Error in .f(.x[[i]], ...) : object 'X' not found
    In addition: Warning message:
    10144 failed to parse.

Anonalize fails on non-time series grouped data

Dear All,
Hopefully the reprex is self-explanatory.
I plan to use anomalize on non-time series data.
It should still work according to the documentation (without the time series decomposition) and it does, but not on non-time series grouped data.
Any ideas?

library(tidyverse)

library(anomalize)
#> ══ Use anomalize to improve your Forecasts by 50%! ═════════════════════════════
#> Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
#> </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>

test1 <- tidyverse_cran_downloads %>%
    time_decompose(count) %>%
    anomalize(remainder)
#> Registered S3 method overwritten by 'quantmod':
#>   method            from
#>   as.zoo.data.frame zoo

print(test1)  ##and this works fine
#> # A time tibble: 6,375 x 9
#> # Index:  date
#> # Groups: package [15]
#>    package date       observed season trend remainder remainder_l1 remainder_l2
#>    <chr>   <date>        <dbl>  <dbl> <dbl>     <dbl>        <dbl>        <dbl>
#>  1 broom   2017-01-01    1053. -1007. 1708.    352.         -1725.        1704.
#>  2 broom   2017-01-02    1481    340. 1731.   -589.         -1725.        1704.
#>  3 broom   2017-01-03    1851    563. 1753.   -465.         -1725.        1704.
#>  4 broom   2017-01-04    1947    526. 1775.   -354.         -1725.        1704.
#>  5 broom   2017-01-05    1927    430. 1798.   -301.         -1725.        1704.
#>  6 broom   2017-01-06    1948    136. 1820.     -8.11       -1725.        1704.
#>  7 broom   2017-01-07    1542   -988. 1842.    688.         -1725.        1704.
#>  8 broom   2017-01-08    1479. -1007. 1864.    622.         -1725.        1704.
#>  9 broom   2017-01-09    2057    340. 1887.   -169.         -1725.        1704.
#> 10 broom   2017-01-10    2278    563. 1909.   -194.         -1725.        1704.
#> # … with 6,365 more rows, and 1 more variable: anomaly <chr>




test2 <- tidyverse_cran_downloads %>%
    group_by(package) %>% 
    time_decompose(count) %>%
    anomalize(remainder)

print(test2)  ##and also this works fine
#> # A time tibble: 6,375 x 9
#> # Index:  date
#> # Groups: package [15]
#>    package date       observed season trend remainder remainder_l1 remainder_l2
#>    <chr>   <date>        <dbl>  <dbl> <dbl>     <dbl>        <dbl>        <dbl>
#>  1 broom   2017-01-01    1053. -1007. 1708.    352.         -1725.        1704.
#>  2 broom   2017-01-02    1481    340. 1731.   -589.         -1725.        1704.
#>  3 broom   2017-01-03    1851    563. 1753.   -465.         -1725.        1704.
#>  4 broom   2017-01-04    1947    526. 1775.   -354.         -1725.        1704.
#>  5 broom   2017-01-05    1927    430. 1798.   -301.         -1725.        1704.
#>  6 broom   2017-01-06    1948    136. 1820.     -8.11       -1725.        1704.
#>  7 broom   2017-01-07    1542   -988. 1842.    688.         -1725.        1704.
#>  8 broom   2017-01-08    1479. -1007. 1864.    622.         -1725.        1704.
#>  9 broom   2017-01-09    2057    340. 1887.   -169.         -1725.        1704.
#> 10 broom   2017-01-10    2278    563. 1909.   -194.         -1725.        1704.
#> # … with 6,365 more rows, and 1 more variable: anomaly <chr>


## From the documentation:
## For non-time series data (data without trend), the anomalize()
## function can be used without time
## series decomposition.





test3 <- tidyverse_cran_downloads %>%
    select(-date) %>%
    filter(package=="broom") %>% 
    anomalize(count)


print(test3) ## OK!
#> # A tibble: 425 x 5
#>    count package count_l1 count_l2 anomaly
#>    <dbl> <chr>      <dbl>    <dbl> <chr>  
#>  1  1053 broom     -2535.    7965. No     
#>  2  1481 broom     -2535.    7965. No     
#>  3  1851 broom     -2535.    7965. No     
#>  4  1947 broom     -2535.    7965. No     
#>  5  1927 broom     -2535.    7965. No     
#>  6  1948 broom     -2535.    7965. No     
#>  7  1542 broom     -2535.    7965. No     
#>  8  1479 broom     -2535.    7965. No     
#>  9  2057 broom     -2535.    7965. No     
#> 10  2278 broom     -2535.    7965. No     
#> # … with 415 more rows



### now let us try this on grouped data






test4 <- tidyverse_cran_downloads %>%
    select(-date) %>% 
    group_by(package) %>% 
    anomalize(count)
#> Error in value[[3L]](cond): Error in prep_tbl_time(): No date or datetime column found.

print(test4)  ##and now an error ## what to do?
#> Error in print(test4): object 'test4' not found

Created on 2020-07-30 by the reprex package (v0.3.0)

Evaluation error: invalid 'tz' value..

At executing this snippet:
tidyverse_cran_downloads %>% time_decompose(count, method = "stl", frequency = "auto", trend = "auto")

I am getting:
Error in mutate_impl(.data, dots) : Evaluation error: invalid 'tz' value..

Re-Installing tibbletime did not help.

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2     lubridate_1.6.0  anomalize_0.1.0  dplyr_0.7.4      purrr_0.2.3      readr_1.1.1     
 [7] tidyr_0.7.1      tibble_1.4.2     ggplot2_2.2.1    tidyverse_1.1.1  tibbletime_0.1.1

Use Anomalize with seconds granularity

Hi,

I am struggling using anomalize with my dataset where the time granularity is the second.

If I use the date (day) I can make it work as in the following example:

df <- tibble(date = seq.Date(from = as.Date("2018-03-14"), length.out = 100, by = 1), group = sample( LETTERS[1:4], 100, replace=TRUE), value = runif(100, 5.0, 7.5))

df %>% time_decompose(value, merge = TRUE, method = "twitter") %>% anomalize(remainder, method = "gesd")

Then it works:

Converting from tbl_df to tbl_time.
Auto-index message: index = date
frequency = 7 days
median_span = 50 days
A time tibble: 100 x 10
Index: date
date group value observed season median_spans remainder remainder_l1 remainder_l2 anomaly

1 2018-03-14 C 6.90 6.90 -0.144 6.04 1.01 -2.61 2.61 No
2 2018-03-15 A 5.04 5.04 -0.0231 6.04 -0.976 -2.61 2.61 No
3 2018-03-16 B 6.09 6.09 -0.0846 6.04 0.139 -2.61 2.61 No
4 2018-03-17 C 6.06 6.06 0.0353 6.04 -0.0157 -2.61 2.61 No
5 2018-03-18 D 6.77 6.77 0.0326 6.04 0.704 -2.61 2.61 No
6 2018-03-19 D 6.28 6.28 0.278 6.04 -0.0393 -2.61 2.61 No
7 2018-03-20 C 6.48 6.48 -0.0947 6.04 0.540 -2.61 2.61 No
8 2018-03-21 B 6.02 6.02 -0.144 6.04 0.124 -2.61 2.61 No
9 2018-03-22 D 5.93 5.93 -0.0231 6.04 -0.0866 -2.61 2.61 No
10 2018-03-23 C 5.78 5.78 -0.0846 6.04 -0.168 -2.61 2.61 No
... with 90 more rows

But if I try to have a dataset that combines days and time (to the second):

df <- tibble(date = seq(from = Sys.time(), length.out = 100, by = 1), group = sample( LETTERS[1:4], 100, replace=TRUE), value = runif(100, 5.0, 7.5))

df %>% time_decompose(value, merge = TRUE, method = "twitter") %>% anomalize(remainder, method = "gesd")

Then it doesn't work:

Error in filter_impl(.data, quo) : Result must have length 8, not 0

It may be the way I prepare the second dataset?
Is it supposed to work with date and time?

Thank you

anomalize errors

I have tried to run the code from your page https://business-science.github.io/anomalize/ but I get the following problems.

Initially, the first bit of the code that downlaods data from CRAN and then creates the faceted graphs works but when I run it for a second time it tells me

"Error in mutate_impl(.data, dots) :
Evaluation error: 'get_index_col' is not an exported object from 'namespace:tibbletime'."

I started again and got this message this time:

"Error: expr must quote a symbol, scalar, or call"

When I run the next bit of code, # Data Manipulation / Anomaly Detection I get this error message even though I have changed nothing in your code:

"Error: index attribute is NULL. Was it removed by a function call?"

When I ran # Anomaly Visualziation, I got this error message:

"Error in eval(lhs, parent, parent) : object 'lubridate_dloads' not found"

My code follows:

Using the anomalize package

https://business-science.github.io/anomalize/

devtools::install_github("business-science/anomalize")

install.packages("anomalize") ... this one did not work for my version of R

library(tidyverse)
library(anomalize)

Next, let’s get some data. anomalize ships with a data set called tidyverse_cran_downloads that contains the daily CRAN

download counts for 15 “tidy” packages from 2017-01-01 to 2018-03-01.

tidyverse_cran_downloads %>%
ggplot(aes(date, count)) +
geom_point(color = "#2c3e50", alpha = 0.25) +
facet_wrap(~ package, scale = "free_y", ncol = 2) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
labs(title = "Tidyverse Package Daily Download Counts",
subtitle = "Data from CRAN by way of cranlogs package")

Suppose we want to determine which daily download “counts” are anomalous.

It’s as easy as using the three main functions (time_decompose(), a

nomalize(), and time_recompose()) along with a visualization function,

plot_anomalies().

tidyverse_cran_downloads %>%

Data Manipulation / Anomaly Detection

time_decompose(count, method = "stl") %>%
anomalize(remainder, method = "iqr") %>%
time_recompose() %>%

Anomaly Visualization

plot_anomalies(time_recomposed = TRUE, ncol = 2, alpha_dots = 0.25) +
labs(title = "Tidyverse Anomalies", subtitle = "STL + IQR Methods")

If you’re familiar with Twitter’s AnomalyDetection package, you can

implement that method by combining time_decompose(method = "twitter")

with anomalize(method = "gesd"). Additionally, we’ll adjust the

trend = "2 months" to adjust the median spans, which is how Twitter’s

decomposition method works.

Get only lubridate downloads

lubridate_dloads <- tidyverse_cran_downloads %>%
filter(package == "lubridate") %>%
ungroup()

Anomalize!!

lubridate_dloads %>%

Twitter + GESD

time_decompose(count, method = "twitter", trend = "2 months") %>%
anomalize(remainder, method = "gesd") %>%
time_recompose() %>%

Anomaly Visualziation

plot_anomalies(time_recomposed = TRUE) +
labs(title = "Lubridate Anomalies", subtitle = "Twitter + GESD Methods")

Last, we can compare to STL + IQR methods, which use different

decomposition and anomaly detection approaches.

lubridate_dloads %>%

STL + IQR Anomaly Detection

time_decompose(count, method = "stl", trend = "2 months") %>%
anomalize(remainder, method = "iqr") %>%
time_recompose() %>%

Anomaly Visualization

plot_anomalies(time_recomposed = TRUE) +
labs(title = "Lubridate Anomalies", subtitle = "STL + IQR Methods")

Sign error in limits of GESD method

I believe there is a sign error in the GESD method in anomalize.methods.R.

limit_upper = critical_value * mad - median

Specifically, the issue is concerning the derivation of the upper bound derived from the critical value from the GESD method. These bounds are stored in the outlier_report element of the output when using verbose = TRUE in the gesd function. The bounds are derived from the equation

\frac{\left|x_m - \mathrm{median}(X)\right|}{\mathrm{MAD}} \leq \lambda

which after rearranging gives

\mathrm{median}(X) - \lambda \cdot \mathrm{MAD}  \leq x_m \leq \lambda \cdot \mathrm{MAD} + \mathrm{median}(X)

In particular, the right hand side is \lambda \cdot \mathrm{MAD} + \mathrm{median}(X) and not \lambda \cdot \mathrm{MAD} - \mathrm{median}(X)

I believe changing line 188 to

limit_upper = critical_value * mad + median

would fix the issue.

Anomaly higher or lower than expected

Is there a way to easily pull out anomalies that are higher than expected compared with those that are lower than expected? I'm only interested in anomalies that are higher than expected and haven't figured out a way to filter out those that are lower than expected.

Thanks.

gesd() does not implement the GESD test

The documentation for anomalize::gesd() states that it implements the GESD method, and references @raunakms's gesd() function. But whereas the GESD method and @raunakms's gesd() function compute the test statistic R_i as

|x_i - mean(x)| / sd(x)

anomalize::gesd() uses

|x_i - median(x)| / mad(x)

Whatever the pros and cons of this modification, the result is NOT the GESD method, and is NOT the same as @raunakms's gesd().

time recompose and negative lower bound

I have a statistical question:
In your cran downloads time series after time_recompose we see the grey area for upper and lower bounds. As the number of downloads can never be negative, why the lower bounds do?

Thus, it could happen, that the number of downloads is zero but still not an outlier because the lower bound is negative!
In general - how to handle this? Maybe log transformations of observed time series, do the analysis and then re-transformation back?

timetk Dependency Failing

I have been working with this package for over a year now and just went to run the code again which includes the line
install.packages("anomalize")
At the very end of the installation returns I get

Error: package or namespace load failed for ‘timetk’:
object 'required_pkgs' not found whilst loading namespace 'timetk'
Error: loading failed
Execution halted
ERROR: loading failed

  • removing ‘/databricks/spark/R/lib/timetk’
    ERROR: dependency ‘timetk’ is not available for package ‘sweep’
  • removing ‘/databricks/spark/R/lib/sweep’
    ERROR: dependencies ‘timetk’, ‘sweep’ are not available for package ‘anomalize’
  • removing ‘/databricks/spark/R/lib/anomalize’

The downloaded source packages are in
‘/tmp/Rtmprfwu3X/downloaded_packages’

which looks like some of the package dependencies are failing. So when I try to run
library(anomalize) it fails as well.

Any idea what changed?

Anomalize a grouped tsibble

I am using anomalize with a tsibble object. Since the data is grouped using the function index_by from tsibble anomalize() cannot work. This is due to an unsupported indexClass of type yearmonth.

I bring in data that is daily with no gaps

Here is my code:

> # Lib Load ####
> install.load::install_load(
+   "tidyquant"
+   , "fable"
+   , "fabletools"
+   , "feasts"
+   , "tsibble"
+   , "timetk"
+   , "sweep"
+   , "anomalize"
+   , "xts"
+   # , "fpp"
+   # , "forecast"
+   , "lubridate"
+   , "dplyr"
+   , "urca"
+   # , "prophet"
+   , "ggplot2"
+ )
> # Get File ####
> fileToLoad <- file.choose(new = TRUE)
> arrivals <- read.csv(fileToLoad)
> View(arrivals)
> arrivals$Time <- mdy(arrivals$Time)
> # Coerce to tsibble ----
> df_tsbl <- arrivals %>%
+   as_tsibble(index = Time)
> df_tsbl
# A tsibble: 6,908 x 2 [1D]
   Time       DSCH_COUNT
   <date>          <int>
 1 2001-01-01         22
 2 2001-01-02         30
 3 2001-01-03         43
 4 2001-01-04         30
 5 2001-01-05         38
 6 2001-01-06         22
 7 2001-01-07         29
 8 2001-01-08         37
 9 2001-01-09         33
10 2001-01-10         52
# ... with 6,898 more rows
> interval(df_tsbl)
1D
> count_gaps(df_tsbl)
# A tibble: 0 x 3
# ... with 3 variables: .from <date>, .to <date>, .n <int>
> # Make Monthly ----
> df_monthly_tsbl <- df_tsbl %>%
+   index_by(Year_Month = ~ yearmonth(.)) %>%
+   summarise(Count = sum(DSCH_COUNT, na.rm = TRUE))
> df_monthly_tsbl           
# A tsibble: 227 x 2 [1M]
   Year_Month Count
        <mth> <int>
 1   2001 Jan  1067
 2   2001 Feb   919
 3   2001 Mar  1024
 4   2001 Apr  1010
 5   2001 May  1056
 6   2001 Jun   995
 7   2001 Jul  1002
 8   2001 Aug  1076
 9   2001 Sep   982
10   2001 Oct   971
# ... with 217 more rows

> # Anomalize ----
> df_monthly_tsbl %>%
+   time_decompose(Count, method = "twitter") %>%
+   anomalize(remainder, method = "gesd") %>%
+   clean_anomalies() %>%
+   time_recompose()
Converting from tbl_ts to tbl_time.
Auto-index message: index = Year_Month
Error in index.xts(x) : unsupported ‘indexClass’ indexing type: yearmonth

Session info:

> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggplot2_3.2.1              urca_1.3-0                 dplyr_0.8.3               
 [4] anomalize_0.2.0            sweep_0.2.2                timetk_0.1.2              
 [7] tsibble_0.8.5              feasts_0.1.1               fable_0.1.1               
[10] fabletools_0.1.1           tidyquant_0.5.9            quantmod_0.4-15           
[13] TTR_0.23-6                 PerformanceAnalytics_1.5.3 xts_0.11-2                
[16] zoo_1.8-6                  lubridate_1.7.4           

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5   purrr_0.3.3        lattice_0.20-38    colorspace_1.4-1  
 [5] vctrs_0.2.1        generics_0.0.2     utf8_1.1.4         rlang_0.4.2       
 [9] pillar_1.4.3       tibbletime_0.1.3   glue_1.3.1         withr_2.1.2       
[13] lifecycle_0.1.0    stringr_1.4.0      Quandl_2.10.0      munsell_0.5.0     
[17] anytime_0.3.6      gtable_0.3.0       labeling_0.3       curl_4.3          
[21] fansi_0.4.1        broom_0.5.3        Rcpp_1.0.3         backports_1.1.5   
[25] scales_1.1.0       install.load_1.2.1 jsonlite_1.6       farver_2.0.2      
[29] gridExtra_2.3      digest_0.6.23      packrat_0.5.0      stringi_1.4.3     
[33] grid_3.5.3         quadprog_1.5-8     cli_2.0.1          tools_3.5.3       
[37] magrittr_1.5       lazyeval_0.2.2     tibble_2.1.3       crayon_1.3.4      
[41] tidyr_1.0.0        pkgconfig_2.0.3    zeallot_0.1.0      assertthat_0.2.1  
[45] httr_1.4.1         rstudioapi_0.10    R6_2.4.1           nlme_3.1-137      
[49] compiler_3.5.3    

Warning message: `cols` is now required.

Hello,

I received the warning below, but search your repo and didn't find any reference to it. Can you add documentation or clarification on this warning? Does it mattter?

Warning message:
`cols` is now required.
Please use `cols = c(anomalies)`

Thanks,

Alfredo

Can the messages in time_decompose be suppressed in a tidy way?

Thanks for the work on the package, I've mostly been finding it really useful, apart from the following issue.

The following from the example prints a large volume of output to the console:

suppressMessages(library(tibbletime))
suppressMessages(library(anomalize))

tidyverse_cran_downloads_anomalized <- tidyverse_cran_downloads %>%
time_decompose(count, merge = TRUE, message = FALSE)

The output:

Registered S3 method overwritten by 'xts':
  method     from
  as.zoo.xts zoo 
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
Registered S3 methods overwritten by 'forecast':
  method             from    
  fitted.fracdiff    fracdiff
  residuals.fracdiff fracdiff
Warning message:
Detecting old grouped_df format, replacing `vars` attribute by `groups` 

The S3 overwrite warnings are generated at the execution of time_decompose, not at the library call. This makes the output frustrating when you aware of these operations, but you can't turn them off.

What would be the recommendation to prevent these messages printing to console?

gesd() does not implement the GESD test

The documentation for anomalize::gesd() states that it implements the GESD method, and references @raunakms's gesd() function. But whereas the GESD method and @raunakms's gesd() function compute the test statistic R_i as

|x_i - mean(x)| / sd(x)

anomalize::gesd() uses

|x_i - median(x)| / mad(x)

Whatever the pros and cons of this modification, the result is NOT the GESD method, and is NOT the same as @raunakms's gesd().

Critical Limits for IQR Method

For this code block to calculate the critical limits, I don't see a difference when there are outliers or not since limit_tbl$limit_lower and limit_tbl$limit_upper come from limits[1]) and limits[2] which are the same for each row.

  if (any(vals_tbl$outlier == "No")) {
    # Non outliers identified, pick first limit
    limit_tbl <- vals_tbl %>%
      dplyr::filter(outlier == "No") %>%
      dplyr::slice(1)
    limits_vec <- c(
      limit_lower = limit_tbl$limit_lower,
      limit_upper = limit_tbl$limit_upper
    )
  } else {
    # All outliers, pick last limits
    limit_tbl <- vals_tbl %>%
      dplyr::slice(n())
    limits_vec <- c(
      limit_lower = limit_tbl$limit_lower,
      limit_upper = limit_tbl$limit_upper
    )
  }```

anomalize - shiny - plotly - interaction - help needed

hi, really thnks for this package its quite amazing! to the thing:

objective
Use the package to explore some random data in an application:
can check at https://github.com/jas1/shiny_ts_anomalize

description
I'm made an interactive plot with shiny and plotly; It worked quite straigthforwarad, nevertheless got an issue at the time trying to show the dates on the plotly tooltip.
the script is : app.r

issue
Try to show the dates on the plotly tooltip.
i've seen that this issue is common on plotly and can be solved. the similar issue is:
https://community.plot.ly/t/date-format-in-tooltip-of-ggplotly/4766

and the solution pointed is:
https://stackoverflow.com/questions/44770799/date-format-in-tooltip-of-ggplotly

As I do not have acces to the plot configuration on the anomalize , I don't know how to load the suggested solution on StackOverflow.

closing

I will really appreciate anyt any suggestion,
Thanks in advance, and keep the awsome job you're doing.

Add one-sided test capability

This is just a suggestion - there are times when one-sided tests are of interest in anomaly detection. It would be nice to have that capability added to anomalize.
Thanks,
Aaron

Anomalies detected within bounds

I'm running anomalize on large datasets and occasionally come across instances where the anomalize() function finds outliers when the remainder is within the remainder_l1 and remainder_l2 bounds. Theoretically this should not be possible, but unless I'm interpreting the output incorrectly I can't understand this result. In the code below, gesd identifies rows 12 and 16 as anomalies, despite the remainder being greater than the lower bound.

library(tibbletime)
library(anomalize)

#Create data frame
df <- data.frame(date = c("2003-01-01", "2004-01-01", "2005-01-01", "2006-01-01", "2007-01-01", "2008-01-01", "2009-01-01", "2010-01-01",
"2011-01-01", "2012-01-01", "2013-01-01", "2014-01-01", "2015-01-01", "2016-01-01", "2017-01-01", "2018-01-01"),
val = c(13.54941, 13.57737, 13.61070, 13.62143, 13.64319, 13.64563, 13.66624, 13.68140, 13.69086, 13.70454,
13.70949, 13.73307, 13.77554, 13.81119, 13.83046, 13.83948))

df$date <- as.Date(df$date)

#Convert to tibbletime object
df_tbl <- as_tbl_time(df, index = date)

#Run anomalize
results <- df_tbl %>% time_decompose(val, frequency = "auto", trend = "auto", method = "stl") %>%
anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.2) %>%
time_recompose()

Error when Applying Code to a Different Dataset

Dear Developer,

Thank you so much for the wonderful package and beautiful demo! I can't wait to use it!

I was replicating the code and testing it on my own dataset where and error occured.

library(data.table)
library(ggplot2)
library(scales)
old_signals = read.csv("~/Desktop/demo_signals.txt")
old_signals = na.omit(old_signals)
long_sig = melt(old_signals, id.vars = "FM")
long_sig$FM = as.Date(long_sig$FM, "%m/%d/%y")
long_sig$FM = as.Date(as.yearmon(long_sig$FM))

library(tidyverse)
library(anomalize)
long_sig %>%
  ggplot(aes(FM, value)) +
  geom_line(color = "#0000CD", alpha = 0.25) +
  facet_wrap(~ variable, scale = "free_y", ncol = 3) +
  theme_minimal() +
  scale_x_date(date_breaks = "12 month", date_labels = "%Y-%m", date_minor_breaks = "3 month")+
  theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
  labs(title = "Validate anomalize package",
       subtitle = "Using Merge and Acquisition* signals")

This code generates the plot below:
validate anomalize package

I then converted it to tbl_time, instead of data.table using the code below:

names(long_sig)[1] = "FM"
long_sig$variable = as.character(long_sig$variable)
long_sig = prep_tbl_time(long_sig)

class(long_sig) <- c("grouped_tbl_time", class(long_sig))

long_sig %>%
  # Data Manipulation / Anomaly Detection
  time_decompose(value, method = "stl") %>%
  anomalize(remainder, method = "iqr") %>%
  time_recompose() %>%
  # Anomaly Visualization
  plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.25) +
  labs(title = "Tidyverse Anomalies", subtitle = "STL + IQR Methods") 

However, and error occured:

screen shot 2018-07-12 at 5 56 21 pm

I couldn't seem to figure out what went wrong. Help!
Thank you so much for your time and have a wonderful day!
All the best,
Kathy Gao
[email protected]

P.S. Below is the link to the demo dataset:
demo_signals.txt

Error: Only year, quarter, month, week, and day periods are allowed for an index of class Date

I have arranged my own data into as close a tibble to the "tidyverse_cran_downloads" demonstration data as possible:

class(tidyverse_cran_downloads)
[1] "grouped_tbl_time" "tbl_time" "grouped_df" "tbl_df" "tbl" "data.frame"

class(isw_simple)
[1] "grouped_tbl_time" "tbl_time" "grouped_df" "tbl_df" "tbl" "data.frame"

glimpse(tidyverse_cran_downloads)
Observations: 6,375
Variables: 3
Groups: package [15]
$ date 2017-01-01, 2017-01-02, 2017-01-03, 2017-01-04, 2017-01-05, 2017-01-06, 2017-01-07, 2017-01-08, 2017-01-09...
$ count 873, 1840, 2495, 2906, 2847, 2756, 1439, 1556, 3678, 7086, 7219, 0, 5960, 2904, 2854, 5428, 6358, 6973, 661...
$ package "tidyr", "tidyr", "tidyr", "tidyr", "tidyr", "tidyr", "tidyr", "tidyr", "tidyr", "tidyr", "tidyr", "tidyr",...

glimpse(isw_tss)
Observations: 15,744
Variables: 4
Groups: staff_id_last_updt [9]
$ sample_date 2011-06-15, 2011-06-15, 2011-06-22, 2011-06-22, 2011-08-16, 2011-08-29, 2011-08-29, 2011-09-20,...
$ reported_value 68.0, 62.0, 38.0, 3.0, 35.1, 147.0, 147.0, 32.4, 1.0, 0.0, 0.0, 13.0, 130.0, 25.9, 10.4, 10.4, 2...
$ parameter_name "Solids, Total Suspended (TSS)", "Solids, Total Suspended (TSS)", "Solids, Total Suspended (TSS)...
$ staff_id_last_updt "bolafso", "bolafso", "bolafso", "bolafso", "bolafso", "bolafso", "bolafso", "bolafso", "bolafso...

As you can see, both 'class()' and 'glimpse()' show very similar structures. I can replicate the results with the demonstration data just fine. However, when I try and apply the 'time_decompose()' function to my data (isw_tss), I get the "Only year, quarter, month, week, and day periods are allowed for an index of class Date" error message.

I am confused by this as my date data are in the ymd format (same as the demonstration data). Any thoughts would be much appreciated.

I have attached a sample data file
isw_tss.txt

Here is the code I have modified up to the error message bits:

load libraries

library(tidyverse)
library(tidyquant)
library(lubridate)
library(ggplot2)
library(ggpubr)
library(anomalize)
library(tibbletime)

read in the data

isw_dmr <- read_csv('C:\Users\kwyther\export_isw_dmr.csv')

change to lower case and remove rows with no reported value

isw_dmr <- rename_all(isw_dmr, tolower) %>%
drop_na(reported_value)

change sample_dates to date

isw_dmr$sample_date <- dmy(isw_dmr$sample_date)

list of paramters

params <- isw_dmr %>% distinct(parameter_name)

list of staff entering data

staff <- isw_dmr %>% distinct(staff_id_last_updt)

simplify by parameter

tss

isw_tss <- isw_dmr %>%
select(sample_date, reported_value, parameter_name, staff_id_last_updt) %>%
filter(parameter_name == 'Solids, Total Suspended (TSS)')

isw_tss <- isw_tss %>%
group_by(staff_id_last_updt) %>%
as_tbl_time(sample_date)

isw_tss <- isw_tss %>%
arrange(sample_date, .by_group = TRUE)

isw_tss %>%
ggplot(aes(sample_date, reported_value)) +
geom_point(color = "#2c3e50", alpha = 0.25) +
facet_wrap(staff_id_last_updt ~ .) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
labs(title = "TSS reported values by staff",
subtitle = "Data from ISW_DMRs")

isw_tss %>%

Data Manipulation / Anomaly Detection

time_decompose(reported_value, method = "stl") %>%
anomalize(remainder, method = "iqr") %>%
time_recompose() %>%

Anomaly Visualization

plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.25) +
labs(title = "TSS Anomalies", subtitle = "STL + IQR Methods")

Error on Windows but not on Mac

anom <- usage %>% group_by(date_only) %>% arrange(date_only) %>% summarize(total = sum(amount)) %>% as_tibble() anom_only <- anom %>% time_decompose(total) %>% anomalize(remainder) %>% time_recompose() %>% filter(anomaly == 'Yes')

This code works as intended on my Mac. However, when it's moved to a Windows server I get the following error:

Error in value[3L] :
Error in prep_tbl_time(): No date or datetime column found.

Any ideas what could be the cause? The date_only variable is, indeed, a Date.

plot_anomaly_decomposition() - Error in -x : invalid argument to unary operator

Looks like a problem with this particular function ggplot ones?

> library(anomalize)
Warning messages:
1: R graphics engine version 12 is not supported by this version of RStudio. The Plots tab will be disabled until a newer version of RStudio is installed. 
2: package ‘anomalize’ was built under R version 3.4.4 
> library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Warning message:
package ‘dplyr’ was built under R version 3.4.2 
> library(ggplot2)
Warning message:
package ‘ggplot2’ was built under R version 3.4.2 
> tidyverse_cran_downloads %>%
+     filter(package == "lubridate") %>%
+     ungroup() %>%
+     time_decompose(count) %>%
+     anomalize(remainder) %>%
+     plot_anomaly_decomposition() +
+     labs(title = "Decomposition of Anomalized Lubridate Downloads")
Converting from tbl_df to tbl_time.
Auto-index message: index = date
frequency = 7 days
trend = 91 days
Error in -x : invalid argument to unary operator
In addition: Warning message:
package ‘bindrcpp’ was built under R version 3.4.1 

Error: Class 'character' is not a known index class.

Dear colleagues,

I am trying to wrap the Anomalize pipeline in a custom function that I want to apply interactively to compare the input with a classification method.

The function is as follows:

CleaningAnomalies <- function(df, alpha_val, max_anoms) {
  df_test <- df %>% time_decompose(global_demand, method = "twitter") %>%
    anomalize(remainder, method = "gesd", alpha=alpha_val, max_anoms = max_anom_percentage) %>%
    clean_anomalies()
  return(df_test)
}

Then I call the function as follows and I get the displaying error:

my_df <- CleaningAnomalies(my_df, 0.05, 0.2)
Error: Class 'character' is not a known index class.
In addition: Warning messages:
1: In to_posixct_numeric.default(index) : NAs introduced by coercion
2: In to_posixct_numeric.default(index) : NAs introduced by coercion
Called from: glue_stop("Class '{class(x)}' is not a known index class.")

Whenever I apply the pipe directly it works like charm. When I try to debug it appears that it fails in the following step:

function (..., .sep = "") 
{
  stop(glue::glue(..., .sep, .envir = parent.frame()), call. = FALSE)
}

Does someone have any idea of why this is happening?

native STL can detect anomaly but time_decompose not

dfr <- tibble(ds = as.Date(c('2017-03-01', '2017-04-01', '2017-05-01', '2017-06-01', '2017-07-01', '2017-08-01', '2017-09-01',
                                  '2017-10-01', '2017-11-01', '2017-12-01', '2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01',
                                  '2018-07-01', '2018-08-01', '2018-09-01', '2018-10-01', '2018-11-01', '2018-12-01', '2019-01-01', '2019-02-01', '2019-03-01',
                                  '2019-04-01', '2019-05-01', '2019-06-01')),
  
                  y=c(12.95, 1.12, 4.48, 17.85, 0.14, 0.7, 1.75, 3.43, 5.18, 14.91, 4.27, 1.82, 4.83, 2.94, 3.22, 6.72, 3.36, 2.52,
                      5.88, 23.1, 13.44, 1244.22, 1.26, 9.66, 22.05, 2.94, 6.3, 1.26))
# outlier in 2018-12-01
decmp <- time_decompose(data = dfr,target = y,method = "stl", frequency = "12 months",trend = "auto",message = FALSE)
fit <- decmp %>% anomalize(remainder,method = "gesd",max_anoms = .3,verbose = FALSE)
# 1244 in 2018-12-01 is seasonality!!!!! and remainder = -0.2793775
fit %>% time_recompose() %>% plot_anomalies()
fit %>% time_recompose() %>% plot_anomaly_decomposition()

#try native STL

t <- ts(data = dfr$y,start = c(2017,3),frequency = 12)
library(highcharter)
hchart(stl(t,s.window = "periodic"))
# in 2018-12-01  remainder is 558 and in this case is anomaly!!!!

# try twitter with same results
decmp <- time_decompose(data = dfr,target = y,method = "twitter", frequency = "12 months",trend = "auto",message = FALSE)
fit <- decmp %>% anomalize(remainder,method = "gesd",max_anoms = .3,verbose = FALSE)
fit

# try original twitter
# devtools::install_github("twitter/AnomalyDetection")
library(AnomalyDetection)
AnomalyDetectionTs(x = dfr)
# timestamp               anoms
# 1 2018-10-01 03:00:00   23.10
# 2 2018-12-01 03:00:00 1244.22

Error in FUN(X[[i]], ...) : object '.group' not found

Hello,

Thanks for this outstanding group.

Following the code posted in https://business-science.github.io/anomalize/ I have got this error:
tidyverse_cran_downloads %>%
# Data Manipulation / Anomaly Detection
time_decompose(count, method = "stl") %>%
anomalize(remainder, method = "iqr") %>%
time_recompose() %>%
# Anomaly Visualization
plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.25)
Error in FUN(X[[i]], ...) : object '.group' not found

Is there any way to overcome this error?

Thanks in advance.

Kind regards,

Guillermo

Using time_decompose(count,method='twitter')

Hi!

using: time_decompose(count,method='twitter')

gives me the error...
<simpleError in stats::stl(., s.window = "periodic", robust = TRUE): series is not periodic or has less than two periods>

Is this the right/intended error message?
I'm confused whether the twitter method is used instead of stl at all.

Thank you Matt.

Bug:: Inconsistency with tssibble

Dear colleagues,

I had a code that it was working as follows:

 #we set time tible index for anomaly detection
          GlobalDemand <- as_tbl_time(x = GlobalDemand, index = snsr_ts)
          GlobalDemandCleaned <- GlobalDemand %>% 
            time_decompose(target = global_demand, method = "twitter") %>%
            anomalize(target = remainder, method = "gesd", alpha=0.2, max_anoms = 0.2) %>%
            clean_anomalies() %>% 
            rename(snsr_ts = snsr_dt)

This code was using tibbletime and it was detecting my sub-hourly frequency (30mins) properly. Based on the status of tibble time I decided to migrate the time to tsibble and now I am doing as follows:

GlobalDemand <- as_tsibble(x = aggregated_data_df, index = snsr_ts, regular = TRUE)

#we set time tible index for anomaly detection

GlobalDemandCleaned <- GlobalDemand %>% 
  time_decompose(target = global_demand ,method = "twitter") %>%
  anomalize(target = remainder, method = "gesd", alpha=0.2, max_anoms = 0.2) %>%
  clean_anomalies() %>% 
  rename(snsr_ts = snsr_dt)

However it seems that anomalize ignores the index of tsibble and sets snsr_dt as index:

GlobalDemand %>%  time_decompose(target = global_demand, method = "twitter") 
Converting from tbl_ts to tbl_time. 
Auto-index message: index = snsr_dt 
Error: Problem with `mutate()` input `snsr_dt`. x 
Only year, quarter, month, week, and day periods are allowed for an index of class DateInput `snsr_dt` is `collapse_index(...)`. 
Run `rlang::last_error()` to see where the error occurred.

Last error and trace as follows:

rlang::last_error()
<error/dplyr_error>
Problem with `mutate()` input `snsr_dt`.
x Only year, quarter, month, week, and day periods are allowed for an index of class DateInput `snsr_dt` is `collapse_index(...)`.
Backtrace:
 9. anomalize::time_decompose(., target = global_demand, method = "twitter")
12. anomalize:::time_decompose.tbl_time(...)
11. anomalize::decompose_twitter(...)
22. anomalize::time_frequency(data, period = frequency, message = message)
12. tibbletime::collapse_by(., period = periodicity_target)
36. dplyr:::mutate.data.frame(...)
37. dplyr:::mutate_cols(.data, ...)
Run `rlang::last_trace()` to see the full context.

rlang::last_trace() 
<error/dplyr_error> 
Problem with `mutate()` input `snsr_dt`. x 
Only year, quarter, month, week, and day periods are allowed for an index of class DateInput `snsr_dt` is `collapse_index(...)`. 
Backtrace:1. └─GlobalDemand %>% time_decompose(target = global_demand, method = "twitter")   
2.   ├─base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))  
3.   └─base::eval(quote(`_fseq`(`_lhs`)), env, env)   
4.     └─base::eval(quote(`_fseq`(`_lhs`)), env, env)   
5.       └─`_fseq`(`_lhs`)   
6.         └─magrittr::freduce(value, `_function_list`)   
7.           ├─base::withVisible(function_list[[k]](value))   
8.           └─function_list[[k]](value)   
9.             ├─anomalize::time_decompose(., target = global_demand, method = "twitter")  
10.             └─anomalize:::time_decompose.tbl_df(...)  
11.               ├─anomalize::time_decompose(...)  
12.               └─anomalize:::time_decompose.tbl_time(...)  
13.                 └─`%>%`(...)  
14.                   ├─base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))  
15.                   └─base::eval(quote(`_fseq`(`_lhs`)), env, env)  
16.                     └─base::eval(quote(`_fseq`(`_lhs`)), env, env)  
17.                       └─anomalize:::`_fseq`(`_lhs`)  
18.                         └─magrittr::freduce(value, `_function_list`)  
19.                           ├─base::withVisible(function_list[[k]](value))  
20.                           └─function_list[[k]](value)  
21.                             └─anomalize::decompose_twitter(...)  
22.                               └─anomalize::time_frequency(data, period = frequency, message = message)  
23.                                 └─`%>%`(...)  
24.                                   ├─base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))  
25.                                   └─base::eval(quote(`_fseq`(`_lhs`)), env, env)  
26.                                     └─base::eval(quote(`_fseq`(`_lhs`)), env, env)  
27.                                       └─anomalize:::`_fseq`(`_lhs`)  
28.                                         └─magrittr::freduce(value, `_function_list`)  
29.                                           └─function_list[[i]](value)  
30.                                             └─tibbletime::collapse_by(., period = periodicity_target)  
31.                                               ├─dplyr::mutate(...)  
32.                                               ├─tibbletime:::mutate.tbl_time(...)  
33.                                               │ ├─tibbletime::reconstruct(NextMethod(), copy_.data)  
34.                                               │ └─tibbletime:::reconstruct.tbl_time(NextMethod(), copy_.data)  
35.                                               ├─base::NextMethod()  
36.                                               └─dplyr:::mutate.data.frame(...)  
37.                                                 └─dplyr:::mutate_cols(.data, ...) 
<error/assertError> Only year, quarter, month, week, and day periods are allowed for an index of class Date



Anomalize error

Hi Matt

I am a newbie to R.

I have tried to run the code from your page https://business-science.github.io/anomalize/ but I just cannot get it to work.

I can install all the packages i need but keep getting an error when i want to view tidyverse_cran_downloads: Error: expr must quote a symbol, scalar, or call

Below are my initial steps:

##################################################

Step 1: Install Anomalize
devtools::install_github("business-science/anomalize",force = TRUE)
install.packages("tibbletime")

Step 2: Load Tidyverse and Anomalize
library(tidyverse)
library(anomalize)
library(tibbletime)

check tidyverse data()
tidyverse_cran_downloads

###############################################

I Have tried to update tibbletime as suggested but it does not work.
Can you please help me resolve this issue.

Kind regards

Heinrich

Error: `var` must evaluate to a single number or a column name, not a list

Hi,
I get the following error message when I execute the time_decompose part of the code (I am showing with only the first step; adding all steps as on (http://www.business-science.io/code-tools/2018/04/08/introducing-anomalize.html) makes no difference) :

t %>%
time_decompose(t) #%>% # , method = "stl", frequency = "auto", trend = "auto", message = TRUE

Converting from tbl_df to tbl_time.
Auto-index message: index = dateTime
frequency = 288 minutes
trend = 2880 minutes
Error: var must evaluate to a single number or a column name, not a list

I am unable to attach sample data file - so showing the first few rows from R terminal.

t
A tibble: 2,304 x 2
dateTime temp

1 2018-02-08 00:00:00 -11.9
2 2018-02-08 00:05:00 -11.9
3 2018-02-08 00:10:00 -11.9
4 2018-02-08 00:15:00 -11.9
5 2018-02-08 00:20:00 -11.9
6 2018-02-08 00:25:00 -11.9
7 2018-02-08 00:30:00 -11.9
8 2018-02-08 00:35:00 -11.9
9 2018-02-08 00:40:00 -11.9
10 2018-02-08 00:45:00 -11.9
... with 2,294 more rows

Kindly suggest how to proceed.

time_decompose with missing values

My dataset is much larger but this error is trivially reproduced with the following sample data.

tmp <- structure(list(ds = structure(c(16482, 16483, 16484, 16485, 16486, 
16487), class = "Date"), y = c(2.16618784530387, NA, NA, 1.95971962616822, 
NA, NA)), .Names = c("ds", "y"), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

Obviously there are NA values. I have been using the prophet package to build some forecast models which handles NA values without an issue. There are likely outliers in my datasets that I would like to leverage anomalize. However, there are various NA values that cause an error to be through with the first time_decompose function. For example

tmp %>%
  time_decompose(y)

Converting from tbl_df to tbl_time.
Auto-index message: index = ds
frequency = 1 days
trend = 6 days
Error in na.fail.default(as.ts(x)) : missing values in object

Thoughts on how to address this?

Evaluation error: Only year, quarter, ...

I'm getting this error.

Error in mutate_impl(.data, dots) :
Evaluation error: Only year, quarter, month, week, and day periods are allowed for an index of class Date.

I checked the class for my date column and it returns "Date" similar to that of the date column in the example data set. An example of what's in the date column is 2014-12-04

This is the script that I run:
twitter.df %>% time_decompose(count, method = "twitter", trend = "2 months") %>% anomalize(remainder, method = "gesd") %>% time_recompose() %>% plot_anomalies(time_recomposed = TRUE)
I can't seem to figure out the problem with it. I've tried updating the packages and reloading them.

Not working with tidyverse 1.2.1

Just making a note for anyone else searching for help on the same problem I had. I upgraded my R version and a lot of packages, and suddenly I got errors when running the standard anomalize package examples:

tidyverse_cran_downloads_anomalized <- tidyverse_cran_downloads %>%
  time_decompose(count, merge = TRUE)

Result:

Error in !.key : invalid argument type

This seems to be already reflected in the information at https://github.com/tidyverse/tidyr/blob/master/revdep/problems.md.

To tide people over until it gets straightened out, I was able to work around the problem by downgrading tidyr:

require(devtools)
install_version("tidyr", version = "0.8.3", repos = "http://cran.us.r-project.org")

Error in mutate_impl(.data, dots): Class 'character' is not a known index class

Hello! I'm having problems using my own data with the package. Here's a little bit of the data for example:

date installs
2014-10-01 23350
2014-10-02 23154
2014-10-03 22785
2014-10-20 23041
2014-10-21 24170
x <- structure(
  list(
    date = structure(c(16344, 16345, 16346, 16347, 
                       16348, 16349, 16350, 16351, 16352, 16353, 16354, 16355, 16356, 
                       16357, 16358, 16359, 16360, 16361, 16362, 16363, 16364),
                     class = "Date"),
    installs = c(23350L, 23154L, 22785L, 24356L, 24234L, 22774L, 
                 22978L, 23028L, 22708L, 23510L, 25631L, 24591L, 22854L, 22540L, 
                 24313L, 24717L, 24169L, 26092L, 25254L, 23041L, 24170L)
  ),
  row.names = c(NA, -21L), class = c("tbl_df", "tbl", "data.frame")
)

When I run:

x_anomalized <- x %>%
  as_tbl_time("date") %>%
  time_decompose(installs) %>%
  anomalize(remainder) %>%
  time_recompose()

I get:

Error in mutate_impl(.data, dots) : 
  Evaluation error: Class 'character' is not a known index class..
In addition: Warning messages:
1: In to_posixct_numeric.default(index) : NAs introduced by coercion
2: In to_posixct_numeric.default(index) : NAs introduced by coercion

The problem appears to be at the very first step with time_decompose(). This is my sessionInfo():

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.1

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets 
[6] methods   base     

other attached packages:
[1] bindrcpp_0.2.2   anomalize_0.1.1  tibbletime_0.1.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       rstudioapi_0.8   bindr_0.1.1     
 [4] magrittr_1.5     tidyselect_0.2.5 munsell_0.5.0   
 [7] lattice_0.20-38  colorspace_1.3-2 R6_2.3.0        
[10] rlang_0.3.0.1    stringr_1.3.1    plyr_1.8.4      
[13] dplyr_0.7.8      xts_0.11-2       tools_3.5.1     
[16] grid_3.5.1       nlme_3.1-137     broom_0.5.0     
[19] gtable_0.2.0     timetk_0.1.1.1   lazyeval_0.2.1  
[22] assertthat_0.2.0 tibble_1.4.2     crayon_1.3.4    
[25] purrr_0.2.5      ggplot2_3.1.0    tidyr_0.8.2     
[28] glue_1.3.0       stringi_1.2.4    compiler_3.5.1  
[31] pillar_1.3.0     backports_1.1.2  scales_1.0.0    
[34] lubridate_1.7.4  zoo_1.8-4        pkgconfig_2.0.2 

Please help. I compared my data with tidyverse_cran_downloads and I cannot figure out what I'm missing. Thank you!

No date or datetime column found.

Hi ! I'm still in trouble, trying to anomalize a dataset called aqw2 and it doesn't work, this is what I get

Error in value[3L] :
Error in prep_tbl_time(): No date or datetime column found.

Here is the dataset
aqw2.txt

Really need it's an important project, thanks so much !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.