mhahsler / arules Goto Github PK
View Code? Open in Web Editor NEWMining Association Rules and Frequent Itemsets with R
Home Page: http://mhahsler.github.io/arules
License: GNU General Public License v3.0
Mining Association Rules and Frequent Itemsets with R
Home Page: http://mhahsler.github.io/arules
License: GNU General Public License v3.0
I'm trying to do a market basket analysis using arules package. However, when I was using the apriori algorithm, R reported the following message.
dt <- split(deviceshowlist$prog_title, deviceshowlist$device_id)
dt2 <- as(dt,"transactions")
rules <- apriori(dt2, parameter = list(support = 0.01, confidence = 0.05, minlen=2))
Apriori
Parameter specification:
Error in print.default(parameter) : attempt to apply non-function
I looked at my transaction data structure and the function of apriori. It seems that there's nothing wrong. Kindly hope that someone ran into the similar question before and could help me with it.
Thanks
Can we have a feature that captures the output of inspect
into a data frame? Here is a particular use case: https://stackoverflow.com/questions/50554355/capture-the-output-of-arulesinspect-as-data-frame
Dear Michael,
May I know why arules only generate rules with only 1 item on the right-hand-side? I have seen you have answered in
But do you mind to explain a bit more? If only rules with 1 item on the right-hand-side are generated, does it mean a lot of rules will be neglected (rules with >1 item on the right-hand-side) and also associations/patterns between items?
Is a rule with more than 1 item on the RHS producing less value (less useful) than the rules with only 1 item on the RHS so it can be neglected?
Many thanks for your help.
Hi, My arulesModel model contains 260K rules and has a size of 17 MB. Upon applying the is.redundant method, memory limit is reached on a machine with 16 GB RAM:
is.redundant(arulesModel)
Error: cannot allocate vector of size 250.0 Gb
In addition: Warning messages:
1: In .local(x, y, proper, sparse, ...) : Reached total allocation of 16274Mb: see help(memory.size)
2: In .local(x, y, proper, sparse, ...) : Reached total allocation of 16274Mb: see help(memory.size)
3: In .local(x, y, proper, sparse, ...) : Reached total allocation of 16274Mb: see help(memory.size)
4: In .local(x, y, proper, sparse, ...) : Reached total allocation of 16274Mb: see help(memory.size)
memory.limit()
[1] 16274 # => 16 GB
How to remove redundant rules without hitting the memory limit? Thanks!
I noticed that the object size increased for the rules after performing any action on it, like subsetting the RHS or sorting. For example:
object.size(arulesModel)
#16908112 bytes # ~ 17 MB
arulesModelSubset <- subset(arulesModel, subset = rhs %in% cnames)
arulesModelSorted <- sort(arulesModel, by = "lift")
object.size(arulesModelSorted)
#50946536 bytes # 51 MB compared to 17 MB above.
Is this expected or is there possibly a memory leak?
Thanks!
Hi,
I'm trying to run the following script:
data <- list(
c("a","b","c"),
c("a","b"),
c("a","b","d"),
c("b","e"),
c("b","c","e"),
c("a","d","e"),
c("a","c"),
c("a","b","d"),
c("c","e"),
c("a","b","d","e")
)
as(data, "transactions")
through command line Rscript and get an error:
Error: could not find function "as"
Execution halted
In some of my Apriori runs, I get 0 rules. The trouble seems to be with the number of observations, at least how I'm using it.
Here's my R script to reproduce the problem. smallRules
are calculated properly. However, largeRules
remain 0 whenever largeObsCount
is above 250. I'm actually not sure where the sweet spot is (200 is OK). I was narrowing it down, but unfortunately random.org won't let me run any more tests today. I had reported this on StackOverflow in a comment, but in fact I wasn't realizing that there were 0 rules.
if(! "arules" %in% installed.packages()) install.packages("arules", depend = TRUE)
library (arules)
if(! "random" %in% installed.packages()) install.packages("random", depend = TRUE)
library(random)
smallItemCount <- 24
smallSampleNames <- as.vector(randomStrings(n=smallItemCount, len=10, unique=TRUE))
shortSamplePaths <- rep("src/", smallItemCount)
smallTmpData <- data.frame(paths=shortSamplePaths,names = smallSampleNames)
smallSampleItems <- interaction(smallTmpData[head(names(smallTmpData))], sep= "")
smallObsCount = 500
smallSampleData <- data.frame(
X = sample(smallSampleItems, smallObsCount, replace = TRUE),
Y = sample(smallSampleItems, smallObsCount, replace = TRUE)
)
smallRules <-apriori(smallSampleData, parameter=list(supp=0.005,conf=0.1,minlen=2))
largeItemCount = 578
largeSampleNames <- as.vector(randomStrings(n=largeItemCount, len=10, unique=TRUE))
#longSamplePaths <- rep("modules/junit4/src/test/java/org/powermock/modules/junit4/", largeItemCount)
longSamplePaths <- rep("junit4/", largeItemCount)
bigTmpData <- data.frame(paths=longSamplePaths,names = largeSampleNames)
bigSampleItems <- interaction(bigTmpData[head(names(bigTmpData))], sep= "")
largeObsCount = 250
bigSampleData <- data.frame(
X = sample(bigSampleItems, largeObsCount, replace = TRUE),
Y = sample(bigSampleItems, largeObsCount, replace = TRUE)
)
bigRules <-apriori(bigSampleData, parameter=list(supp=0.005,conf=0.1,minlen=2))
I recently came across a rule with a huge negative Odds Ratio:
> rules[69804] %>% interestMeasure(transactions, "oddsRatio")
[1] -5.954607e+20
I guess it's hard to replicate without supplying my whole dataset, but i tried to track the reason down;
In the function .getCounts
, the variable f01
becomes negative.
It is created by subtracting fx1 - f11
, i.e.
.rhsSupport(x, transactions, reuse) * N - interestMeasure(x, "support", transactions, reuse) * N
or as I (hopefully correctly?) interpret it, supp(Y) - supp(X=>Y)
, which should never become negative.
Here are the fully printed raw numbers, where you can see that f11 is in fact represented larger than fx1.
> sprintf("%.66f",.rhsSupport(x, transactions, reuse))
[1] "0.000092340537126436346296656787480117145605618134140968322753906250"
> sprintf("%.66f",interestMeasure(x, "support", transactions, reuse))
[1] "0.000092340537126436359853520752238864588434807956218719482421875000"
I guess the error happens somewhere at a lower level like arules::support() or even at the C Level, but I didn't track it down any further. It probably is an R floating point rounding error somewhere.
I just wanted to give attention to this event, maybe a simple check could be implemented here and negative numbers replaced by zero with a warning.
Clarification Edit: Of course, only the result of this subtraction should be rounded to zero, so the division yields Inf
which is within the definition range for an odds ratio.
In previous version of discretize it works well on vectors with NA available. This is not a big issue, but in general this function works differently than the old one. On the same data it does not allow me to use same number of bins, tells to decrease it. So I invoked and use old discretize and cut2 function
Arules outputs many diagnostic messages such as Parameter specification and Algorithmic control.
These messages cannot be suppressed with the suppressMessages()
function. The gist of the problem might be described in this stackoverflow entry.
As a side note, the warning "You chose a very low absolute support count of 0. You might run out of memory! Increase minimum support." can be supressed with suppressWarnings()
.
Working suppressMessages() would be nice when apriori is invoked programmatically, from other packages.
I was trying to remove redundant rules from my association rules (total 3xxx rules) and after the use of the is.redundant
function:
inspect(rules[!is.redundant(rules)])
It turns out just 54 rules are not redundant. I was quite shock, and then I used another example to test which is here: http://www.rdatamining.com/examples/association-rules
The redundant rules, as shown, should be rules [2], [4], [7], and [8], but I used the is.redundant
function and it showed that only rules [4] and [8] are NOT redundant which is totally wrong apparently.
Is there anything wrong with the function or have I misused it?
I performed association Rules
trans <- read.transactions("C:/Users/synthex/Downloads/Groceries.csv", format = 'basket', sep = ',')
rules <- apriori(trans, parameter = list(supp = 0.001, conf = 0.8))
rules <- apriori(trans, parameter = list(supp = 0.001, conf = 0.8,maxlen=3))
plot(rules,method="graph",engine='interactive',shading=NA)
Then i got this plot
https://i.stack.imgur.com/3jr7x.jpg
What does mean the red circles and is it possible to give lable them?
mydat
structure(list(chocolate = structure(c(9L, 13L, 1L, 8L, 16L,
2L, 14L, 11L, 7L, 15L, 17L, 5L, 10L, 4L, 3L, 6L, 2L, 18L, 12L
), .Label = c("bottled water", "canned beer", "chicken,citrus fruit,tropical fruit,root vegetables,whole milk,frozen fish,rollsbuns",
"chicken,pip fruit,other vegetables,whole milk,dessert,yogurt,whippedsour cream,rollsbuns,pasta,soda,waffles",
"citrus fruit,pip fruit,root vegetables,other vegetables,whole milk,cream cheese ,domestic eggs,brown bread,margarine,baking powder,waffles",
"frankfurter,citrus fruit,onions,other vegetables,whole milk,rollsbuns,sugar,soda",
"frankfurter,rollsbuns,bottled water,fruitvegetable juice,hygiene articles",
"frankfurter,sausage,butter,whippedsour cream,rollsbuns,margarine,spices",
"fruitvegetable juice", "hamburger meat,other vegetables,whole milk,curd,yogurt,rollsbuns,pastry,semi-finished bread,margarine,bottled water,fruitvegetable juice",
"meat,citrus fruit,berries,root vegetables,whole milk,soda",
"packaged fruitvegetables,whole milk,curd,yogurt,domestic eggs,brown bread,mustard,pickled vegetables,bottled water,misc. beverages",
"pickled vegetables,coffee", "root vegetables", "tropical fruit,margarine,rum",
"tropical fruit,pip fruit,onions,other vegetables,whole milk,domestic eggs,sugar,soups,tea,soda,hygiene articles,napkins",
"tropical fruit,root vegetables,herbs,whole milk,butter milk,whippedsour cream,flour,hygiene articles",
"turkey,pip fruit,salad dressing,pastry"), class = "factor")), .Names = "chocolate", class = "data.frame", row.names = c(NA,
-19L))
Hi, when I train the model as follows:
traintrans <- as(traindata.data.frame, "transactions")
rules <- apriori(traintrans, parameter = list(supp=0.001, minlen = 2, maxlen=5, conf = 0.5, target = "rules")
, appearance = list(rhs = cnames))
where cnames <- colnames(traindata.data.frame)[7:9], that is, I would like to train the rules only for recommendations from the list "cnames", why do some extra recommendations also trickle in the resulting output? Is not the parameter "appearance" supposed to control which set of values one is interested in getting recommendations for?
Thanks!
Hi:
Was wondering how I could look for a particular word in RHS of appearance class in the apriori code.
rulesD<-apriori(data = txn, parameter=list(supp=0.0005,conf = 0.01),
appearance = list(default="lhs",rhs = grep("DATE|DATES", rhs, value = TRUE)),
control = list(verbose=F))
Here I need a list of products on RHS that contains the word "DATE" or "DATES".
rhs = grep("DATE|DATES", rhs, value = TRUE)) in the code is resulting in error.
Following the below example:
rules<-apriori(data=dt, parameter=list(supp=0.001,conf = 0.8), appearance = list(default="lhs",rhs="Banana"), control = list(verbose=F))
Only the products that only contains "BANANA"is returned. However, "ORGANIC BANANA" is not returned in RHS.
Sorry if this is a simple question! :) Searched the cran doc, did not find a solution.
Thank you for your help.
My name is Feng, I am using your arules packages 1.5.2 to find the related products in my retail data. Now, I am confused about two measures in the function interestMeasure: kappa and leastContradiction.
In the package manual, there is a piece of code of explaining how to use interestMeasure. I change the code a little bit:
data("Income") rules <- apriori(Income) quality(rules)$kappa <- interestMeasure(rules,measure='kappa',transactions = Income) quality(rules)$leastContradiction <- interestMeasure(rules,measure='leastContradiction',transactions = Income) try <- as(rules,'data.frame')
Then, we can see the ranges of these two measures are:
summary(try$leastContradiction)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.08794 0.13920 0.17000 0.18930 0.22170 0.90460
summary(try$kappa)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-43160000 -20510000 -19140000 -17660000 -12220000 -8042000
You can see the range of kappa is so different from what the manual describes: [-1,1]
When I use these two measures on my own data, I have:
summary(myData1$kappa)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5767000000000 -5765000000000 -5756000000000 -5745000000000 -5728000000000 -5610000000000
summary(myData1$leastContradiction)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-218.9000 -5.4530 -2.0120 -4.9540 -1.1050 0.8824
Could you please explain to me how to use these two measures?
Thanks a lot
Feng
I use arulesSequences and i want to access/get value from sequence every levels, for example:
x <- read_baskets(file.choose(), info = c("sequenceID","eventID","SIZE"))
as(x,"data.frame")
s0 <- cspade(x, parameter = list(support = 0, maxsize = 5, maxlen = 5))
seq = as(s0, "data.frame")
variable seq
returned this:
1 <{257}> 0.02777778
2 <{259},{305}> 0.02777778
3 <{259},{305}> 0.02777778
4 <{259},{305}> 0.02777778
5 <{259},{305}> 0.02777778
...
...
...27 <{292},{305}> 0.02777778
28 <{293},{305}> 0.02777778
29 <{292},{293},{305}> 0.02777778
30 <{290},{293},{305}> 0.02777778
31 <{259},{293},{305}> 0.02777778
32 <{290},{292},{293},{305}> 0.02777778
33 <{259},{292},{293},{305}> 0.02777778
i want to get the first and last value of sequence per levels, something like this:
1 : first = last = 257
2 : first = 259, last = 305
29 : first = 292, last = 305
33 : first = 259, last = 305
is that possible?
I am not sure if there is an error in parameter appearance (or I don not understand the meaning).
I execute:
load("./titanic.raw.rdata")
library(arules)
rules <- apriori(titanic.raw, control = list(verbose=F),
parameter = list(supp=0.002, conf=0.01),
appearance = list( rhs=c("Survived=Yes"),
lhs=c("Class=1st", "Class=2nd", "Class=3rd",
"Age=Child", "Age=Adult")))
inspect(rules)
rules2 <- apriori(titanic.raw, control = list(verbose=F),
parameter = list(supp=0.002, conf=0.01),
appearance = list( rhs=c("Survived=Yes"),lhs=c("Class=1st")))
inspect(rules2)
My concern is about. rules1 has 12 rules with some of the attributes in:
---> lhs=c("Class=1st", "Class=2nd", "Class=3rd",
"Age=Child", "Age=Adult")
But with the second call to apriory I put a less restrictive set of conditions to apriori:
--->,lhs=c("Class=1st")
My question is why the second call generate less rules?
For instance in the first order the rule
[8] {Class=1st,Age=Child} => {Survived=Yes} 0.002726034 1.0000000
is generated
But in the second order this rule is not generated. And it has Class=1st on the left.
Could you help me?
Thank you very much in advance
ANGEL MORA BONILLA
University of Málaga, Spain
when applying is.closed to an itemset class, sometimes I will get an error of "invalid count". This seems to occur only when the itemset has a support count of 1 but it does not occur with every itemset with a support count of 1. I'm trying to find why this occurs.
Example from rwdvc returns all transactions!
require(arules)
data("Adult")
## Mine association rules.
rules <- apriori(Adult,
parameter = list(supp = 0.5, conf = 0.9, target = "rules", minlen = 2))
summary(rules)
sub_rules <- rules[1]
inspect(sub_rules)
sub_trans <- subset(Adult, items %in% lhs(sub_rules))
Hi, I get the following error from is.subset, example shown below:
is.subset(as(list(c("salt","water"),c("pepper")), "transactions"), as(list(c("salt","water")), "transactions"))
Error in .local(x, ...) :
All item labels in x must be contained in 'itemLabels' or 'match'.
The above error does not allow me to make predictions for NEW transactions, when I use the is.subset method as:
rulesMatchLHS <- is.subset(rules@lhs, newtrans)
where "rules" is the model from "aprior" with target = "rules".
I get the above error when the new data ("newtrans") has columns missing from the training data.
Also, if I try using "subset" instead, I get the following error from transactions in newtrans that may contain newer items/values (columns/values) not seen in the training data. The error obtained is as follows:
Error in
subset(rules, subset = rules@lhs %in% LIST(newtrans[1])[[1]], :
Error in rules@lhs %in% LIST(newtrans[1])[[1]] : table contains an unknown item label
Could you please help fix the errors above? Or please point to how I could make predictions using new transactions that may contain new items or missing items from that seen in the training data.
Thanks!
I must performing association rules in R and i found the example
here
http://www.salemmarafi.com/code/market-basket-analysis-with-r/
In this example they work with data(Groceries)
but they gave original dataset Groceries.csv
structure(list(chocolate = structure(c(9L, 13L, 1L, 8L, 16L,
2L, 14L, 11L, 7L, 15L, 17L, 5L, 10L, 4L, 3L, 6L, 2L, 18L, 12L
), .Label = c("bottled water", "canned beer", "chicken,citrus fruit,tropical fruit,root vegetables,whole milk,frozen fish,rollsbuns",
"chicken,pip fruit,other vegetables,whole milk,dessert,yogurt,whippedsour cream,rollsbuns,pasta,soda,waffles",
"citrus fruit,pip fruit,root vegetables,other vegetables,whole milk,cream cheese ,domestic eggs,brown bread,margarine,baking powder,waffles",
"frankfurter,citrus fruit,onions,other vegetables,whole milk,rollsbuns,sugar,soda",
"frankfurter,rollsbuns,bottled water,fruitvegetable juice,hygiene articles",
"frankfurter,sausage,butter,whippedsour cream,rollsbuns,margarine,spices",
"fruitvegetable juice", "hamburger meat,other vegetables,whole milk,curd,yogurt,rollsbuns,pastry,semi-finished bread,margarine,bottled water,fruitvegetable juice",
"meat,citrus fruit,berries,root vegetables,whole milk,soda",
"packaged fruitvegetables,whole milk,curd,yogurt,domestic eggs,brown bread,mustard,pickled vegetables,bottled water,misc. beverages",
"pickled vegetables,coffee", "root vegetables", "tropical fruit,margarine,rum",
"tropical fruit,pip fruit,onions,other vegetables,whole milk,domestic eggs,sugar,soups,tea,soda,hygiene articles,napkins",
"tropical fruit,root vegetables,herbs,whole milk,butter milk,whippedsour cream,flour,hygiene articles",
"turkey,pip fruit,salad dressing,pastry"), class = "factor")), .Names = "chocolate", class = "data.frame", row.names = c(NA,
-19L))
i load this data
g=read.csv("g.csv",sep=";")
so i must convert it to transactions like arule requires
#'@importClassesFrom arules transactions
trans = as(g, "transactions")
lets' examinate data(Groceries)
> str(Groceries)
Formal class 'transactions' [package "arules"] with 3 slots
..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
.. .. ..@ i : int [1:43367] 13 60 69 78 14 29 98 24 15 29 ...
.. .. ..@ p : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
.. .. ..@ Dim : int [1:2] 169 9835
.. .. ..@ Dimnames:List of 2
.. .. .. ..$ : NULL
.. .. .. ..$ : NULL
.. .. ..@ factors : list()
..@ itemInfo :'data.frame': 169 obs. of 3 variables:
.. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...
.. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...
.. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...
..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
>
and my converted data from original csv
> str(trans)
Formal class 'transactions' [package "arules"] with 3 slots
..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
.. .. ..@ i : int [1:9835] 1265 6162 6377 4043 3585 6475 4431 3535 4401 6490 ...
.. .. ..@ p : int [1:9836] 0 1 2 3 4 5 6 7 8 9 ...
.. .. ..@ Dim : int [1:2] 7011 9835
.. .. ..@ Dimnames:List of 2
.. .. .. ..$ : NULL
.. .. .. ..$ : NULL
.. .. ..@ factors : list()
..@ itemInfo :'data.frame': 7011 obs. of 3 variables:
.. ..$ labels : chr [1:7011] "tr=abrasive cleaner" "tr=abrasive cleaner,napkins" "tr=artif. sweetener" "tr=artif. sweetener,coffee" ...
.. ..$ variables: Factor w/ 1 level "tr": 1 1 1 1 1 1 1 1 1 1 ...
.. ..$ levels : Factor w/ 7011 levels "abrasive cleaner",..: 1 2 3 4 5 6 7 8 9 10 ...
..@ itemsetInfo:'data.frame': 9835 obs. of 1 variable:
.. ..$ transactionID: chr [1:9835] "1" "2" "3" "4" ...
>
We see that in data(Groceries)
transactions in sparse format with
9835 transactions (rows) and
169 items (columns)
in my trans data
9835 transactions (rows) and
7011 items (columns)
i.e. i got 7011 columns from Groceries.csv, meanwhile in embedded example(169 columns)
Why it is so? How this file convert correct.
I must understand it, cause, i can't work with my file
I'm trying to create a flexdashboard with results from the r arules apriori function, so to display associated relations for specific items selected from the pull down menu in the markdown dashboard. When I create function outside the markdown environment I'm able to successfully subset the new product item to the apriori function without problem. However, When I replace the function variable with the reactive function name I get the following error message: "Error in ==: comparison (1) is possible only for atomic and list types"
library(flexdashboard)
library(htmlwidgets)
library(htmltools)
library(knitr)
library(arules)
library(arulesViz)
library(datasets)
library(reshape2)
library(scales)
data(Groceries)
rules <- apriori (data=Groceries
,parameter=list (supp=0.001,conf = 0.15,minlen = 2,maxlen=5)
,control = list (verbose=F) )
selectInput("Product","Product", c("whole milk","sugar"), selected = "whole milk")
###Associated Product
Product <- reactive({input$Product})
renderDataTable({
#using this direct call works
#Product<-"whole milk"
#subrules<- subset(rules, subset=(lhs %in% as.character(Product) & !(rhs %in% as.character(Product))))
#This reactive function produces error: "Error in ==: comparison (1) is possible only for atomic and list types"
subrules <- rules[lhs(rules) %pin% as.character(Product()) & !rhs(rules) %pin% as.character(Product())]
rules_conf <- sort (subrules, by=c("confidence"), decreasing=TRUE) # 'high-confidence' rules.
redundant <- which (colSums (is.subset (rules_conf, rules_conf)) > 1) # get redundant rules in vector
rules_conf <- rules_conf[-redundant] # remove redundant rules
rules_conf2<-as(rules_conf,"data.frame")
# split lhs and rhs into two columns
rules_conf2<-transform(rules_conf2, rules = colsplit(rules, pattern = "=>", names = c("lhs","rhs")))
# convert to character
rules_conf2$rules$lhs <- as.character(rules_conf2$rules$lhs)
rules_conf2$rules$rhs <- as.character(rules_conf2$rules$rhs)
rules_conf3<-data.frame(LHS=rules_conf2$rules$lhs
,RHS=rules_conf2$rules$rhs
,Support=percent(rules_conf2$support)
,Confidence=percent(rules_conf2$confidence)
,Lift=round(rules_conf2$lift,2))
DT::datatable(rules_conf3,rownames=F
,options = list(pageLength = 10
,columnDefs = list(list(className = 'dt-center', targets = 2:4
,autoWidth = TRUE, searchable = FALSE))))
})
Hello,
The documentation (and function name) of is.redundant suggests that it should return TRUE for rules that are redundant, i.e. rules that have a negative improvement. Instead, it returns TRUE for rules with a positive improvement.
Thanks!
Hi, for the apriori algorithm, I use the following parameters:
support, minlen, maxlen, confidence, target ( = "rules"). I am currently using this set to both tune my model as well as limit the size of the model (that is, the number of rules generated).
It would be immensely helpful to have a separate parameter to control the size of the model, for example, something like "maxrules" so that one can fine-tune the model (for better performance) using the above existing parameters as well as create a model that has a controlled number of rules using "maxrules". Right now, if I fine-tune my model using the existing set of parameters, the number of rules becomes too large (sometimes a few million) which results in long model-building time as well as making predictions. This (limiting the size of the apriori object as well as model-tuning) becomes quite of an issue with automating thousands of models.
Is it possible to add such a parameter in the near future?
Thanks!
Supriya
Hi,
Are there any plans for including the FP Growth algorithm?
We are currently using the arules package to extract frequent item pairs for usage in a PCA. We have approximately 50000 "sets" (which means there are (50000^2)/2 potential pairs). We wanted to convert the sparse matrix into a full matrix using the following code:
apri.test <- apriori(transactions,
parameter = list(target = "frequent itemsets", supp = support,
minlen = 2, maxlen = 2), control = list(verbose = TRUE))
pairs.matrix <- as.matrix(apri.test@items@data)
However this gave us a memory error, saying that 580 gb are needed to allocate the matrix. Our rough estimate (50000^2 * 40 / 1000000000 = 80 GB) was greatly below this value. Is there a more efficient way integrated in the package to extract this matrix or are we attempting to extract the wrong matrix entirely?
All the best from the WU.
With the new version of arules (1.5-0) I am sometimes getting error "reached CPU time limit".
In my evaluations on 27 UCI datasets, I get it on waveform-5000 (MDLP-discretized) with the following setting:
confidence = 0.5 , support = 0 , minlen = 2 , maxlen = 4, maxtime=102 (i.e. the new feature turned off). I always get it during 10 fold crossvalidation, but each time on a different fold.
The same setup with the previous version of arules worked fine. This error is distinct from the new warning related to maxtime.
Using arules 1.5-3 on R 3.3.3, running apriori() causes a fatal crash when the 'appearance' list specifies lhs and rhs items while default is "none".
See my repository for reproducible example.
Code file baskets.R loads the data in assocs.RData, which consists of a data frame of market baskets and two character vectors specifying items to exclude from LHS and RHS when mining associations.
I provided three ready options for specifying the 'appearance' argument: constrain LHS, constrain RHS, constrain both. R crashes when both sides are constrained.
The measure referred to as "ralambrodrainy" seems to be a spelling mistake regarding the names of the authors that wrote the paper
Diatta, J., Ralambondrainy, H., & Totohasina, A. (2007). Towards a Unifying Probabilistic Implicative Normalized Quality Measure for Association Rules. Studies in Computational Intelligence, 237–250. doi:10.1007/978-3-540-44918-8_10".
During investigating the correlation between rankings of different measures I found that this measure also has a negative or no correlation with most other measures . Especially confidence is highly negatively correlated.
This makes me think that there might be something wrong with the implementation.
Hi,
I think there is a simple error in the documentation of "transactions-class" when it says:
coerce signature(from = "matrix", to = "transactions"); produces a transactions data set from a binary incidence matrix. The row names are used as item labels and the column names are stores as transaction IDs.
From what I can see in my experiments, rows represent transactions and columns represent items.
Regards,
Víctor
Imagine Transactions with sizes of up to 15 Items. With a low support, arules does not check Itemsets of Level 10 correctly. Every Itemset of length 10 and more is always not frequent.
I compared the results with the output of the latest Borgelt Apriori implementation. Every retrieved Frequent Itemset is identical up to Level 9, after which they are simply missing.
Hi,
I've realized that using the function interestMeasure
with a set of rules of size 1 does not return a proper dataframe.
data("Income")
rules <- apriori(Income)
r1 <- rules[1]
r2 <- rules[1:2]
> interestMeasure(x = r2, measure = c("support", "confidence")) #OK
support confidence
1 0.9128854 0.9128854
2 0.1127109 0.9292566
> interestMeasure(x = r1, measure = c("support", "confidence")) #Wrong
sapply(measure, FUN = function(m) interestMeasure(x, m, transactions, reuse, ...))
support 0.9128854
confidence 0.9128854
Is this a bug in the function or am i using it badly?
Thanks!
Víctor
Was using interestMeasure, and decide to sample some asymmetric measures with symmetric was surprised to see the Mutual Information (M) numbers come back negative based on the documentation on the arules pdf.
"mutualInformation", uncertainty, M (Tan et al., 2002)
Measures the information gain for Y provided by X.
Range: [0, 1](0 for independence)
assuming it could be just a bug (i.e. remove the minus sign and that should be it)... FYI
Sample below:
measures7 <- interestMeasure(rules_Arrest,
c("phi", "mutualInformation", "cosine", "jaccard", "lift", "hyperLift"), #"leverage", "support", "gini", "hyperConfidence"),
inspect(head(rules_Arrest))transactions = DPPD_trans1_Arrest)
lhs rhs support confidence lift
1 {ZipCode=75240} => {Division=North Central} 0.02027592 1.0000000 9.235521
2 {Sector=640} => {Division=North Central} 0.02100753 1.0000000 9.235521
3 {Sector=110} => {Division=Central} 0.02132107 1.0000000 8.181274
4 {ZipCode=75212} => {Sector=420} 0.02132107 0.9855072 40.997101
5 {ZipCode=75212} => {Division=SouthWest} 0.02142559 0.9903382 5.879960
6 {ZipCode=75208} => {Division=SouthWest} 0.02158236 0.9951807 5.908712
head(measures7, n = 20)
phi mutualInformation cosine jaccard lift hyperLift
1 0.41284206 -0.307263353 0.4327340 0.18725869 9.235521 6.807018
2 0.42038123 -0.310908803 0.4404718 0.19401544 9.235521 6.931034
3 0.39553519 -0.285008230 0.4176524 0.17443352 8.181274 6.181818
4 0.93344676 -0.847975096 0.9349343 0.87553648 40.997101 22.666667
5 0.32658365 -0.214306327 0.3549388 0.12705299 5.879960 4.659091
6 0.32891311 -0.216503142 0.3571049 0.12806202 5.908712 4.693182
7 0.47310567 -0.353491059 0.4906198 0.24783684 11.486745 8.354167
8 0.09550435 -0.012162221 0.1752533 0.03071371 1.409442 1.315457
9 0.09428369 -0.011812036 0.1750450 0.03078283 1.402731 1.310345
10 0.43094164 -0.316088614 0.4512959 0.20366795 9.235521 6.918033
11 0.94651261 -0.869063673 0.9477856 0.89938398 39.246165 23.052632
12 0.94651261 -0.884808556 0.9477856 0.89938398 39.246165 23.052632
13 0.86713506 -0.752373996 0.8704477 0.75767918 32.655290 19.304348
14 0.44229206 -0.321755514 0.4629100 0.21428571 9.235521 6.937500
15 0.07797480 -0.007918385 0.1698410 0.03065275 1.323733 1.237389
16 0.09873251 -0.012415261 0.1810415 0.03277602 1.409442 1.320475
17 0.41578962 -0.295491763 0.4383589 0.19215855 8.244722 6.371429
18 0.68171238 -0.548458088 0.6910043 0.47748691 20.037696 13.411765
19 0.60163715 -0.471097334 0.6141427 0.37717122 15.827957 11.121951
20 0.34716814 -0.225250203 0.3761424 0.14148309 5.937325 4.750000
19 0.60163715 -0.471097334 0.6141427 0.37717122 15.827957 11.121951
20 0.34716814 -0.225250203 0.3761424 0.14148309 5.937325 4.750000
Maybe better error/warning output? See relevant post at SO:
https://stackoverflow.com/q/53185553/680068
I am guessing OP is passing a data.frame to arules::write
, and arules
is using base::write
as input is not "transactions", maybe check input?
I attempted to install the dev version from github (to see if the %pin% issue in issue #16 was fixed in 1.4-2-1) I was unable to build from github. Build is on an up-to-date mac pro with R version 3.3.1 and RStudio 0.99.902.
> devtools::install_github("mhahsler/arules")
Downloading GitHub repo mhahsler/arules@master
from URL https://api.github.com/repos/mhahsler/arules/zipball/master
Error: Could not find build tools necessary to build arules
> devtools::install_git("git://github.com/mhahsler/arules")
Downloading git repo git://github.com/mhahsler/arules
Error: Could not find build tools necessary to build arules
> devtools::install_git("git://github.com/mhahsler/arules", args="--recursive")
Downloading git repo git://github.com/mhahsler/arules
Error: Could not find build tools necessary to build arules
I get different number of rules when I turn it on/off. When memopt is TRUE, I usually get lesser rules. I am using minlen and maxlen, as well. Am I missing something?
The new feature maxTime has anything to do with it?
After getting a strange error
Error in validObject(.Object) :
invalid class “ngCMatrix” object: Not a valid 'Mnumeric' class object
when using random.transactions
, I realised that the error is due to the fact that Rscript
does not load methods
by default.
Minimal Working Example:
R -e 'library("arules"); random.transactions(5, 5)'
works like a charm, while Rscript
thorws an error:
Rscript -e 'library("arules"); random.transactions(5, 5)'
...
Error in validObject(.Object) :
invalid class “ngCMatrix” object: Not a valid 'Mnumeric' class object
A temporary fix is to run Rscript
as follows:
Rscript --default-packages=methods -e 'library("arules"); random.transactions(5, 5)'
Another solution is to add library("methods")
in the code before importing arules
.
Is there a better solution? Can the arules
package made robust against this issue?
Hi, the following example is available for coercing a data.frame into transactions object when the TransactionId and Items are provided:
`## example 4: creating transactions from a data.frame with
a_df3 <- data.frame(
TID = c(1,1,2,2,2,3),
item=c("a","b","a","b","c", "b")
)
a_df3
trans4 <- as(split(a_df3[,"item"], a_df3[,"TID"]), "transactions")
trans4
inspect(trans4)
`
However, when I try to apply the same to a data.frame that has additional columns, I get error:
Error in asMethod(object) : can coerce list with atomic components only
My example is:
a_df3
<- data.frame(TID = c(1,1,2,2,2,3), item=c("a","b","a","b","c", "b") , column2 = c("Mon", "Wed", "Mon", "Tue", "Fri", "Mon"))
trans4 <- as(split(a_df3[,c("item", "column2")], a_df3[,"TID"]), ``"transactions")`
Error in asMethod(object) : can coerce list with atomic components only
Is there a method available to do the above with multiple columns (factor/logical)? In my application, if I first use "dcast" to create a WIDE table with unique TID values per row and then use only the data.frame without the TID column to convert it into a "transactions" object following your example 3 (a_df), then my data.frame size becomes too large (up to 5-7 GB). So I was hoping to create the "transactions" object required for "apriori" directly from my multiple-columns LONG table as shown above.
Thanks for any feedback!
There is a typo in one of the level 2 item labels in the Groceries data. I figured I should report it, but I almost hope you'll consider not fixing it - it gives me a little chuckle every time I see "meet and sausage".
Am sorry I am new to arules
package and this is not actually an issue but rather a question. I already posted this question on stackoverflow but wanted to ask it here hoping to get an answer quick.
I have a data set with customer ID, event_date, and event_type looking like this:
cid event_date event_type
451 2017-01-05 VSLS
451 2017-01-08 VCRD
451 2017-02-04 COMM
451 2017-02-05 COMM
...
564 2017-01-05 VSVC
564 2017-01-06 COMM
564 2017-02-05 VCRD
...
and wanted to analyze frequent pattern of events. Q is how I build a transaction
class that could potentially include customer id and time stamp in its @itemsetInfo?
Thnx
Is there a code change between v1.2.x and 1.4.x that requires R >= 3.2.0? Lots of users run older releases of R for some time between updating, so if there isn't a vital new feature or bugfix required in the newer R releases a lower version dependency is nice--not necessarily 2.14, but R >= 3.0 represents a hard break.
After running:
rules <- df %>%
apriori(appearance = list(lhs = c(x), default="rhs"),
parameter=list(support=0.0, confidence=0.25))
When running the following line:
rules <- subset(rules, subset = (lhs %pin% as.character(x)) & lift > 1)
I receive the error:
"unable to find an inherited method for function '%pin%' for signature '"standardGeneric", "character"'"
I believe the 'standardGeneric is the 'lhs' portion and the portion causing the error. x is just a column name.
This is with version 1.4-2. I don't believe I received the error on version 1.4-1.
Hi, is there a "predict" method for apriori similar to predict.rpart, etc? Currently I have my own code built following the suggestion at:
http://stats.stackexchange.com/questions/21340/finding-suitable-rules-for-new-data-using-arules
`basket <- Groceries[2]
rulesMatchLHS <- is.subset(rules@lhs,basket)
suitableRules <- rulesMatchLHS & !(is.subset(rules@rhs,basket))
inspect(rules[suitableRules])
recommendations <- strsplit(LIST(rules[suitableRules]@rhs)[[1]],split=" ")
recommendations <- lapply(recommendations,function(x){paste(x,collapse=" ")})
recommendations <- as.character(recommendations)
recommendations <- recommendations[!sapply(recommendations,function(x){basket %in% x})]
print(recommendations)`
but that takes enormously long (several hours) to process a testdata that is 20,000 rows with an apriori model having about 300,000 rules. Wanted to check if there exists any method already that processing a table of testdata much faster (ideally, a few seconds), or if there is plan of developing such a method in the near future?
Thanks!
Supriya
I can see there is a long list of metrics/scores to calculate additional interest measures however, by default (by not setting the "measure" argument in the interestMeasure function, only 3 or 4 are calculated).
Basically, I am interested in calculating "certainty" which is on the list, but the function returns an error
"Error in interestMeasure(r, measure = c("certainty"), trans = subscription.trans) :
Value 'certainty' is an invalid measure for itemsets."
This is regardless of using "apriori" or "eclat".
Thanks,
Hi everyone,
I'm trying to use the discretize
function. Given the following vector:
nums <- c(rep(1,7),
rep(2,3),
rep(3,4),
rep(4,5),
rep(5,9),
rep(6,10),
rep(7,8),
rep(8,1),
rep(9,9),
rep(10,4))
> nums
[1] 1 1 1 1 1 1 1 2 2 2 3 3 3 3 4 4 4 4 4 5 5 5 5 5 5 5 5
[28] 5 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 8 9 9 9 9 9 9 9
[55] 9 9 10 10 10 10
When I use the code table(discretize(nums, "frequency", categories=6))
I get the output below
[ 1, 3) [ 3, 6) 6 7 [ 8,10) 10
10 18 10 8 10 4
Nevertheless, I was expecting 6 categories of frequency 10. Am I misunderstanding something?
Thank you in advance,
Noelia
I am trying to do association rules clustering in R and I am having a problem when I was trying to use the function "dissimilarity". I tried something like this:
dis<-dissimilarity(rules, method = "gupta", args = "trans")
I want to use the "gupta" method to calculate the dissimilarity/distances of the rules, the R help manual said to use "gupta": "The transactions used to mine the associations has to be passed on via args as element "transactions"."
"trans" is my transactions (sparse format) and "rules" is my association rules. An error message said "Error in args$$trans : $ operator is invalid for atomic vectors"
why is it doesn't work?
I have also tried:
dis<-dissimilarity(rules, method = "gupta", args = list("trans"))
Error: Error in .local(x, y, method, args, ...) : Transactions needed in args for this method!
Is there anything wrong with my syntax?
HELP PLEASE !!!!!!!!!!
Executing head(x) or tail(x) on a rules object having no rules results in error message "Error in slot(x, s)[i] : subscript out of bounds".
Error does not occur with arules::sort().
Using arules 1.5-0 on R 3.3.3
Add code to produce rules with more than 1 items in the RHS in function ruleInduction()
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.