gertjanssenswillen / edear Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 10.0 3.56 MB

!! repository moved to https://github.com/bupaverse/edeaR !! This repo is read-only from now one.

License: Other

R 100.00%

edear's People

Contributors

Stargazers

Watchers

Forkers

smyth7 bbrewington marijkeswennen strategist922 heoa hectorperez8 fmannhardt sop-with skweon2 gervao

edear's Issues

`filter_throughput_time` fails when using `week` units

Hi.

I'm trying to filter my event log to discard those cases longer than 10 weeks, and I found this situation.

> eventlog_ %>% filter_throughput_time(interval = c(10, NA), units = "week")
Error in mutate_impl(.data, dots) : 
  Evaluation error: invalid units specified.

Otherwise, if I use a different time unit, it works seamlessly.

> eventlog %>% filter_throughput_time(interval = c(10, NA), units = "days")
Event log consisting of:
4480 events
52 traces
684 cases
12 activities
4480 activity instances

# A tibble: 4,480 x 14
[...]

Is there anything I'm doing wrong?

Add method "only" for filter_activivty_presence()

Hi, I would like to suggest a new parameter option for the filter_activivty_presence() function
Currently you can ony show data that pass on all activity from a list or only 1 activity of a list (all or one_off)
Adding a third option which filter data that only pass on the listed activites would be very helpfull.
Using
filter_activivty_presence(c("A","B","C","D"))
will filter a process passing in "A","B","A","C","D" and not show a process passing in "A","B","C","D","E"

filter_precedence()

The filter_precedence() function puzzles me quite a bit (even after reading the available documentation).

When using this function, before creating a process_map(), I would expect to see the same number of events as is listed in the precedence_matrix(). However, that's not the case.

According to patients %>% precedence_matrix() %>% plot(), there are 492 cases/events between Discuss Results and Check-out.
However, running patients %>% filter_precedence("Discuss Results", "Check-out") %>% case_labels() shows only 3 cases.

After further analysis it appears that one must use the argument filter_method = "none" to achieve the expected outcome: getting only those case which do have these precedence activities included in their trace. Forgetting this argument results in an, for me unintended and unexpected, outcome (which actually shows the opposite: cases having not the provided filter argument).

So my question is: could this be a bug or am I misunderstanding the purpose of this function?

Thx!

Different activity orderings in case of identical timestamps

Hi Gert,

happy new year.

I'm working on bupaR and I noted some inconsistencies.

One of my issue last year was about the inconsistency of the information about the functions process_map() and precedence_matrix().

Now it's working well, but the information generated with the function start_activity() has not consistency yet.

Attached here some immages about the problem (Activities with letters - A, B, C).

Thanks for all.

Endpoint filter does not work

edeaR/R/filter_endpoints_percentile.R

Line 22 in 695c0fd

pull(1)

This does not provide a list of cases anymore. Maybe changes in dplyr or somewhere else in edeaR?

throughput_time with strange output. Bug?

The throughput_time function shows some strange behaviour. I would assume that the following three code examples should produce the same output. However, all resulting quartiles and the mean are very different, except the Min. and the Max.

sepsis %>% throughput_time(level = "case") %>% summary()


sepsis %>% throughput_time(level = "case", append = TRUE) %>% 
    select(throughput_time_case, force_df = TRUE) %>% summary()


sepsis %>% throughput_time(level = "log") %>% summary()

Could there be a bug or do I misunderstand the function?

Be consistent about row ordering for resource metrics

As an example of inconsistency, number_of_repetitions() returns values ordered by resource whereas resource_frequency() returns values ordered by count.

library(edeaR)
data(sepsis, package = "eventdataR")
number_of_repetitions(sepsis, level = "resource")
Using default type: all
## # resource_metric [26 × 3]
## first_resource absolute relative
## <fct>             <dbl>    <dbl>
##   1 ?                     0  0      
## 2 A                     0  0      
## 3 B                  1536  0.189  
## 4 C                     3  0.00285
## 5 D                     0  0      
## 6 E                     0  0      
## 7 F                    16  0.0741 
## 8 G                    67  0.453  
## 9 H                     6  0.109  
## 10 I                    12  0.0952 
resource_frequency(sepsis, level = "resource")
# A tibble: 26 x 3
## resource absolute relative
## <fct>       <int>    <dbl>
##   1 B            8111  0.533  
## 2 A            3462  0.228  
## 3 C            1053  0.0692 
## 4 E             782  0.0514 
## 5 ?             294  0.0193 
## 6 F             216  0.0142 
## 7 L             213  0.0140 
## 8 O             186  0.0122 
## 9 G             148  0.00973
## 10 I             126  0.00828
## # ... with 16 more rows

I think it make sense to have every resource metric return values in the same order. You could take the approach of dplyr::count() and have a sort argument that determines whether or not to sort the rows by count.

error when setting trace_length @level='trace'

Hi I found this error useing level=trace on trace_length...

ciao=edeaR::trace_length(eventlog = evl, level="trace")
Error in eval(lhs, parent, parent) :
argument "eventlog" is missing, with no default

Actually, I gave "eventlog" parameter.

Sharper knives for filter_time_period in combination with trim

What I expected: When trimming to a specific time period is that events that are partly in the time period are also trimmed so that they stay in the result.

What I got: Events that are only partly in the trimmed period are discarded.

Why is this a problem?: We use trim mainly to slice a larger period in even parts so that we can measure what the total processing time is per part. This is only possible when events also sliced and attribute processing time to the right part. Which is why we need the sharper knife which also cuts the raisins in the cake.

Note: for the other filter options: "contained", "intersecting", "start", "complete" there is no problem.

eventlog function

It seems like the eventlog function is missing from the edeaR package, is it? I don't see it in the package or in the documention, there is only an eventlog_from_xes function. The eventlog function is referenced in the vignette on data preprocessing in the importing from csv section.

Filter events based on lifecycle_id logic

Thanks so much for this helpful R package! How would you suggest cleaning up events based on the lifecycle_id variable?

For example suppose I have an activity that should always have a "start" and "complete" event. However my event log is a bit messy and occasionally a case has only a "start" or only a "complete" event but not both. How would you suggest I filter activities to ensure that every activity has a "start" and "complete" event?

This seems similar to what the filter_precedence function does but I want filter events based on the ordering of the lifecycle_id within each activity.

I can create a reproducible example if that would be helpful.

Add a plot for number_of_repetitions when type is all and level is resource

This is currently not supported. It returns a function, which is confusing.

library(edeaR)
data(sepsis, package = "eventdataR")
n_reps <- number_of_repetitions(sepsis, level = "resource")
## Using default type: all
plot(n_reps)
## function (...) 
##   tags$p(...)
## <bytecode: 0x1022c4c50>
##   <environment: namespace:htmltools>

This ought to show a bar plot of the absolute number of repetitions by resource (to match the behavior when level = "activity").

start_activity and activity_frequency

Hi, thanks for the bupaR package, it is very useful!
I am having an issue using start_activity() and activity_frequency() functions:

evl %>% start_activities(level = "case")
evl %>% activity_frequency("case")

where evl is an eventlog object. The output is:

a data frame with only one row;
never stop running.

Any suggestion?

Thanks.

start_activities() shows unexpected column label

Have a look at patients %>% start_activities(level = "case"). The output is as expected, it's just that the column label is a bit confusing (showing end_activity).

activity_frequency(level = "activity") vs n_events()

Hello Gert,

I'm using the activity_frequency (level = "activity") and n_events ().

What I see is that n_events () has twice the number of events compared to activity_frequency (level = "activity") (perhaps due to the start/full state).

Thank you.