The dupree from russhyde

Add a system.file() example that the user could use

At present, the example code in the dupree README couldn't be reran by a user if they downloaded dupree from CRAN, say, since they would require the dupree source code for the examples to work (and dupree isn't particularly bad for duplication within the source code).

It would be useful to have a couple of dupree examples that use example files (from inst/) that CRAN users could access

- add two example files to inst.
- add example sections that use these system.files to the dupree() roxygen
- use the system.files within a vignette or in the README.
- also (but unrelated) show how to visualise a similarity network using tidygraph in the vignette

rstudio addin for running dupree on projects / packages

Vignette: How to sensibly use {dupree} as part of your workflow

DON'T JUST DEDUPLICATE WITHOUT THINKING!

Write up a vignette of a sensible workflow:

The aim is to add useful functionality that reduces the floor-space for bugs without deduplicating so far that your code becomes unreadable

Two main approaches:

If you want to fix a bug
- Write a test that traps the bug, if you can
- Isolate the code that is affected by the bug
- run dupree and see if that code is duplicated elsewhere in the package
- [see if you need to write a test that captures bugs in the duplicated bits]
- [deduplicate the duplicated bits of code & then fix the bug]
- OR [fix the bug(s) and then deduplicate]
If you want to add a feature
- again, write tests for the new feature
- write the code to implement your feature
- run dupree; find if your feature-code duplicates some other part of the package
- refactor the duplicated bits of code

As such, {dupree} doesn't really fit into the goodpractice-type checks (where you analyse the current state of the whole package relative to general guidelines); you need to consider before & after and focus on the bit of code that you are currently modifying. [? does that make sense]

Fix any dplyr-based code / tests to work with dplyr=1.0

Dear Russ Hyde,

This is an automated email to let you know that:

A new version of dplyr is ready to go to CRAN. dplyr is
currently at version 0.8.99.9002 and will become 1.0.0 upon release.
dupree uses dplyr and has problems with the new version.
We plan to submit dplyr to CRAN on May 1.

This is a major release. See
https://www.tidyverse.org/blog/2020/03/dplyr-1-0-0-is-coming-soon/ for
a detailed article about what's changed.

I need your help to keep dupree and dplyr working together smoothly.
In the next weeks, can you please:

Read about the changes to dplyr at
https://github.com/tidyverse/dplyr/blob/master/NEWS.md.
This page includes a list of breaking changes, the reasoning behind
them, and to how to update your code.
Carefully inspect the failing checks listed at the bottom of this email.
For each failing check, either update your package, or tell me
that I have a bug. If you have made changes to your package, please
submit an update to CRAN before May 1.

If you have discovered a bug in dplyr, please file an issue (ideally
with a small reprex that illustrates the problem) at
https://github.com/tidyverse/dplyr/issues. If you're not sure whether
or not you've found a bug, please file an issue at
https://github.com/tidyverse/dplyr/issues for discussion. Breaking
changes that are not listed qualify as bugs.

Please respond to this message if you have any questions.

Thanks,

Romain Francois

dupree doesn't work on R-markdown files with non-R code-blocks

Code blocks for python-engine are parsed out as-if R blocks

Fix warning (?) test_dupree_classes.R:33: warning: EnumeratedCodeTable: construction / validity `cols` is now required. Please use `cols = c(enumerated_code)`

dupree on a *.R file with no R blocks

Calling dupree on coxpresdbr/R/coxpresdbr_parse.R throws an error.

The only content in coxpresdbr_parse.R is some blank lines and a comment.

Error is:
Error in mutate_impl(.data, dots) : Evaluation error: unique() applies only to vectors.
traceback() shows the error occurs in a mutate_ call in enumerate_code_symbols()

Replace pr-commands.yaml with pre-commit rules

Aim, to ensure that styler::style_pkg() and devtools::document() are ran on every commit. This should keep the code clean, and is more hands off than having PR discussion commands for styling and documenting the code.

Check that `dupree_package` works on windows

dupree_package uses a filter that works like
grep(pattern = "<my_package>/R/", x= initial_files)

... where initial_files is created by dir(). Under windows, does dir return file-paths like "<my_package>\R\some_file.R" or like "<my_package>/R/some_file.R"? if the former, then the filter won't work on windows

Length of code-block contents should be returned

dupree() filters out trivial symbols and then quantifies the similarity between the blocks that remain. But this means that:
library(dplyr)
gets converted to non-trivial symbols
"library dplyr"

so any pair of files containing library(dplyr) will match exactly for this specific block.
We either need a way of running dupree for blocks that are of at least this length or a way of returning results from dupree that contain the block-lengths for each compared pair of blocks.

master -> main

rewrite all tidyverse `select_(~ ...)`-type code

Rewrite this stuff, preferably using base R constructs for stability.
When running tests for dupree, in R-3.5.1 with dplyr=0.8.3, I get comments like

select_ is deprecated use select() instead ...

Please try not to break backwards compatibility (so don't use {{ my_columns }} syntax) - maybe test the package with both R=3.4.1/dplyr0.7/rlang0.2 and R=3.5.1/dplyr0.8/rlang0.4

hedgehog tests for alignment speed

Code block alignments are done by calling stringdist::seq_sim
In a given iteration, for each code block, seq_sim is used to compare it against all other code blocks in the set of files (seq_sim(this_code_block, all_other_code_blocks)).

There may be faster ways of doing the alignments. Suggest making an efficiency test to ensure that the speed of dupree is not adversely affected by pull-requests etc.

Use hedgehog to generate random vectors of integers that can be used as test input.

`dupree_package` fails for {logging}

In {logging} the file structure of the github repo looks like:

/home/ah327h/temp/dev-tools-analysis/logging/
├── cran-comments.md
├── handlers
│   └── pkg
│       ├── DESCRIPTION
│       ├── man
│       │   └── sentry.Rd
│       ├── NAMESPACE
│       └── R
│           └── sentry.R
├── pkg
│   ├── DESCRIPTION
│   ├── man
│   │   ├── // -- snip -- //
│   ├── NAMESPACE
│   ├── NEWS.md
│   ├── R
│   │   ├── logger.R
│   │   ├── // -- snip -- //
│   └── tests
│       ├── run_tests.R
│       ├── testthat
│       │   ├── // -- snip -- //
│       └── testthat.R
├── README.md
├── tutorial.rst
└── www
    - // snip //

But dupree_package fails on this package because it can't find a top-level ./R/ directory

TODO:

- (temp) stop dupree_package with informative error message when no top-level R directory is found
- determine how common the structure: (repo/-- pkg/-- R) is;
- rewrite dupree_package to handle this structure

Tag version 0.2.0

CRAN submission of dupree 0.2.0 on 2019-10-31 at 12.00 noon.

Commit: 9a0ae8b

ensure R CMD check passes without warnings

Current warnings:

* checking package subdirectories ... WARNING
Subdirectory ‘inst’ contains no files.

* checking Rd \usage sections ... WARNING
Undocumented arguments in documentation object 'annotate_parsed_content'
  ‘parsed_content’ ‘file’ ‘block’ ‘start_line’
Undocumented arguments in documentation object 'dupree'
  ‘files’ ‘min_block_size’ ‘...’
Undocumented arguments in documentation object 'get_localised_parsed_code_blocks'
  ‘source_exprs’

code coverage recipe for travis.ci

Also parse #' @examples in R documentation to find code duplicates

Hi, very useful package. I am wondering if there is any way to also parse the #' @examples from the roxygen documentation for duplications?

extend documentation

- can the user disregard specific directories (eg, tests/)
- how to add to / modify the disregarded symbols
- network diagram of duplication in a project / package
- why were the default disregarded symbols selected?
- limitation: only looks at top-level blocks
- ? alternatives

Continuous integration

Add travis integration

Plans for {dupree} 0.3.x

Major

- Provide a way to highlight, or print, the duplicated region corresponding to a match (#44 and #27)
- Introduce a lightweight class to store pairwise block-to-block duplication information (#60 ; see #63 )
- Visualisation of duplication within a project
- Provide some estimate of lines-of-code-that-could-be-saved (or similar) based on the dup-length and the frequency with which that dup is found across the files

Minor

- dupree_package() and dupree_dir() should work on current-working dir by default
- tests that directly use the top-level entry points, eg, make a package structure in a subdir of /tests and run dupree, dupree_dir and dupree_package on it (#45 )
- bump min_block_size to 40
- fix dupree_package so that it assesses
- - the R subdirectory of a given directory, rather than any subdirectory of the given directory that contains /R/ in it's name (found while running dupree_package on unitizer)
- - the structure of the passed-in directory (are DESCRIPTION, ~~NAMESPACE~~ and R/ all present?) #57
- relative_path argument in dupree_dir and dupree_package to indicate whether the filepaths in the results should be written relative to the analysed directory (vs, as full paths) and also whether excluded directories/files are specified relative to the analysed directory #62

Add shiny app

Prototype:

App should run locally on user's computer
They can point the app to a directory, package or file
Dupree will run and a table of duplications will be created

Would like:

Graph-based visualisation of duplications
Nested-graph (blocks nested inside files) or circular (ordered blocks inside files on the perimeter with links between dups) visualisation

CRAN resubmission

Add @value tag to dupree()
Add system.file(...)-based examples (see #39 )

dplyr::n should be imported

From #21 : In a fresh R session, if dplyr is not explicitly loaded, this function gives a different error because it fails to find dplyr::n()

> dupree::dupree_package(".")
Error in n() : could not find function "n"

add appveyor integration

add presentation for edinbR

Presentation should cover:

Classes: `dup` and `dups`

PLAN: conversion of dupree output into Dups object

- rewrite all tests that use output from dupree*() functions to test on as.data.frame(dupree_*()) rather than on the actual return values.
add Dups class (just wrap the data.frame)
add as.data.frame.Dups()
rewrite dupree() to return a Dups object

Conversion of Dups[data.frame] into Dups[list(Dup)]:

TODO

Plans for {dupree} 0.3.1

Major

Minor

fix dupree_package so that it assesses
- the R subdirectory of a given directory, rather than any subdirectory of the given directory that contains /R/ in it's name (found while running dupree_package on unitizer)
relative_path argument in dupree_dir and dupree_package to indicate whether the filepaths in the results should be written relative to the analysed directory (vs, as appended paths: ie, should dupree_package("pkg") have "pkg/R/some_file" or "R/some_file" in its file column?) and also whether excluded directories/files are specified relative to the analysed directory #62
README
- add downloads & cran status badges
  - see https://cran.r-project.org/web/packages/badgecreatr/vignettes/all_badges.html
  - or https://usethis.r-lib.org/reference/badges.html
- add link to pkgdown website
Specify R-base >= 3.4

Allow multiple 'path' entries

eg.

dupree_dir(path = c("R", "inst"))

Report missingness (for files / dirs) better

With an empty directory as working-dir:

dupree_package("somePackage")
# Error: Column `text` not found in `.data`
# Run `rlang::last_error()` to see where the error occurred.

It should say "could not find package " or some thing similar

Mention install from CRAN in README

Function for extracting a template of any duplicated code

Say these two code blocks are identified by dupree

my_code <- some_data %>%
   a_really() %>%
   long_pipeline() %>%
   bespoke_function1()
.
.
.
some_data %>%
    a_really %>%
    long_pipeline() %>%
    bespoke_function2()

Is there some way that dupree could take the details of these two code blocks (file / line) and return a template for abstracting out the common code?

new_function <- function(x) {
  x %>% a_really() %>% long_pipeline()
}

Allow user to add / drop token-types from the trivial-symbols list

Fix tidyselection warnings

Function for obtaining / printing the text for a pair of duplicated blocks

For example,
print_dup(dup_df[1, ])

Or, if we change dupree to return a list of class Dups, wherein each entry is of class Dup; then
print(dups[[1]]) might be better syntax

dupree_package fails if path specified like "~/the/path"

This fails because there is an attempt to filter to keep only files in /the/fully/specified/path/R/ using regexes. But if package is specified like ~/specified/path, then it's files look like ~/specified/path/R/some-file.R and do not match the fully-specified path.

dupree checks a->b and b->a

I noticed in the presentation that every result is presented twice; once as a-> and again as b->a (where -> denotes "is most similar"). I don't think doing a one vs all for every block is the right approach, although when you reduce the problem to comparison of int vectors it's probably not too bad.

What I would do (see PR) is do only the unique combinations, then filter those results somehow. The way I've done in the PR is a bit off, so take it with a grain of salt.

btw the thought occurs - how well does this approach of tokenisation work when you have more than 10 symbols? Seems like that might mess with the similarity quite a bit as "1" "6" then becomes the same as "16"

Error in rep.int(NA_character_, max(ends - 1)) : invalid 'times' value

I am trying to use this package to see if it works for this package (https://github.com/IndrajeetPatil/ggstatsplot), but I keep getting the following error-

# in package directory?
getwd()
#> [1] "C:/Users/inp099/Documents/ggstatsplot"

# checking for duplicated code
dupree::dupree_package(".")
#> Error in rep.int(NA_character_, max(ends - 1)) : invalid 'times' value
#> In addition: Warning message:
#> In max(ends - 1) : no non-missing arguments to max; returning -Inf

Here is the tracenack-

> traceback()
34: extract_r_source(source_file$filename, source_file$lines)
33: lintr::get_source_expressions(file)
32: get_source_expressions(.)
31: function_list[[i]](value)
30: freduce(value, `_function_list`)
29: `_fseq`(`_lhs`)
28: eval(quote(`_fseq`(`_lhs`)), env, env)
27: eval(quote(`_fseq`(`_lhs`)), env, env)
26: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
25: file %>% get_source_expressions() %>% get_localised_parsed_code_blocks() %>% 
        dplyr::filter_(~!token %in% "COMMENT")
24: .f(.x[[i]], ...)
23: purrr::map(., import_parsed_code_blocks_from_one_file)
22: function_list[[i]](value)
21: freduce(value, `_function_list`)
20: `_fseq`(`_lhs`)
19: eval(quote(`_fseq`(`_lhs`)), env, env)
18: eval(quote(`_fseq`(`_lhs`)), env, env)
17: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
16: files %>% purrr::map(import_parsed_code_blocks_from_one_file) %>% 
        dplyr::bind_rows()
15: import_parsed_code_blocks(.)
14: function_list[[i]](value)
13: freduce(value, `_function_list`)
12: `_fseq`(`_lhs`)
11: eval(quote(`_fseq`(`_lhs`)), env, env)
10: eval(quote(`_fseq`(`_lhs`)), env, env)
9: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
8: files %>% import_parsed_code_blocks() %>% tokenize_code_blocks() %>% 
       filter_(~block_size >= min_block_size)
7: preprocess_code_blocks(files, min_block_size)
6: eval(lhs, parent, parent)
5: eval(lhs, parent, parent)
4: preprocess_code_blocks(files, min_block_size) %>% find_best_matches()
3: dupree(keep_files, min_block_size)
2: dupree_dir(package, min_block_size, filter = paste0(package, 
       "/R/"))
1: dupree::dupree_package(".")

And session information-

sessioninfo::session_info()
#> - Session info ----------------------------------------------------------
#>  setting  value                                             
#>  version  R Under development (unstable) (2018-11-30 r75724)
#>  os       Windows 10 x64                                    
#>  system   x86_64, mingw32                                   
#>  ui       RTerm                                             
#>  language (EN)                                              
#>  collate  English_United States.1252                        
#>  ctype    English_United States.1252                        
#>  tz       America/New_York                                  
#>  date     2019-01-26                                        
#> 
#> - Packages --------------------------------------------------------------
#>  package     * version    date       lib source                    
#>  assertthat    0.2.0      2017-04-11 [1] CRAN (R 3.5.1)            
#>  cli           1.0.1.9000 2019-01-20 [1] Github (r-lib/cli@94e2fc5)
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.5.1)            
#>  digest        0.6.18     2018-10-10 [1] CRAN (R 3.5.1)            
#>  evaluate      0.12       2018-10-09 [1] CRAN (R 3.5.1)            
#>  highr         0.7        2018-06-09 [1] CRAN (R 3.5.1)            
#>  htmltools     0.3.6      2017-04-28 [1] CRAN (R 3.5.1)            
#>  knitr         1.21       2018-12-10 [1] CRAN (R 3.6.0)            
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 3.5.1)            
#>  Rcpp          1.0.0      2018-11-07 [1] CRAN (R 3.6.0)            
#>  rmarkdown     1.11       2018-12-08 [1] CRAN (R 3.6.0)            
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.6.0)            
#>  stringi       1.2.4      2018-07-20 [1] CRAN (R 3.6.0)            
#>  stringr       1.3.1      2018-05-10 [1] CRAN (R 3.5.1)            
#>  withr         2.1.2      2018-03-15 [1] CRAN (R 3.5.1)            
#>  xfun          0.4        2018-10-23 [1] CRAN (R 3.6.0)            
#>  yaml          2.2.0      2018-07-25 [1] CRAN (R 3.5.1)            
#> 
#> [1] C:/Users/inp099/Documents/R/win-library/3.6
#> [2] C:/Program Files/R/R-devel/library

^{Created on 2019-01-26 by the reprex package (v0.2.1)}

dupree on a *.Rmd file with no R blocks

Ran dupree on polyply/vignettes/origins_of_the_datasets.Rmd which is an R-markdown document with no R blocks.

Error thrown is different from that in #1 , relates to knitr parsing of the Rmd document.

Error in rep.int(NA_character_, max(ends - 1)) : Invalid `times` value In addition: Warning message: In max(ends - 1) : no non-missing arguments to max; returning -Inf

Error appears to be thrown when running lintr:::get_source_expressions and hence lintr:::extract_r_source

Call to lintr:::extract_r_source within lintr::get_source_expressions looks (effectively) like this:

lintr:::extract_r_source( filename, base::readLines(filename) )
Within extract_r_source, get_knitr_pattern works fine, starts and ends are defined (but empty). Then definition of output fails. TODO: send PR to lintr re short-circuit of extract_r_source

tag the 0.3 version now that it has been pushed to CRAN

Migrate to GHA

Currently using Travis for CI. Should update to using GHA.

`dupree_package` should assert R package structure

If I call dupree_package on the repo for {rscala}, dupree_package runs fine. But, the rscala package code resides in the subdirectory {repo_root}/R/rscala rather than in the repo-root.

When a directory is passed to dupree::dupree_package, it should check that NAMESPACE, DESCRIPTION and R/ are present in that directory.

change default `min_block_size` to 20

remove outdated `dupr`-related functions

This will render some imports unused, eg, Biostrings.
Biostrings is killing travis-CI at present

`relative_path = TRUE` argument in `dupree_[dir|package]`

Reason:

Running dupree_package on aoos during code_as_data returned a data.frame that looks like:

file_a  file_b  block_a block_b line_a  line_b  score
<MY_HOME>/temp/dev-tools-analysis/aoos/R/S4-expressions.R    <MY_HOME>/temp/dev-tools-analysis/aoos/R/S4-expressions.R  2       4       71      139     0.24880382775119614
<MY_HOME>/temp/dev-tools-analysis/aoos/R/RL-retList.R        <MY_HOME>/temp/dev-tools-analysis/aoos/R/S4RC-Accessor.R   98      16      112     32      0.2222222222222222

I would rather the file paths were relative to the package-path or dir-path that was passed into dupree_* (for this particular analysis), that is:

file_a  file_b  block_a block_b line_a  line_b  score
R/S4-expressions.R    R/S4-expressions.R  2       4       71      139     0.24880382775119614
R/RL-retList.R        R/S4RC-Accessor.R   98      16      112     32      0.2222222222222222

lintr provides an equivalent argument, that is TRUE by default.
See lint_dir / lint_package:

@param relative_path if \code{TRUE}, file paths are printed using their path
#' relative to the base directory.  If \code{FALSE}, use the full
#' absolute path.

Could the name relative_path be confused:

it is supposed to indicate that the paths in the results will be written relative to the user-specified directory;
when the user provides a list of files or directories that should be ignored, should relative_path also dictate whether the ignored directories are specified relative to the analysed directory

dupree_dir function or dupree(..., dir = ".")

Need a function that can identify all R or .Rmd files in subdirectories of the working-directory.

Preferably, it should disregard files that match a regex (eg, drop_pattern = "testthat|Rcheck") or disregard stated subdirectories.

Default package should be "." when calling `dupree_package()`

Current:

dupree_package()
# Error in dupree_package(package, min_block_size, filter = paste0...)
#   argument "package" is missing, with no default

Speed ups

For large code bases, dupree may be pretty slow because it does block-by-block pairwise-sequence analysis.
Where there are lots of top-level expressions, it may be speeded up by doing a k-mer similarity analysis over the tokens first, and then only running pairwise-sequence analysis on block-pairs that share some k-mers.

BUT: only do this if dupree is wayy too slow for some typically-sized packages. eg, do a package-size analysis over a range of different R packages and compare the number of blocks, and the block-size-distributions to the length of time that dupree takes

Add rstudio integration

.. so the user can navigate to course-code lines that are flagged by dupree

See this gist: https://gist.github.com/moodymudskipper/a1d0344e4a8aeb93708ff44c5d9c01f7

And the twitter thread by Antoine Fabri: https://twitter.com/antoine_fabri/status/1506628013684891649

russhyde / dupree Goto Github PK

dupree's People

Contributors

Stargazers

Watchers

dupree's Issues

Recommend Projects

Recommend Topics

Recommend Org