Coder Social home page Coder Social logo

dupree's People

Contributors

alanocallaghan avatar olivroy avatar russhyde avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

dupree's Issues

Add a system.file() example that the user could use

At present, the example code in the dupree README couldn't be reran by a user if they downloaded dupree from CRAN, say, since they would require the dupree source code for the examples to work (and dupree isn't particularly bad for duplication within the source code).

It would be useful to have a couple of dupree examples that use example files (from inst/) that CRAN users could access

  • - add two example files to inst.
  • - add example sections that use these system.files to the dupree() roxygen
  • - use the system.files within a vignette or in the README.
  • - also (but unrelated) show how to visualise a similarity network using tidygraph in the vignette

Vignette: How to sensibly use {dupree} as part of your workflow

DON'T JUST DEDUPLICATE WITHOUT THINKING!

Write up a vignette of a sensible workflow:

The aim is to add useful functionality that reduces the floor-space for bugs without deduplicating so far that your code becomes unreadable

Two main approaches:

  • If you want to fix a bug
    • Write a test that traps the bug, if you can
    • Isolate the code that is affected by the bug
    • run dupree and see if that code is duplicated elsewhere in the package
    • [see if you need to write a test that captures bugs in the duplicated bits]
    • [deduplicate the duplicated bits of code & then fix the bug]
    • OR [fix the bug(s) and then deduplicate]
  • If you want to add a feature
    • again, write tests for the new feature
    • write the code to implement your feature
    • run dupree; find if your feature-code duplicates some other part of the package
    • refactor the duplicated bits of code

As such, {dupree} doesn't really fit into the goodpractice-type checks (where you analyse the current state of the whole package relative to general guidelines); you need to consider before & after and focus on the bit of code that you are currently modifying. [? does that make sense]

Fix any dplyr-based code / tests to work with dplyr=1.0

Dear Russ Hyde,

This is an automated email to let you know that:

  • A new version of dplyr is ready to go to CRAN. dplyr is
    currently at version 0.8.99.9002 and will become 1.0.0 upon release.

  • dupree uses dplyr and has problems with the new version.

  • We plan to submit dplyr to CRAN on May 1.

This is a major release. See
https://www.tidyverse.org/blog/2020/03/dplyr-1-0-0-is-coming-soon/ for
a detailed article about what's changed.

I need your help to keep dupree and dplyr working together smoothly.
In the next weeks, can you please:

  1. Read about the changes to dplyr at
    https://github.com/tidyverse/dplyr/blob/master/NEWS.md.
    This page includes a list of breaking changes, the reasoning behind
    them, and to how to update your code.

  2. Carefully inspect the failing checks listed at the bottom of this email.

  3. For each failing check, either update your package, or tell me
    that I have a bug. If you have made changes to your package, please
    submit an update to CRAN before May 1.

If you have discovered a bug in dplyr, please file an issue (ideally
with a small reprex that illustrates the problem) at
https://github.com/tidyverse/dplyr/issues. If you're not sure whether
or not you've found a bug, please file an issue at
https://github.com/tidyverse/dplyr/issues for discussion. Breaking
changes that are not listed qualify as bugs.

Please respond to this message if you have any questions.

Thanks,

Romain Francois

dupree on a *.R file with no R blocks

Calling dupree on coxpresdbr/R/coxpresdbr_parse.R throws an error.

The only content in coxpresdbr_parse.R is some blank lines and a comment.

Error is:
Error in mutate_impl(.data, dots) : Evaluation error: unique() applies only to vectors.
traceback() shows the error occurs in a mutate_ call in enumerate_code_symbols()

Replace pr-commands.yaml with pre-commit rules

Aim, to ensure that styler::style_pkg() and devtools::document() are ran on every commit. This should keep the code clean, and is more hands off than having PR discussion commands for styling and documenting the code.

Check that `dupree_package` works on windows

dupree_package uses a filter that works like
grep(pattern = "<my_package>/R/", x= initial_files)

... where initial_files is created by dir(). Under windows, does dir return file-paths like "<my_package>\R\some_file.R" or like "<my_package>/R/some_file.R"? if the former, then the filter won't work on windows

Length of code-block contents should be returned

dupree() filters out trivial symbols and then quantifies the similarity between the blocks that remain. But this means that:
library(dplyr)
gets converted to non-trivial symbols
"library dplyr"

so any pair of files containing library(dplyr) will match exactly for this specific block.
We either need a way of running dupree for blocks that are of at least this length or a way of returning results from dupree that contain the block-lengths for each compared pair of blocks.

rewrite all tidyverse `select_(~ ...)`-type code

Rewrite this stuff, preferably using base R constructs for stability.
When running tests for dupree, in R-3.5.1 with dplyr=0.8.3, I get comments like

select_ is deprecated use select() instead ...

Please try not to break backwards compatibility (so don't use {{ my_columns }} syntax) - maybe test the package with both R=3.4.1/dplyr0.7/rlang0.2 and R=3.5.1/dplyr0.8/rlang0.4

hedgehog tests for alignment speed

Code block alignments are done by calling stringdist::seq_sim
In a given iteration, for each code block, seq_sim is used to compare it against all other code blocks in the set of files (seq_sim(this_code_block, all_other_code_blocks)).

There may be faster ways of doing the alignments. Suggest making an efficiency test to ensure that the speed of dupree is not adversely affected by pull-requests etc.

Use hedgehog to generate random vectors of integers that can be used as test input.

`dupree_package` fails for {logging}

In {logging} the file structure of the github repo looks like:

/home/ah327h/temp/dev-tools-analysis/logging/
├── cran-comments.md
├── handlers
│   └── pkg
│       ├── DESCRIPTION
│       ├── man
│       │   └── sentry.Rd
│       ├── NAMESPACE
│       └── R
│           └── sentry.R
├── pkg
│   ├── DESCRIPTION
│   ├── man
│   │   ├── // -- snip -- //
│   ├── NAMESPACE
│   ├── NEWS.md
│   ├── R
│   │   ├── logger.R
│   │   ├── // -- snip -- //
│   └── tests
│       ├── run_tests.R
│       ├── testthat
│       │   ├── // -- snip -- //
│       └── testthat.R
├── README.md
├── tutorial.rst
└── www
    - // snip //

But dupree_package fails on this package because it can't find a top-level ./R/ directory

TODO:

  • - (temp) stop dupree_package with informative error message when no top-level R directory is found
  • - determine how common the structure: (repo/-- pkg/-- R) is;
  • - rewrite dupree_package to handle this structure

ensure R CMD check passes without warnings

Current warnings:

* checking package subdirectories ... WARNING
Subdirectory ‘inst’ contains no files.

* checking Rd \usage sections ... WARNING
Undocumented arguments in documentation object 'annotate_parsed_content'
  ‘parsed_content’ ‘file’ ‘block’ ‘start_line’
Undocumented arguments in documentation object 'dupree'
  ‘files’ ‘min_block_size’ ‘...’
Undocumented arguments in documentation object 'get_localised_parsed_code_blocks'
  ‘source_exprs’

extend documentation

  • - can the user disregard specific directories (eg, tests/)
  • - how to add to / modify the disregarded symbols
  • - network diagram of duplication in a project / package
  • - why were the default disregarded symbols selected?
  • - limitation: only looks at top-level blocks
  • - ? alternatives

Plans for {dupree} 0.3.x

Major

  • - Provide a way to highlight, or print, the duplicated region corresponding to a match (#44 and #27)
  • - Introduce a lightweight class to store pairwise block-to-block duplication information (#60 ; see #63 )
  • - Visualisation of duplication within a project
  • - Provide some estimate of lines-of-code-that-could-be-saved (or similar) based on the dup-length and the frequency with which that dup is found across the files

Minor

  • - dupree_package() and dupree_dir() should work on current-working dir by default
  • - tests that directly use the top-level entry points, eg, make a package structure in a subdir of /tests and run dupree, dupree_dir and dupree_package on it (#45 )
  • - bump min_block_size to 40
  • - fix dupree_package so that it assesses
    • - the R subdirectory of a given directory, rather than any subdirectory of the given directory that contains /R/ in it's name (found while running dupree_package on unitizer)
    • - the structure of the passed-in directory (are DESCRIPTION, NAMESPACE and R/ all present?) #57
  • - relative_path argument in dupree_dir and dupree_package to indicate whether the filepaths in the results should be written relative to the analysed directory (vs, as full paths) and also whether excluded directories/files are specified relative to the analysed directory #62

Add shiny app

Prototype:

  • App should run locally on user's computer
  • They can point the app to a directory, package or file
  • Dupree will run and a table of duplications will be created

Would like:

  • Graph-based visualisation of duplications
  • Nested-graph (blocks nested inside files) or circular (ordered blocks inside files on the perimeter with links between dups) visualisation

dplyr::n should be imported

From #21 : In a fresh R session, if dplyr is not explicitly loaded, this function gives a different error because it fails to find dplyr::n()

> dupree::dupree_package(".")
Error in n() : could not find function "n"

add presentation for edinbR

Presentation should cover:

  • - clean-ish code
  • - code smells and architectural ideals
  • - how style inconsistencies & duplication arises in scripts / packages
  • - lintr
    • - linting an example script
    • - modifying which linters are used
    • - advances in the dev version of lintr
    • - writing a new linter
  • - dupree
    • - illustration of algorithm?
    • - identification of duplicated code
    • - analysis of duplicated code across packages
      • - dupree on lintr
      • - visualisation of dupree results on lintr
    • - is dupree fast enough?
  • - the myriad ways to clean up duplicated sections of code
    • - how would we clean up the duplicated lintr code?

Classes: `dup` and `dups`

PLAN: conversion of dupree output into Dups object

  • - rewrite all tests that use output from dupree*() functions to test on as.data.frame(dupree_*()) rather than on the actual return values.
  • add Dups class (just wrap the data.frame)
  • add as.data.frame.Dups()
  • rewrite dupree() to return a Dups object

Conversion of Dups[data.frame] into Dups[list(Dup)]:

  • TODO

Plans for {dupree} 0.3.1

Major

  • Provide a way to highlight, or print, the duplicated region corresponding to a match (#44 and #27)
  • Visualisation of duplication within a project
  • Provide some estimate of lines-of-code-that-could-be-saved (or similar) based on the dup-length and the frequency with which that dup is found across the files
  • Migrate from travis to github actions (& add OS-X checks) (#73)
  • pkgdown website
  • Vignettes / Blogposts:
    • how to sensibly use dupree during development (#59)
    • example of PR or bug report aided by dupree
    • visualisation of duplication
  • add shiny app to illustrate running / visualising results from dupree (#74)

Minor

  • fix dupree_package so that it assesses
    • the R subdirectory of a given directory, rather than any subdirectory of the given directory that contains /R/ in it's name (found while running dupree_package on unitizer)
  • relative_path argument in dupree_dir and dupree_package to indicate whether the filepaths in the results should be written relative to the analysed directory (vs, as appended paths: ie, should dupree_package("pkg") have "pkg/R/some_file" or "R/some_file" in its file column?) and also whether excluded directories/files are specified relative to the analysed directory #62
  • README
  • Specify R-base >= 3.4

Report missingness (for files / dirs) better

With an empty directory as working-dir:

dupree_package("somePackage")
# Error: Column `text` not found in `.data`
# Run `rlang::last_error()` to see where the error occurred.

It should say "could not find package " or some thing similar

Function for extracting a template of any duplicated code

Say these two code blocks are identified by dupree

my_code <- some_data %>%
   a_really() %>%
   long_pipeline() %>%
   bespoke_function1()
.
.
.
some_data %>%
    a_really %>%
    long_pipeline() %>%
    bespoke_function2()

Is there some way that dupree could take the details of these two code blocks (file / line) and return a template for abstracting out the common code?

new_function <- function(x) {
  x %>% a_really() %>% long_pipeline()
}

dupree_package fails if path specified like "~/the/path"

This fails because there is an attempt to filter to keep only files in /the/fully/specified/path/R/ using regexes. But if package is specified like ~/specified/path, then it's files look like ~/specified/path/R/some-file.R and do not match the fully-specified path.

dupree checks a->b and b->a

I noticed in the presentation that every result is presented twice; once as a-> and again as b->a (where -> denotes "is most similar"). I don't think doing a one vs all for every block is the right approach, although when you reduce the problem to comparison of int vectors it's probably not too bad.

What I would do (see PR) is do only the unique combinations, then filter those results somehow. The way I've done in the PR is a bit off, so take it with a grain of salt.

btw the thought occurs - how well does this approach of tokenisation work when you have more than 10 symbols? Seems like that might mess with the similarity quite a bit as "1" "6" then becomes the same as "16"

Error in rep.int(NA_character_, max(ends - 1)) : invalid 'times' value

I am trying to use this package to see if it works for this package (https://github.com/IndrajeetPatil/ggstatsplot), but I keep getting the following error-

# in package directory?
getwd()
#> [1] "C:/Users/inp099/Documents/ggstatsplot"

# checking for duplicated code
dupree::dupree_package(".")
#> Error in rep.int(NA_character_, max(ends - 1)) : invalid 'times' value
#> In addition: Warning message:
#> In max(ends - 1) : no non-missing arguments to max; returning -Inf

Here is the tracenack-

> traceback()
34: extract_r_source(source_file$filename, source_file$lines)
33: lintr::get_source_expressions(file)
32: get_source_expressions(.)
31: function_list[[i]](value)
30: freduce(value, `_function_list`)
29: `_fseq`(`_lhs`)
28: eval(quote(`_fseq`(`_lhs`)), env, env)
27: eval(quote(`_fseq`(`_lhs`)), env, env)
26: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
25: file %>% get_source_expressions() %>% get_localised_parsed_code_blocks() %>% 
        dplyr::filter_(~!token %in% "COMMENT")
24: .f(.x[[i]], ...)
23: purrr::map(., import_parsed_code_blocks_from_one_file)
22: function_list[[i]](value)
21: freduce(value, `_function_list`)
20: `_fseq`(`_lhs`)
19: eval(quote(`_fseq`(`_lhs`)), env, env)
18: eval(quote(`_fseq`(`_lhs`)), env, env)
17: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
16: files %>% purrr::map(import_parsed_code_blocks_from_one_file) %>% 
        dplyr::bind_rows()
15: import_parsed_code_blocks(.)
14: function_list[[i]](value)
13: freduce(value, `_function_list`)
12: `_fseq`(`_lhs`)
11: eval(quote(`_fseq`(`_lhs`)), env, env)
10: eval(quote(`_fseq`(`_lhs`)), env, env)
9: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
8: files %>% import_parsed_code_blocks() %>% tokenize_code_blocks() %>% 
       filter_(~block_size >= min_block_size)
7: preprocess_code_blocks(files, min_block_size)
6: eval(lhs, parent, parent)
5: eval(lhs, parent, parent)
4: preprocess_code_blocks(files, min_block_size) %>% find_best_matches()
3: dupree(keep_files, min_block_size)
2: dupree_dir(package, min_block_size, filter = paste0(package, 
       "/R/"))
1: dupree::dupree_package(".")

And session information-

sessioninfo::session_info()
#> - Session info ----------------------------------------------------------
#>  setting  value                                             
#>  version  R Under development (unstable) (2018-11-30 r75724)
#>  os       Windows 10 x64                                    
#>  system   x86_64, mingw32                                   
#>  ui       RTerm                                             
#>  language (EN)                                              
#>  collate  English_United States.1252                        
#>  ctype    English_United States.1252                        
#>  tz       America/New_York                                  
#>  date     2019-01-26                                        
#> 
#> - Packages --------------------------------------------------------------
#>  package     * version    date       lib source                    
#>  assertthat    0.2.0      2017-04-11 [1] CRAN (R 3.5.1)            
#>  cli           1.0.1.9000 2019-01-20 [1] Github (r-lib/cli@94e2fc5)
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.5.1)            
#>  digest        0.6.18     2018-10-10 [1] CRAN (R 3.5.1)            
#>  evaluate      0.12       2018-10-09 [1] CRAN (R 3.5.1)            
#>  highr         0.7        2018-06-09 [1] CRAN (R 3.5.1)            
#>  htmltools     0.3.6      2017-04-28 [1] CRAN (R 3.5.1)            
#>  knitr         1.21       2018-12-10 [1] CRAN (R 3.6.0)            
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 3.5.1)            
#>  Rcpp          1.0.0      2018-11-07 [1] CRAN (R 3.6.0)            
#>  rmarkdown     1.11       2018-12-08 [1] CRAN (R 3.6.0)            
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.6.0)            
#>  stringi       1.2.4      2018-07-20 [1] CRAN (R 3.6.0)            
#>  stringr       1.3.1      2018-05-10 [1] CRAN (R 3.5.1)            
#>  withr         2.1.2      2018-03-15 [1] CRAN (R 3.5.1)            
#>  xfun          0.4        2018-10-23 [1] CRAN (R 3.6.0)            
#>  yaml          2.2.0      2018-07-25 [1] CRAN (R 3.5.1)            
#> 
#> [1] C:/Users/inp099/Documents/R/win-library/3.6
#> [2] C:/Program Files/R/R-devel/library

Created on 2019-01-26 by the reprex package (v0.2.1)

dupree on a *.Rmd file with no R blocks

Ran dupree on polyply/vignettes/origins_of_the_datasets.Rmd which is an R-markdown document with no R blocks.

Error thrown is different from that in #1 , relates to knitr parsing of the Rmd document.

Error in rep.int(NA_character_, max(ends - 1)) : Invalid `times` value In addition: Warning message: In max(ends - 1) : no non-missing arguments to max; returning -Inf

Error appears to be thrown when running lintr:::get_source_expressions and hence lintr:::extract_r_source

Call to lintr:::extract_r_source within lintr::get_source_expressions looks (effectively) like this:

lintr:::extract_r_source( filename, base::readLines(filename) )
Within extract_r_source, get_knitr_pattern works fine, starts and ends are defined (but empty). Then definition of output fails. TODO: send PR to lintr re short-circuit of extract_r_source

Migrate to GHA

Currently using Travis for CI. Should update to using GHA.

`dupree_package` should assert R package structure

If I call dupree_package on the repo for {rscala}, dupree_package runs fine. But, the rscala package code resides in the subdirectory {repo_root}/R/rscala rather than in the repo-root.

When a directory is passed to dupree::dupree_package, it should check that NAMESPACE, DESCRIPTION and R/ are present in that directory.

`relative_path = TRUE` argument in `dupree_[dir|package]`

Reason:

Running dupree_package on aoos during code_as_data returned a data.frame that looks like:

file_a  file_b  block_a block_b line_a  line_b  score
<MY_HOME>/temp/dev-tools-analysis/aoos/R/S4-expressions.R    <MY_HOME>/temp/dev-tools-analysis/aoos/R/S4-expressions.R  2       4       71      139     0.24880382775119614
<MY_HOME>/temp/dev-tools-analysis/aoos/R/RL-retList.R        <MY_HOME>/temp/dev-tools-analysis/aoos/R/S4RC-Accessor.R   98      16      112     32      0.2222222222222222

I would rather the file paths were relative to the package-path or dir-path that was passed into dupree_* (for this particular analysis), that is:

file_a  file_b  block_a block_b line_a  line_b  score
R/S4-expressions.R    R/S4-expressions.R  2       4       71      139     0.24880382775119614
R/RL-retList.R        R/S4RC-Accessor.R   98      16      112     32      0.2222222222222222

lintr provides an equivalent argument, that is TRUE by default.
See lint_dir / lint_package:

@param relative_path if \code{TRUE}, file paths are printed using their path
#' relative to the base directory.  If \code{FALSE}, use the full
#' absolute path.

Could the name relative_path be confused:

  • it is supposed to indicate that the paths in the results will be written relative to the user-specified directory;
  • when the user provides a list of files or directories that should be ignored, should relative_path also dictate whether the ignored directories are specified relative to the analysed directory

dupree_dir function or dupree(..., dir = ".")

Need a function that can identify all R or .Rmd files in subdirectories of the working-directory.

Preferably, it should disregard files that match a regex (eg, drop_pattern = "testthat|Rcheck") or disregard stated subdirectories.

Speed ups

For large code bases, dupree may be pretty slow because it does block-by-block pairwise-sequence analysis.
Where there are lots of top-level expressions, it may be speeded up by doing a k-mer similarity analysis over the tokens first, and then only running pairwise-sequence analysis on block-pairs that share some k-mers.

BUT: only do this if dupree is wayy too slow for some typically-sized packages. eg, do a package-size analysis over a range of different R packages and compare the number of blocks, and the block-size-distributions to the length of time that dupree takes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.