Coder Social home page Coder Social logo

krayzpipes / txt-ferret Goto Github PK

View Code? Open in Web Editor NEW
3.0 4.0 0.0 123 KB

Identify and classify data in your text files with Python.

License: Apache License 2.0

Python 100.00%
regex security sensitive-data-discovery pci-compliance data-loss-prevention data-protection gdpr dlp python intellectual-property

txt-ferret's Introduction

txtferret

Identify and classify data in your text files with Python.

Description

Definition: txtferret

  • A weasel-like mammal that feasts on rodents... and apparently social security numbers, credit card numbers, or any other data that's in your text or gzipped text files.

Use custom regular expressions and sanity checks (ex: luhn algorithm for account numbers) to find sensitive data in virtually any size file via your command line.

Why use txtferret? See the How/why did this come about? section below.

Table of Contents

  • Quick Start
  • Configuration
  • How/why did this come about?
  • Development

Quick Start

PyPi

  1. Install it

    $ pip3 install txtferret

Repo

  1. Clone it.
    $ git clone [email protected]:krayzpipes/txt-ferret.git
    $ cd txt-ferret
  2. Setup environment.
    $ python3.7 -m venv venv
    $ source venv/bin/activate
  3. Install it.
    (venv) $ python setup.py install

Run it

  • Example file size:

    # Decent sized file.
    
    $ ls -alh | grep my_test_file.dat
    -rw-r--r--  1 mrferret ferrets  19G May  7 11:15 my_test_file.dat
  • Scanning the file.

    # Scan the file.
    # Default behavior is to mask the string that was matched.
    
    $ txtferret scan my_test_file.dat
    2019:05:20-22:18:01:-0400 Beginning scan for /home/mrferret/Documents/test_file_1.dat
    2019:05:20-22:18:18:-0400 PASSED sanity and matched regex - /home/mrferret/Documents/test_file_1.dat - Filter: fake_ccn_account_filter, Line 712567, String: 100102030405060708094
    2019:05:20-22:19:09:-0400 Finished scan for /home/mrferret/Documents/test_file_1.dat
    2019:05:20-22:19:09:-0400 SUMMARY:
    2019:05:20-22:19:09:-0400   - Matched regex, failed sanity: 2
    2019:05:20-22:19:09:-0400   - Matched regex, passed sanity: 1
    2019:05:20-22:19:09:-0400 Finished in 78 seconds (~1 minutes).
  • Scanning the file with a delimiter.

    # Break up each line of a CSV into columns by declaring a comma for a delimiter.
    # Scan each field in the row and return column numbers as well as line numbers.
    
    $ txtferret scan --delimiter , test_file_1.csv
    2019:05:20-21:41:57:-0400 Beginning scan for /home/mrferret/Documents/test_file_1.csv
    2019:05:20-21:44:34:-0400 PASSED sanity and matched regex - /home/mrferret/Documents/test_file_1.csv - Filter: fake_ccn_account_filter, Line 712567, String: 100102030405060708094, Column: 171
    2019:05:20-21:49:16:-0400 Finished scan for /home/mrferret/Documents/test_file_1.csv
    2019:05:20-21:49:16:-0400 SUMMARY:
    2019:05:20-21:49:16:-0400   - Matched regex, failed sanity: 2
    2019:05:20-21:49:16:-0400   - Matched regex, passed sanity: 1
    2019:05:20-21:49:16:-0400 Finished in 439 seconds (~7 minutes).
    • Scan all files in a directory. Write results to a file and stdout
    # Uses multiprocessing to speed up scans of a bulk group of files
    
    $ txtferret scan -o bulk_testing.log --bulk ../test_files/
    2019:06:09-15:15:27:-0400 Detected non-text file '/home/mrferret/Documents/test_file_1.dat.gz'... attempting GZIP mode (slower).
    2019:06:09-15:15:27:-0400 Detected non-text file '/home/mrferret/Documents/test_file_2.dat.gz'... attempting GZIP mode (slower).
    2019:06:09-15:15:27:-0400 Beginning scan for /home/mrferret/Documents/test_file_1.dat.gz
    2019:06:09-15:15:27:-0400 Beginning scan for /home/mrferret/Documents/test_file_2.dat.gz
    2019:06:09-15:15:27:-0400 Beginning scan for /home/mrferret/Documents/test_file_3.dat
    2019:06:09-15:15:27:-0400 PASSED sanity and matched regex - /home/mrferret/Documents/test_file_2.dat.gz - Filter: fake_ccn_account_filter, Line 4, String: 100102030405060708094
    2019:06:09-15:15:27:-0400 Finished scan for /home/mrferret/Documents/test_file_2.dat.gz
    2019:06:09-15:16:04:-0400 PASSED sanity and matched regex - /home/mrferret/Documents/test_file_3.dat - Filter: fake_ccn_account_filter, Line 712567, String: 100102030405060708094
    2019:06:09-15:16:51:-0400 PASSED sanity and matched regex - /home/mrferret/Documents/test_file_1.dat.gz - Filter: fake_ccn_account_filter, Line 712567, String: 100102030405060708094
    2019:06:09-15:17:15:-0400 Finished scan for /home/mrferret/Documents/test_file_3.dat
    2019:06:09-15:19:24:-0400 Finished scan for /home/mrferret/Documents/test_file_1.dat.gz
    2019:06:09-15:19:24:-0400 SUMMARY:
    2019:06:09-15:19:24:-0400   - Scanned 3 file(s).
    2019:06:09-15:19:24:-0400   - Matched regex, failed sanity: 16
    2019:06:09-15:19:24:-0400   - Matched regex, passed sanity: 3
    2019:06:09-15:19:24:-0400   - Finished in 236 seconds (~3 minutes).
    2019:06:09-15:19:24:-0400 FILE SUMMARIES:
    2019:06:09-15:19:24:-0400 Matches: 1 passed sanity checks and 2 failed, Time Elapsed: 236 seconds / ~3 minutes - /home/mrferret/Documents/test_file_1.dat.gz
    2019:06:09-15:19:24:-0400 Matches: 1 passed sanity checks and 3 failed, Time Elapsed: 0 seconds / ~0 minutes - /home/mrferret/Documents/test_file_2.dat.gz
    2019:06:09-15:19:24:-0400 Matches: 1 passed sanity checks and 2 failed, Time Elapsed: 107 seconds / ~1 minutes - /home/mrferret/Documents/test_file_3.dat

Configuration

There are two ways to configure txt-ferret. You can make changes or add filters through making a custom configuration file (based on the default YAML file) or you can add some settings via CLI switches.

  • CLI Switches will always win/take precedence.
  • User-defined configuration file will always beat the default configuration.
  • If any settings are not defined in a user-file configuration by cli switches, then the default setting will be applied.

Txt-ferret comes with a default config which you can dump into any directory you wish and change it or use it for reference. If you change the file, you have to specifiy it with the appropriate CLI switch in order for the script to use it. See the CLI section below.

(venv) $ txtferret dump-config /file/to/write/to.yaml

There are two sections of the config file: filters and settings.

Filters

Filters are regular expressions with some metadata. You can use this metadata to perform sanity checks on regex matches to sift out false positives. (Ex: luhn algorithm for credit card numbers). You can also mask the output of the matched string as it is logged to a file or displayed on a terminal.

filters:
- label: american_express_15_ccn
  pattern: '((?:34|37)\d{2}(?:(?:[\W_]\d{6}[\W_]\d{5})|\d{11}))'
  substitute: '[\W_]'
  exclude_patterns: ["dont_match_me", "dont_match_me_either"]
  sanity: luhn
  mask:
    index: 2,
    value: XXXXXXXX
  type: Credit Card Number
  • label:
    • This will be displayed in the logs when the filter in question has matched a string.
  • pattern:
    • The regular expression which will be used to find data in the file.
    • Regular expression must be compatible with the python re module in the standard library.
    • Be sure that your regular expression only contains ONE and ONLY ONE capture group. For example, if you are capturing a phone number:
      • Don't do this: '(555-(867|555)-5309)'
      • Do this: '(555-(?:867|555)-5309)'
      • The former example has two capture groups, and inner and an outer.
      • The latter has one capture group (the outer). The inner is a non-capturing group as defined by starting the capture group with ?:.
    • Note: If you run into issues with loading a custom filter, try adding single-quotes around your regular expression.
  • exclude_patterns:
    • List of regular expressions.
    • If a pattern match also matches any of the exclude_patterns, it will not be included in the results.'
  • substitute:
    • Allows you to define what characters are removed from a string before it is passed to the sanity check(s).
    • Must be a valid regular expression.
    • If missing or empty, the default substitute is [\W_].
  • sanity:
    • This is the algorithm to use with this filter in order to validate the data is really what you're looking for. For example, 16 digits might just be a random number and not a credit card. Putting the numbers through the luhn algorithm will validate they could potentially be an account number and reduce false positives.
    • You can also pass through a list of strings that represent algorithms as long as the algorithm exists in the library. If not, please add it and make a pull request!
  • mask:
    • index:
      • This is the position in the matched string in which the masking will begin.
    • value
      • The string which will be used to mask the matched string.
  • type:
    • This is basically a description of the 'type' of data you're looking for with this filter.

Settings

settings:
  mask: No
  log_level: INFO
  summarize: No
  output_file:
  show_matches: Yes
  delimiter:
  ignore_columns: [1, 5, 6]
  file_encoding: 'utf-8'
  • bulk
    • This setting is accessible via CLI arguments -b or --bulk.
    • When those switches are used, pass the directory to scan instead of a single file name.
    • Example:
    $ txtferret scan --bulk /home/mrferret/Documents
  • mask
    • If set to true, the mask value defined in the filter will be used to mask the data during output.
    • If no mask is set for a filter, the program will mask with a default mask value.
    • This is set to 'true' by default.
    • CLI - The -m switch can be used to turn on masking'
    $ txtferret scan -m ../fake_ccn_data.txt
    2017:05:20-00:24:52:-0400 PASSED sanity and matched regex - Filter: fake_ccn_account_filter, Line 1, String: 10XXXXXXXXXXXXXXXXXXX
    
    $ txtferret scan ../fake_ccn_data.txt
    2017:05:20-00:26:18:-0400 PASSED sanity and matched regex - Filter: fake_ccn_account_filter, Line 1, String: 100102030405060708094
  • summarize
    • If set to true, the script will only output a list of three data points:
      • The number of matches that did not pass sanity checks.
      • The number of matches that did pass sanity checks.
      • The time it took to finish searching the file.
    • CLI - The -s switch will kickoff the summary.
    $ txtferret scan ../fake_ccn_data.txt
    2019:05:20-00:36:00:-0400 PASSED sanity and matched regex - Filter: fake_ccn_account_filter, Line 1, String: 100102030405060708094
    
    $ txtferret scan -s ../fake_ccn_data.txt
    2019:05:20-01:05:29:-0400 SUMMARY:
    2019:05:20-01:05:29:-0400   - Matched regex, failed sanity: 1
    2019:05:20-01:05:29:-0400   - Matched regex, passed sanity: 1
    2019:05:20-01:05:29:-0400 Finished in 0 seconds (~0 minutes)
    
  • output_file
    • Add the absolute path to a file for script output.
    • Note - File results are put into individual log files per file scanned.
    • CLI - Use the -o switch to set an output file.
    $ txtferret scan -o my_output.log file_to_scan.txt
  • show_matches
    • If this is set to 'No', then it will redact the matched string all-together in output. Be careful that you do not override this setting via a CLI switch unless it is on purpose.
  • delimiter
    • Define a delimiter. This delimiter will be used to split each line from the txt file into columns. The output of the script will provide you with the column in which the regex matched in addition to the line number.
    • CAUTION: Searching by columns GREATLY slows down the script since you are applying a regular expression to each column instead of the entire line.
    • You can define a byte-code delimiter by using b followed by the code. For example, b1 will use Start of Header as a delimiter (\x01 in hex)
    • CLI - Use the -d switch to set a delimiter and scan per column instead of line.
    $ txtferret scan ../fake_ccn_data.txt
    2019:05:20-00:36:00:-0400 PASSED sanity and matched regex - Filter: fake_ccn_account_filter, Line 1, String: 100102030405060708094
    
    # Comma delimiter
    
    $ txtferret scan -d , ../fake_ccn_CSV_file.csv
    2019:05:20-01:12:18:-0400 PASSED sanity and matched regex - Filter: fake_ccn_account_filter, Line 1, String: 100102030405060708094, Column: 3
  • ignore_columns
    • This setting is ignored if the delimiter setting or switch is not set.
    • Add a list of integers and txtferret will skip those columns.
    • If ignore_columns: [2, 6] is configured and a csv row is hello,world,how,are,you,doing,today, then world and doing will not be scanned but will be ignored.
    • This is particularly useful in columnar datasets when you know there is a column that is full of false positives.
  • file_encoding
    • Two uses:
      • Used to encode your delimiter value to the appropriate encoding of your file.
      • Used to encode the data matched in the file before being applied to sanity check.
    • Default value is 'utf-8'

How/why did this come about?

There are a few shortcomings with commercial Data Loss Prevention (DLP) products:

  • They often rely on context. It would be too noisy for a commercial solution to alert every time it matched a sixteen digit string as not all of their customers handle credit card data.
  • Many vendors don't have a cheap method of handling large files over 1 GB. Even less can handle files that are many GBs in size.
  • Some DLP solutions will only classify data files based on the first so many megabytes.
  • Many DLP solutions are black boxes of magic that do not give you say in what they are looking for in your data or how they know what they're looking at.

Txtferret was born out after realizing some of these limitations. It isn't perfect, but it's a great sanity check which can be paired with a DLP solution. Here are some things it was designed to do:

  • Virtually no size limitation
    • Can run against any size file as long as you can fit it on a drive.
    • It's python... so... no speed guarantee on huge files, but at least it will eventually get it done. We've found that txtferret can scan a ~20 GB file between 1 and 3 minutes on our systems.
  • Customizable
    • Define your own regular expressions and pair them with a sanity check. For example, using the built-in luhn algorithm will sift out many false positives for credit card numbers. The matched credit card number will be run through the luhn algorithm first. If it doesn't pass, it is discarded.
  • No context needed
    • Yeah, this can cause a lot of false positives with certain files; However, if you're dealing with a file that doesn't contain context like 'VISA' or 'CVE', then you need to start somewhere.
  • Helpful output
    • Indicates which line the string was found.
    • Indicates which column if you've defined a delimiter (ex: comma for CSV files).
    • You can choose to mask your output data to make sure you're not putting sensitive data into your log files or outputting them to your terminal.
  • It's free
    • No contracts
    • No outrageous licensing per GB of data scanned.
  • You can contribute!

Releases

Version 0.3.0a - 2019-09-05

  • Removed log level switch. Only matches are shown now.
  • Added exclude_patterns to filters
  • Adjusted result format to tab-delimited output.
  • Bulk scans now write results to separate files instead of all in one file.

Version 0.2.1 - 2019-08-11

  • Fixed allowed keys bug introduced in v0.2.0

Version 0.2.0 - 2019-08-05

  • Changed tokenize to mask because tokenize was a lie.. it's masking.
    • Replaced --no-tokenize switch with --mask switch.
  • Turned masking off by default.

Version 0.1.3 - 2019-08-05

  • Added file_encoding setting for multi-encoding support.
    • Reads in bytes and assumes 'utf-8' encoding by default.

Version 0.1.2 - 2019-08-01

  • Fixed bug with regex when reading gzipped files.

Version 0.1.1 - 2019-07-30

  • Added substitute option to filters.

Version 0.1.0 - 2019-07-30

  • Removed the config-override option.
  • Added ignore_columns setting.

Version 0.0.4 - 2019-06-09

  • Added bulk file scanning by the --bulk switch.
  • Added multiprocessing for bulk scanning.

Version 0.0.3 - 2019-06-01

  • Added gzip detection and support.

Development

Some info about development.

Running Tests

$ pytest txt-ferret/tests/

Contributing

Process

  1. Create an issue.
  2. Fork the repo.
  3. Do your work.
  4. WRITE TESTS
  5. Make a pull request.
    • Preferably, include the issue # in the pull request.

Style

  • Black for formatting.
  • Pylint for linting.

License

See License

txt-ferret's People

Contributors

krayzpipes avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

txt-ferret's Issues

GZIP support broken by substitute function

Doh... GZIP support broken by substitute function. substitute only replaces with a str object. There is no provision for replacing with an empty b'' string.

This is why you should write integration tests.........

Traceback (most recent call last):
  File "/home/some_user/venv/bin/txtferret", line 11, in <module>
    sys.exit(main())
  File "/home/some_user/venv/lib64/python3.6/site-packages/txtferret/__init__.py", line 7, in main
    exit(cli())
  File "/home/some_user/venv/lib64/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/some_user/venv/lib64/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/some_user/venv/lib64/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/some_user/venv/lib64/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/some_user/venv/lib64/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/some_user/venv/lib64/python3.6/site-packages/txtferret/cli.py", line 169, in scan
    result = bootstrap(config)
  File "/home/some_user/venv/lib64/python3.6/site-packages/txtferret/cli.py", line 69, in bootstrap
    ferret.scan_file()
  File "/home/some_user/venv/lib64/python3.6/site-packages/txtferret/core.py", line 326, in scan_file
    self._scan_non_delimited_line(line, index)
  File "/home/some_user/venv/lib64/python3.6/site-packages/txtferret/core.py", line 397, in _scan_non_delimited_line
    if not sanity_test(filter_, match):
  File "/home/some_user/venv/lib64/python3.6/site-packages/txtferret/core.py", line 468, in sanity_test
    _text = re.sub(filter_.substitute, "", text)
  File "/usr/lib64/python3.6/re.py", line 191, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: sequence item 0: expected a bytes-like object, str found

Note: text = re.sub(filter.substitute, "", text)

API to further manipulate data after TxtFerret has performed analysis

If a user has post-analysis to be performed on results from a regex scan and sanity check, it would be good if TxtFerret returned a dictionary with all values required so that the user could perform further analysis.

High-level psuedocode to get a feel for what I'm talking about:

from txtferret import TxtFerret

my_ferret = TxtFerret(......)

my_ferret.run()

do_more_complex_stuff(my_ferret.result())

Clean up 'no-tokenize' function... or replace with better implementation

The '--no-tokenize' switch is not intuitive. It was originally designed so that tokenization happens by default (set in the default configuration). It's also a lie... it's masking, not tokenization.

Make the following changes:

  • Remove the '--no-tokenize' switch and setting.
  • Make a '--mask' or '-m' switch with the matching setting, and do not mask by default. Require the switch to implement the mask.

Simplify config override

Using an extra switch to override the default config with a custom config causes confusion. It is easy to dump the existing config and add to it.

  • Remove the 'config override' option
  • If a user defines a custom configuration, then ONLY values in that custom config can be used.
  • For the time being, require the custom config to have all the same settings as the default config.

Bulk scan dies when one process raises an uncaught exception

When a subprocess throws an uncaught exception, the entire bulk scan stops instead of just skipping and letting the others finish.

Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/some_user/venv/lib64/python3.6/site-packages/txtferret/cli.py", line 69, in bootstrap
    ferret.scan_file()
  File "/home/some_user/venv/lib64/python3.6/site-packages/txtferret/core.py", line 315, in scan_file
    for index, line in enumerate(rf):
  File "/usr/lib64/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 1471: invalid continuation byte
"""

Traceback (most recent call last):
  File "/home/some_user/venv/bin/txtferret", line 11, in <module>
    sys.exit(main())
  File "/home/some_user/venv/lib64/python3.6/site-packages/txtferret/__init__.py", line 7, in main
    exit(cli())
  File "/home/some_user/venv/lib64/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/some_user/venv/lib64/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/some_user/venv/lib64/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/some_user/venv/lib64/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/some_user/venv/lib64/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/some_user/venv/lib64/python3.6/site-packages/txtferret/cli.py", line 190, in scan
    results = p.map(bootstrap, configs)
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 1471: invalid continuation byte

Need to fix the UnicodeDecoding issues in another issue.

Reword part of Readme about sanity checks

This just sounds bad:

Sanity:
This is the algorithm to use with this filter in order to validate the data is really what you're looking for. For example, 16 digits might just be a random number and not a credit card. Putting the numbers through the luhn algorithm will validate they could potentially be an account number and reduce false positives.
You can also pass through a list of strings that represent algorithms as long as the algorithm exists in the library. If not, please add it and make a pull request!

Blacklist columns

As a user of TxtFerret in delimiter mode, it would be beneficial to be able to ignore certain columns to filter out things like table IDs or fields that are foreign keys / IDs of another table or database.

I'm thinking that adding a list of integers to the yaml file. For example:
ignored_columns: [1, 5, 10] would ignore the 1st, 5th, and 10th columns.

The ignore should happen before the regular expression is used for that column for efficiency.

Attribute is called before it is set in TxtFerret.__init__()

Attribute error due to wrong order of attributes being set. We don't explicitly set some attributes which means a dependent attribute may not be set before it is used. This appears to be what is happening with file_encoding and output_file or file handler creation in TxtFerret.init()

Add bulk file scan.

Add a switch and/or config option that would allow all files in a directory to be scanned and then an aggregate report be generated.

Add gzip support

Would be nice to be able to stream in gzip files. Currently gzipped files raise an 'UnicodeDecodeError' because the utf-8 codec can't read the byte in position 1 of the gzipped file.

Support other codecs than just 'utf-8'

The following stack trace shows what appears to be a non-utf-8 encoded file. Some options:

  • Let the user define the codec to use. <-- I like this one
  • Automatically run through a list of codecs to try.
Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/some_user/venv/lib64/python3.6/site-packages/txtferret/cli.py", line 69, in bootstrap
    ferret.scan_file()
  File "/home/some_user/venv/lib64/python3.6/site-packages/txtferret/core.py", line 315, in scan_file
    for index, line in enumerate(rf):
  File "/usr/lib64/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 1471: invalid continuation byte
"""

Traceback (most recent call last):
  File "/home/some_user/venv/bin/txtferret", line 11, in <module>
    sys.exit(main())
  File "/home/some_user/venv/lib64/python3.6/site-packages/txtferret/__init__.py", line 7, in main
    exit(cli())
  File "/home/some_user/venv/lib64/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/some_user/venv/lib64/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/some_user/venv/lib64/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/some_user/venv/lib64/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/some_user/venv/lib64/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/some_user/venv/lib64/python3.6/site-packages/txtferret/cli.py", line 190, in scan
    results = p.map(bootstrap, configs)
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 1471: invalid continuation byte

Ability to define which delimiters to remove during Luhn algorithm

The luhn algorithm in this script has a hardcoded value of [\W_] which is to be replaced before performing the Luhn check. If you change your scanning regular expression to something that includes alphanumeric characters, then the luhn algorithm will fail because the alphanumeric characters will not be removed by the Luhn algorithm.

Add a config file option to define what delimiters to substitute with an empty string in the luhn algorithm.

Example:

substitute: '[a-zA-Z_\^\t]'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.