agrc / sweeper Goto Github PK

View Code? Open in Web Editor NEW

4.0 8.0 3.0 2.25 MB

🧹A cli tool for making data good 🧹

License: MIT License

Python 98.05% HTML 1.95%

python python3 spatial-data gis gis-data quality-improvement cli government government-app

sweeper's Introduction

ugrc-sweeper

The data cleaning service.

Available Sweepers

Addresses

Checks that addresses have minimum required parts and optionally normalizes them.

Duplicates

Checks for duplicate features.

Empties

Checks for empty geometries.

Metadata

Checks to make sure that the metadata meets the Basic SGID Metadata Requirements.

Summary

Checks to make sure that the summary is less than 2048 characters (a limitation of AGOL) and that it is shorter than the description.

Description

Checks to make sure that the description contains a link to a data page on gis.utah.gov.

Use Limitations

Checks to make sure that the text in this section matches the official text for UGRC.

--try-fix updates the text to match the official text.

Parsing Addresses

This project contains a module that can be used as a standalone address parser, sweeper.address_parser. This allows developer to take advantage of sweepers advanced address parsing and normalization without having to run the entire sweeper process.

Usage Example

from sweeper.address_parser import Address

address = Address('123 South Main Street')
print(address)

'''
--> Parsed Address:
{'address_number': '123',
 'normalized': '123 S MAIN ST',
 'prefix_direction': 'S',
 'street_name': 'MAIN',
 'street_type': 'ST'}
'''

Available Address class properties

All properties default to None if there is no parsed value.

address_number

address_number_suffix

prefix_direction

street_name

street_direction

street_type

unit_type

unit_id If no unit_type is found, this property is prefixed with # (e.g. # 3). If unit_type is found, # is stripped from this property.

city

zip_code

po_box The PO Box if a po-box-type address was entered (e.g. po_box would be 1 for p.o. box 1).

normalized A normalized string representing the entire address that was passed into the constructor. PO Boxes are normalized in this format PO BOX <number>.

Installation (requires Pro 2.7+)

clone arcgis conda environment
- conda create --name sweeper --clone arcgispro-py3
activate environment
- activate sweeper
install sweeper
- pip install ugrc-sweeper
Optionally duplicate config.sample.json as config.json in the folder where you will run sweeper.

Caution

This is required for the following functions:

--scheduled argument (required for sending emails)
--change-detect argument
using user-specific connection files via the CONNECTIONS_FOLDER config value

Exclusions

Tables can be skipped by adding values to the EXCLUSIONS.<sweeper_key> config array. These values are matched against table names using fnmatch. Note that these do not apply when using the --table-name argument.

Development

clone arcgis conda environment
- conda create --name sweeper --clone arcgispro-py3
activate environment
- activate sweeper
install required dependencies to work on sweeper
- pip install -e ".[tests]"
test_metadata.py uses a SQL database that needs to be restored via src/sweeper/tests/data/Sweeper.bak to your local SQL Server.
run sweeper: sweeper
test: pytest
lint: ruff check .
format: ruff format .

sweeper's People

Contributors

Stargazers

Watchers

Forkers

rkelson jacobdadams

sweeper's Issues

Numeric street names with E for a unit id

1050 E 470 S E will get the address correct but will not return E as the unit id.
1106 S OLD HWY 89 E will return E as the unit id correctly.

There are some cases where # is incorrectly being added to unit id's when there is no unit type.

Addresses with “#th” streets

Valuable feedback from @msilski at WFRC...

Ex. No match found for 2040 S 23RD E in Salt Lake City
Ex. No match found for 70 N 2ND E in American Fork
Solution? replace #th characters with 00 if 2 explicit directionals are present

Currently, 2040 S 23RD E is parsed like this:

We could do a regex on street names looking for th and rd (are there any others?) and replace them with 00.

Manage own last checked

Currently sweeper relies on the last checked file from the cloudb open-sgid synchronizing tool. This is problematic for many reasons.

There isn't a connection between cloudb and sweepers runs. If cloudb failed sweeper should still run with the correct dates. If sweeper fails it should try again from the last successful date.
The cloudb tool is going cloud native so the file will not exist for sweeper to use
cloudb isn't running in GCVE so sweeper doesn't have a last checked file to work from and cannot be migrated

The code from the open-sgid can be used to create and manage the file.

The sweeper code change needs to happen here and here

The file should only be created if --change-detection flag is present.
It should only be updated if the run is successful.

Replace credentials.py

The credentials.py file is problematic when using the pypi package. Primarily since you do not have access to create or use a credentials.py file when the package is compiled. We need to evaluate some options to bring in this information another way.

The credential file seems to support many things but is mainly used towards automating sweeper right?

DB = ''  #: Full path to sde connection file
CHANGE_DETECTION = '' #: Change detection table name with 'SGID.' prefix
LAST_CHECKED_PATH = '' #: Full path to .last_checked file
REPORT_BASE_PATH = '' #: File path for report CSVs of everything that was fixed; rotated on each run
LOG_FILE_PATH = '' #: File path to log that is rotated on each run
CONNECTIONS = '' #: Dictionary that holds SDE connection file paths
EMAIL_SETTINGS = {  #: Settings for EmailHandler
    'smtpServer': '',
    'smtpPort': 25,
    'from_address': '',
    'to_addresses': '',
    'prefix': f'Auditor on {socket.gethostname()}: ',
}

DB is the workspace to sweep
CHANGE_DETECTION and LAST_CHECKED_PATH is to know what files have changed since the last scheduled run
REPORT_BASE_PATH is where to write logs? This seems almost unnecessary and can have a default value and maybe a cli option to overwrite it.
CONNECTIONS this is to hold the owner connection files and is very specific to our schema and the automated fixing process
EMAIL_SETTINGS this is to email the logs which is great for a set it and forget it automated process

If in #84 we move to a convention based location CHANGE_DETECTION and LAST_CHECKED_PATH can be removed and replaced with conventions with the opportunity to change through the cli

DB is passed in as the workspace and isn't required. Why is it in the credentials also?

CONNECTIONS, and EMAIL_SETTINGS seem good for some sort of config file. Email should go to sendgrid any only require an api key, but i'll create another issue for that.

Should we move to a convention that inside the folder the cli is run from there should be a json config file and loads the settings from there? Do we continue to shrink the need for the file and look for a _connections folder as another convention?

Thoughts?

Use supervisors sendgrid feature

Sweeper uses supervisor to notify people of errors and logs and things. Currently it's using the on prem mail sending server judging by the config template.

Supervisor has a way to send email with sendgrid and we've been migrating to sengrid for all mail sending.

Let's migrate sweeper to use sendgrid

Create Docs for `sweeper.address_parser`

problem address

Addresses throwing exceptions:

Address Reversals

Another idea from @steveoh. The tests for the web api will likely be useful for this issue.

Add check for non-standard OBJECTID field names

Field names like OBJECTID_1 can create problems. This may be something that could be auto-fixed as well.

complete docopt CLI

The docopt CLI needs to be completed so the different functions can be called by the main script. depending on what the user wants to accomplish.

merge tags with sweeper and auditor

suggestion: it would be a good idea to put these values into a file type that doesn't require the module to be installed to update so modifying this list can be done simply.

Originally posted by @steveoh in #67 (comment)

Metadata Summary

should be shorter than the description.
needs to be less than 2048 characters (if this is what maps to snippet in AGOL)

I'm not sure that we want to limit the number of sentences?

Street Name Misspellings

From @steveoh

get the unique street names from our roads data and address points. then parse their addresses to the parts and see if the road exists in our data or something similar with levenshtein to catch misspellings

From @ZachBeck

[Look] for compound word misspellings like Switchback Way vs Switch Back Way

Not sure on the best way to do this. Perhaps trying to compare concatenated multiple word street names to the known list of street names? Or maybe something like levenshtein could handle this.

address sweeping

The WFRC and a whole bunch of other agencies try to clean their addressing data to produce high match rate geocoding results. We should try to come up with a sweeper that can flag records and possible standardize addresses.

Minimum Address Parts

From @steveoh

making sure there are the minimum addresses parts would be helpful and reporting what might be missing

As part of this issue, the names of the different address parts need to be decided upon (e.g. suffixDirection, prefixDirection, etc).

Addresses with multiple unit numbers

Here are a few from this old issue:

1361 N 1075 WEST UNIT 12 BLDG B
860 S 1625 EAST UNIT C BLDG 27

Currently, these addresses throw exceptions:

>>> address_parser.Address('1361 N 1075 WEST UNIT 12 BLDG B')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\projects\sweeper\src\sweeper\address_parser.py", line 67, in __init__
    parts, parsed_as = usaddress.tag(address_text.replace('.', ''), TAG_MAPPING)
  File "C:\Users\agrc-arcgis\AppData\Local\ESRI\conda\envs\sweeper\lib\site-packages\usaddress\__init__.py", line 186, in tag
    label)
usaddress.RepeatedLabelError:
ERROR: Unable to tag this string because more than one area of the string has the same label

ORIGINAL STRING:  1361 N 1075 WEST UNIT 12 BLDG B
PARSED TOKENS:    [('1361', 'AddressNumber'), ('N', 'StreetNamePreDirectional'), ('1075', 'StreetName'), ('WEST', 'StreetNamePostDirectional'), ('UNIT', 'SubaddressType'), ('12', 'SubaddressIdentifier'), ('BLDG', 'SubaddressType'), ('B', 'SubaddressIdentifier')]
UNCERTAIN LABEL:  unit_type

When this error is raised, it's likely that either (1) the string is not a valid person/corporation name or (2) some tokens were labeled incorrectly

To report an error in labeling a valid name, open an issue at https://github.com/datamade/usaddress/issues/new - it'll help us continue to improve probablepeople!

For more information, see the documentation at https://usaddress.readthedocs.io/

@eneemann suggested that maybe these should be parsed as 'UNIT B12' and 'UNIT 27C'.

At the very least, we should not be passing on the exception from usaddress...

Required certain metadata items

Title
Summary
Description
Use Limitations

Anything else?

invalid report names

sweeper report naming needs to be improved to allow for feature service input.

@gregbunce I think it's because the table name is built into the filename of the report, which causes problems when the table name is a URL:

OSError: [Errno 22] Invalid argument: 'c:\\\\Temp\\sweeper_run_20201130_0842\\https://services.arcgis.com/ZzrwjTRez6FJiOq4/arcgis/rest/services/Oil_and_Gas_Fields/FeatureServer/0_DuplicateTest_0.txt'

Originally posted by @eneemann in agrc/porter#85 (comment)

Directionals With Street Types

From @rkelson

Directional with Street Type (ex. North CIR or South ST)

@rkelson Can you give me some examples?

standardize sweeper reporting

The reporting format needs to be determined and standardized across the functions so a consistent report is provided.

Metadata Tags

Ref: https://docs.google.com/document/d/1VkXRwfSn6MraI1VeLfei5tg6je4bd2pp_Vh1JuST9xs/edit

All datasets should have at least the SGID and AGRC tags as well as the ISO category.
All words should be title-cased except for articles and abbreviations (see agol-validator for specifics).
Implement try-fix that adds the database and schema tags and fixes bad casing

Validate Zones (Zip Codes or City Names)

From @steveoh

we can check that zip codes exist and city names are spelled properly for the zones

Check for nulls that aren't null and fix them with an actual null

Create a sweeper that replaces

"" empty string -> null
" " empty spaces -> null
"null" null string (ignoring case) -> null
"" null string with brackets (ignoring case) -> null

What other values does the data team find in real world data?

empty, -, ...?

add functionality to check domains

currently, we are moving away from coded value domains that do not match the domain description. this check would flag domains that are not in compliance. see this doc for more info

Wire address parser up to a sweeper

Add check to ensure data exists before sweeping

We need to add a check (arcpy.Exists()) to ensure the data listed in the change detection table still exists in the workspace. We've had several crashes because Sweeper tries to 'sweep' a data layer that no longer exists in workspace, but still has a row in change detection.

Freeport Center (Clearfield) & Hill Air Force Base Addresses

These were examples sent by WFRC as seemingly valid addresses that do no match with our geocoder.

State Routes SR

Addresses like 910 S SR 22 or 1910 N US HWY 89
will return 22 and 89 for the street_name

Terms of Use

Validate that it matches the official text from: https://docs.google.com/document/d/1VkXRwfSn6MraI1VeLfei5tg6je4bd2pp_Vh1JuST9xs/edit

Test for stewarding agency tag

At a minimum, each SGID dataset’s tags should include the stewarding agency (ie: UGRC), “SGID”, and the appropriate category name.

Come up with a list of known stewards and create a sweeper test to warn if it doesn't find one.

Originally posted by @steveoh in #75 (comment)

sweeper item class convention

will you fellers make your files into classes por favor?

@ZachBeck and I came up with this little "interface" that you could follow so they were all interoperable by some sort of runner or orchestrator.

test functions on common database

When ready, we need to test the different aspects of each function against a common SDE database to ensure everything is working correctly.

Greg's test database is probably a good place to do this

sweeper: global id field

we don't version or have a need for global ids in the SGID, should we create a check that looks for this field and removes it when loading data into the SGID? There are a handful in the internal db right now.

general data cleaning functions

Might be nice to add a simple function (or functions) that performs general data cleaning on feature classes or a database to ensure human-induced errors aren't propagated. This is probably most applicable for string fields. Could loops through rows/fields to:

Remove internal whitespace or extra spaces
Remove leading and trailing whitespace
Others...

991 vista east fork

{'address_number': '991',
 'normalized': '991 VISTA EAST FRK',
 'street_name': 'VISTA EAST',
 'street_type': 'FRK'}

street type should be vista east frk. @rkelson says frk is not a valid street type in utah.

Coded value domains must have eqivalent codes and values

All coded value domain codes must have equivalent code and values.

Non-standard street directions are incorrectly parsed as part of the street name

More geocoding feedback from @msilski:

Ex. 166 E 14000 SO SUITE 200 in Draper matched to 166 E ST in Salt Lake City

This address currently parses as:

>>> Address('166 E 14000 SO SUITE 200')
Parsed Address:
{'address_number': '166',
 'normalized': '166 E 14000 SO SUITE 200',
 'prefix_direction': 'E',
 'street_name': '14000 SO',
 'street_type': None,
 'unit_id': '200',
 'unit_type': 'SUITE'}

It should really be parsed as:

>>> Address('166 E 14000 SO SUITE 200')
Parsed Address:
{'address_number': '166',
 'normalized': '166 E 14000 SO SUITE 200',
 'prefix_direction': 'E',
 'street_name': '14000',
 'street_type': None,
 'suffix_direction': 'S',
 'unit_id': '200',
 'unit_type': 'SUITE'}

I wonder if we could check the street names for last words with two-letter directions and then move them to suffix_direction.

166 E 14000 S SUITE 200 geocodes correctly...

metadata sweeper

Was this ever discussed as an option? I'm seeing lots of missing metadata. We'd need to wait until Pro 2.5 is released to be able to have access to it via arcpy.

Should validate that summary is shorter in length than description. Moved to: #48
purpose needs to be less than 2048 characters to map to snippet in AGOL. Moved to #48.
Link to data page. Moved to #49

Address Parser Module Design

There obviously needs to be a module in src/sweepers that defines the sweep and try_fix methods. But if this is going to be a new replacement for https://github.com/agrc/agrc.python/blob/master/agrc/parse_address.py, there needs to be a way for users to use the parsing logic outside of the main sweeper process. My proposal would be to keep the core address parsing logic in a separate module (still within the sweeper project) that could be imported and used directly.

An example of how it could be used outside of the sweeper process in a custom script:

from sweeper import addressParser

parsed = addressParser.parse('123 S Main St')

Does anyone have issues with this or ideas on how it could be done better?

Ping @ZachBeck @rkelson

Bug: Unable to delete rows on standalone tables

Need to use a different tool for selecting/deleting rows in a standalone table in duplicates and empties tests. Pseudo-code below
If is_table:
use MakeTableView, DeleteRows
else:
use MakeFeatureLayer, DeleteFeatures

Old Midvale Address System Addresses

From @msilski:

Old Midvale addresses - found five cases of using legacy local address system
Ex. 112 S ALLEN ST in Midvale matched to 112 S ST in Salt Lake City (true location: 7832 S ALLEN ST in Midvale)
Ex. 19 E CENTER ST in Midvale matched to 19 W CENTER ST in Salt Lake City (true location: 684 W CENTER ST in Midvale)

It may not be worth it for only five records in this dataset (DWS employment data). I'm not sure how common these addresses really are. I do remember encountering them when I worked for Sandy City years ago.

try fix will require a schema lock

sweeper/street_types.json

The file https://github.com/agrc/sweeper/blob/master/src/sweeper/street_types.json should probably be updated to use the one more specific to utah and be similar to the AGRC_Street_Suffixes.csv. I emailed the file to Scott and Steve.

tags with AGRC change to UGRC

check and make sure all AGRC tags are changed to UGRC

Add check for UTM12 projection

it might be nice to have a check that lets the user know if the data is not in UTM12

Directional and/or Street Type with Unit or Apartment #

From @rkelson

Directional and/or Street Type with Unit or Apartment # (i.e. does not change 'North' to 'N')

@rkelson Can you give me some examples for this one? The current proposal in #24 is to standardize prefix and suffix directions to single letters.

Replace print statements with python logging

Python logging can have multiple handlers - console, file, whatever. I would suggest that the standard out (console) handler is always added. If the --save-report flag is added then add the file handler. This would be the same as using print with the added benefit that you can filter the severity level. This is an example from forklift.

Originally posted by @steveoh in #67 (comment)

Metadata Description

link to corresponding gis.utah.gov/data/page

Add Missing Street Types in Certain Circumstances

From @eneemann:

Ensure street type is used consistently (specifically, enforcing that 'ST' is used on Center/Main)
I don't explicitly mean there should never be a "Main Dr" (though there are probably very few). What I mean is that I know of some cities that don't put a street type on Main St or Center St. Their addresses are just "123 Main" or "456 Center." I just want to make sure we catch those, and think standardizing them to add "St" as a street type for Main or Center, when the street type is missing, would probably improve results.

Standardize Address Parts

The address parser should be able to format these address parts:

Street Types (ST, WAY, HWY, etc)
Cardinal Directions - Single letters are preferred (N, S, E, W)
Unit Types (Suite, Apt)
PO Boxes

The sweeper task would report non-standard values as issues and could attempt to fix them if the --try-fix flag is passed.

Are there any other types that I'm missing?

Where can I get a list of the values that are accepted as standards for each of these address parts? The lowest hanging fruit is the cardinal directions. Is the preference single capital letters (N, S, E, W)?

Ping @rkelson @ZachBeck @eneemann @gregbunce

agrc / sweeper Goto Github PK

sweeper's Introduction

ugrc-sweeper

Available Sweepers

Addresses

Duplicates

Empties

Metadata

Tags

Summary

Description

Use Limitations

Parsing Addresses

Usage Example

Available Address class properties

Installation (requires Pro 2.7+)

Exclusions

Development

sweeper's People

Contributors

Stargazers

Watchers

Forkers

sweeper's Issues

Recommend Projects

Recommend Topics

Recommend Org