Coder Social home page Coder Social logo

sweeper's Introduction

ugrc-sweeper PyPI versionPush Events

The data cleaning service.

sweeper_sm

Available Sweepers

Addresses

Checks that addresses have minimum required parts and optionally normalizes them.

Duplicates

Checks for duplicate features.

Empties

Checks for empty geometries.

Metadata

Checks to make sure that the metadata meets the Basic SGID Metadata Requirements.

Tags

Checks to make sure that existing tags are cased appropriately. This mean that the are title-cased other than known abbreviations (e.g. UGRC, BLM) and articles (e.g. a, the, of).

This check also verifies that the data set contains a tag that matches the database name (e.g. SGID) and the schema (e.g. Cadastre).

--try-fix adds missing required tags and title-cases any existing tags.

Summary

Checks to make sure that the summary is less than 2048 characters (a limitation of AGOL) and that it is shorter than the description.

Description

Checks to make sure that the description contains a link to a data page on gis.utah.gov.

Use Limitations

Checks to make sure that the text in this section matches the official text for UGRC.

--try-fix updates the text to match the official text.

Parsing Addresses

This project contains a module that can be used as a standalone address parser, sweeper.address_parser. This allows developer to take advantage of sweepers advanced address parsing and normalization without having to run the entire sweeper process.

Usage Example

from sweeper.address_parser import Address

address = Address('123 South Main Street')
print(address)

'''
--> Parsed Address:
{'address_number': '123',
 'normalized': '123 S MAIN ST',
 'prefix_direction': 'S',
 'street_name': 'MAIN',
 'street_type': 'ST'}
'''

Available Address class properties

All properties default to None if there is no parsed value.

address_number

address_number_suffix

prefix_direction

street_name

street_direction

street_type

unit_type

unit_id If no unit_type is found, this property is prefixed with # (e.g. # 3). If unit_type is found, # is stripped from this property.

city

zip_code

po_box The PO Box if a po-box-type address was entered (e.g. po_box would be 1 for p.o. box 1).

normalized A normalized string representing the entire address that was passed into the constructor. PO Boxes are normalized in this format PO BOX <number>.

Installation (requires Pro 2.7+)

  1. clone arcgis conda environment
    • conda create --name sweeper --clone arcgispro-py3
  2. activate environment
    • activate sweeper
  3. install sweeper
    • pip install ugrc-sweeper
  4. Optionally duplicate config.sample.json as config.json in the folder where you will run sweeper.

Caution

This is required for the following functions:

  • --scheduled argument (required for sending emails)
  • --change-detect argument
  • using user-specific connection files via the CONNECTIONS_FOLDER config value

Exclusions

Tables can be skipped by adding values to the EXCLUSIONS.<sweeper_key> config array. These values are matched against table names using fnmatch. Note that these do not apply when using the --table-name argument.

Development

  1. clone arcgis conda environment
    • conda create --name sweeper --clone arcgispro-py3
  2. activate environment
    • activate sweeper
  3. install required dependencies to work on sweeper
    • pip install -e ".[tests]"
  4. test_metadata.py uses a SQL database that needs to be restored via src/sweeper/tests/data/Sweeper.bak to your local SQL Server.
  5. run sweeper: sweeper
  6. test: pytest
  7. lint: ruff check .
  8. format: ruff format .

sweeper's People

Contributors

dependabot[bot] avatar eneemann avatar gregbunce avatar jacobdadams avatar rkelson avatar stdavis avatar steveoh avatar ugrc-release-bot[bot] avatar zachbeck avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

sweeper's Issues

Numeric street names with E for a unit id

1050 E 470 S E will get the address correct but will not return E as the unit id.
1106 S OLD HWY 89 E will return E as the unit id correctly.

There are some cases where # is incorrectly being added to unit id's when there is no unit type.

Addresses with “#th” streets

Valuable feedback from @msilski at WFRC...

Ex. No match found for 2040 S 23RD E in Salt Lake City
Ex. No match found for 70 N 2ND E in American Fork
Solution? replace #th characters with 00 if 2 explicit directionals are present

Currently, 2040 S 23RD E is parsed like this:
image

We could do a regex on street names looking for th and rd (are there any others?) and replace them with 00.

Manage own last checked

Currently sweeper relies on the last checked file from the cloudb open-sgid synchronizing tool. This is problematic for many reasons.

  1. There isn't a connection between cloudb and sweepers runs. If cloudb failed sweeper should still run with the correct dates. If sweeper fails it should try again from the last successful date.
  2. The cloudb tool is going cloud native so the file will not exist for sweeper to use
  3. cloudb isn't running in GCVE so sweeper doesn't have a last checked file to work from and cannot be migrated

The code from the open-sgid can be used to create and manage the file.

The sweeper code change needs to happen here and here

The file should only be created if --change-detection flag is present.
It should only be updated if the run is successful.

Replace credentials.py

The credentials.py file is problematic when using the pypi package. Primarily since you do not have access to create or use a credentials.py file when the package is compiled. We need to evaluate some options to bring in this information another way.

The credential file seems to support many things but is mainly used towards automating sweeper right?

DB = ''  #: Full path to sde connection file
CHANGE_DETECTION = '' #: Change detection table name with 'SGID.' prefix
LAST_CHECKED_PATH = '' #: Full path to .last_checked file
REPORT_BASE_PATH = '' #: File path for report CSVs of everything that was fixed; rotated on each run
LOG_FILE_PATH = '' #: File path to log that is rotated on each run
CONNECTIONS = '' #: Dictionary that holds SDE connection file paths
EMAIL_SETTINGS = {  #: Settings for EmailHandler
    'smtpServer': '',
    'smtpPort': 25,
    'from_address': '',
    'to_addresses': '',
    'prefix': f'Auditor on {socket.gethostname()}: ',
}

DB is the workspace to sweep
CHANGE_DETECTION and LAST_CHECKED_PATH is to know what files have changed since the last scheduled run
REPORT_BASE_PATH is where to write logs? This seems almost unnecessary and can have a default value and maybe a cli option to overwrite it.
CONNECTIONS this is to hold the owner connection files and is very specific to our schema and the automated fixing process
EMAIL_SETTINGS this is to email the logs which is great for a set it and forget it automated process

If in #84 we move to a convention based location CHANGE_DETECTION and LAST_CHECKED_PATH can be removed and replaced with conventions with the opportunity to change through the cli

DB is passed in as the workspace and isn't required. Why is it in the credentials also?

CONNECTIONS, and EMAIL_SETTINGS seem good for some sort of config file. Email should go to sendgrid any only require an api key, but i'll create another issue for that.

Should we move to a convention that inside the folder the cli is run from there should be a json config file and loads the settings from there? Do we continue to shrink the need for the file and look for a _connections folder as another convention?

Thoughts?

Use supervisors sendgrid feature

Sweeper uses supervisor to notify people of errors and logs and things. Currently it's using the on prem mail sending server judging by the config template.

Supervisor has a way to send email with sendgrid and we've been migrating to sengrid for all mail sending.

Let's migrate sweeper to use sendgrid

problem address

Addresses throwing exceptions:

  • 2430 N RIVER VIEW WAY
  • 135 S RIVER BEND WAY
  • 1384 S CANYON CREST
  • 1361 N 1075 WEST UNIT 12 BLDG B
  • 860 S 1625 EAST UNIT C BLDG 27
  • 728 S WATER MILL WAY
  • 1623 E POETS REST
  • 1623 E POETS RST
  • 2362 S 3340 WEST CIR

complete docopt CLI

The docopt CLI needs to be completed so the different functions can be called by the main script. depending on what the user wants to accomplish.

Metadata Summary

  • should be shorter than the description.
  • needs to be less than 2048 characters (if this is what maps to snippet in AGOL)

I'm not sure that we want to limit the number of sentences?

Street Name Misspellings

From @steveoh

get the unique street names from our roads data and address points. then parse their addresses to the parts and see if the road exists in our data or something similar with levenshtein to catch misspellings

From @ZachBeck

[Look] for compound word misspellings like Switchback Way vs Switch Back Way

Not sure on the best way to do this. Perhaps trying to compare concatenated multiple word street names to the known list of street names? Or maybe something like levenshtein could handle this.

address sweeping

The WFRC and a whole bunch of other agencies try to clean their addressing data to produce high match rate geocoding results. We should try to come up with a sweeper that can flag records and possible standardize addresses.

Minimum Address Parts

From @steveoh

making sure there are the minimum addresses parts would be helpful and reporting what might be missing

As part of this issue, the names of the different address parts need to be decided upon (e.g. suffixDirection, prefixDirection, etc).

Addresses with multiple unit numbers

Here are a few from this old issue:

1361 N 1075 WEST UNIT 12 BLDG B
860 S 1625 EAST UNIT C BLDG 27

Currently, these addresses throw exceptions:

>>> address_parser.Address('1361 N 1075 WEST UNIT 12 BLDG B')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\projects\sweeper\src\sweeper\address_parser.py", line 67, in __init__
    parts, parsed_as = usaddress.tag(address_text.replace('.', ''), TAG_MAPPING)
  File "C:\Users\agrc-arcgis\AppData\Local\ESRI\conda\envs\sweeper\lib\site-packages\usaddress\__init__.py", line 186, in tag
    label)
usaddress.RepeatedLabelError:
ERROR: Unable to tag this string because more than one area of the string has the same label

ORIGINAL STRING:  1361 N 1075 WEST UNIT 12 BLDG B
PARSED TOKENS:    [('1361', 'AddressNumber'), ('N', 'StreetNamePreDirectional'), ('1075', 'StreetName'), ('WEST', 'StreetNamePostDirectional'), ('UNIT', 'SubaddressType'), ('12', 'SubaddressIdentifier'), ('BLDG', 'SubaddressType'), ('B', 'SubaddressIdentifier')]
UNCERTAIN LABEL:  unit_type

When this error is raised, it's likely that either (1) the string is not a valid person/corporation name or (2) some tokens were labeled incorrectly

To report an error in labeling a valid name, open an issue at https://github.com/datamade/usaddress/issues/new - it'll help us continue to improve probablepeople!

For more information, see the documentation at https://usaddress.readthedocs.io/

@eneemann suggested that maybe these should be parsed as 'UNIT B12' and 'UNIT 27C'.

At the very least, we should not be passing on the exception from usaddress...

invalid report names

sweeper report naming needs to be improved to allow for feature service input.

@gregbunce I think it's because the table name is built into the filename of the report, which causes problems when the table name is a URL:

OSError: [Errno 22] Invalid argument: 'c:\\\\Temp\\sweeper_run_20201130_0842\\https://services.arcgis.com/ZzrwjTRez6FJiOq4/arcgis/rest/services/Oil_and_Gas_Fields/FeatureServer/0_DuplicateTest_0.txt'

Originally posted by @eneemann in agrc/porter#85 (comment)

standardize sweeper reporting

The reporting format needs to be determined and standardized across the functions so a consistent report is provided.

add functionality to check domains

currently, we are moving away from coded value domains that do not match the domain description. this check would flag domains that are not in compliance. see this doc for more info

Add check to ensure data exists before sweeping

We need to add a check (arcpy.Exists()) to ensure the data listed in the change detection table still exists in the workspace. We've had several crashes because Sweeper tries to 'sweep' a data layer that no longer exists in workspace, but still has a row in change detection.

State Routes SR

Addresses like 910 S SR 22 or 1910 N US HWY 89
will return 22 and 89 for the street_name

Test for stewarding agency tag

At a minimum, each SGID dataset’s tags should include the stewarding agency (ie: UGRC), “SGID”, and the appropriate category name.

Come up with a list of known stewards and create a sweeper test to warn if it doesn't find one.

Originally posted by @steveoh in #75 (comment)

test functions on common database

When ready, we need to test the different aspects of each function against a common SDE database to ensure everything is working correctly.

Greg's test database is probably a good place to do this

sweeper: global id field

we don't version or have a need for global ids in the SGID, should we create a check that looks for this field and removes it when loading data into the SGID? There are a handful in the internal db right now.

general data cleaning functions

Might be nice to add a simple function (or functions) that performs general data cleaning on feature classes or a database to ensure human-induced errors aren't propagated. This is probably most applicable for string fields. Could loops through rows/fields to:

  • Remove internal whitespace or extra spaces
  • Remove leading and trailing whitespace
  • Others...

991 vista east fork

{'address_number': '991',
 'normalized': '991 VISTA EAST FRK',
 'street_name': 'VISTA EAST',
 'street_type': 'FRK'}

street type should be vista east frk. @rkelson says frk is not a valid street type in utah.

Non-standard street directions are incorrectly parsed as part of the street name

More geocoding feedback from @msilski:

Ex. 166 E 14000 SO SUITE 200 in Draper matched to 166 E ST in Salt Lake City

This address currently parses as:

>>> Address('166 E 14000 SO SUITE 200')
Parsed Address:
{'address_number': '166',
 'normalized': '166 E 14000 SO SUITE 200',
 'prefix_direction': 'E',
 'street_name': '14000 SO',
 'street_type': None,
 'unit_id': '200',
 'unit_type': 'SUITE'}

It should really be parsed as:

>>> Address('166 E 14000 SO SUITE 200')
Parsed Address:
{'address_number': '166',
 'normalized': '166 E 14000 SO SUITE 200',
 'prefix_direction': 'E',
 'street_name': '14000',
 'street_type': None,
 'suffix_direction': 'S',
 'unit_id': '200',
 'unit_type': 'SUITE'}

I wonder if we could check the street names for last words with two-letter directions and then move them to suffix_direction.

166 E 14000 S SUITE 200 geocodes correctly...
image

metadata sweeper

Was this ever discussed as an option? I'm seeing lots of missing metadata. We'd need to wait until Pro 2.5 is released to be able to have access to it via arcpy.

  • Should validate that summary is shorter in length than description. Moved to: #48
  • purpose needs to be less than 2048 characters to map to snippet in AGOL. Moved to #48.
  • Link to data page. Moved to #49

Address Parser Module Design

There obviously needs to be a module in src/sweepers that defines the sweep and try_fix methods. But if this is going to be a new replacement for https://github.com/agrc/agrc.python/blob/master/agrc/parse_address.py, there needs to be a way for users to use the parsing logic outside of the main sweeper process. My proposal would be to keep the core address parsing logic in a separate module (still within the sweeper project) that could be imported and used directly.

An example of how it could be used outside of the sweeper process in a custom script:

from sweeper import addressParser

parsed = addressParser.parse('123 S Main St')

Does anyone have issues with this or ideas on how it could be done better?

Ping @ZachBeck @rkelson

Bug: Unable to delete rows on standalone tables

Need to use a different tool for selecting/deleting rows in a standalone table in duplicates and empties tests. Pseudo-code below
If is_table:
use MakeTableView, DeleteRows
else:
use MakeFeatureLayer, DeleteFeatures

Old Midvale Address System Addresses

From @msilski:

Old Midvale addresses - found five cases of using legacy local address system
Ex. 112 S ALLEN ST in Midvale matched to 112 S ST in Salt Lake City (true location: 7832 S ALLEN ST in Midvale)
Ex. 19 E CENTER ST in Midvale matched to 19 W CENTER ST in Salt Lake City (true location: 684 W CENTER ST in Midvale)

It may not be worth it for only five records in this dataset (DWS employment data). I'm not sure how common these addresses really are. I do remember encountering them when I worked for Sandy City years ago.

Replace print statements with python logging

Python logging can have multiple handlers - console, file, whatever. I would suggest that the standard out (console) handler is always added. If the --save-report flag is added then add the file handler. This would be the same as using print with the added benefit that you can filter the severity level. This is an example from forklift.

Originally posted by @steveoh in #67 (comment)

Add Missing Street Types in Certain Circumstances

From @eneemann:

Ensure street type is used consistently (specifically, enforcing that 'ST' is used on Center/Main)
I don't explicitly mean there should never be a "Main Dr" (though there are probably very few). What I mean is that I know of some cities that don't put a street type on Main St or Center St. Their addresses are just "123 Main" or "456 Center." I just want to make sure we catch those, and think standardizing them to add "St" as a street type for Main or Center, when the street type is missing, would probably improve results.

Standardize Address Parts

The address parser should be able to format these address parts:

  • Street Types (ST, WAY, HWY, etc)
  • Cardinal Directions - Single letters are preferred (N, S, E, W)
  • Unit Types (Suite, Apt)
  • PO Boxes

The sweeper task would report non-standard values as issues and could attempt to fix them if the --try-fix flag is passed.

Are there any other types that I'm missing?

Where can I get a list of the values that are accepted as standards for each of these address parts? The lowest hanging fruit is the cardinal directions. Is the preference single capital letters (N, S, E, W)?

Ping @rkelson @ZachBeck @eneemann @gregbunce

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.