agrc / forklift Goto Github PK
View Code? Open in Web Editor NEW:tractor::package::sparkles: Slinging data all over the place :tractor::package::sparkles:
License: MIT License
:tractor::package::sparkles: Slinging data all over the place :tractor::package::sparkles:
License: MIT License
Most of the pallets I have been converting have different settings for dev/test/prod. This is only possible with forklift now if we run the pallet individually. If we add a property to the config that has a configuration or we add optional args to the cli, we can pass that value to a new build
method that we discussed in #65. Optional args can still be passed to the __init__
method when run individually.
We should be able to run all pallets as dev, stage or prod without having to modify the pallets.
The logging situation now is not ideal. It does a sliding time based on the last write. The setting we have now makes it easy to make the log get cut off.
I think if we switch our handler to use the midnight
setting and schedule forklift to be run after midnight our logs will be grouped properly.
class SimplePallet(Pallet):
def __init__(self):
super(SimplePallet, self).__init__()
destination_workspace = env.scratchGDB
source_workspace = 'Database Connections\\[email protected]'
#: source name, source workspace, destination workspace, destination name
crate_info = ('SGID10.GEOSCIENCE.AvalanchePaths', source_workspace, destination_workspace, 'AvyPaths')
self.add_crate(crate_info)
I would expect the crate creation to be source name, workspace, destination name, workspace.
We have a mirror of params. without looking at this array it's hard to know how to create a crate properly.
It looks as if multiple pallets defined in the same module are executed in alphabetical order. This should be verified and persisted with unit tests. There are some projects that will depend on a consistent execution order (e.g. DEQ).
I would expect to see the result of each crate.
Only the message from each crate is displayed.
It should be reported as running successfully. Perhaps with a message that says "This pallet only runs on Fridays".
It shows up as an error with a message of "None".
Part of the issue is this change which will always set the success value of the pallet to be false if it's not Friday. @steveoh Why did we add the check on is_ready_to_ship
to the report? I can't think of a good reason. I think that we should remove this extra check and set the success (boolean and message) in the pallet something like this:
def is_ready_to_ship(self):
ready = strftime('%A') == 'Friday'
if not ready:
self.success = (True, 'This pallet only runs on Fridays')
return ready
forklift init
config.json
created in the folder from which forklift was invoked.
config.json
is created in site-packages
. This is nice in that it is in the same place no matter where you run forklift from but it's a pain to dig into that folder to take a look at the config file. It's also a bad place to be maintaining things in general. I think that it would be better to have it closer to where you are in the file system.
.
.
The crate should be ignored and a warning should be sent from ๐
Nothing but exceptions from arcpy
We need to decide exactly how we want the backups generated and where they should be stored.
Try to update a feature class that has a longer text field in source than in destination (same name).
I would expect the tool to warn me about the different in field length but it doesn't.
No schema change error is reported.
We should look at field type, length, scale and precision.
Ported from agrc/agrc.python#3.
When attempting to run forklift for a specific pallet, it failed to copy the data to copyDestinations because a service associated with another pallet was locking the same database. I wonder if _hydrate_copy_structures could be adjusted to populate source_to_services for all pallets even when it's run on a single pallet. That way when it goes to copy the data it would stop all services pointing to the database even if they are not associated with the pallet that was run.
From windows scheduler on .56
All are daily unless otherwise noted
The rest of these are in SGID10.gdb...
Define a new property on Pallet
called copy_data
(List) that tells forklift what workspaces you want copied to the production servers. This would default to []
in which case no data would be copied.
Add a new property to the config called copy_destinations
(List) that defines where you want the copy data (defined by copy_data
in Pallet) from forklift to be copied after processing.
Implement a new private method in lift.py
that loops through all of the pallets and generates a distinct list of all of the workspaces that need to be copied and copies them to copy_destinations
.
__
for all pallet properties and private methodsAvoid unexpected overwrites when creating your own pallet
๐ฅ Collide all the things.
('FCName', 'FGDB.gdb', 'SGID Stage.sde', 'SGID10.OWNER.FCName')
SGID Stage.sde/SGID10.OWNER.FCName
is updated or created.
SGID Stage.sde/SGID10.OWNER.SGID10_OWNER_FCName
is updated or created. :(
I propose that the .
-> _
replacement only be done if destination_name
is not passed into the constructor.
Float
(shows Single
in code) to an SDE database.Double
and curse ESRI.The schema check for the crate should pass successfully since it's the same feature class and no change to the schema have been made.
The schema check fails and reports something like: AVE_LENGTH: source type of Single does not match destination type of Double
.
The issue is in this line of code. I propose that we run something like this before checking the field types:
if not isTable:
arcpy.MakeFeatureLayer_management(sdeFC, layer, '1 = 2')
else:
arcpy.MakeTableView_management(sdeFC, layer, '1 = 2')
try:
arcpy.Append_management(layer, f, 'TEST')
passed = True
except:
# go onto checking the field types and lengths for the report
I've tested Append_management against the Single/Double issue and it runs successfully.
we should pull counties into the test sde and see what is happening to make it always think it's shape has changed.
I vote for docopt because it's awesome.
brainstorming ideas...
'''
forklift
Usage:
forklift update [--config=<config>]
forklift update-only <path> [--plugin=<plugin>]
Options:
--config the path to some cfg or text file or something where we keep paths to places where there are update plugins. defaults to some relative or static path.
--plugin the name of the plugin used to filter execution. maybe a partial match or glob or exact match?
Arguments:
<path> an optional path to pass in so you can run a certain directory
'''
I may be missing some and some should be removed because they do not use fgdb's etc
I would love a flag that I can use to print debug logs in the console when running forklift from the command line. Something like -v
or -debug
.
It should be stored locally on each server. This will help with agrc/locate#94
This saves us from having to manage them in the source code.
We need to test arcgis server to see what happens when we remove and copy data while it has a process spooled up. If it's ok then this issue can be closed.
Otherwise, a Pallet needs a property with the path to the service that can be used to restart the service.
I have a hunch, that we can remove and copy the fresh data and then restart the service. I'm hoping we do not need to stop the service to remove and copy.
The create schema lock setting might need to be set for all services.
Just an idea.
Right now update.py has a lot of public methods. We should tighten those up when we create core.py from it.
I think we should pull update.py out of agrc.python and rename it to core.py
.
A crate can ask for sgid10.boundaries.counties or counties and the county feature class will end up in a boundaries.gdb?
Everything ends up in SGID10.gdb
I think the problem here is objectid_1 and order by objectid
RuntimeError: An invalid SQL statement was used. [SELECT OBJECTID_1, CustomerInfo_FK, Program, ContactPerson, GUID, DateFromPurchasing, ContactPhone, OriginalContractAmount, PDFDocument, ConEffectiveDate, ContactEmail, Fund, ReasonForRejection, PracticeMapPDF, NRCSNum, CancelDate, ArchPDF, Notes, UDAFContractNum, Dept, NEPAPDF, Cancelled, Project_FK, ScheduleOfOperationsPDF, ContractAmount, ReasonForCancellation, GrantCategory, ContractStatus, GranteeStatus, ConExpirationDate, OrgUnit, ManagerNotes, DateReceived, CostShareRate, ContractType, AppUnit, DateToPurchasing FROM ContractInformation ORDER BY OBJECTID]
The destination table should be created on the first run with an OBJECTID fields since it's being created within a geodatabase. Then the destination table should validate on subsequent runs.
On the first run the destination table is created as expected. However, on susquent runs core.py
reports OBJECTID
as a missing field and throws an exception when trying to check for changes. See below for the console output...
INFO 06-20 07:48:34 lift: 49 crate: interactive_map_monitoring_sites
WARN 06-20 07:48:35 core: 174 Missing fields in \\tsclient\stdavis\Documents\Projects\deq-enviro\scripts\nightly\settings\..\databases\eqmairvisionp.sde\AVData.dbo.interactive_map_monitoring_sites: OBJECTID
ERRO 06-20 07:48:37 core: 63 unhandled exception: Attribute column not found[42S22:[Microsoft][SQL Server Native Client 11.0][SQL Server]Invalid column name 'OBJECTID'.] for crate { 'destination': 'C:\\Scheduled\\staging\\DEQEnviro\\TempPoints.gdb\\interactive_map_monitoring_sites',
'destination_coordinate_system': <SpatialReference object at 0xc3a2930[0xc3393f8]>,
'destination_name': 'interactive_map_monitoring_sites',
'destination_workspace': 'C:\\Scheduled\\staging\\DEQEnviro\\TempPoints.gdb',
'geographic_transformation': 'NAD_1983_To_WGS_1984_5',
'result': ( 'This crate was never processed.',
None),
'source': '\\\\tsclient\\stdavis\\Documents\\Projects\\deq-enviro\\scripts\\nightly\\settings\\..\\databases\\eqmairvisionp.sde\\AVData.dbo.interactive_map_monitoring_sites',
'source_name': 'AVData.dbo.interactive_map_monitoring_sites',
'source_workspace': '\\\\tsclient\\stdavis\\Documents\\Projects\\deq-enviro\\scripts\\nightly\\settings\\..\\databases\\eqmairvisionp.sde'}
Traceback (most recent call last):
File "C:\Python27\ArcGIS10.3\lib\site-packages\forklift\core.py", line 56, in update
if _has_changes(crate):
File "C:\Python27\ArcGIS10.3\lib\site-packages\forklift\core.py", line 269, in _has_changes
for destination_row, source_row in izip(f_cursor, sde_cursor):
RuntimeError: Attribute column not found[42S22:[Microsoft][SQL Server Native Client 11.0][SQL Server]Invalid column name 'OBJECTID'.]
INFO 06-20 07:48:37 lift: 56 result: ('Unhandled exception during update.', "Attribute column not found[42S22:[Microsoft][SQL Server Native Client 11.0][SQL Server]Invalid column name 'OBJECTID'.]")
We should be able to report on exceptions that happen within these methods.
I gave this a shot in the parcels application. Let's put some incantation of this into forklift.
Do we want to have to checkout forklift into every project so the update script can be imported and inherited from? Do we checkout one version into each server where we have update scripts?
williamscraigm: do you run compact on it after doing that? if you're doing lots of updates and don't compact, all the changes are in delta files and it's not optimal.
I wonder if we should only try to import potential pallets that have "palllet" in the file name. This would cut out all of the issues with running standalone scripts unintentionally. It would also make the import errors more relevant since they are more likely real issues. For example, last night the deq pallet failed to import (because I forgot to include a file that's not in version control) but the main forklift report didn't report any issues.
Crate:
'destination_name': 'DWQMercuryInFishTissue',
'destination_workspace': 'SGID10 as ENVIRONMENT on stage.sde',
'source_name': 'Mercury_in_Fish_Tissue',
'source_workspace': '\\\\<...>\\GIS\\DWQGIS\\projects\\Interactive_Map\\DWQ_Data_Interactive_Map.gdb'
Run pallet two times in a row.
Result: "Created table successfully." on first run.
Result: "No changes found." on second run.
Result: "Created table successfully." on first run.
Result: "Data updated successfully." on second run.
Here's an example of the rows that were compared:
source row: (194.8000030517578, -111.941056, u'No Consumption Advisory', 41.501327, u'4900440 no fish advisory', u'Bluegill', 1, 0.1346, 189.1999969482422, u'BOX ELDER', 2013, u'4900440-Bluegill-2013', 10, 0.206, u'4900440', u'Reservoir/Lake', 0.08, 0.040219, u'MANTUA RES AB DAM 01', 4594839.0, 421458.0, (421458.0, 4594839.0))
destination row: (194.80000305, -111.941056, u'No Consumption Advisory', 41.501327, u'4900440 no fish advisory', u'Bluegill', 1, 0.1346, 189.19999695, u'BOX ELDER', 2013, u'4900440-Bluegill-2013', 10, 0.206, u'4900440', u'Reservoir/Lake', 0.08, 0.040219, u'MANTUA RES AB DAM 01', 4594839.0, 421458.0, (421458.0, 4594839.0))
source_workspace
set to a SDE db connect as not the owner. eg SGID10 with agrc user.source_name
to owner.name or name. eg: GEOSICIENCE.AvalancePaths
or AvalanchePaths
forklift lift
twice.
I would like to be able to specify the feature class name without the owner. ๐ should only fail if there are duplicate names.
name
๐ will fail on the first call to lift because it is not found in the source.owner.name
the second call to lift will fail because it is not found in the destinationfrom arcpy import env
from forklift.models import Pallet
class SimplePallet(Pallet):
def __init__(self):
super(SimplePallet, self).__init__()
destination_workspace = env.scratchGDB
source_workspace = 'Database Connections\\[email protected]'
self.add_crates(['SGID10.GEOSCIENCE.AvalanchePaths'], {'source_workspace': source_workspace,
'destination_workspace': destination_workspace})
I think I'm noticing something odd where a pallets source code can update but if there was a pyc file from a prior run, the old code is run. We may want to git clean -f
to remove untracked files after our git update.
This could get rid of secret files or things that are git ignored right? So maybe we can delete all *.pyc files?
__init__
I would expect forklift to complete successfully and show the error raised for that pallet.
Forklift chokes and crashes without processing any subsequent pallets.
I think that we should probably wrap this code in a try/except. On except we may need to create a fake pallet with the appropriate success
tuple.
lift is using pallet outside of the loop so the status is never set. A way to look up the pallet from the destination is needed to update the status for the pallet.
The pallet should have a method defined that allows for easy email notifications. This will allow for situations where we want to notify someone if data was updated a process occurred for a specific pallet.
Maybe something like
self.send_email('[email protected]', 'Raster data was updated (subject)', 'This would be the body of the email')
Run lift with logger: "file"
.
A log file is created but I still see the output in the console.
The log file is created but the console is blank.
I can't think of a good reason to have to chose one handler over the other. Why not just use both all of the time and get rid of the logger
config?
The table structure should be copied as is
OBJECTID_1 is created and causes issues on the second run of forklift.
If you need an example, i can repro this with the ACTS pallet I am creating so the data is coming from a 9.3 geodatabase.
Should OBJECTID_1 be added to the naughty list?
Pallet
should define default destination_coordinate_system
and geographic_transformation
properties that will be passed into the Crate
constructor. These could obviously be overridden.
self.destination_coordinate_system = 3857
self.geographic_transformation = 'NAD_1983_To_WGS_1984_5'
list-pallets
on a folder that has pallet classes defined in files that are within subdirectories.All of the pallets (including those within subfolders) to be listed.
Only direct children of the pallet folder are listed.
forklift.py
can be the runner. This will be the tool that is run every so often. I think it should have a config of file paths - unc and most likely a default for c:\scheduled
so it can scan for update plugins. Then it can figure out which ones to run and which ones to skip.
How should be define/handle reprojecting data between source and destination?
output_spatial_reference
and default_transform
properties on a pallet?
feel free to add to this. trying to brainstorm.
Pallet:process
should run.
Pallet:process
does not run and the pallet and crates return successful.
destination_workspace
that does not existThe destination workspace (e.g. a file geodatabase) would be created.
The crate returns this error:
ERROR 000210: Cannot create output C:\ForkliftData\Broadband.gdb\BB_Service
Failed to execute (CopyFeatures).
The report from forklift when printed to the console is a bunch of html and is not helpful. We should do something easier to read and still color coded. The nose-cover output would be a good example.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.