Coder Social home page Coder Social logo

etl's Introduction

PHP tools for ETL

This is a collection of PHP 7 classes useful for extracting, transforming, and loading data between sources.

Hierarchical XML and JSON can be automatically converted to relational SQL. Support includes extracting data documents from a file system or REST API, and then loading the data into a DBMS like Microsoft SQL Server.

Values are surveyed for datatypes, numeric cardinality, and unique natural key candidates. Then this information is used to create a normalized multi-table database structure suited to insert the data.

Branch Tests Code Coverage
master Build Status Codecov
dev Build Status Codecov

Requirements

PHP >= 7.0

Installation

This library is registered on packagist and can be easily installed into your project using composer.

composer require jpuck/etl

Getting Started

There are 3 basic groups of interrelated classes: Sources provide Data which have Schemata.

  1. Sources

    Sources extend the abstract Source class and transport Datum objects. In particular, the abstract DB class has concrete class implementations such as MicrosoftSQLServer.

  2. Data

    Data classes extend Datum and must implement a valid parser, satisfied by ParseValidator. It uses the Schematizer to construct the object from raw data, which can be overridden by passing an existing Schema.

  3. Schemata

    A Schema is a concrete class with a Validator to enforce structure. The Merger class is for combining Schemas to create super-set Schemas. The DDL trait is used by the DB class to generate SQL Data Definition Language which contains abstract methods to be implemented by a specific database management system.

Schematizer

The Schematizer class is for surveying the structure of the data. It includes node names, the count of distinct element groupings, numeric cardinality for relationships between subnodes, and descriptive statistics about the values including uniqueness. Categorically, it recognizes datetime, integer, and decimal datatypes. Decimals will include scale and precision measurements suitable for SQL.

Schematizer::getPrecision returns the scale and precision of numeric values suitable for the SQL DECIMAL(scale,precision) datatype. This function has notable behavior in that trailing zeros are discarded when passed as raw PHP float types. However, when passed as a string, then the trailing zeros are preserved in the precision. See SchematizerUtilitiesTest::precisionDataProvider for examples. Note that in the XML class, parsed values are represented as strings in PHP, so trailing zeros should be represented in the precision values.

node name
├── count
│   ├── max
│   │   ├── measure
│   │   └── value
│   └── min
│       ├── measure
│       └── value
├── unique (all values)
├── primaryKey
├── varchar          ────┐
│   ├── max              │
│   │   ├── measure      │
│   │   └── value        │
│   └── min              │
│       ├── measure      │
│       └── value        │
├── datetime             ├ datatypes
│   ├── max              │
│   │   └── value        │
│   └── min              │
│       └── value        │
├── int/decimal          │
│   ├── max              │
│   │   └── value        │
│   └── min              │
│       └── value    ────┘
├── scale            ────┐
│   ├── max              │
│   │   ├── measure      │
│   │   └── value        │
│   └── min              │
│       ├── measure      │
│       └── value        │
├── precision            ├ if decimal
│   ├── max              │
│   │   ├── measure      │
│   │   └── value        │
│   └── min              │
│       ├── measure      │
│       └── value    ────┘
├── children
│   ├── distinct (count of children)
│   └── count
│       ├── max
│       │   └── measure
│       └── min
│           └── measure
├── attributes
│   └── ... (excluding count, which must be 1)
└── elements
    └── ...

Database Connections

The DB class requires an instance of PDO in the constructor to connect, but it is possible to pass a null value if only utilizing the class for DDL.

SQL Data Definition Language

When one-to-many XML nodes are used to represent one-to-one relationships, then the Schematizer recognizes this and a DDL class flattens them as columns on a table. If a node has more than one of its name or grandchildren, then the one-to-many relationship is preserved in a separate normalized table. Surrogate keys are created to maintain the Primary/Foreign Key referential integrity.

If the Schema has a primaryKey set, then that field will be used for DDL generation instead of the surrogate. However, this Schema must also be passed to the Datum constructor prior to being used with DB::insert, otherwise the surrogate keys will be used by default and will result in a failed insertion if the surrogate columns don't exist.

Saving Schemas

Generating Schemas can take a long time and may require customization, such as adding primaryKey flags or removing unwanted fields to be ignored. Here are some examples for exporting and importing:

$xml = file_get_contents("sample.xml");
$xml = new XML($xml);
$schema = $xml->schema();

// normal JSON_PRETTY_PRINT
echo $schema;

By simply echoing the object, output can be redirected to a file from console:

php script.php > myschema.json

Use file_put_contents to write to disk. Schema::toJSON accepts all the json_encode options.

$string = $schema->toJSON(JSON_UNESCAPED_UNICODE);
file_put_contents('myschema.json', $string);

// native php array
$array = var_export($schema->toArray(), true);
$array = "<?php return $array;";
file_put_contents('myschema.php', $array);

Import any of those formats the same way by passing the filename, an array, or a JSON string to the constructor.

$schemas []= new Schema('myschema.php');
$schemas []= new Schema('myschema.json');

$schemas []= new Schema($schema->toArray());
$schemas []= new Schema($schema->toJSON());
$schemas []= new Schema($schema->toJSON(JSON_PRETTY_PRINT));

foreach($schemas as $s){
	var_dump($schema == $s);
}

Override the internal Schematizer by passing a Schema to the Datum constructor.

$schema = new Schema('myschema.json');
$xml = file_get_contents("sample.xml");
$xml = new XML($xml, $schema);

You can also pass the Schema override through Source::fetch

$credentials = [
	'url' => 'https://api.example.com',
	'username' => 'user',
	'password' => 'P@55w0rd',
];
$source = new REST($credentials);
$xml = $source->fetch('endpoint', XML::class, new Schema('myschema.json'));

Development

The development dependencies can be installed by running composer with or without the --dev option (enabled by default).

composer install --dev

Testing

Tests are written for PHPUnit which is included as a composer dev-dependency. To run the whole test suite, then execute this command from the shell console:

php vendor/bin/phpunit

You might also be interested in an easy to read checklist output:

php vendor/bin/phpunit --testdox

When stepping through breakpoints in an IDE, like Netbeans, it's helpful to see the current test name output by setting the run configuration to debug:

php vendor/bin/phpunit --debug

A code coverage report is available if you have the xdebug extension installed. In addition to the console text summary report, a full HTML GUI is generated to explore in the coverage folder. The easiest way to view this is to boot up a dev server:

php -S localhost:8080 -t coverage/

Database Testing

You must create the file (or symbolic link) tests/data/pdos/pdo.php in order to run the database tests. This should simply return a PDO instance, for example:

<?php
return (function(){
	$hostname = 'sql.example.com';
	$database = 'mydb';
	$username = 'user';
	$password = 'P@55w0rd';
	// https://www.microsoft.com/en-us/download/details.aspx?id=50419
	$driver   = 'ODBC Driver 13 for SQL Server';

	$pdo = new PDO("odbc:Driver=$driver;
		Server=$hostname;
		Database=$database",
		$username,
		$password
	);
	$pdo->setAttribute( PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION );

	return $pdo;
})();

etl's People

Contributors

jpuck avatar

Watchers

 avatar

etl's Issues

group insert values

A performance boost could be gained by grouping insert values when possible into a single SQL statement to execute.

refactor DB, DDL class architecture

There's something aloof about the relationship between these classes. Despite the perils of multi-level inheritance, it still might be best to merge DDL with DB combined as an abstract class extending the abstract class Source. Then concrete classes such as MicrosoftSQLServer make more sense.

Table name delimiter

When child tables are created in the DDL class, the names are glued together without delimiters to minimize length, but it would be nice to have the option for more readable names.

flattener

need a manual way to denormalize relationships up or down the chain

DB, DDL table prefix

Need to be able to apply a prefix to table names for cohabitation of same structures. This is a precursor for a multi-document replace operation whereby temp tables are used as a staging area for partial data transfers - in the event of a runtime interruption we don't want upsets to the existing production data.

detect database driver

there needs to be an intelligent way to check what type of database management system is connected for handling variations in returning last inserted id, quote types, and other assorted differences.

MySQL datatyper

This should also dictate the method for last-inserted-ID in the DB class

Schema Merger needs to set minimums

Currently if one Schema doesn't have a minimum, and the other one does, then the Merger will discard all minimums.

If both Schemata lack a minimum because min == max respectively, then a new min must be set to the value of the smaller max.

JSON Datum

It would be useful to be able to parse and schematize JSON data in the same fashion as the XML class.

Use natural keys for DDL relationships

Currently the Schematizer identifies unique values as candidate keys, but the application of this information into the DDL class is yet unimplemented. It would be nice to have the option to use these instead of the default surrogate jpetl_id keys.

This would be the first step in creating a truly integrated entity relationship between documents.

surrogate key generator

Because there's no transactional race conditions with distributed systems in this ETL, then instead of relying on the db to generate the IDs, just generate them server-side prior to insert. This would make batch insertions #11 reasonable.

JSON streamer for chunks

This would be more of an actual streamer than the current one, which is really more of an aggregator.

This would have the unique values disabled #15 because there's no efficient way to keep track of all those disparate values when merging discrete Schemas.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.