azure / azure-datafactory Goto Github PK

License: Other

C# 53.95% PowerShell 23.33% PigLatin 0.35% R 0.58% Java 2.64% TSQL 1.76% HiveQL 2.00% Jupyter Notebook 6.61% HTML 1.99% Bicep 1.73% Python 1.10% Batchfile 3.96%

azure-datafactory's Introduction

Microsoft Azure Data Factory Samples

This folder contains samples for the Azure Data Factory

For more information about Azure Data Factory, see http://go.microsoft.com/fwlink/?LinkId=513883

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Data Integration in a box

Quict-start with an end-to-end data engineeing pipelines in just a few clicks!

Learn more about data integration in a box.

azure-datafactory's People

Contributors

Stargazers

Watchers

Forkers

mkanorwala kalyanrk61 mohabdel2013 darsch aftabansari2512 jessekraut cicorias smaguiremsft dattatreysindol cdemircioglu bluewatersql pekadouch debuggio fcavaco vinbhatia chavdarbilyanski noodlefrenzy slunyakin-zz mlflakus spelluru francescodiaz1 mcr1321 sunliangms fixer-yamashita mld3742 jackmagic313 olmg-jordan mindis deepfat kapilzen harishkragarwal timovalordev jogazit msaranga iirdna hirenshahms gauravparikh luisefigueroa neelam04 aisley parthat ranjit127 rysiekg francescodiaz farukc jrmcdona nrevell spotlabsnet darknasty smumtaz chars14sep iburvt roneyaj pelendage ssg14 clarkdan elenaterenzi amolshelar2802 systematixindore esbran milud bellevue888 segheysens sramayanam plamendimitrov8 lasse1203ars uneidel agjoshi14 dipple elliotwmsft davilin loumondelaers tyonas9 annielytix jcastiblanco tonysmith99 ahkim lucylu00 iladan strategist922 avergottini waynesx gaurav2150 nedkoclx swan-am-i finologic shrotriya dalbertang keithbu amitestmo coderrs divyana fuxiong rupeshmuppala abhi786786 algattik axnserv live2learn garionmsft bartekgraczyk

azure-datafactory's Issues

How to increment a parameter value inside Until activity in Azure ADF V2

I am looping thorough couple copy activities inside the Until activity in Azure ADF V2, as part of this I am passing "top" as input parameter in the dataset "DS_HTTP_School1" for the "COPY_HTTP_School" activity. Here i want to increment "top" value inside the loop every time by 100 and have to pass the updated value to my API for pagination purpose.

Help me on how to increment this value for each iteration

{
"name": "PL_DATA_DS_HTTP_TO_BLOB_School2",
"properties": {
"description": "PL_DATA_DS_HTTP_TO_BLOB_School1",
"activities": [
{
"name": "Until1",
"type": "Until",
"typeProperties": {
"expression": {
"value": "@activity('COPY_HTTP_School').output",
"type": "Expression"
},
"activities": [
{
"name": "COPY_HTTP_School",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 3,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "HttpSource",
"httpRequestTimeout": "00:01:40"
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "DS_HTTP_School1",
"type": "DatasetReference",
"parameters": {
"top": 100
}
}
],
"outputs": [
{
"referenceName": "DS_BLOB_Output_School",
"type": "DatasetReference"
}
]
},
{
"name": "COPY_School",
"type": "Copy",
"dependsOn": [
{
"activity": "COPY_HTTP_School",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 3,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000
},
"enableStaging": false,
"dataIntegrationUnits": 0,
"translator": {
"type": "TabularTranslator",
"columnMappings": {
"schoolid": "SchoolId",
"schoolname": "SchoolName",
"schoolnumber": "SchoolNumber",
"region": "RefRegionId",
"schooltype": "SchoolTypeId",
"authoritytypedesc": "AuthorityTypeDesc",
"email": "Email",
"agreedmaxresourcingroll": "AgreedMaxResourcingRoll",
"schoolstatusdate": "SchoolStatusDate"
}
}
},
"inputs": [
{
"referenceName": "DS_BLOB_School",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "DS_ASQL_School",
"type": "DatasetReference"
}
]
}
],
"timeout": "7.00:00:00"
}
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}

--appFile and --class don't work

Similarly to #31, it seems that these two options are not recognised either.

Build failure

Perhaps this a newbie question. When I copy/paste into VS Studio Express 2015, nuget the libraries and build I get (among others): "The name 'datasets' does not exist in the current context" on line 74 which makes perfect sense since datasets is not visible inside the function "DeleteBlobFileFolder". Am I missing something? Does this actually compile?

Azure Data Factory generates UTF-8 blob with a BOM

When using Data Factory V2 with an output Dataset being a Json on an Azure Storage Blob V2 with a blank encodingName, the blob is encoded in UTF-8 with a BOM at the beginning, which is not conventional for UTF-8 and is not consistent with the output of other Azure services.
For instance, when output binding an Azure Function to a blob and not specifying the encoding, it generates an UTF-8 blob without a BOM.

ADF pipeline parameters default values

Hi,
Not sure if my understanding is not correct or this is a bug. It's about passing (parameter) values to the ADF pipeline and collecting it. Here is my use case -
I have a pipeline(say P2) with an activity. The pipeline is triggered from another pipeline(say P1) which passes some value to this pipeline that is extracted using @pipeline.parameters().variablename. An example is as follows -

Scenario-1
Define the parameter with default values(click on P2 pipeline and select "New" under "Parameters" tab.)

Name	Type	Default Value
paramValue	String	@pipeline().parameters.inputValue

Which results in an error while trying to publish the pipeline. [Error code: BadRequest
Inner error code: BadRequest
Message: Missing parameter definition for inputValue]

Scenario-2
Define default value of the parameter with same name as the collecting variable (click on P2 pipeline and select "New" under "Parameters" tab.) -

Name	Type	Default Value
paramValue	String	@pipeline().parameters. paramValue

This works.
Could somebody throw light on how the collecting variable name is related to the parameter value passed to the pipeline. Logically it doesn't make sense.

Cannot update a value in a table using UPDATE QUERY

Am using Copy Activity where I want to update a value after the complete process gets over. So, in order to update a value, am using an UPDATE query. Actually it updates the value in a table but it fails the activity with the following ERROR :
"errorCode": "2200",
"message": "Failure happened on 'Source' side. ErrorCode=UserErrorInvalidDbQueryString,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException, Message=The specified SQL Query is not valid. It could be caused by that the query doesn't return any data. Invalid query: 'UPDATE tablename SET column1 = getdate() WHERE column2 = ' ' and column3= ' '.Source=Microsoft.DataTransfer.ClientLibrary,'",
"failureType": "UserError",
"target": "Copy Data 1"
Is it possible to update a query with a success status in COPY ACTIVITY ??

ADF v2 - Copy From REST API into a BLOB

I've been trying to get the REST API source connector working for the better part of week now and but I keep getting stuck on how to drop off the REST Body response as as json file into a Blob as a sink. I've worked through the passing the bearer token and getting a response back, successfully but then I get stuck trying to extract the body response JSON from the return call. There are so many options of how to set this up and really no hard concrete examples of how to pull out and format the JSON correctly so that the blob accepts it.

Here is the error I continue to get and need help working through...

age": "Failure happened on 'Source' side. ErrorCode=UserErrorInvalidValueInPayload,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed to convert the value in 'requestBody' property to 'System.String' type. Please make sure the payload structure and value are correct.,Source=Microsoft.DataTransfer.DataContracts,''Type=System.InvalidCastException,Message=Object must implement IConvertible.,Source=mscorlib,'",
"failureType": "UserError",
"target": "Copy Data1"
}

From what I can find this is caused by the BLOB sink settings, and I think because its trying to convert the whole response call into JSON, vs just the body where the targeted data resides. Any insights or help would be appreciated!

Copy data activity loses factorial seconds

We have simple pipeline for incremental copying of data from one db to another one. We use the approach you have described in documentation: https://docs.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-multiple-tables-portal

Basically we have watermark table, we pull old watermarks and new ones to find the delta. Then we call Copy Data which makes the call of stored procedure in Azure SQL Server with watermarks as parameters. Then we update watermark table by using StoredProcedure activity.

Our watermarks have DateTimeOffset type. Copy Data activity loses factorial seconds meaning:
Value like 2018-12-17 14:02:47.1696724 +00:00
becomes like this 2018-12-17 14:02:47.0000000 +00:00

This causes the issue when we miss some data if it was added same second but a bit later.
At the same time StoredProcedure activity works with DateTimeOffset properly without any loss.

It doesn't sound like a huge issue, but it was tricky to find this misbehavior. Actually, when our users reported this issue it sounded like a fiction :)

We have two workarounds to avoid this issue:

increment the value of watermark - basically round up the value (ceiling)
instead of taking last watermark value from target database table, we use current datetimeoffset value as a watermark

Bu I believe this either should be fixed in CopyData activity or at least in mentioned in documentation.

pipeline().Pipeline variable gives GUID with the name

pipeline().Pipeline system variable has to return the pipeline name only but now it gives some GUID number with the name like this 9a61de73-a35a-4a22-95ed-256406e0f15a

Cannot Execute Pyspark Scripts

To execute pyspark scripts using spark-submit method, we need to call --py-files instead of --jarfile and there is no need for main class when using pyspark scripts. However cannot modify the code to get it working with PySpark scripts.

Issue when login to data Lake store in custom activity

Error in Activity: Unknown error in module: System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.Runtime.Serialization.SerializationException: Type 'Microsoft.Rest.Azure.CloudException' in Assembly 'Microsoft.Rest.ClientRuntime.Azure, Version=3.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' is not marked as serializable. Server stack trace: at System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo.InitSerialize(Object obj, ISurrogateSelector surrogateSelector, StreamingContext context, SerObjectInfoInit serObjectInfoInit, IFormatterConverter converter, ObjectWriter objectWriter, SerializationBinder binder) at System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo.Serialize(Object obj, ISurrogateSelector surrogateSelector, StreamingContext context, SerObjectInfoInit serObjectInfoInit, IFormatterConverter converter, ObjectWriter objectWriter, SerializationBinder binder) at System.Runtime.Serialization.Formatters.Binary.ObjectWriter.Serialize(Object graph, Header[] inHeaders, __BinaryWriter serWriter, Boolean fCheck) at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Serialize(Stream serializationStream, Object graph, Header[] headers, Boolean fCheck) at System.Runtime.Remoting.Channels.CrossAppDomainSerializer.SerializeMessageParts(ArrayList argsToSerialize) at System.Runtime.Remoting.Messaging.SmuggledMethodReturnMessage..ctor(IMethodReturnMessage mrm) at System.Runtime.Remoting.Messaging.SmuggledMethodReturnMessage.SmuggleIfPossible(IMessage msg) at System.Runtime.Remoting.Channels.CrossAppDomainSink.DoDispatch(Byte[] reqStmBuff, SmuggledMethodCallMessage smuggledMcm, SmuggledMethodReturnMessage& smuggledMrm) at System.Runtime.Remoting.Channels.CrossAppDomainSink.DoTransitionDispatchCallback(Object[] args) Exception rethrown at [0]: at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg) at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type) at Remed.Inflex.CA.ICrossAppDomainDotNetActivity1.Execute(TExecutionContext context, IActivityLogger logger) at Microsoft.Azure.Management.DataFactories.Runtime.ActivityExecutor.Execute(Object job, String configuration, Action1 logAction) --- End of inner exception stack trace --- at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor) at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj, Object[] parameters, Object[] arguments) at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture) at Microsoft.DataPipeline.Compute.HDInsightJobExecution.ReflectingActivityWrapper.Execute() at Microsoft.DataPipeline.Compute.HDInsightJobExecution.JobWrapper.RunJob() at Microsoft.DataPipeline.Compute.HDInsightJobExecution.Launcher.Main(String[] args)..

DeleteFromBlobActivity.cs Contains Build Errors

In public void DeleteBlobFileFolder

The name 'datasets' does not exist in the current context.
The name 'logger' does not exist in the current context.
The name 'linkedServices' does not exist in the current context.

CSV dataset import ignores header row.

Header row should take precedence over dataset definition when determining column order.
Unspecified columns should also be ignored.

Invalid scope of variables

I have a pipeline with a variable foo and i got the error

The output of variable 'foo' can't be referenced since it is not a variable of the current pipeline.

when i try to use it in activity in a Until Activity.

Rerun activity in ADF V2

Allow rerun an activity within a pipeline.

MySQL Date Conversion Issue

This is seems to be an issue with the integration runtime, so if this isn't the right place to put this issue then let me know.

Background:
I have a MySQL database that I'm pulling the data from and copying it over to an Azure SQL database.
The MySQL database has the Data Type of Date (or Timestamp) and the NO_ZERO_DATE mode is disabled, so some of the dates are 0000-00-00. (I have no control over the MySQL datasource structure).

Issue:
When copying from tables with the Data Type of Date I get the following error
{ "errorCode": "2200", "message": "Failure happened on 'Source' side. 'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Odbc Operation Failed.,Source=Microsoft.DataTransfer.ClientLibrary.Odbc.OdbcConnector,''Type=System.Data.Odbc.OdbcException,Message=ERROR [22018] [Microsoft][Support] (40550) Invalid character value for cast specification.,Source=MySQLODBC_sb64.dll,''Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Odbc Operation Failed.,Source=Microsoft.DataTransfer.ClientLibrary.Odbc.OdbcConnector,''Type=System.Data.Odbc.OdbcException,Message=ERROR [22018] [Microsoft][Support] (40550) Invalid character value for cast specification.,Source=MySQLODBC_sb64.dll,'", "failureType": "UserError", "target": "Copy_ogj" }

Things I've found when testing:
If the first few records have 0000-00-00 then the 0000-00-00 values gets read in as NULL (Which is prefer by me) and the data loads in fine.
On tables where the records with 0000-00-00 start at (for example) line 8000, the above ODBC error gets thrown.
If I make the source ADF Dataset Data Type of string it looks like ODBC still reads the data as datetime and throws the same error.
I tried an ORDER BY in my SELECT statement but that only worked with tables with just a single Date column.

My current solution is modifying my SELECT statement to CAST the Date types as string and then the data reads in just fine.

My Questions:
Is there a way to make the integration runtime to either force the 0000-00-00 to become NULL or to have it read in the MySQL Data as the Data Types defined on the ADF DataSet (So I don't have to cast every Date column in my SELECT statement)?

Data Factory UI not showing using free subscription

Hello, I'm having trouble launching the Data Factory UI from Azure Portal.

I'm using a free subscription.

I can create the resource, but when I press "Author & Monitor" the UI doesn't show.

If I open the developer console, the following error is shown:

ERROR Error: Uncaught (in promise): AADSTS70002: Error validating credentials. AADSTS16000: Either multiple user identities are available for the current request or selected account is not supported for the scenario. Trace ID: 4acd0a65-c1da-414c-b334-b05170022200 Correlation ID: 4349ee60-82dd-4c74-9950-0caffe656237 Timestamp: 2019-01-24 13:21:48Z

According to this link, the free subscription includes free low frequency activities with Azure Data Factory

Thanks.

Please provide well formatted steps about how to use spark on DataFactory

Hi,
I am looking to use Spark with DataFactory. While Hive, Pig etc are well documented, but unfortunately not Spark.

All I could find on the internet was this sample https://github.com/Azure/Azure-DataFactory/tree/master/Samples/Spark even then, read me is not well formatted. Thus very hard to read and understand.

I would really appreciate if you could add more detailed docs about using Spark with DataFactory.

Thank you

Azure data factory data lake analytics linked service provisioning failed

I am creating pipeline using azure data lake factory. Pipeline has one data lake analytics U-SQL activity. This used to work fine but stopped working stating i need refresh credential tokens.

When I authorize again and deploy I get error saying "Provisioning failed. Internal server error. Request id :....".

{
"name": "DataLakeAnalyticsLinkedService",
"properties": {
"description": "",
"hubName": "datafactory_hub",
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "datalakeanlytics",
"authorization": "_",
"sessionId": "_",
"subscriptionId": "......",
"resourceGroupName": "DataLake"
}
}
}

I have tried restarting browser session, creating new service it still fails.

Is there anything else am I missing ?

Copy data from Blob fields in SQL server to Azure Blob Storage

I have a table in SQL Server with the structure similar to this:

CREATE TABLE dbo.Documents (
   Id INT IDENTITY
  ,Name VARCHAR(50) NOT NULL
  ,Data IMAGE NOT NULL
 ,CONSTRAINT PK_Documents_DocumentId PRIMARY KEY CLUSTERED (DocumentId)
)

I need to extract the data from this table and load it to Azure Blob Storage (this is like a part of migration from on-prem data to cloud) preserving the following structure:

import-results
- document-id
  - document-name

where

document-id is actually id of a document (Id field) from Document table
document-name is name of a document (Name field) from Document table
the content of the blob is Data field from Document table

Is there a way to achieve this using Data Factory v2?

Linked Service (Azure Function) values hard coded in adf_publish ARM template

I have a ADF instance that is connected to a Azure DevOps Repo git repository, and is used as a part of a devops release pipeline.

As a part of our ADF CI/CD pipeline we overwrite the generated ARMTemplateForFactory.json file that is produced when we click the Publish button in the ADF UI - which commits the ARM templates to adf_publish.

I have created a Azure Function Linked Service inside ADF, which is set up with a URL and a Function Key. The problem is that when the ADF instance is published, the Azure Function Linked Service inside the generated ARM templates does not create a parameter for the functionAppUrl, only for the functionKey. This makes it really hard for us to integrate ADF in our Azure DevOps CI/CD pipeline.

This is not the case with other Linked Services such as Oracle/SQL Server/Data Lake. For these Linked Services all the parameters that are needed to change the environment are inside the ARMTemplateForFactory.json.

Example of the ARM linked service resource created in the adf_publish branch.

{
	"name": "[concat(parameters('factoryName'), '/AzureFunction1')]",
	"type": "Microsoft.DataFactory/factories/linkedServices",
	"apiVersion": "2018-06-01",
	"properties": {
		"annotations": [],
		"type": "AzureFunction",
		"typeProperties": {
			"functionAppUrl": "https://somerandom.azurewebsites.net",
			"functionKey": {
				"type": "SecureString",
				"value": "[parameters('AzureFunction1_functionKey')]"
			}
		}
	},
	"dependsOn": []
}

What it should create

{
	"name": "[concat(parameters('factoryName'), '/AzureFunction1')]",
	"type": "Microsoft.DataFactory/factories/linkedServices",
	"apiVersion": "2018-06-01",
	"properties": {
		"annotations": [],
		"type": "AzureFunction",
		"typeProperties": {
			"functionAppUrl": "[parameters('AzureFunction1_functionAppUrl)]", <---
			"functionKey": {
				"type": "SecureString",
				"value": "[parameters('AzureFunction1_functionKey')]"
			}
		}
	},
	"dependsOn": []
}

Dataset Parameters / Breaking change with pipeline parameters

Hi
It seems the change with dataset parameters is now obligatory.

You can no longer publish any change unless ALL your datasets stop using @pipeline.parameters. Or you'll get the validation error "Pipeline parameters can only be used within their defining pipeline. Define dataset parameters to accept the pipeline parameter values instead"

Is it so or is this maybe a bug that we expect to be addressed soon?

Sample for custom activity in Batch Service does not show usage

Issue: Non realistic sample in Batch Service

Comment: A sample showing how to process data in Batch Service in coop with Data Factory should show actual data processing. There is no loading of data, use of parallelism, and the sample is all in all not very realistic or useful as it is now.

inconsistent ARM template schema for Triggers in ADFv2

I hereby request that ARM template schema is aligned for different trigger types.
For example ScheduleTrigger pipeline reference is
"pipelines": [
{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "PipelineName"
},
"parameters": {}
}
]

For a tumbling window trigger, the reference is
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "PipelineName",
},
"parameters": {}
}
}

Thanks you.

keep original filenames

I want to copy files from an onprem fileshare to Azure Blob Storage or Data Lake Store.
Since I want to copy all files from a directory I do not specify a filename.
Is there a way to keep the original filenames and extentions? All the files currently end up on wasb or adls with a filename something like Data.9fdb19e7-1f8b-41a5-b77b-7086b0fe1151

Thanks!

Excel to CSV on adf v2

I would like to convert data from Excel to CSV file by Custom Activity on Azure Data Factory V2. Now, I can build EXE file for convert file, but I don’t know how to reference Excel file in blob and destination blob container path for CSV file result.

Support help with ADF pipeline stuck in provisioning

Hello @spelluru ... ? for you ...

I have an ADF pipeline (GuardRexPipelineRawDataAndProcessing) that seems to be stuck in provisioning. It's been stuck in that state for a few hours now and doesn't look like it's going to recover on its own. I can't access it in the portal; and when I try to delete it, I get this exception:

I don't have a MS Azure support agreement, so nobody will talk to me via Azure tech support. Can you tell me the best way to get a team member to help me with this pipeline?

ADF v2 unable to debug pipeline

Hi,

I get the following error when debugging a newly created pipeline:

{"name":"TimeoutError","stack":"TimeoutError: Timeout has occurred\n at new t (https://adf.azure.com/main.0254f8f467e8b5ca2d8c.bundle.js:1:3286122)\n at https://adf.azure.com/main.0254f8f467e8b5ca2d8c.bundle.js:1:3209053\n at e.U6yM.qe.a.timeout (https://adf.azure.com/main.0254f8f467e8b5ca2d8c.bundle.js:1:3287253)\n at e. (https://adf.azure.com/4.11215f27c8fb6d770ab6.chunk.js:1:1401040)\n at https://adf.azure.com/4.11215f27c8fb6d770ab6.chunk.js:1:1366636\n at Object.next (https://adf.azure.com/4.11215f27c8fb6d770ab6.chunk.js:1:1366741)\n at https://adf.azure.com/4.11215f27c8fb6d770ab6.chunk.js:1:1365891\n at new t (https://adf.azure.com/polyfills.9c34bdd62ad81631e216.bundle.js:1:125185)\n at CnKh.kd (https://adf.azure.com/4.11215f27c8fb6d770ab6.chunk.js:1:1365668)\n at e._constructSandboxPipeline (https://adf.azure.com/4.11215f27c8fb6d770ab6.chunk.js:1:1400467)","message":"Timeout has occurred","__zone_symbol__currentTask":{"type":"microTask","state":"notScheduled","source":"Promise.then","zone":"angular","cancelFn":null,"runCount":0}}

Its a simple pipeline and works when I click on trigger now.

But debug option does nothing and then the error above appears.

Thanks

--app_arguments doesn't seem to work in spark job

When I run a job which specifies app_arguments like below to a spark job:

"--app_arguments",
"$$Text.Format('wasb://[email protected]/data/{0:yyyy}/{0:MM}/{0:dd}/{0:HH}/', SliceStart)",

I get below error. I don't see that this arg being handled in SparkJob.java. Am I missing something?

WARNING: Use "yarn jar" to launch YARN applications.
Jul 01, 2016 10:33:36 PM com.adf.spark.SparkJob run
SEVERE: Error while parsing arguments
org.apache.commons.cli.UnrecognizedOptionException: Unrecognized option: --app_arguments
at org.apache.commons.cli.Parser.processOption(Parser.java:363)
at org.apache.commons.cli.Parser.parse(Parser.java:199)
at org.apache.commons.cli.Parser.parse(Parser.java:85)
at com.adf.spark.SparkJob.run(SparkJob.java:89)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at com.adf.spark.SparkJob.main(SparkJob.java:51)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Exception in thread "main" java.lang.RuntimeException: Error while parsing arguments:Unrecognized option: --app_arguments
at com.adf.spark.SparkJob.run(SparkJob.java:98)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at com.adf.spark.SparkJob.main(SparkJob.java:51)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Reading from empty Gzip file in Azure Data Lake Store hangs pipeline

Using DataFactoryV2 - When a Copy activity is reading from an empty GZIP file in Azure Data Lake Store to write to an Azure SQL sink, the pipeline hangs (will not move forward). Updating the dataset to Compression=None, allows the file to be recognized as empty and read without issue.

Confirmed this behavior when using GZIP, BZIP, and Deflate compression options. ZipDeflate fails the pipeline with the message "is not a valid Zip file with Deflate compression method".

.Net custom activity Sample for pulling data from Google Analytics

I have recently developed ADF .Net custom activity pulling data from GA and push them into Blob(and to DW eventually). Though ADF supports OData connector, OAuth isn't supported yet. This was the reason why I had to come up with custom activity. Resolving OAuth was a bit tricky due to OAuth requiring user's consent. I am happy to contribute this with some sample code creation. But, I'd like to make sure whether this will be an appropriate addition to this repository before I start. Please comment.

Unable to access key vault key

I stored a private RSA key inside key vault, but when trying to retrieve it from data factory it says data factory is unable to retrieve the secret. Is data factory limited to only retrieving secrets? or am I doing something wrong?

Thank you,
Cristian

Web Activity Linked Services not Updating

When publishing changes to linked service settings on Data Factory web activities inside Azure Data Factory V2 online editor, the changes do not get published with the pipeline. In some instance, the linked service shown in the UI even after refreshing the page is not the one that is in the underlying pipeline JSON file.

Pipeline stuck in "PendingUpdate" provisioning status on deployment

I am a having a problem for the last couple of hours, upon deployment of a pipeline, it goes into the provisioning state and then gets stuck there.

It then eventually fails with one of these two errors:"Failed to reach service" or "Internal Server Error."
The state of the pipeline is stuck at "Pending Update"

I have even created a Data Factory from scratch, created new linked services and data sets, and created the pipeline within that data factory, and the same thing is happening.

What can be causing this issue?

I have to add that I was using DF and deploying pipelines and everything was going well, this was a sudden issue.

Cannot update packages of ADFCustomActivityRunner

Hi,
Packages in ADFCustomActivityRunner cannot be updated, failing with error message:

Failed to add reference. The package 'Microsoft.Azure.Common' tried to add a framework reference to 'System.Threading' which was not found in the GAC. This is possibly a bug in the package. Please contact the package owners for assistance.

ADF Onprem SQL Server to ADW Table copy issue

I am getting an error while performing the copy of a given on prem table to ADW, I have checked the data types and all, all looks good here but I dont know what breaking or whats been going on here, so that I can figure it out, Please have a look at the below error, also I am not using the polybase here as In on prem sql server there are text datatypes so for those I am using the varchar(max) approach, as polybase does not work with max data types.

Error:-

Copy activity encountered a user error at Sink:adatawarehouse.database.windows.net side: 'Type=System.OverflowException,Message=Array dimensions exceeded supported range.,Source=Microsoft.DataTransfer.Common,'.

ADF Web Activity - Incorrect data send while posting JSON content with Arrays

While trying to perform HTTP post with JSON content with array in the body of the HTTP request, the endpoint receives - System.Collections.Generic.List`1[System.Object] instead of the actual content.

Verified it using combination of ngork, Visual Studio and Azure Functions

How to pass data out from azure batch

string inputFolderPath = activity.typeProperties.extendedProperties.InputFolderPath;
I see that you can read the value from extendedProperties.

Is it possible to write or update the extended properties so that down stream activities can read from it ?
Or even write to the output ?
so that it can be extracted using statements like this
"@{activity('InputOutputTest').output.outpufromBatch}"

Slice Execution

Hi,

Everything works fine in Data Factory except the following things (or I just don't understand how that works) :
Data Factory checks the avaibility for all slices whitout waiting the start slice hour.....

here :

Here the dataSetInput :

{ "name": "DVDFromAzureBlobInput", "properties": { "structure": [ { "name": "DVD_Title", "type": "String" }, { "name": "Studio", "type": "String" }, { "name": "Released", "type": "String" }, { "name": "Status", "type": "String" }, { "name": "Sound", "type": "String" }, { "name": "Versions", "type": "String" }, { "name": "Price", "type": "String" }, { "name": "Rating", "type": "String" }, { "name": "Year", "type": "String" }, { "name": "Genre", "type": "String" }, { "name": "Aspect", "type": "String" }, { "name": "UPC", "type": "String" }, { "name": "DVD_ReleaseDate", "type": "String" }, { "name": "ID", "type": "String" }, { "name": "Timestamp", "type": "String" } ], "published": false, "type": "AzureBlob", "linkedServiceName": "Dicom-Azure-Storage", "typeProperties": { "folderPath": "adf/inputdata/{Hour}/", "format": { "type": "TextFormat", "rowDelimiter": "\n", "columnDelimiter": ",", "nullValue": "N", "quoteChar": "\"" }, "partitionedBy": [ { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "%H" } } ] }, "availability": { "frequency": "Hour", "interval": 1 }, "external": true, "policy": {} } }

here my pipeline :

{ "name": "CsvToSQLAzurePipeline", "properties": { "description": "CSV to SQL Azure", "activities": [ { "type": "Copy", "typeProperties": { "source": { "type": "BlobSource", "treatEmptyAsNull": true, "skipHeaderLineCount": 1 }, "sink": { "type": "SqlSink", "writeBatchSize": 10000, "writeBatchTimeout": "00:00:00" }, "translator": { "type": "TabularTranslator", "columnMappings": "DVD_Title: TITLE" } }, "inputs": [ { "name": "DVDFromAzureBlobInput" } ], "outputs": [ { "name": "DVDToAzureSQLOutput" } ], "scheduler": { "frequency": "Hour", "interval": 1 }, "name": "CopyTEST", "description": "description" } ], "start": "2016-04-06T10:30:00Z", "end": "2016-04-07T12:00:00Z", "isPaused": false, "hubName": "dicomfactory_hub", "pipelineMode": "Scheduled" } }

So the first slice works fine because the data is here and we have a script internally that loads a file on the blob storage every hours. But the data factory checks if the data is already here for every hours of the day..... So the execution doesn't works if I don't rerun the validation manually when the script has uploaded the file on the blob storage....

How to tell the data factory to validate only the slice of the current hour... not for every hours directly ?

In ADF V2 ForEach Activity is triggering in parallel despite of selecting sequential.

I am triggering a pipeline inside a foreach activity for array of dates. I have used sequence option to trigger it one by one.

When I run the main pipeline which has foreach activity it is triggering the inner activity in parallel for all the dates in array.

Failing SPN Authentication in AzureAnalysisServicesRefresh sample

Attempting to use SPN Authetication in the AAS Refresh Sample. The method "AzureAnalysisServicesProcessSample.ProcessAzureASActivity.GetTabularModel" is throwing NullReferenceExceptions. Debugged code locally and I'm successfully retrieving a token from Azure AD but it fails at the same position.

The stack trace log from ADF references the directory on my personal machine where the code is stored, not the expected directory from the Azure Batch Cluster. Additionally, the activity works when using regular username and password. This only happens when using SPN Auth.
AASRefreshStackTrace.txt

Invalid expresion in logical functions with variables

I got invalid expression error when i try to use the expression

@greater(variables('x'), variables('y'))

i got the same error with less, equals and the others logical functions.

Change Control

What is the typical development cycle for an Azure Data Factory pipeline (mainly from a change-control perspective). Suppose I create a pipeline and it runs for a while, but then it needs a change. How would I do a code-review for the pipeline change?

ICustomActivity still in TwitterAnalysisSample-CustomC#Activity sample

According to following,

https://azure.microsoft.com/en-gb/documentation/articles/data-factory-release-notes/

ICustomActivity interface is renamed to IDotNetActivity. However, the sample still uses ICustomActivity. Please update the sample code to reflect the change required.

Copy Data Activity for SQL azure errors on varbinary(max) field when staging is enabled

Error:
ErrorCode=FailedDbOperation,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=A database operation failed with the following error: 'Implicit conversion from data type nvarchar(max) to varbinary(max) is not allowed. Use the CONVERT function to run this query.',Source=,''Type=System.Data.SqlClient.SqlException,Message=Implicit conversion from data type nvarchar(max) to varbinary(max) is not allowed. Use the CONVERT function to run this query.,Source=.Net SqlClient Data Provider,SqlErrorNumber=257,Class=16,ErrorCode=-2146232060,State=3,Errors=[{Class=16,Number=257,State=3,Message=Implicit conversion from data type nvarchar(max) to varbinary(max) is not allowed. Use the CONVERT function to run this query.,},],'

I am using the Copy Data activity with out explicitly specifying a schema.
This seems to work for most tables and columns, but appears to not be working for varbinary fields.
All the tables with varbinary are failing for me.
I've checked the schema on both source and destination and both appear to be the same, data types and sizes, order of fields.

I have been able to isolate which step along the way appears to be failing.
I've setup a test where all steps are dynamic up until the Sink's Data service connection.
When I create a dataset destination with a service connection that has no schema specific table it will fail.
However, once I manually import the schema the entire pipeline succeeds and I get data in my table.
When I use variable table the schema can't be imported for it so that will also fails here with the same error.

Now it does seem that If I disable blob staging my test pipeline seems to work for my test case that fails.
So it seems to be triggered by blob staging, though it can be reversed by manually specifying the schema and creating separate data sets for each target table.

Let me know if you need more information or specific pipeline definitions.

Documentation about dynamic folder paths/wild cards

Is there a feature for having dynamic folder paths using wild cards, custom partitions, or some other mechanism?

For example "myContainer/myFiles/{Year}/{Month}/{Day}/{CustomerId}/customerdata.json" or
"myContainer/myFiles/{Year}/{Month}/{Day}/*/customerdata.json" while preserving filename/partitioning key values.

[DOCUMENTATION] Please add a README

It would be super helpful if a README could be added to the samples directory that provides an overview of the different samples so that folks do not need to click into each sample.

This would also mean that each sample could have a consistent short overview as the individual READMEs vary substantially in content and format.

ADFv2 stored procedure activity failure

I am currently using ADFv2 in a pipeline i have a series of activity if one of the activity we call stored procedure. If the stored procedure returns any error basically i dont want to proceed with next activity in pipeline. Currently though stored procedure is failed with error activity shows success. Please let me know if i am missing anything or how to fail the activity itself.

Multiple active dependencies between differents pipelines in ADF V2

Currently ADF V2 only allow once dependencies from activities of different pipelines, so we need create pipelines with differents dependencies to activities in multiples pipelines.

DatafactoryManagementClient Method not found

Hi,

As soon as I add the code snippet below in to my code I get the error below:

DataFactoryManagementClient client = new DataFactoryManagementClient(credentials) { SubscriptionId = subscriptionId };

Method not found: 'Void Microsoft.Azure.Management.DataFactory.DataFactoryManagementClient..ctor(Microsoft.Rest.ServiceClientCredentials, System.Net.Http.DelegatingHandler[])'.

The full code snippet for the method is below:

```

string tenantId = "my-tenant-id";
                string clientId = "my-client-id";
                string secretKey = "my-secret-key";
                string subscriptionId = "my-subscription-id";

            var context = new AuthenticationContext("https://login.windows.net/" + tenantId);
            ClientCredential clientCredential = new ClientCredential(clientId, secretKey);
            var tokenResponse = await context.AcquireTokenAsync("https://management.azure.com/", clientCredential);
            var accessToken = tokenResponse.AccessToken;

            TokenCredentials credentials = new TokenCredentials(accessToken);
            DataFactoryManagementClient client = new DataFactoryManagementClient(credentials) { SubscriptionId = subscriptionId };


Is this a bug? Or am I doing something wrong?

Thanks