Coder Social home page Coder Social logo

azure-data-lake-store-net's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

azure-data-lake-store-net's Issues

MockAdlsClient does not override async methods

Hi,

Unless I am missing something obvious, it appears that mockAdlsClient only overrides and implements sync methods But not async variants. Is this by design? Or is it something which will be tackled in future?

Can't null check IEnumerable<DirectoryEntry> AdlsClient.EnumerateDirectory

When Enumerating through directories if the filepath is wrong I am not able to check if the IEnumerable is empty or null. The code below crashes at line --if(files.Any())-- I believe it's because the internal class FileStatusOutput extends IEnumerator and is returning a null Enumerator but I am not able to check that because of access. You can easily replicate this by running the code below and passing in an invalid filepath. I could simply check if this filepath exist first but that seems unnecessary.

 IEnumerable<DirectoryEntry> files = dataLakeClient.EnumerateDirectory(filePath);

            if (files != null)
            {
                if (files.Any())
                {
                    return files.ToList();
                }
            }

            return null;

Here is the error log in regards to variable 'files'

System.NullReferenceException: 'Object reference not set to an instance of an object.'

DownloadFolder doesn't download full directory tree

DownloadFolder doesn't always download all files and fails silently after downloading a fraction of the files. This is quite alarming, and I have no idea how to debug or provide more information (a verbose flag would be quite useful).

What is the difference between azure.storage.files.datalake and Microsoft.Azure.DataLake.Store

For newer projects, can we use this( Microsoft.Azure.DataLake.Store ) package to connect with Azure Data Lake Storage ? Actually I am using azure.storage.files.datalake nuget package. But there is a scenario to download multiple files . Through internet search I have come to know about BulkDownload() method. Is it ok to use Microsoft.Azure.DataLake.Store package ? It is neither deprecated nor updated in recent days. Can I use both this package in the same project ?

More Over samples repo (https://github.com/Azure-Samples/data-lake-store-adls-dot-net-samples) provided in readme file is changed to azure.storage.files.datalake library. Can we get some explicit differences between these libraries ?

Inline documentation: Many functions do not include thread safety information.

This is a reference doc issue that requires a change to the underlying /// comments.

From @leftler (Azure/azure-docs-sdk-dotnet#364):

Many class definitions do not include thread safety information. For example for https://github.com/Azure/azure-docs-sdk-dotnet/blob/b39513f/xml/Microsoft.Azure.DataLake.Store/AdlsClient.xml does not mention anywhere if the class's methods are safe to use from multiple threads (like WebClient is) or if we need a instance per thread.

cc: @nitinme

CONCURRENTAPPEND failed with Unknown Error: Only one usage of each socket address (protocol/network address/port) is normally permitted

We are facing issues in our container app running on .NET core. I don't know whether it is relevant, but I have set DefaultConnectionLimit to 200. This has started happening from a week or so. It is not even intermittent. We try write to different files using same ADLSClient instance in parallel.

await Connection.ADLSClient .ConcurrentAppendAsync(filePath, true, textByteArray, 0, textByteArray.Length) .ConfigureAwait(false)

It is not that every calls fails but more than some 30% calls are failing due to below exception. I even updated Microsoft.Azure.DataLake.Store to 1.1.11 version.

Exception and stack:

Failed to push data to sink with exception Microsoft.Azure.DataLake.Store.AdlsException: 
Error in concurrent append for file **filename**.

Operation: CONCURRENTAPPEND failed with   Unknown Error: 
Only one usage of each socket address (protocol/network address/port) is normally permitted Only one usage of each socket address (protocol/network address/port) is normally permitted 
Source: System.Net.Requests StackTrace:    at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)

at System.Net.WebRequest.<>c.<GetResponseAsync>b__68_2(IAsyncResult iar)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---

at Microsoft.Azure.DataLake.Store.WebTransport.MakeSingleCallAsync(String opCode, String path, ByteBuffer requestData, ByteBuffer responseData, QueryParams qp, AdlsClient client, RequestOptions req, OperationResponse resp, CancellationToken cancelToken, IDictionary`2 customHeaders).
Last encountered exception thrown after 1 tries. [Only one usage of each socket address (protocol/network address/port) is normally permitted Only one usage of each socket address (protocol/network address/port) is normally permitted]
[ServerRequestId:]
 at Microsoft.Azure.DataLake.Store.AdlsClient.ConcurrentAppendAsync(String path, Boolean autoCreate, Byte[] dataBytes, Int32 offset, Int32 length, CancellationToken cancelToken)

Support for gen2

Will support for gen2 be included in this sdk?
Is there an sdk available for gen2?

Thanks

Cancellation token is ignored in AdlsClient.EnumerateDirectory

Cancellation token cancelToken is ignored by EnumerateDirectory implementation, although the underlying implementation of Core.ListStatusAsync supports it.

I would propose passing the token down to the REST call wrapper.

internal IEnumerable<DirectoryEntry> EnumerateDirectory(string path, int maxEntries, string listAfter, string listBefore, UserGroupRepresentation userIdFormat = UserGroupRepresentation.ObjectID, CancellationToken cancelToken = default(CancellationToken))
{
    if (string.IsNullOrEmpty(path))
    {
        throw new ArgumentException("Path is null");
    }
    return new FileStatusOutput<DirectoryEntry>(listBefore, listAfter, maxEntries, userIdFormat, this, path);
}

async directory content summary

Is there a way to call it in async way without blocking the thread? I see async calls for open, create, read, etc. but no async for list of contents

Expected format for Token parameter in AdlsClient.CreateClient(string, string)

I am trying to connect to ADLS via a function using MSI.

I see that there is an overloaded method AdlsClient.CreateClient(string, string) where the 2nd string is a token, where the format is supposed to be
"token String Full authorization Token e.g. Bearing: abcddsfere....."
Is that a typo for Bearer: ?

Either way, I can't seem to get the right combination for this to work in a function via MSI authentication:

var azureServiceTokenProvider = new AzureServiceTokenProvider();
var accessToken = await azureServiceTokenProvider.GetAccessTokenAsync("https://datalake.azure.net/");
var lakeClient = AdlsClient.CreateClient("<your-data-lake>.azuredatalakestore.net", $"Bearer: {accessToken}");
var access = lakeClient.CheckAccess("/path-to-check/", "--x");

Any pointers?

How to get the creation time of a stream?

Hi, I want to know why there is not a property to represent the created time of streams in the Class DirectoryEntry? How to get that property from this SDK? This is blocking me [cry].

Unable to control retry logic on AdlsClient.EnumerateDirectory

It doesn't appear that there is any way to control the retry behavior in AdlsClient.EnumerateDirectory(or any of the methods from what I can tell).

Example:
AdlsClient client = AdlsClient.CreateClient($"invalidAccount.azuredatalakestore.net", this.Credentials);
DirectoryEntry entry = client.EnumerateDirectory("/").First();

This takes 30 seconds to fail. It would be nice if this allowed you to declare the retry policy so that you could have this fail immediately if you wanted.

how to use the update api to update a non-empty file in adls gen2?

I'm using the update api to try to update a file which is not empty, but always error occurs related to position is incorrect.

can you please give me an example for that? thanks.

by the way, it's good when using action=append, which returns 202. but next when using action=flush, always 400 error. I tried many position, the uploaded string length / the length of file on azure + uploaded string, no good luck.

File properties non-recursive and to memory (not file)

Why is there even a method to write file properties to a file? Who needs them in another file?
I want last access time and size in a simplest way possible - how do I do it?

Something like GetFileProperties but async and returning a simple structure.

DirectoryEntry.Name returns "" instead of the entity name

I'm calling GetDirectoryEntryAsync() with file path and directory path.
The returned DirectoryEntry.Name is always empty string.
Note: But DirectoryEntry.FullName returns the correct full path.

Based on the documentation, I would expect DirectoryEntry.Name provide the directory or filename in the storage.

Here is a sample code to repro

DirectoryEntry directoryEntry = await client.GetDirectoryEntryAsync("/local/vivihung-testadlssdk/Test.csv");
// DirectoryEntry directoryEntry = await client.GetDirectoryEntryAsync("/local/vivihung-testadlssdk");
Console.WriteLine($"Name: {directoryEntry.Name}");
Console.WriteLine($"Full Name: {directoryEntry.FullName}");
Console.WriteLine($"Last Modified Time: {directoryEntry.LastModifiedTime}");
Console.WriteLine($"Size: {directoryEntry.Length}");

Result

Name:
Full Name: /local/vivihung-testadlssdk/Test.csv
Last Modified Time: 1/10/2018 9:47:41 PM
Size: 15725240

BulkUpload fails with error message "Offset and length were out of bounds for the array or count is greater than the number of elements from index to the end of the source collection."

The issue happens if the files is larger than 240MB (need to upload as multiple chunks) and the line across 240MB boundary is close (but less than) 4MB (non-binary).

The issue is here

int totBytesRead = ReadDataIntoBuffer(readStream, buffer, bufferOffset, ReadForwardBuffSize);

The bufferOffset value is 0 at the beginning normally, with 4MB BuffSize and 8KB ReadForwardBuffSize, it could make sure the bufferOffset is always multiple of 8KB (until the end of stream), and therefore bufferOffset + ReadForwardBufferSize is always no larger than BuffSize.

However, for the given case, the method is called by ReadForwardTillNewLine(readStream, readBytes, residualDataSize) and residualDataSize (which is bufferOffset) is not always multiple of 8KB, which means in the end the bufferOffset + ReadForwardBufferSize could be larger than BuffSize.

Suggest to fix it as
int totBytesRead = ReadDataIntoBuffer(readStream, buffer, bufferOffset, Math.Min(ReadForwardBuffSize, BuffSize - bufferOffset));

AdlsClient.BulkUpload(...) does not report progress correctly

The AdlsClient.BulkUpload(...) method does not report progress correctly for files. When a single file is being uploaded, the TransferStatus object does not have updated values for ChunksTransfered, SizeTransfered, and TotalChunksToTransfer fields.

Progress works well when a folder is uploaded.

AdlsClient.CreateClient(FQDN, credentilas) throws exception if FQDN == " https://someName.dfs.core.windows.net/",

var adlsClient = AdlsClient.CreateClient("https://SOMENAME_FROM_PORTAL.dfs.core.windows.net", credentials);

where SOMENAME_FROM_PORTAL is taken from "Storage Account -> Properties -> Primary ADLS file system endpoint"

Exception:
System.ArgumentException HResult=0x80070057 Message=Account name https://SOMENAME_FROM_PORTAL.dfs.core.windows.net is invalid. Specify the full account including the domain name. Source=Microsoft.Azure.DataLake.Store StackTrace: at Microsoft.Azure.DataLake.Store.AdlsClient..ctor(String accnt, Int64 clientId, Boolean skipAccntValidation) at Microsoft.Azure.DataLake.Store.AdlsClient.CreateClient(String accountFqdn, ServiceClientCredentials creds) at batchHasher.Program.<Main>d__0.MoveNext() in C:\Dev\batchHasher\Program.cs:line 53 at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() in E:\A\_work\307\s\src\mscorlib\src\System\Runtime\ExceptionServices\ExceptionDispatchInfo.cs:line 132 at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) in E:\A\_work\307\s\src\mscorlib\src\System\Runtime\CompilerServices\TaskAwaiter.cs:line 155 at batchHasher.Program.<Main>(String[] args)

Port Exhaustion when enumerating directories

We have a scenario where we need to enumerate all directories in the Data Lake Store and delete specific files. This results in a lot of HTTP requests to the DLS. During this operation Application Insights report the following dependency error during this enumeration:

"Only one usage of each socket address (protocol/network address/port) is normally permitted"

There is some sort of retry pattern built into the client, so eventually the operation succeds, but it seems odd to run into port exhaustion.

Adls client freezing on "GetDirectoryEntry"

I have a long living adlsclient that is alive the whole time the app is running. however on a call to receive file properties it freezes and never returns. This part of the code only allows one thread at a time and no more. However I have multiple adls clients instantiated throughout the project but they are only used by one thread at a time and not used together.

getfilepropertyerror

socket exception using ConcurrentAppendAsync

I have a simple azure function on the changefeed of a cosmos db which "exports" to a datalake. The db has thousands of updates every second so the datalake is structured in such a way to have multiple cosmos "entities" into 1 datalake file (partitioned by minute per entity type). this is done to follow best practices of not created too many small files in the lake.

The function generates thousands of errors everyday:

Error in concurrent append for file edited-out/entity-type-edited-out/y=2018/m=8/d=6/h=16/m=13/data.json. Operation: CONCURRENTAPPEND failed with Unknown Error: Only one usage of each socket address (protocol/network address/port) is normally permitted Only one usage of each socket address (protocol/network address/port) is normally permitted Source: System.Net.Requests StackTrace: at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)
at System.Net.WebRequest.<>c.b__68_2(IAsyncResult iar)
at System.Threading.Tasks.TaskFactory1.FromAsyncCoreLogic(IAsyncResult iar, Func2 endFunction, Action1 endAction, Task1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.DataLake.Store.WebTransport.MakeSingleCallAsync(String opCode, String path, ByteBuffer requestData, ByteBuffer responseData, QueryParams qp, AdlsClient client, RequestOptions req, OperationResponse resp, CancellationToken cancelToken, IDictionary`2 customHeaders). Last encountered exception thrown after 1 tries. [Only one usage of each socket address (protocol/network address/port) is normally permitted Only one usage of each socket address (protocol/network address/port) is normally permitted] [ServerRequestId:]

Recursive ACL application of Defaults should ignore Files

As discovered via the powershell wrapper: Azure/azure-powershell#6171 (comment)

It would be helpful when setting default ACLs when applying ACL's recursively that Files are ignored, as it causes problems when needing to apply a new default to folders than already contain files.

I suspect around here: https://github.com/Azure/azure-powershell/blob/7fd657a2e53f7b4f1571acb75bdee2a376c3094d/src/ResourceManager/DataLakeStore/Commands.DataLakeStore/DataPlaneModels/DataLakeStoreItemAce.cs#L70

That if it's a default it doesn't get added to the changes and therefore results in a Acl Specification List is empty error.

Directory watch and delete support

In our case we would like to watch a local directory and upload the files to a Data Lake Storage.
After an upload the file can be deleted.
Would be great a fist class sdk support for that case. I think it is very common use case.

Support for netcoreapp2.0

Package 'Microsoft.Azure.DataLake.Store' is compatible with a subset of the specified frameworks in project

Durable function activity invoking does not provide compile time safety

Invoking activities in a durable function does not provide type safety.

From the Orchestrator we use CallActivityAsync that takes object as input parameter and inside the activity we use context.GetInput method.

This doesn't provide type safety between the caller and the consumer and this has bitten us multiple times - devs changing in one place and forgetting to change in the other.

WebProxy for HttpWebRequest

Right now it's not possible to use custom WebProxy settings for HttpWebRequest instances created in WebTransport.
Can you please provide additional implementation for AdlsClient factory to create a client instance with WebProxy to be used in WebTransport?
Thanks.

AdlsOutputStream should handle empty source streams properly

We encountered an ArgumentOutOfRangeException when calling the CopyToAsync() API with an empty MemoryStream whose initial buffer size is 0.

We traced it to this line:

if (buffer != null && (offset >= buffer.Length || (offset < 0) || (count + offset > buffer.Length)))

When an empty memory stream gets passed in, offset and buffer.Length are zero. This case should be handled and the write should succeed instead of throwing an ArgumentOutOfRangeException.

Error: Specified value has invalid Control characters

Exception occurs when using ConcurrentAppendAsync in an azure function listening to a cosmosdb changefeed.

Error in concurrent append for file files/edited-entity-type/y=2018/m=8/d=6/h=14/m=8/data.json. Operation: CONCURRENTAPPEND failed with Unknown Error: Specified value has invalid Control characters.
Parameter name: value Source: System.Net.WebHeaderCollection StackTrace: at System.Net.HttpValidationHelpers.CheckBadHeaderValueChars(String value)
at System.Net.WebHeaderCollection.Set(String name, String value)
at Microsoft.Azure.DataLake.Store.WebTransport.AssignCommonHttpHeaders(HttpWebRequest webReq, AdlsClient client, RequestOptions req, String token, String opMethod, IDictionary2 customHeaders) at Microsoft.Azure.DataLake.Store.WebTransport.MakeSingleCallAsync(String opCode, String path, ByteBuffer requestData, ByteBuffer responseData, QueryParams qp, AdlsClient client, RequestOptions req, OperationResponse resp, CancellationToken cancelToken, IDictionary2 customHeaders). Last encountered exception thrown after 1 tries. [Specified value has invalid Control characters.
Parameter name: value] [ServerRequestId:] ID:: b6c67b50-46e7-45ad-908c-a00ac93db54a

AzureDataLakeStorageClient.BulkDownload() fails when downloading large file

When downloading a file of size 2,396MB the BulkDownload() method fails with:

EntryName: /tmp/\7700484d-1cf4-4451-8589-5759f1ed7fde.csv8bcccfdc-e097-4d3b-98e8-ed6f4a533c13Segments, EntryType: Chunk, JobStatus: Failed, Error: Unable to load shared library 'Kernel32.dll' or one of its dependencies. In order to help diagnose loading problems, consider setting the LD_DEBUG environment variable: libKernel32.dll: cannot open shared object file: No such file or directory

When downloading a file of size 1,384MB download is successful. Is there some known limitation to ADLS client?
For context we are running our service on linux.

Can't copy large file Error "Operation: failed with Error: Offset of file is negative."

It happens when files are bigger than 2Gb and it happens when using WriteAsync or CopyAsync methods.
When using Write and Copy methods it works well.

It happens because the integer limit is exceeded. Here is the stack trace :

Microsoft.Azure.DataLake.Store.AdlsException: Error in appending for file at offset 2147483648.
Operation: failed with Error: Offset of file is negative.
Last encountered exception thrown after 1 tries.
[ServerRequestId:]
at Microsoft.Azure.DataLake.Store.AdlsOutputStream.d__53.MoveNext()

AdlsClient.CreateClient throws ArgumentException when my account FQDN contains "-"

I'm trying to access our ADLS with AAD token via ADLS SDK,
However, this line in my code got exception from the SDK.

AdlsClient.CreateClient(clientAccountName, clientCreds2);

Error

Unhandled Exception: System.ArgumentException: Account name test-ppe-c14.azuredatalakestore.net is invalid. Specify the full account including the domain name.
at Microsoft.Azure.DataLake.Store.AdlsClient..ctor(String accnt, Int64 clientId, ServiceClientCredentials creds, Boolean skipAccntValidation)
at Microsoft.Azure.DataLake.Store.AdlsClient.CreateClient(String accountFqdn, ServiceClientCredentials creds)
at AdlsSdkSamples.Program.Main(String[] args) in e:\GitHub\data-lake-store-adls-dot-net-samples\AdlsSdkSamples\Program.cs:line 40

By looking at the SDK source code, I believe it is the following regex which determines my ADLS FQDN is invalid.

        private bool IsValidAccount(string accnt)
        {
            return Regex.IsMatch(accnt, @"^[a-zA-Z0-9]+\.[a-zA-Z0-9\-][a-zA-Z0-9.\-]*$");
        }

But, I think my ADLS FQDN is a valid one in Azure portal (or I cannot create it, right?)
Here is my ADLS FQDN pattern
test-ppe-c14.azuredatalakestore.net

Looks like a bug in the SDK to me.

Edit note [1/9/2018 Updated]:
After chat with the team, Azure portal doesn't allow '-' in the account name.
It's our team worked with Azure to create those ADLS account which might bypass the Azure portal's validation. Therefore, this issue is a special case.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.