Coder Social home page Coder Social logo

lytics / cloudstorage Goto Github PK

View Code? Open in Web Editor NEW
78.0 24.0 16.0 362 KB

Cloud & local storage unified api (s3, google, azure, sftp, local)

License: MIT License

Go 98.62% Shell 1.38%
cloud-storage golang google-cloud-storage mock abstractions

cloudstorage's Introduction

Introduction

Cloudstorage is an library for working with Cloud Storage (Google, AWS, Azure) and SFTP, Local Files. It provides a unified api for local files, sftp and Cloud files that aids testing and operating on multiple cloud storage.

GoDoc Go ReportCard

Features

  • Provide single unified api for multiple cloud (google, azure, aws) & local files.
  • Cloud Upload/Download is unified in api so you don't have to download file to local, work with it, then upload.
  • Buffer/Cache files from cloud local so speed of usage is very high.

Similar/Related works

Example usage:

Note: For these examples all errors are ignored, using the _ for them.

Creating a Store object:
// This is an example of a local storage object:  
// See(https://github.com/lytics/cloudstorage/blob/master/google/google_test.go) for a GCS example:
config := &cloudstorage.Config{
	Type:            localfs.StoreType,
	AuthMethod:      localfs.AuthFileSystem,
	LocalFS:         "/tmp/mockcloud",
	TmpDir:          "/tmp/localcache",
}
store, _ := cloudstorage.NewStore(config)
Listing Objects:

See go Iterator pattern doc for api-design: https://github.com/GoogleCloudPlatform/google-cloud-go/wiki/Iterator-Guidelines

// From a store that has been created

// Create a query
q := cloudstorage.NewQuery("list-test/")
// Create an Iterator
iter, err := store.Objects(context.Background(), q)
if err != nil {
	// handle
}

for {
	o, err := iter.Next()
	if err == iterator.Done {
		break
	}
	log.Println("found object ", o.Name())
}
Writing an object :
obj, _ := store.NewObject("prefix/test.csv")
// open for read and writing.  f is a filehandle to the local filesystem.
f, _ := obj.Open(cloudstorage.ReadWrite) 
w := bufio.NewWriter(f)
_, _ := w.WriteString("Year,Make,Model\n")
_, _ := w.WriteString("1997,Ford,E350\n")
w.Flush()

// Close sync's the local file to the remote store and removes the local tmp file.
obj.Close()
Reading an existing object:
// Calling Get on an existing object will return a cloudstorage object or the cloudstorage.ErrObjectNotFound error.
obj2, _ := store.Get(context.Background(), "prefix/test.csv")
// Note, the file is not yet open
f2, _ := obj2.Open(cloudstorage.ReadOnly)
bytes, _ := ioutil.ReadAll(f2)
fmt.Println(string(bytes)) // should print the CSV file from the block above...
Transferring an existing object:
var config = &storeutils.TransferConfig{
	Type:                  google.StoreType,
	AuthMethod:            google.AuthGCEDefaultOAuthToken,
	ProjectID:             "my-project",
	DestBucket:            "my-destination-bucket",
	Src:                   storeutils.NewGcsSource("my-source-bucket"),
	IncludePrefxies:       []string{"these", "prefixes"},
}

transferer, _ := storeutils.NewTransferer(client)
resp, _ := transferer.NewTransfer(config)

See testsuite.go for more examples

Testing

Due to the way integration tests act against a cloud bucket and objects; run tests without parallelization.

cd $GOPATH/src/github.com/lytics/cloudstorage
go test -p 1 ./...

cloudstorage's People

Contributors

ajroetker avatar araddon avatar bpopadiuk avatar epsniff avatar erinpentecost avatar humanchimp avatar junichif avatar kyledj avatar mattayes avatar mh-cbon avatar peczenyj avatar ropes avatar sergeyt avatar snargleplax avatar timkaye11 avatar vitaminmoo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cloudstorage's Issues

Tests failing after I modified Move/Copy test cases to use variable len payloads

In a previous PR I discovered that sftp/localstore and Azure test cases were failing after I switched test cases to using variable length payloads.

SFTP/Localstore failed because the files weren't correctly being truncated before being written too. So, if new data was less than the previous data, then the file will contain extra data and be corrupted.

Azure turned out to be a race with how we wrote to the backing store.

S3 Copy/Move performance optimizations

There are Copy/Move that are not currently implemented on S3. These are performance optimizations that when copy/move from s3 -> to s3 don't do network transfer, instead allow s3 to do the copy/move through native s3 api.

cloudstorage/store.go

Lines 73 to 85 in 2731960

// StoreCopy Optional interface to fast path copy. Many of the cloud providers
// don't actually copy bytes. Rather they allow a "pointer" that is a fast copy.
StoreCopy interface {
// Copy from object, to object
Copy(ctx context.Context, src, dst Object) error
}
// StoreMove Optional interface to fast path move. Many of the cloud providers
// don't actually copy bytes.
StoreMove interface {
// Move from object location, to object location.
Move(ctx context.Context, src, dst Object) error
}

Google implementation

// Copy from src to destination

Handle Files vs Directories

The current Object implementation assumes it is a file, and doesn't have any affordance for directory only types. https://golang.org/pkg/os/#FileMode

Google Storage has a Delimiter for filtering to certain types

https://developers.google.com/apis-explorer/#p/storage/v1/storage.objects.list?bucket=lytics-dataux-tests&delimiter=%252F&maxResults=1000&prefix=tables%252F&_h=8&

Limiting to directories with Delimiter = "/"

{
 "kind": "storage#objects",
 "prefixes": [
  "tables/article/",
  "tables/user/"
 ]
}

Regular objects

https://developers.google.com/apis-explorer/#p/storage/v1/storage.objects.list?bucket=lytics-dataux-tests&maxResults=1000&prefix=tables%252F&_h=7&

{
 "kind": "storage#objects",
 "items": [
  {


   "kind": "storage#object",
   "id": "lytics-dataux-tests/tables/article/article1.csv/1457896488161000",
   "selfLink": "https://www.googleapis.com/storage/v1/b/lytics-dataux-tests/o/tables%2Farticle%2Farticle1.csv",
   "name": "tables/article/article1.csv",
   "bucket": "lytics-dataux-tests",
   "generation": "1457896488161000",
   "metageneration": "1",
   "contentType": "text/csv; charset=utf-8",
   "timeCreated": "2016-03-13T19:14:48.119Z",
   "updated": "2016-03-13T19:14:48.119Z",
   "storageClass": "STANDARD",
   "size": "398",
   "md5Hash": "+RTyIckctKnUmha0OaBBHA==",
   "mediaLink": "https://www.googleapis.com/download/storage/v1/b/lytics-dataux-tests/o/tables%2Farticle%2Farticle1.csv?generation=1457896488161000&alt=media",
   "metadata": {
    "content_type": "text/csv; charset=utf-8"
   },
   "owner": {
    "entity": "user-00b4903a9730f58d42103a7fd40a4d0f92e371ae57112475d3993f88ccf7e8d6",
    "entityId": "00b4903a9730f58d42103a7fd40a4d0f92e371ae57112475d3993f88ccf7e8d6"
   },
   "crc32c": "8oZBFA==",
   "etag": "COidr9KvvssCEAE="
  },
  {


   "kind": "storage#object",
   "id": "lytics-dataux-tests/tables/user/user1.csv/1457896494340000",
   "selfLink": "https://www.googleapis.com/storage/v1/b/lytics-dataux-tests/o/tables%2Fuser%2Fuser1.csv",
   "name": "tables/user/user1.csv",
   "bucket": "lytics-dataux-tests",
   "generation": "1457896494340000",
   "metageneration": "1",
   "contentType": "text/csv; charset=utf-8",
   "timeCreated": "2016-03-13T19:14:54.328Z",
   "updated": "2016-03-13T19:14:54.328Z",
   "storageClass": "STANDARD",
   "size": "299",
   "md5Hash": "p6GxtAFU3xu3q8ty852yxw==",
   "mediaLink": "https://www.googleapis.com/download/storage/v1/b/lytics-dataux-tests/o/tables%2Fuser%2Fuser1.csv?generation=1457896494340000&alt=media",
   "metadata": {
    "content_type": "text/csv; charset=utf-8"
   },
   "owner": {
    "entity": "user-00b4903a9730f58d42103a7fd40a4d0f92e371ae57112475d3993f88ccf7e8d6",
    "entityId": "00b4903a9730f58d42103a7fd40a4d0f92e371ae57112475d3993f88ccf7e8d6"
   },
   "crc32c": "1fpnNw==",
   "etag": "CKCvqNWvvssCEAE="
  }
 ]
}

Overwriting behavioral difference between stores types, when using store.NewWriter...

Some cloud providers overwrite a file as an atomic operation that takes place on a call to writer.Close(). But for localfs and sftp, we're currently removing the file when the writer is opened and then we overwrite the object as we stream bytes to it. This creates a gap of time when the file is in an inconsistent state for those stores that don't support atomic replacement on Close().

localfs NewWriter, should use O_TRUNC ?

hi,

when i read the doc about NewWriter it says,

		// NewWriter returns a io.Writer that writes to a Cloud object
		// associated with this backing Store object.
		//
		// A new object will be created if an object with this name already exists.
		// Otherwise any previous object with the same name will be replaced.
		// The object will not be available (and any previous object will remain)
		// until Close has been called
		NewWriter(o string, metadata map[string]string) (io.WriteCloser, error)

however when using the localfs type, it appears the file is not truncated

What do you think ?

awss3 List does not call query ApplyFilters

the documentation of Query ApplyFilters says

// ApplyFilters is called as the last step in store.List() to filter out the
// results before they are returned.

but it is not true for awss3 ( and maybe for google )

File name gets too long when local storage is used with TmpDir=LocalFS

When the local file storage gets used and the TmpDir is set to the same as the LocalFS, it looks like it causes a loop of cached files getting created every time List() is called (this happens in LIO right now with the event store archive pointing at /tmp)

localfile: error occurred opening cachedcopy file. cachepath=/tmp/aid-12/v1/stream/.5dee81effdba11e5adfe0862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.5dee81effdba11e5adfe0862664d56cf.cache.5dee81effdba11e5adfe0862664d56cf.cachefrom_s3.5dee81effdba11e5adfe0862664d56cf.cache err=open /tmp/aid-12/v1/stream/.5dee81effdba11e5adfe0862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.5dee81effdba11e5adfe0862664d56cf.cache.5dee81effdba11e5adfe0862664d56cf.cachefrom_s3.5dee81effdba11e5adfe0862664d56cf.cache: file name too long

Causes the reads to crash after a few iterations since the filename gets too long.

Add support for SFTP

Add support for SFTP as a backend

  • sftp uses traditional file based folder structures which means you can have a file a/b/file.csv which means folder a and that contains folder b which contains file.csv. the a folder is empty, and would NOT show up in traditional cloud based systems where the a is ignored. only a/b would be a folder in google, azure, etc.
    • update tests to validate this behavior across the existing tools. probably a weakness in the existing test-suite.

Read File if already in LocalCache and Cleaner

It would be nice to read file if it already exists in local-file cache, checking md5 to make sure it is the same md5. Use md5 as filename? Would require a couple of cleaner strategies, time-based as well as size-based.

awss3 List method ignores NextMarker

Hello

I had some troubles to list the entire content of one s3 bucket - it only returns until the PageSize value but does not give me the next marker when the response is Truncated

q := cloudstorage.NewQuery(path)
or, err := s.StoreReader.List(context.Background(), q)
...
spew.Dump(or.NextMarker) // always empty

when I investigate I find two problems here:

https://github.com/lytics/cloudstorage/blob/master/awss3/store.go#L309

	if resp.IsTruncated != nil && *resp.IsTruncated {
		q.Marker = *resp.Contents[len(resp.Contents)-1].Key
	}

first, the query q is not a reference/pointer to a Query, it is passed by copy, so set the marker into q is useless

second, it does not fill the objResp.NextMarker ( like Azure https://github.com/lytics/cloudstorage/blob/master/azure/store.go#L265 )

This may affect other backends, I have an workaround for this but will be great a fix -- I will try to submit a pull request.

google storage type BucketLifecycleRuleCondition field Age change type to *int64

with latest versions of google.golang.org/api/storage/v1

on struct BucketLifecycleRuleCondition

the field Age change type from int64 to *int64 and it prevent the update of this specific package when using cloudstorage

$ ./go.test.sh 
?   	github.com/lytics/cloudstorage/testutils	[no test files]
ok  	github.com/lytics/cloudstorage/sftp	0.106s	coverage: 8.4% of statements
ok  	github.com/lytics/cloudstorage/localfs	0.625s	coverage: 82.7% of statements
# github.com/lytics/cloudstorage/google
google/apistore.go:84:58: cannot use days (variable of type int64) as type *int64 in struct literal
FAIL	github.com/lytics/cloudstorage/google/storeutils [build failed]
FAIL

improve sftp support

Hi!

Thanks again for the lib, super useful!

About the sftp part i want to report that some options should be configurable, they are hardcoded but this does not work very well imho.

In those two methods,

// ConfigUserPass creates ssh config with user/password
// HostKeyCallback was added here
// https://github.com/golang/crypto/commit/e4e2799dd7aab89f583e1d898300d96367750991
// currently we don't check hostkey, but in the future (todo) we could store the hostkey
// and check on future logins if there is a match.
func ConfigUserPass(user, password string) *ssh.ClientConfig {
	return &ssh.ClientConfig{
		User: user,
		Auth: []ssh.AuthMethod{
			ssh.Password(password),
		},
		// Config: ssh.Config{
		// 	Ciphers: []string{
		// 		"aes128-ctr", "aes192-ctr", "aes256-ctr", "[email protected]",
		// 		"arcfour256", "arcfour128", "aes128-cbc",
		// 	},
		// },
		HostKeyCallback: ssh.InsecureIgnoreHostKey(),
		Timeout:         timeout,
	}
}

// ConfigUserKey creates ssh config with ssh/private rsa key
func ConfigUserKey(user, keyString string) (*ssh.ClientConfig, error) {
	// Decode the RSA private key

	key, err := ssh.ParsePrivateKey([]byte(keyString))
	if err != nil {
		return nil, fmt.Errorf("bad private key: %s", err)
	}

	return &ssh.ClientConfig{
		User: user,
		Auth: []ssh.AuthMethod{
			ssh.PublicKeys(key),
		},
		Config: ssh.Config{
			Ciphers: []string{
				"aes128-ctr", "aes192-ctr", "aes256-ctr", "[email protected]",
				"arcfour256", "arcfour128", "aes128-cbc",
			},
		},
		HostKeyCallback: ssh.InsecureIgnoreHostKey(),
		Timeout:         timeout,
	}, nil
}

As you can see i had to comment the cipher part, otherwise i d end up with an ssh error such as : "SSH_FX_PERMISSION_DENIED".

Also, i think the callback to check for insecure keys and the timeout should be configurable.

About those 3 things, i d like to suggest to add new configurations keys (like ConfKeyFolder = "folder") so the end user can modify it via the gou.JsonHelper map.

What do you think ?

thanks

Using the Google auth for JWT files w/ out a scope can lead to confusion.

I got a report of a user trying to use a JWT file and getting confused about why it wasn't working with an error about lack of scopes. The reason was that they weren't setting the Scope, which the construction code allowed them to do. But then, on the first attempt to use the store an error was thrown.

We should add better feedback for this during config/Auth construction, so it's more clear what went wrong.

update Google Cloud API client import paths and more

The Google Cloud API client libraries for Go are making some breaking changes:

  • The import paths are changing from google.golang.org/cloud/... to
    cloud.google.com/go/.... For example, if your code imports the BigQuery client
    it currently reads
    import "google.golang.org/cloud/bigquery"
    It should be changed to
    import "cloud.google.com/go/bigquery"
  • Client options are also moving, from google.golang.org/cloud to
    google.golang.org/api/option. Two have also been renamed:
    • WithBaseGRPC is now WithGRPCConn
    • WithBaseHTTP is now WithHTTPClient
  • The cloud.WithContext and cloud.NewContext methods are gone, as are the
    deprecated pubsub and container functions that required them. Use the Client
    methods of these packages instead.

You should make these changes before September 12, 2016, when the packages at
google.golang.org/cloud will go away.

LocalFS Prefix Query Does Not Match Filename Prefixes

The LocalFS store will not match a filename prefix, it only matches folder prefixes:
if a file like /tests/users-2021-10-12.csv exists in the base of a LocalFS store.

  • a query for /tests will return the file
  • a query for /tests/users- will not return the file

This is different behavior from other store types that would return the file in both cases. This can cause issues when using creating an application that can operate with multiple different cloudstorage types.

SFTP (which has operates on a similar file system) has been built to behave the same as other store providers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.