Coder Social home page Coder Social logo

toglacier's Introduction

GoDoc license Build Status Coverage Status Go Report Card codebeat badge

toglacier

toglacier

Send data to the cloud periodically.

What?

Have you ever thought that your server could have some backup in the cloud to mitigate some crazy ransomware infection? Great! Here is a peace of software to help you do that, sending your data periodically to the cloud. For now it could be the Amazon Glacier or the Google Cloud Storage services. It uses the AWS SDK and Google Cloud SDK behind the scenes, all honors go to Amazon developers and Google developers.

The program will first add all modified files (compared with the last sync) to a tarball and then, if a secret was defined, it will encrypt the archive. After that, if AWS was chosen, it will decide to send it in one shot or use a multipart strategy for larger files. For now we will follow the AWS suggestion and send multipart when the tarball gets bigger than 100MB. When using multipart, each part will have 4MB (except for the last one). The maximum archive size is 40GB (but we can increase this).

Old backups will also be removed automatically, to avoid keeping many files in the cloud, and consequently saving you some money. Periodically, the tool will request the remote backups in the cloud to synchronize the local storage.

Some cool features that you will find in this tool:

  • Backup the desired directories periodically;
  • Upload only modified files (small backups parts);
  • Detect ransomware infection (too many modified files);
  • Ignore some files or directories in the backup path;
  • Encrypt backups before sending to the cloud;
  • Automatically download and rebuild backup parts;
  • Old backups are removed periodically to save you some money;
  • List all the versions of a file that was backed up;
  • Smart backup removal, replacing references for incremental backups;
  • Periodic reports sent by e-mail.

Install

To compile and run the program you will need to download the Go compiler, set the $GOPATH, add the $GOPATH/bin to your $PATH and run the following command:

go get -u github.com/rafaeljusto/toglacier/...

If you are thinking that is a good idea to encrypt some sensitive parameters and want to improve the security, you should replace the numbers of the slices in the function passwordKey of the encpass_key.go file for your own random numbers, or run the python script (inside internal/config package) with the command bellow. Remember to compile the tool again (go install).

encpass_key_generator.py -w

As this program can work like a service/daemon (start command), in this case you should run it in background. It is a good practice to also add it to your system startup (you don't want your backup scheduler to stop working after a reboot).

Usage

The program will work with environment variables or/and with a YAML configuration file. You can find the configuration file example on cmd/toglacier/toglacier.yml, for the environment variables check bellow:

Environment Variable Description
TOGLACIER_AWS_ACCOUNT_ID AWS account ID
TOGLACIER_AWS_ACCESS_KEY_ID AWS access key ID
TOGLACIER_AWS_SECRET_ACCESS_KEY AWS secret access key
TOGLACIER_AWS_REGION AWS region
TOGLACIER_AWS_VAULT_NAME AWS vault name
TOGLACIER_GCS_PROJECT GCS project name
TOGLACIER_GCS_BUCKET GCS bucket name
TOGLACIER_GCS_ACCOUNT_FILE GCS account file
TOGLACIER_PATHS Paths to backup (separated by comma)
TOGLACIER_DB_TYPE Local backup storage strategy
TOGLACIER_DB_FILE Path where we keep track of the backups
TOGLACIER_LOG_FILE File where all events are written
TOGLACIER_LOG_LEVEL Verbosity of the logger
TOGLACIER_KEEP_BACKUPS Number of backups to keep (default 10)
TOGLACIER_BACKUP_SECRET Encrypt backups with this secret
TOGLACIER_MODIFY_TOLERANCE Maximum percentage of modified files
TOGLACIER_IGNORE_PATTERNS Regexps to ignore files in backup paths
TOGLACIER_SCHEDULER_BACKUP Backup synchronization periodicity
TOGLACIER_SCHEDULER_REMOVE_OLD_BACKUPS Remove old backups periodicity
TOGLACIER_SCHEDULER_LIST_REMOTE_BACKUPS List remote backups periodicity
TOGLACIER_SCHEDULER_SEND_REPORT Send report periodicity
TOGLACIER_EMAIL_SERVER SMTP server address
TOGLACIER_EMAIL_PORT SMTP server port
TOGLACIER_EMAIL_USERNAME Username for e-mail authentication
TOGLACIER_EMAIL_PASSWORD Password for e-mail authentication
TOGLACIER_EMAIL_FROM E-mail used when sending the reports
TOGLACIER_EMAIL_TO List of e-mails to send the report to
TOGLACIER_EMAIL_FORMAT E-mail content format (html or plain)

Amazon cloud credentials can be retrieved via AWS Console (My Security Credentials and Glacier Service). You will find your AWS region identification here. For Google Cloud Storage credentials, check the Service Account Keys. If you chose Google Cloud Storage, you will need to create the project and the bucket manually.

By default the tool prints everything on the standard output. If you want to redirect it to a log file, you can define the location of the file with the TOGLACIER_LOG_FILE. Even with the output redirection, the messages are still written in the standard output. You can define the verbosity using the TOGLACIER_LOG_LEVEL parameter, that can have the values debug, info, warning, error, fatal or panic. By default the error log level is used.

There are some commands in the tool to manage the backups:

  • sync: execute the backup task now
  • get: retrieve a backup from AWS Glacier service
  • list or ls: list the current backups in the local storage or remotely
  • remove or rm: remove a backup from AWS Glacier service
  • start: initialize the scheduler (will block forever)
  • report: test report notification
  • encrypt or enc: encrypt a password or secret to improve security

You can improve the security by encrypting the values (use encrypt command) of the variables TOGLACIER_AWS_ACCOUNT_ID, TOGLACIER_AWS_ACCESS_KEY_ID, TOGLACIER_AWS_SECRET_ACCESS_KEY, TOGLACIER_BACKUP_SECRET and TOGLACIER_EMAIL_PASSWORD, or the respective variables in the configuration file. The tool will detect an encrypted value when it starts with the label encrypted:.

For keeping track of the backups locally you can choose boltdb (BoltDB) or auditfile in the TOGLACIER_DB_TYPE variable. By default boltdb is used. If you choose the audit file, as it is a human readable and a technology free solution, the format is defined bellow. It's a good idea to periodically copy the audit file or the BoltDB file somewhere else, so if you lose your server you can recover the files faster from the cloud (don't need to wait for the inventory). If you change your mind later about what local storage format you want, you can use the toglacier-storage program to convert it. Just remember that boltdb format stores more information than the auditfile format.

[datetime] [vaultName] [archiveID] [checksum] [size] [location]

The [location] in the audit file could have the value aws or gcs depending on the cloud service used to store the backup.

When running the scheduler (start command), the tool will perform the actions bellow in the periodicity defined in the configuration file. If not informed default values are used.

  • backup the files and folders;
  • remove old backups (save storage and money);
  • synchronize the local storage;
  • report all the scheduler occurrences by e-mail.

A shell script that could help you running the program in Unix environments (using AWS):

#!/bin/bash

TOGLACIER_AWS_ACCOUNT_ID="encrypted:DueEGILYe8OoEp49Qt7Gymms2sPuk5weSPiG6w==" \
TOGLACIER_AWS_ACCESS_KEY_ID="encrypted:XesW4TPKzT3Cgw1SCXeMB9Pb2TssRPCdM4mrPwlf4zWpzSZQ" \
TOGLACIER_AWS_SECRET_ACCESS_KEY="encrypted:hHHZXW+Uuj+efOA7NR4QDAZh6tzLqoHFaUHkg/Yw1GE/3sJBi+4cn81LhR8OSVhNwv1rI6BR4fA=" \
TOGLACIER_AWS_REGION="us-east-1" \
TOGLACIER_AWS_VAULT_NAME="backup" \
TOGLACIER_PATHS="/usr/local/important-files-1,/usr/local/important-files-2" \
TOGLACIER_DB_TYPE="boltdb" \
TOGLACIER_DB_FILE="/var/log/toglacier/toglacier.db" \
TOGLACIER_LOG_FILE="/var/log/toglacier/toglacier.log" \
TOGLACIER_LOG_LEVEL="error" \
TOGLACIER_KEEP_BACKUPS="10" \
TOGLACIER_CLOUD="aws" \
TOGLACIER_BACKUP_SECRET="encrypted:/lFK9sxAXAL8CuM1GYwGsdj4UJQYEQ==" \
TOGLACIER_MODIFY_TOLERANCE="90%" \
TOGLACIER_IGNORE_PATTERNS="^.*\~\$.*$" \
TOGLACIER_SCHEDULER_BACKUP="0 0 0 * * *" \
TOGLACIER_SCHEDULER_REMOVE_OLD_BACKUPS="0 0 1 * * FRI" \
TOGLACIER_SCHEDULER_LIST_REMOTE_BACKUPS="0 0 12 1 * *" \
TOGLACIER_SCHEDULER_SEND_REPORT="0 0 6 * * FRI" \
TOGLACIER_EMAIL_SERVER="smtp.example.com" \
TOGLACIER_EMAIL_PORT="587" \
TOGLACIER_EMAIL_USERNAME="[email protected]" \
TOGLACIER_EMAIL_PASSWORD="encrypted:i9dw0HZPOzNiFgtEtrr0tiY0W+YYlA==" \
TOGLACIER_EMAIL_FROM="[email protected]" \
TOGLACIER_EMAIL_TO="[email protected],[email protected]" \
TOGLACIER_EMAIL_FORMAT="html" \
toglacier $@

With that you can just run the following command to start the scheduler:

./toglacier.sh start

Just remember to give the write permissions to where the stdout/stderr and audit files are going to be written (/var/log/toglacier).

Deployment

For developers that want to build a package, we already have 2 scripts to make your life easier. As Go can do some cross-compilation, you can build the desired package from any OS or architecture.

Debian

To build a Debian package you will need the Effing Package Management tool. Then just run the script with the desired version and release of the program:

./package-deb.sh <version>-<release>

FreeBSD

You can also build a package for the FreeBSD pkgng repository. No external tools needed here to build the package.

./package-txz.sh <version>-<release>

Windows

To make your life easier you can use the tool NSSM to build a Windows service to run the toglacier tool in background. The following commands would install the service:

c:\> nssm.exe install toglacier

c:\> nssm.exe start toglacier

toglacier's People

Contributors

rafaeljusto avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

toglacier's Issues

Detect ransomware infection

When the differences from a backup path are more than a specific percentage, we should do something to avoid sending it to the cloud.

Verify uploaded backup checksum

After uploading an archive to AWS service we should compare the SHA256 checksum returned with what we expected to keep the consistency.

Check audit file consistency

When retrieving the remote list of backups, we could compare with the local audit file and report differences. Maybe we could replace the local information with what we got from AWS Glacier.

Wrong multipart upload part size

When uploading big files with the multipart strategy we receive the following error message from AWS API:

InvalidParameterValueException: Invalid part size: 4096. Part size must be a power of two and be between 1048576 and 4294967296 bytes. status code: 400, request id: ...

The problem is that we wrongly thought that the attribute InitiateMultipartUploadInput.PartSize stored the value in KB and not in bytes. But looking at the documentation we can clearly see:

The size of each part except the last, in bytes. The last part can be smaller
than this part size.

Manage backups with sub commands

Create sub commands to perform some actions related to backups:

  • sync - will backup now the desired folder(s)
  • remove - will remove a specific backup or all older backups
  • list - will show the backups from AWS Glacier service

Error on file backup

Today the tool only accepts directories for backups, but the user could be interested on saving just 1 file.

Keep consistency of archive information after backup removal

When removing a backup from the cloud and local storage, we should look for references for this backup in the archive information of all other backups, then we could have 2 approaches:

  • Remove the reference
  • Replace it for the most recent version of that file

To decide the approach we could use a flag (fallback).

Remove many backups at once

Command line could receive many ids at once. We could add some safe guard to avoid removing all backups or add some yes/no confirmation.

Remove old archives based on audit file

Today we check the list of archives remotely in AWS Glacier to decide which one will be removed. For that reason the command takes to long to run. Why don't we just look the local audit file?

Invalid Content-Range in multipart upload

We are sending the Content-Range header without the bytes label. This is causing the error:

InvalidParameterValueException: Invalid Content-Range: 0-4194304/1554187834

Support Google Cloud Platform

There's a similar service from AWS Glacier in Google Cloud Platform called Nearline. it is good to have multiple options of cloud companies and let the user decide which one is better. The SDK for developing it is available in GitHub.

In the future we could detect automatically the cheapest cloud company to send the backups.

Send e-mail reports

When running the scheduler is a good idea to send reports to make sure that everything is working well.

Checksum error don't store the backup information

When the upload archive checksum doesn't match with what we expected, we only report the error error comparing checksums and don't store the backup information of the uploaded archive. So we can't remove the archive from the cloud without listing the remote backups (that can take a while).

We should always store the information in this situation, or remove the backup automatically from the cloud when there's an error like this one.

Don't download unmodified files

To minimize the number of backup parts downloaded from the cloud we should verify the files that didn't change from the desired backup in the filesystem.

Update tool automatically

It would be great if the tool could check and update itself. In Linux and FreeBSD environments we could use a package repository to guarantee this, but I'm not sure if there's something similar in Windows environments.

Attention: We should to be very careful with security here. We need to make sure that we are downloading a new version of the tool.

Tarball with repeated files

When building a tarball with multiple paths, if both paths have the same filename, they will get repeated in the tarball. The example bellow shows the problem with file1:

$ tar -tvf /tmp/toglacier-189756224
drwx------ 1000/1000         0 2017-05-01 11:13 backup-20170501111339/
drwxrwxr-x 1000/1000         0 2017-05-01 11:13 backup-20170501111339/dir1/
-rwxrwxr-x 1000/1000        10 2017-05-01 11:13 backup-20170501111339/dir1/file3
-rwxrwxr-x 1000/1000        10 2017-05-01 11:13 backup-20170501111339/file1
-rwxrwxr-x 1000/1000        10 2017-05-01 11:13 backup-20170501111339/file2
drwx------ 1000/1000         0 2017-05-01 11:13 backup-20170501111339/
drwxrwxr-x 1000/1000         0 2017-05-01 11:13 backup-20170501111339/dir2/
-rwxrwxr-x 1000/1000        10 2017-05-01 11:13 backup-20170501111339/dir2/file6
-rwxrwxr-x 1000/1000        18 2017-05-01 11:13 backup-20170501111339/file1
-rwxrwxr-x 1000/1000        10 2017-05-01 11:13 backup-20170501111339/file4
-rwxrwxr-x 1000/1000        10 2017-05-01 11:13 backup-20170501111339/file5

The main problem here is that we strip the path in tar_builder.go method build:

if path == source && !info.IsDir() {
	// when we are building an archive of a single file, we cannot remove the
	// source path
	header.Name = filepath.Join(baseDir, filepath.Base(path))

} else {
	header.Name = filepath.Join(baseDir, strings.TrimPrefix(path, source))
}

Remove old backups

We could remove automatically old backups to make our AWS invoice cheaper. But we need to decide what is the best algorithm: Keep only the last 5 backups? Remove backups older than a year?

Config documentation

Add some inline documentation in the YAML configuration file. This will make the life easier for new users.

Allow multiple backup paths

We should allow a list of paths in the TOGLACIER_PATH environment variable. They could be separated by a comma, for example:

AWS_ACCOUNT_ID="000000000000" \
AWS_ACCESS_KEY_ID="AAAAAAAAAAAAAAAAAAAA" \
AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
AWS_REGION="us-east-1" \
AWS_VAULT_NAME="backup" \
TOGLACIER_PATH="/usr/local/myimportantfiles1,/usr/local/myimportantfiles2" \
TOGLACIER_AUDIT="/var/log/toglacier/audit.log" \
TOGLACIER_KEEP_BACKUPS="10" \
toglacier &>> /var/log/toglacier/error.log

When building the tarball is a good practice, when having multiple files, to add a root folder. This would avoid messing up when extracting this tarball. We could use as root name: backup-<date>

Allow using a YAML file for configuration

It's time to get some flexibility and allow the user to inform the configurations using a YAML file. There're some OS out there (I will not name them) that create some extra complexity if you want to set some environment variables to your program, so let's make everyone happy.

Error listing backups when the audit file doesn't exist

When the audit file doesn't exist the "list" or "ls" command will return the error error opening the audit file. details: open ...: no such file or directory. When the file doesn't exist the tool should just tell that there're no backups or in the case of the remote fetch, create the audit file.

Move tool to cmd subdirectory

As we plan to add some binaries to the toglacier project, we need to reorganize everything so the main tool don't be in the project root.

Encrypt secrets

We could encrypt the input to improve the security. A good idea is to detect encrypt content when there's the prefix encrypted:, this would maintain the backward compatibility.

AWS_ACCOUNT_ID="encrypted:eeBont5Lzlre4cxDi8QT/M6EbAGxTerniqywbpLBVA=" \
AWS_ACCESS_KEY_ID="encrypted:u8n8ds0Ssf/AdJCxpOG7AQ==" \
AWS_SECRET_ACCESS_KEY="encrypted:YPbbO6nZbl992kPRFzTgCRszKNPtlk" \
AWS_REGION="us-east-1" \
AWS_VAULT_NAME="backup" \
TOGLACIER_PATH="/usr/local/important-files" \
TOGLACIER_AUDIT="/var/log/toglacier/audit.log" \
TOGLACIER_KEEP_BACKUPS="10" \
toglacier start &>> /var/log/toglacier/error.log

To encrypt the secret we could change the program to add subcommands as the following:

toglacier encrypt <value>

  Encrypt a value with an internal master key. Wouldn't need any environment variable.


toglacier start

  The usual scheduler. You already known that this will need many environment variables.

In the future we could also add a subcommand to retrieve the last backup file (useful when shit happens in your server). Also, after doing this we could start allowing a configuration file as input (to add another option for those who don't like environment variables).

Error creating TAR file

When creating an archive with multiple files, I got the error:

error writing header in tar fo file ... details: archive/tar: missed writing 165 bytes.

Don't know yet what is the cause of this problem. Was found in a Windows environment.

Log inside library

Inject a logger into the library to add informational and debug level messages. This is useful when debugging a problem to understand what is going on.

For now we could use the following log interface:

type Log interface {
	Debug(args ...interface{})
	Debugf(format string, args ...interface{})
	Info(args ...interface{})
	Infof(format string, args ...interface{})
}

Windows setup

We need to simplify the Windows setup building an easy installation package.

Some features that this installation should have:

  • Ask for AWS/GCS credentials and store it in configuration file or environment variables
  • Extract files and directories
  • Create and run a background service for toglacier start

Update local storage after download

We should also synchronize the archive information after downloading a backup part. This will make sure that our local storage is consistent with what we found in the cloud.

Remove TAR after download

After we download a backup part we extract the content but leave the tarball in the filesystem (temporary directory). We should remove it from the disk as it was already extracted.

Reports in HTML format

Reports are pretty ugly today in plain format. We could improve this using HTML format, making plain format optional.

Isolate project into a library

We could keep our tool in a cmd path in the project and try to make everything else work like a library.

Pros:

  • Test everything
  • Document everything
  • Better software design

Cons:

  • If you want a library, why don't you use the AWS library directly?

Store backup information in a local database

We could improve how we store the backups information, instead of keeping track in an audit file, we could use a simple database like Bolt. This would make our life easier when managing existing backups (you don't want to create an audit file parser, do you?).

Use a custom error type

Many parts of the internal library returns errors using fmt.Errorf or errors.New. This is a bad practice, as a error message is useful only for humans.

Add logs to reports

It would be useful to have all logs from an action performed by toglacier in the report.

Automatic increase command line version

In the command line, we need to always remember to increase the version when making a new deploy. It would be great if this was automatic with the package generation scripts.

The current command line version is wrong (2.0.0), it should be 3.0.0.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.