Coder Social home page Coder Social logo

faunadb-importer's Introduction

FaunaDB Importer

FaunaDB Importer is a command line utility to help you import static data into FaunaDB. It can import data into FaunaDB Cloud or an on-premises FaunaDB Enterprise cluster.

Supported input file formats:

  • JSON
  • CSV
  • TSV

Requirements:

  • Java 8

Usage

Download the latest version and extract the zip file. Inside the extracted folder, run:

./bin/faunadb-importer \
  import-file \
  --secret <keys-secret> \
  --class <class-name> \
  <file-to-import>

NOTE: The command line arguments are the same on Windows, but you must use a different startup script. For example:

.\bin\faunadb-importer.bat import-file --secret <keys-secret> --class <class-name> <file-to-import>

For example:

./bin/faunadb-importer \
  import-file \
  --secret "abc" \
  --class users \
  data/users.json

The importer will load all data into the specified class, preserving the field names and types as described in the import file.

You can also type ./bin/faunadb-importer --help for more detailed information.

How it works

The importer is a stateful process separated into two phases: ID generation and data import.

First, the importer will parse all records and generate unique IDs by calling the next_id function for each record. Pre-generating IDs beforehand allows us to import schemas containing relational data while keeping foreign keys consistent. It also ensures that we can safely re-run the process without the risk of duplicating information.

In order to map legacy IDs to newly generated Fauna IDs, the importer will:

  • Check if there is a field configured with the ref type. The field's value will be used as the lookup term for the new Fauna ID.
  • If no field is configured with the ref type, the importer will assign a sequential number for each record as the lookup term for the new Fauna ID.

Once this phase completes, the pre-generated IDs will be stored at the cache directory. In case of a re-run, the importer will load the IDs from disk and skip this phase.

Second, the importer will insert all records into FaunaDB, using the pre-generated IDs from the first step as their ref field.

At this phase, if the import fails to run due to data inconsistency, it is:

  • SAFE to fix data inconsistencies in any field except fields configured with the ref type.
  • NOT SAFE to change fields configured with the ref type as they will be used as the lookup term for the pre-generated ID from the first phase.
  • NOT SAFE to remove entries from the import file if you don't have a field configured as a ref field; this will alter the sequential number assigned to the record.

As long as you keep the cache directory intact, it is safe to re-run the process until the import completes. If you want to use the importer again with a different input file, you must empty the cache directory first.

File structure

.
├── README.md                    # This file
├── bin                          #
│   ├── faunadb-importer         # Unix startup script
│   └── faunadb-importer.bat     # Windows startup script
├── cache                        # Where the importer saves its cache
├── data                         # Where you should copy the files you wish to import
├── lib                          #
│   └── faunadb-importer-1.0.jar # The importer library
└── logs                         # Logs for each execution

Advanced usage

Configuring fields

When importing JSON files, field names and types are optional; when importing text files, you must specify each field's name and type in order using the --format option:

./bin/faunadb-importer \
  import-file \
  --secret "<your-keys-secret-here>" \
  --class <your-class-name> \
  --format "<field-name>:<field-type>,..." \
  <file-to-import>

For example:

./bin/faunadb-importer \
  import-file \
  --secret "abc" \
  --class users \
  --format "id:ref, username:string, vip:bool" \
  data/users.csv

Supported types:

Name Description
string A string value
long A numeric value
double A double precision numeric value
bool A boolean value
ref A ref value. It can be used to mark the field as a primary key or to reference another class when importing multiple files. For example city:ref(cities)
ts A numeric value representing the number of milliseconds passed since 1970-01-01 00:00:00. You can also specify your own format as a parameter. For example: ts("dd/MM/yyyyTHH:mm:ss.000Z")
date A date value formatted as yyyy-MM-dd. You can also specify your own format as a parameter. For example: date("dd/MM/yyyy")

Renaming fields

You can rename fields from the input file as they are inserted into FaunaDB with the following syntax:

<field-name>-><new-field-name>:<field-type>

For example:

./bin/faunadb-importer \
  import-file \
  --secret "abc" \
  --class users \
  --format "id:ref, username->userName:string, vip->VIP:bool" \
  data/users.csv

Ignoring root element

When importing a JSON file where the root element of the file is an array, or when importing a text file where the first line is the file header, you can skip the root element with the --skip-root option. For example:

./bin/faunadb-importer \
  import-file \
  --secret "abc" \
  --class users \
  --skip-root true \
  data/users.csv

Ignoring fields

You can ignore fields with the --ignore-fields option. For example:

./bin/faunadb-importer \
  import-file \
  --secret "abc" \
  --class users \
  --format "id:ref, username->userName:string, vip->VIP:bool" \
  --ignore-fields "id" \
  data/users.csv

NOTE: In the above example, we omit the id field when importing the data into FaunaDB, but we still use the id field as the ref type so that the importer tool will properly map the newly-generated Fauna ID for that specific user.

How to maintain data in chronological order

You can maintain chronological order when importing data by using the --ts-field option. For example:

./bin/faunadb-importer \
  import-file \
  --secret "abc" \
  --class users \
  --ts-field "created_at" \
  data/users.csv

The value configured in the --ts-field option will be used as the ts field for the imported instance.

Importing to your own cluster

By default, the importer will load your data into FaunaDB Cloud. If you wish to import the data to your own cluster, you can use the --endpoints option. For example:

./bin/faunadb-importer \
  import-file \
  --secret "abc" \
  --class users \
  --endpoints "http://10.0.0.120:8443, http://10.0.0.121:8443" \
  data/users.csv

NOTE: The importer will load balance requests across all configured endpoints.

Importing multiple files

In order to import multiple files, you must run the importer with a schema definition file. For example:

./bin/faunadb-importer \
  import-schema \
  --secret "abc" \
  data/my-schema.yaml

Schema definition syntax

<file-address>:
  class: <class-name>
  skipRoot: <boolean>
  tsField: <field-name>
  fields:
    - name: <field-name>
      type: <field-type>
      rename: <new-field-name>
  ignoredFields:
    - <field-name>

For example:

data/users.json:
  class: users
  fields:
    - name: id
      type: ref

    - name: name
      type: string

  ignoredFields:
    - id

data/tweets.csv:
  class: tweets
  tsField: created_at
  fields:
    - name: id
      type: ref

    - name: user_id
      type: ref(users)
      rename: user_ref

    - name: text
      type: string
      rename: tweet

  ignoredFields:
    - id
    - created_at

Performance considerations

The importer's default settings should be enough to provide good performance for most cases. Still, there are a few things that are worth mentioning:

Memory

You can set the maximum amount of memory available to the import tool with -J-Xmx. For example:

./bin/faunadb-importer \
  -J-Xmx10G \
  import-schema \
  --secret "abc" \
  data/my-schema.yaml

NOTE: Parameters prefixed with -J must be placed as the first parameters for the import tool.

Batch sizes

The size of each individual batch is controlled by --batch-size parameter.

In general, individual requests will have a higher latency with a larger batch size. However, the overall throughput of the import process may increase by inserting more records in a single request.

Large batches can exceed the maximum size of a HTTP request, forcing the import tool to split the batch into smaller requests, therefore degrading the overall performance.

Default: 50 records per batch.

Managing concurrency

Concurrency is configured using the --concurrent-streams parameter.

A large number of concurrent streams can cause timeouts. When timeouts happen, the import tool will retry failing requests applying exponential backoff to each request.

Default: the number of available processors * 2

Backoff configuration

Exponential backoff is a combination of the follow parameters:

  • network-errors-backoff-time: The number of seconds to delay new requests when the network is unstable. Default: 1 second.
  • network-errors-backoff-factor: The number to multiply network-errors-backoff-time by per network issue detected; not to exceed max-network-errors-backoff-time. Default: 2.
  • max-network-errors-backoff-time: The maximum number of seconds to delay new requests when applying exponential backoff. Default: 60 seconds.
  • max-network-errors: The maximum number of network errors tolerated within the configured timeframe. Default: 50 errors.
  • reset-network-errors-period: The number of seconds the import tool will wait for a new network error before resetting the error count. Default: 120 seconds.

License

All projects in this repository are licensed under the Mozilla Public License

faunadb-importer's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

faunadb-importer's Issues

Import of simple JSON File to faunadb-developer leads to error

For testing purposes I tried to import a very simple JSON file into the developer version of faunadb:

.\bin\faunadb-importer.bat import-file --endpoints http://localhost:8443 --secret ABCDEF --class simple test.json

Here is my import file (test.json):

[{
	"name": "ABC",
	"address": {
		"city": "Wien",
		"country": "AT",
		"street": "Testgasse 4",
		"zip": "1210"
	}
},
{
	"name": "DEF",
	"address": {
		"city": "Wien",
		"country": "AT",
		"street": "Testgasse 10",
		"zip": "1010"
	}
}
]

The error:

[2017-09-21 22:02:24] validation failed: Instance data is not valid. at line: 1, column: 1: [{ "name": "ABC", "address": { "city": "Wien", "country": "AT", "street": "Testgasse 4", "zip": "1210" } }, { "name": "DEF", "address": { "city": "Wien", "country": "AT", "street": "Testgasse 10", "zip": "1010" } }]

Upgrade faunadb driver

Version 1.2.0 of scala driver will allow us to:

  • Upgrade scala to 2.12.x
  • Upgrade jackson versio to 1.8 (which has some performance improvements)

Handle possible errors from FaunaDB

  • Apply backpressure when service is unavailable
  • Handle errors when the request is too large due to huge batch sizes
  • Implement a split and retry strategy for batches in order to find the error for a single instance rather than marking the whole batch as failed

Cache not clear when running import

If errors occured during import, the next import attempt will failed too, because the temp working file (cache) is not cleared, i need to manually delete the file to make it works.

Importer needs a good readme

  • Usage instructions
  • Explain field syntax
  • Explain yaml file syntax
  • Explain constraints (like, do not remove a line from the import file if you're using row number as ids)

Field marked as "ref" seems to be ignored

Hi,
I have a file with lines of JSON like
{
"id": "3924",
"name": "Mary"
}.

I can upload it with no problems but the "id" field that I mark as ref using the parameter
--format "id:ref"
is ignored and the record gets a different ref (generated automatically) like "198480451673784842"

Ref
q.Ref("classes/my-class/198480451673784842")
Class
q.Ref("classes/my-class")
TS
1525544559283937
Data
{
"id": "3924",
"name": "Mary"
}

Am I missing something?
Thank you

Importing arrays of ref

Hi

We are looking to import csv with arrays of refs, is there a way to do this with the importer?

Examples code with a ref to another file - ref is currently a single ref but we have records which have multiple refs.

data/test1.csv:
  class: test1
  fields:
    - name: id
      type: ref

    - name: name
      type: string

  ignoredFields:
    - id

data/test2.csv:
  class: test2
  fields:
    - name: user_id
      type: ref(test1) # We would like this to be an array of brands

Thank you in advance!

Simplify status line

Right now it's confusing. I was thinking about something simpler and wait for feedback. Like:

Completed: 42% Errors: 2 RPS: 123 Latency(client/server): 12.2/5.2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.