Comments (9)
Clearly at least one of your rows has more than one value in the client_mac
field.
While crate is very strict with single values and arrays (string vs. array(string)) on inserts and updates, we made a tradeoff between performance and accuracy towards performance when it comes to importing data. The usual field validation does not take place using the copy from
statement (e.g. if you give a string for a timestamp it won't be converted to long, same for arrays).
So you have to make sure that the client_mac field from your mongodb collection only contains a single value, not an array.
How long does your nightly copy from
import take? And your MongoDB export?
from crate.
We have v. small mongo machines collecting the data and not much control over what's imported. Having said that, it's the same sources so the macs shouldn't be anything other than strings.
It took 5 mins to export, a couple to gzip and I cancelled the job after 5 minutes importing about 250,000 rows (on a crappy dev. machine to test).
The problem would be scrubbing the data everyday (around 20mill each day) would be too time consuming. But I might take a performance hit upon importing the data - especially as we might want to clean other fields first.
We had problems with ES and mac addresses previously because of the colons - even tried replacing with hypens, no success. Mac addresses are fun...
from crate.
I think the problem with ES and mac addresses is that they analyze the column by default, stripping numbers and colons and stuff. crate by default analyzes its columns with the keyword analyzer, interpreting the whole content as one term. So colons and hence mac addresses shouldn't be a problem with crate.
Can you provide your CREATE TABLE
statement you used for your import? Did you use any?
If so, did you specify an analyzer on the client_mac
column? Maybe that's the problem.
Could you even provide some sample rows from your mongodump (anonymized or randomized if it contains anything confidential or such)?
Looking forward to getting your mongo data correctly into crate! :)
from crate.
Yeah us too! Looks excellent though so far.
That sounds about right, for our production application we've had to create a multi-field mapping for the macs - one unindexed as it breaks stuff..
Sample JSON here:
https://gist.github.com/simonmorley/8dd81b1456d1d366dcff
The create command would have been something like this:
create table streams (
created_at timestamp,
ap_mac string INDEX using fulltext,
rssi integer,
client_mac string INDEX using fulltext,
secret string INDEX using fulltext,
id string primary key
);
Looking at it, I wonder if I could create with ap_mac as a non-indexed string?
from crate.
I've delete the table and recreated with no index on that field and am just waiting for them to import. Don't want to kill it in case I truncate a row prematurely so will sit it out.
On a virtual machine running 4Gb ram and an SSD, it's taken 15mins + to copy 1million rows.
Can't currently test the query while it's importing though - waits and waits understandably.
from crate.
ah, i can see clearly now, the rain has gone!
We actually didn't make that as clear as it should be in the documentation.
You can only group by on indexed columns, as stated here https://crate.io/docs/stable/sql/dml.html#group-by
but per default, every column is indexed. You can disable indexing with INDEX OFF
if you really need it.
What we forgot to say is that you cannot group by on columns using a full text index as it could internally contain more than one value (the actual content is split up into terms), which raised your error.
You should not use a fulltext index on your mac address fields as you shouldn't need to do fulltextsearch on them. Here is a table schema, that should work for you:
create table streams (
created_at timestamp,
ap_mac string,
rssi integer,
client_mac,
secret string,
id string primary key
) with (number_of_replicas=0);
using the fulltext index will use the standard analyzer from elasticsearch - so you would copy the same bad behaviour for mac addresses from elasticsearch. :(
Another advice: you should create your table with number_of_replicas=0
and increase them afterwards. Otherwise every row will be copied on your local machine. This trick will significantly speed up your import.
from crate.
haha. I can't see clearly, it's too sunny here!
Ok, that makes good sense. The docs are good but obvs. missing the backup of a load of community support (yet).
That's how I just imported it :) 4 million down.
Thanks for the tip re. replicas. Am not sure how that's going to fare importing this much every day.
Do you have any guidance for working with big data sets? Whilst I still don't really consider what we're doing 'big data' and I hate that term, I think we're going to be around 6-8Tb in a year.
from crate.
The documentation is updated in master, see #874.
Will be released soon.
Can you close the issue if the new schema works for you?
Let's discuss via IRC on all the other topics. :)
from crate.
Deal, see you there :)
Import worked fine, thanks.
from crate.
Related Issues (20)
- Add role session setting defaults HOT 1
- Add support to work with FDW in Postgres compatible tools (e.g. DBeaver) HOT 1
- Can't read object, timestamp, geopoint fields from remote CrateDB cluster using FDW HOT 1
- Allow FDW user mapping against roles HOT 3
- Allow the use of EXPLAIN ANALYZE on queries with scalar subselects
- Ensure own password is also hidden when querying `information_schema.user_mapping_options` HOT 2
- SQLParseException on query with subqueries with latest nightly HOT 4
- FDW - Cannot read JSONB directly from PostgreSQL Server HOT 2
- Expand `OBJECT`s during view creation to include subfields in metadata HOT 4
- ClassCastException on function with RETURNS OBJECT HOT 2
- Support `ALTER SERVER` to change connection url
- Support PUBLIC user mapping for FDW HOT 1
- Unexpected result when using `PG_GET_PARTKEYDEF` HOT 1
- Unexpected result when using `DEFAULT` during creating table HOT 1
- Regression on correlated subqueries
- Meta - Foreign Tables with PostgreSQL JSON(B) Columns HOT 5
- Can't join foreign table with a local table in a multi-node cluster HOT 1
- INSERT INTO much slower than separate SELECT and INSERT
- Ignored objects in foreign tables should be fetched as a whole
- Support sas_token for Azure Snapshot repository
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crate.