Comments (8)
@brancz @metalmatze we've managed to get a good performance boost from this parallelism. I wrote a benchmark program (see below) and found that this is about 50% faster with the new code path.
I found that the code in table iterator that was a bit of a bottleneck, so by parallelizing that we're able to take advantage of parallelism in the execution of the physical plan.
https://github.com/polarsignals/arcticdb/pull/90/files#diff-134c48f173480a9772719053c1b3e967756a71c75f84ed4f94f12c71cd91ed30R319
I also tested varying the level of parallelism. Adding more horizontal parallelism gets us improved query performance to a point. I found performance improvements seemed to taper off when I got to a number of workers that was approximately the number of CPU cores on my PC -1 (which is probably expected in a highly cpu bound workload like this benchmark)
How does this all sound to y'all? Does that sound like a good performance improvement? Would we expect more improvment from this technique?
This code in PR is still WIP (I need to fix/write some unit tests)
The raw data and the detailed analysis results can be found here (for reference)
https://docs.google.com/spreadsheets/d/1782MYsiM3rHqb3VK77EJ9VpLxM3ET-WcxROIFQ686Vs/edit#gid=0
These numbers are from running this test on my macbook, which as 12 cpu cores.
Here's the benchmarking program I used (also for reference)
package main
import (
"context"
"fmt"
"log"
"os"
"runtime"
"runtime/pprof"
"time"
"github.com/apache/arrow/go/v8/arrow"
"github.com/apache/arrow/go/v8/arrow/memory"
log2 "github.com/go-kit/log"
"github.com/go-kit/log/level"
"github.com/segmentio/parquet-go"
"github.com/polarsignals/arcticdb"
"github.com/polarsignals/arcticdb/dynparquet"
"github.com/polarsignals/arcticdb/query"
"github.com/polarsignals/arcticdb/query/logicalplan"
)
func main() {
var err error
columnstore := arcticdb.New(nil, 10, 512*1024*1024).WithIndexDegree(3)
db, err := columnstore.DB("test_db")
if err != nil {
panic(err)
}
logger := log2.NewLogfmtLogger(log2.NewSyncWriter(os.Stderr))
logger = level.NewFilter(logger, level.AllowDebug())
schema := dynparquet.NewSchema("test_schema",
[]dynparquet.ColumnDefinition{
{
Name: "Column1",
StorageLayout: parquet.Encoded(parquet.String(), &parquet.RLEDictionary),
Dynamic: false,
},
{
Name: "Column2",
StorageLayout: parquet.Int(64),
Dynamic: false,
},
},
[]dynparquet.SortingColumn{
dynparquet.Ascending("Column1"),
})
tableConfig := arcticdb.NewTableConfig(schema)
table, err := db.Table("test_table", tableConfig, logger)
if err != nil {
panic(err)
}
buffer, err := schema.NewBuffer(map[string][]string{})
log.Printf("num cores = %d\n", runtime.NumCPU())
log.Println("begin writing table")
for i := 0; i < 250_000; i++ {
row := make([]parquet.Value, 0)
message := fmt.Sprintf("hello%d", i)
row = append(row, parquet.ValueOf(message).Level(0, 0, 0))
row = append(row, parquet.ValueOf(10).Level(0, 0, 1))
_, err = buffer.WriteRows([]parquet.Row{row})
if err != nil {
panic(err)
}
if i > 0 && i%1000 == 0 {
_, err = table.InsertBuffer(context.Background(), buffer)
if err != nil {
panic(err)
}
buffer, err = schema.NewBuffer(map[string][]string{})
if err != nil {
panic(err)
}
}
}
_, err = table.InsertBuffer(context.Background(), buffer)
if err != nil {
panic(err)
}
table.Sync()
log.Println("done writing table")
queryEngine := query.NewEngine(memory.DefaultAllocator, db.TableProvider())
f, perr := os.Create("cpu.pprof")
if perr != nil {
log.Fatal(perr)
}
err = pprof.StartCPUProfile(f)
if err != nil {
panic(err)
}
for j := 0; j < 100; j++ {
for i := 1; i < 20; i++ {
os.Setenv("parallelism", fmt.Sprintf("%d", i)) // this is what I was using to vary the parallelism
query := queryEngine.ScanTable("test_table").
Filter(
&logicalplan.BinaryExpr{
Left: logicalplan.Col("Column1"),
Op: logicalplan.RegExpOp,
Right: logicalplan.Literal("hello.*"),
},
)
start := time.Now().UnixMicro()
err = query.Execute(context.Background(), func(ar arrow.Record) error {
log.Printf("%v", ar)
return nil
})
end := time.Now().UnixMicro()
log.Printf("parallel = %d | trial = %d | time = %d us\n", i, j, end-start)
}
}
pprof.StopCPUProfile()
if err != nil {
msg := err.Error()
log.Printf("%s", msg)
log.Printf("%v", err)
}
}
from frostdb.
I wasn't aware of this paper, thanks for sharing @yeya24! I've only cross-read the paper, but from what I can tell what @albertlockett proposed so far is actually in line with what's described in this paper, as it handles full arrow.Record
frames, as opposed to 1-tuple-a-time like you said.
About what Albert wrote:
Does that sound like it's the right approach? Is it too simplistic?
I think what you proposed is very elegant and should work quite well. Only practice will truly tell, but I think the direction looks promising!
from frostdb.
Thanks @brancz, I'll move forward with this. I'll get the code cleaned up, add tests & error handling and do some profiling to verify if this approach improves performance and what level of parallelism makes sense.
from frostdb.
Hey @brancz I agree volcano seems like a good fit.
To add parallelization to articdb, I think we could add an implementation of the exchange operator described in the paper. We could have optimizers in the logicalplan to inject exchange operators in the appropriate places in the query tree.
I took a stab at implementing this. #90
The ExchangeOperator in this cases uses channels instead of IPC like in the paper.. The chan length is used to provide the backpressure that (whereas in the paper they describe using semaphores), and it spawns a number goroutines that are listening to provide the horizontal parallelism (instead of using multiple queues as the paper describes).
Does that sound like it's the right approach? Is it too simplistic?
Of course this code in the PR is still WIP, but it gives us something to iterate on at least. Looking forward to your feedback!
from frostdb.
Compared to the Volcano's 1-tuple-a-time Iterator model, vectorwise batching iterator model used by MonetDB seems better for columnar storage https://www.cidrdb.org/cidr2005/papers/P19.pdf. This is used by TiDB and CRDB as well.
But since we are using Arrow underlying already maybe Volcano is good enough.
from frostdb.
here is a document with some more of my notes
https://docs.google.com/document/d/1-9jX8XafLP3m5hLShSRh1hoMAEKrQTzqS0og2JCOlXg/edit#heading=h.vf1o5lbbldn6
from frostdb.
That all sounds great and makes sense with the NumCPUs. I would have also been fine with just leaving it at that number and not subtracting 1
, but since you've already figured out we can keep NumCPUs - 1
.
There's one comment on that PR about errgroups
and another nitpick. Any reason this is still a draft? Let's discuss the details over there.
Overall it sounds like the way forward!
from frostdb.
from frostdb.
Related Issues (20)
- Use compute.FilterRecord instead of bitmask and manually filtering HOT 10
- records.Build can be moved out of internal packages, so that we can use it to directly build arrow.record? HOT 6
- unsupport aggregation specific dynamic column HOT 1
- Move prehash function to happen before write to WAL
- unsupport avg on a single dynamic column HOT 1
- OrderedAggregate leaks memory HOT 1
- What is the state of schema v1alpha2 ? HOT 18
- memory leak for PredicateFilter HOT 1
- panic: Duplicate registration HOT 1
- `Test_Table_ReadIsolation` flaky
- MergeRecords can support array.Float64 sort? HOT 4
- Proper transaction support HOT 4
- Snapshot refactor
- index: block rotation deadlock due to not releasing parts
- snapshot: two snapshots at the same txn causes data loss
- Add renovate to keep our dependencies up-to-date HOT 1
- There is an issue with the combination of readOnlyTable and ValidateTableScan HOT 2
- arrowutils.mergeRecords function has an assertion error HOT 4
- The problem of combining aggregation function with custom runtime.GOMAXPROCS(1) HOT 1
- Access is denied HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from frostdb.