Coder Social home page Coder Social logo

Comments (4)

matthiasfeist avatar matthiasfeist commented on August 19, 2024 1

Hi folks.
I finally had some downtime during the summer to do a small benchmark to compare performance.

First, my setup

I used the GTFS Sweden file from here: https://www.trafiklab.se/api/trafiklab-apis/gtfs-sweden/
The ZIP is ~900 MB and the individual files in the zip are as follows:

  12K  agency.txt
 205K  calendar.txt
 4.3M  calendar_dates.txt
 148B  feed_info.txt
 337K  routes.txt
 3.3G  shapes.txt
 689M  stop_times.txt
 9.7M  stops.txt
  18M  transfers.txt
  23M  trips.txt

Note: I had to delete the file attribution.txt because it was throwing an error for some reason. 🤷
I am running this very basic benchmark on my laptop: MacBook Pro, Apple M2 Pro, 32 GB RAM.

GTFS import with node-gtfs without modifications (baseline)

I am using version 4.13.1.
For the import I am running the following script:

import { importGtfs } from 'gtfs';
try {
  await importGtfs({
    "sqlitePath": "./gtfs.sqlite",
    "agencies": [{ "path": "./sweden/" }],
  });
} catch (error) {
  console.error(error);
}

here's the output of that script and also the timings from time node gtfs.js:

Starting GTFS import for 1 file using SQLite database at ./gtfs.sqlite
Importing GTFS from ./sweden/
Importing - agency.txt - 152 lines imported
Importing - areas.txt - No file found
Importing - attributions.txt - No file found
Importing - booking_rules.txt - No file found
Importing - calendar_dates.txt - 265808 lines imported
Importing - calendar.txt - 5559 lines imported
Importing - fare_attributes.txt - No file found
Importing - fare_leg_rules.txt - No file found
Importing - fare_media.txt - No file found
Importing - fare_products.txt - No file found
Importing - fare_rules.txt - No file found
Importing - fare_transfer_rules.txt - No file found
Importing - feed_info.txt - 1 lines imported
Importing - frequencies.txt - No file found
Importing - levels.txt - No file found
Importing - location_groups.txt - No file found
Importing - location_group_stops.txt - No file found
Importing - locations.geojson - No file found
Importing - networks.txt - No file found
Importing - pathways.txt - No file found
Importing - route_networks.txt - No file found
Importing - routes.txt - 6675 lines imported
Importing - shapes.txt - 67182455 lines imported
Importing - stop_areas.txt - No file found
Importing - stop_times.txt - 8517255 lines imported
Importing - stops.txt - 178665 lines imported
Importing - timeframes.txt - No file found
Importing - transfers.txt - 321250 lines imported
Importing - translations.txt - No file found
Importing - trips.txt - 379149 lines imported
Completed GTFS import for 1 agency
node gtfs.js  429.79s user 161.21s system 90% cpu 10:54.14 total

The resulting SQLite file is around 10 GB.

GTFS import with node-gtfs with changing PRAGMA journal_mode to WAL

 db.pragma("journal_mode = WAL");

WAL mode is said to be much more performant, however in my test, I didn't see any significant improvement:

node gtfs.js  437.53s user 112.99s system 97% cpu 9:27.00 total

Import via DuckDB.

Disclaimer: I have only imported the raw CSV files into the DB file using the following script. I know that the node-gtfs library is doing a bunch of stuff under the hood, so the comparison is absolutely not fair.

import { Database } from "duckdb-async";
import { resolve } from 'node:path'

const db = await Database.create('./gtfs.duckdb');
const files = [
    'agency','calendar','calendar_dates','feed_info',
    'routes','shapes','stop_times','stops','transfers',
    'trips',
]
for (const file of files) {
    console.time(file);
    const fullPath = resolve('./sweden/' + file + '.txt');
    await db.exec(`CREATE TABLE ${file} AS FROM read_csv('${fullPath}');`);
    console.timeEnd(file)
}

and here's the output and the result of the "time" command:

agency: 1.834ms
calendar: 8.889ms
calendar_dates: 42.819ms
feed_info: 1.411ms
routes: 8.62ms
shapes: 3.716s
stop_times: 3.085s
stops: 124.628ms
transfers: 168.056ms
trips: 115.312ms
node duck.js  44.27s user 1.97s system 611% cpu 7.557 total

That's only 10% of the time using the normal import.
The resulting DB file is only 1.2 GB, however, I have verified that the data is indeed correctly imported!

from node-gtfs.

matthiasfeist avatar matthiasfeist commented on August 19, 2024

I think we can close this ticket after the recent performance improvements in #149

from node-gtfs.

brendannee avatar brendannee commented on August 19, 2024

I'm still interested in this idea if duckDB has performance improvements - so let me know if you have a branch or PR to check out.

Would using Prisma throughout this library help allow using a variety of databases (including DuckDB)?

from node-gtfs.

laem avatar laem commented on August 19, 2024

Just a quick message to say that the performance difference between node-GTFS and Motis is quite important.

I guess the DB is the bottleneck, since I didn't notice any important improvement switching to bun instead of node.

(sorry, I have no numbers yet)

from node-gtfs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.