Comments (4)
Hi folks.
I finally had some downtime during the summer to do a small benchmark to compare performance.
First, my setup
I used the GTFS Sweden file from here: https://www.trafiklab.se/api/trafiklab-apis/gtfs-sweden/
The ZIP is ~900 MB and the individual files in the zip are as follows:
12K agency.txt
205K calendar.txt
4.3M calendar_dates.txt
148B feed_info.txt
337K routes.txt
3.3G shapes.txt
689M stop_times.txt
9.7M stops.txt
18M transfers.txt
23M trips.txt
Note: I had to delete the file attribution.txt because it was throwing an error for some reason. 🤷
I am running this very basic benchmark on my laptop: MacBook Pro, Apple M2 Pro, 32 GB RAM.
GTFS import with node-gtfs without modifications (baseline)
I am using version 4.13.1
.
For the import I am running the following script:
import { importGtfs } from 'gtfs';
try {
await importGtfs({
"sqlitePath": "./gtfs.sqlite",
"agencies": [{ "path": "./sweden/" }],
});
} catch (error) {
console.error(error);
}
here's the output of that script and also the timings from time node gtfs.js
:
Starting GTFS import for 1 file using SQLite database at ./gtfs.sqlite
Importing GTFS from ./sweden/
Importing - agency.txt - 152 lines imported
Importing - areas.txt - No file found
Importing - attributions.txt - No file found
Importing - booking_rules.txt - No file found
Importing - calendar_dates.txt - 265808 lines imported
Importing - calendar.txt - 5559 lines imported
Importing - fare_attributes.txt - No file found
Importing - fare_leg_rules.txt - No file found
Importing - fare_media.txt - No file found
Importing - fare_products.txt - No file found
Importing - fare_rules.txt - No file found
Importing - fare_transfer_rules.txt - No file found
Importing - feed_info.txt - 1 lines imported
Importing - frequencies.txt - No file found
Importing - levels.txt - No file found
Importing - location_groups.txt - No file found
Importing - location_group_stops.txt - No file found
Importing - locations.geojson - No file found
Importing - networks.txt - No file found
Importing - pathways.txt - No file found
Importing - route_networks.txt - No file found
Importing - routes.txt - 6675 lines imported
Importing - shapes.txt - 67182455 lines imported
Importing - stop_areas.txt - No file found
Importing - stop_times.txt - 8517255 lines imported
Importing - stops.txt - 178665 lines imported
Importing - timeframes.txt - No file found
Importing - transfers.txt - 321250 lines imported
Importing - translations.txt - No file found
Importing - trips.txt - 379149 lines imported
Completed GTFS import for 1 agency
node gtfs.js 429.79s user 161.21s system 90% cpu 10:54.14 total
The resulting SQLite file is around 10 GB.
GTFS import with node-gtfs with changing PRAGMA journal_mode to WAL
db.pragma("journal_mode = WAL");
WAL mode is said to be much more performant, however in my test, I didn't see any significant improvement:
node gtfs.js 437.53s user 112.99s system 97% cpu 9:27.00 total
Import via DuckDB.
Disclaimer: I have only imported the raw CSV files into the DB file using the following script. I know that the node-gtfs library is doing a bunch of stuff under the hood, so the comparison is absolutely not fair.
import { Database } from "duckdb-async";
import { resolve } from 'node:path'
const db = await Database.create('./gtfs.duckdb');
const files = [
'agency','calendar','calendar_dates','feed_info',
'routes','shapes','stop_times','stops','transfers',
'trips',
]
for (const file of files) {
console.time(file);
const fullPath = resolve('./sweden/' + file + '.txt');
await db.exec(`CREATE TABLE ${file} AS FROM read_csv('${fullPath}');`);
console.timeEnd(file)
}
and here's the output and the result of the "time" command:
agency: 1.834ms
calendar: 8.889ms
calendar_dates: 42.819ms
feed_info: 1.411ms
routes: 8.62ms
shapes: 3.716s
stop_times: 3.085s
stops: 124.628ms
transfers: 168.056ms
trips: 115.312ms
node duck.js 44.27s user 1.97s system 611% cpu 7.557 total
That's only 10% of the time using the normal import.
The resulting DB file is only 1.2 GB, however, I have verified that the data is indeed correctly imported!
from node-gtfs.
I think we can close this ticket after the recent performance improvements in #149
from node-gtfs.
I'm still interested in this idea if duckDB has performance improvements - so let me know if you have a branch or PR to check out.
Would using Prisma throughout this library help allow using a variety of databases (including DuckDB)?
from node-gtfs.
Just a quick message to say that the performance difference between node-GTFS and Motis is quite important.
I guess the DB is the bottleneck, since I didn't notice any important improvement switching to bun instead of node.
(sorry, I have no numbers yet)
from node-gtfs.
Related Issues (20)
- Can I use the command line for query operations? HOT 6
- GTFS-R TripDescriptor HOT 5
- Working with Docker Database locked HOT 4
- Dropping/Clearing database HOT 1
- Deployment on errors HOT 5
- Invalid default csv parser option "relax" in import script. HOT 2
- Performance improvements HOT 4
- Changing internal maxInsertVariables has significant impact on total import time HOT 5
- Agency_id defined in agency.txt but not in routes.txt results in invalid GTFS export HOT 1
- Edge deployment and SQLite HOT 4
- Not running with Deno HOT 9
- occupancyStatus from vehicle_positions HOT 6
- Persistent trip_updates when running getStopTimeUpdates HOT 13
- Deleting the db to avoid id collisions HOT 3
- Propagate RT delays to missing stop_sequences HOT 6
- updateGtfsRealtime doesn't work if one URL is down HOT 2
- Disable clean stale GTFS-Realtime data by default HOT 12
- Zip File Containing Multiple GTFS Static Zip Files HOT 3
- import.js throws an error because attribution_id is missing but the spec says it's optional HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from node-gtfs.