telexyz / bon Goto Github PK
View Code? Open in Web Editor NEWTối ưu hoá xử lý ngữ liệu tiếng Việt
Tối ưu hoá xử lý ngữ liệu tiếng Việt
Kết hợp:
https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable
https://github.com/telexyz/engine/blob/main/.save/hash_count.zig
Note: tự viết 1 hàm hash hiệu quả rất khó => nên dùng hàm hash sẵn của Zig
https://github.com/travisstaloch/art.zig
https://db.in.tum.de/~leis/papers/ART.pdf
This library provides a zig implementation of the Adaptive Radix Tree or ART. The ART operates similar to a traditional radix tree but avoids the wasted space of internal nodes by changing the node size. It makes use of 4 node sizes (4, 16, 48, 256), and can guarantee that the overhead is no more than 52 bytes per key, though in practice it is much lower. As a radix tree, it provides the following:
Note: token quá dài thường là đoạn text vô nghĩa nên bỏ qua được => giúp đơn giản hoá code.
Xin chào team ạ.
Em đọc mà không có thấy phần nào hướng dẫn build và sử dụng telexify ạ.
Em đã cài đặt zig (window x86) nhưng không thể sử dụng được telexify (bản build sẵn) hay thử build lại thông qua:
zig build zig build -Drelease-fast=true tại folder bon\simdify nhưng không thành công ạ.
error: no field or member function named 'standardReleaseOptions' in 'Build'
const mode = b.standardReleaseOptions();
~^~~~~~~~~~~~~~~~~~~~~~~
D:\setup\zig-windows-x86_64-0.11.0-dev.3704+729a051e9\lib\std\Build.zig:1:1: note: struct declared here
const std = @import("std.zig");
^~~~~
referenced by:
runBuild__anon_7220: D:\setup\zig-windows-x86_64-0.11.0-dev.3704+729a051e9\lib\std\Build.zig:1602:27
steps__anon_6992: D:\setup\zig-windows-x86_64-0.11.0-dev.3704+729a051e9\lib\build_runner.zig:914:20
remaining reference traces hidden; use '-freference-trace' to see all reference traces
Vậy anh có thể cho em xin tài liệu hướng dẫn sử dụng cho phần này được không ạ?
https://gist.github.com/loderunner/36724cc9ee8db66db305 đo branch misses, cache misses ...
https://github.com/judofyr/zini
Given a set of n
elements, with the only requirement being that you can hash them, it generates a hash function which maps each element to a distinct number between 0
and n - 1
. The generated hash function is extremely small, typically consuming less than 4 bits per element, regardless of the size of the input type.
The algorithm provides multiple parameters to tune making it possible to optimize for (small) size, (short) construction time, or (short) lookup time.
To give a practical example:
In ~0.6 seconds Zini was able to create a hash function for /usr/share/dict/words containing 235886 words.
The resulting hash function required in total 865682 bits in memory. This corresponds to 108.2 kB in total or 3.67 bits per word.
In comparison, the original file was 2.49 MB and compressing it with gzip -9
only gets it down to 754 kB (which you can't use directly in memory without decompressing it).
It should of course be noted that they don't store the equivalent data as you can't use the generated hash function to determine if a word is present or not in the list. The comparison is mainly useful to get a feeling of the magnitudes.
In addition, Zini provides various functionality for dealing with arrays of numbers:
zini.CompactArray
stores n-bit numbers tightly packed, leaving no bits unused.m
then you actually only need n = log2(m) + 1
bits per element.[]u64
as it stores each element using only 9 bits (and 64 divided by 9 is roughly 7).zini.DictArray
finds all distinct elements in the array, stores each once into a CompactArray (the dictionary), and creates a new CompactArray containing indexes into the dictionary.zini.EliasFano
stores increasing 64-bit numbers in a compact manner.zini.darray
provides constant-time support for the select1(i)
operation which returns the i-th set bit in a std.DynamicBitSetUnmanaged
.Hashtable lý tưởng được dùng bởi nhiều threads mà ko conflict, cache friendly (flat_map), tận dụng SIMD intrinsics cho các thao tác comparing, hashing, probing ...
NGUỒN: https://github.com/search?q=simd+hash+table
https://github.com/matmuher/hash_table_optimize thử nghiệm nhiều hàm hash và nhiều cách optimize hashtable bao gồm SIMD
https://github.com/michaelvlach/ADbHash
The ADbHash is a hash table inspired by google's "Swiss table" presented at CppCon 2017 by Matt Kulukundis. It is based on open-addressing hash table storing extra byte per element (key-value pair). In this byte there are stored a control bit (controlling whether the element is empty or full) and the rest of the byte is taken from the hash of the key. When searching through the table 16 of these control bytes are loaded from the position the element we look for is supposed to be. Then they are compared to a byte constructed from the hash we search for. This is achieved by using Single Instruction Multiple Data (SIMD) and thus a typical search in this hash table will take exactly two instructions:
Compare 16 bytes with 1 byte.
Jump to the matching element.
https://github.com/telekons/42nd-at-threadmill a nearly lock-free* hash table based on Cliff Click's NonBlockingHashMap
, and Abseil's flat_hash_map
. use SSE2 intrinsics for fast probing, and optionally use AVX2 for faster byte broadcasting.
Ko tìm thấy count của nearby symbol 62010d:
ael
2:par
Ko tìm thấy count của nearby symbol 710102:par
Ko tìm thấy count của nearby symbol 710102:par
3:Bel
4:den
Ko tìm thấy count của nearby symbol 650105:den
5:uT
6:rit
7:Oc
Ko tìm thấy count của nearby symbol 730118:rit
=> Lỗi do hash count
TOTAL 1118271 entries, max_probs: 16, avg_probs: 1 (1321322 / 1118271).
Hash Count Validation: false
count[–]=70342
count[Internet]=42891
count[ km]=39915
count[km²]=39343
count[Geometridae]=38669
count[album]=37671
count[Internet]=36997
count[Internet]=1
count[ km]=1
count[km²]=1
cài lock để dùng chung 1 hash count hoặc dùng mỗi hash count cho 1 thread rồi merge sau!
https://github.com/Highload-fun/platform/wiki/How-to-use-MMAP-to-optimize-data-reading
STDIN
is a file in RAM. That's why you can use STDIN_FILENO
to get access to the file.
Now you can create a mapping to this file in the virtual address space of the calling process using the mmap
function.
C++:
#include <unistd.h>
#include <sys/mman.h>
...
off_t fsize = lseek(0, 0, SEEK_END);
char* buffer = (char*)mmap(0, fsize, PROT_READ, MAP_PRIVATE | MAP_POPULATE, 0, 0);
Also, the Huge Pages feature is available.
Để tiện cho việ impl thuật toán các enums trên gộp theo nhóm thuộc tính (độ dài, âm đóng, ...) khi xuất ra ID cần tráo vị trí của chúng theo Alphabet order và revert khi convert từ ID ra Syllable
Dùng bit filters 65kb
để lọc 3-bytes đầu hoặc 3-bytes cuối của token xem có khả năng là tiếng Việt hay không? Nếu có mới phân tích tiếp.
Phức tạp hơn nữa thì dùng XorFilter để lọc cả token.
Hoặc dùng HashMap, DaTrie ... để map trực tiếp 1 token string thành syllable_id
sample data: "́ hệ của cái gia đình này và chắc chắn"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.