kimahriman / hdfs-native Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
There's a few things we can do to speed up the build process and remove required native dependencies for a build
HDFS has a soft lease limit of 60 seconds, which I believe means any file being written to for longer than 60 seconds could be "taken" by another writer if the lease hasn't been renewed. We should add a lease renewal process like the Java client to make sure any files actively being written to have their lease renewed
Hello, thank you so much for this fantastic project! I'm a member of the OpenDAL community, and we've been closely following your project for quite some time: apache/opendal#3144.
The only barrier to integrating your project is the licensing issue. Which license does this project use? Please clarify by including a LICENSE
file in the repository. I'm happy to create a PR for this if you'd like.
I noticed you've set the Apache 2.0
license for the Rust crate; does this cover the entire repository?
Line 11 in 6be6a72
Currently we are limited to individual read calls returning a whole buffer of data, and chunking up reading requires creating new datanode TCP connections for each chunk. We are already getting the data in batches (at the packet level). We should simply stream these back up to the FileReader so it can decide whether to combine them into a single buffer or return a stream object directly
As a performance improvement, we should support caching and reusing datanode connections like the Java library does. This reduces the overhead of creating a new TCP connection for every individual block read to the same datanode
Observer reads are technically supported because we track and use the state ID if it's provided, but this can be improved by keeping track of which NameNodes are observers, and knowing which RPC calls are reads and can be sent to observers, and which are writes and have to go to the Active NameNode
We currently are not checking data checksums on read (in fact we don't have the datanode send checksums). We should enable this to ensure we are not returning corrupt data
Hi,
I have created a dir and list its permission as below
let dir_path = "/hdfs-native-test/";
client.mkdirs(dir_path, 777, true).await.unwrap();
let file_info = client.get_file_info(dir_path).await.unwrap();
println!("file status : {:?}", file_info);
Output as file status : FileStatus { path: "/hdfs-native-test/", length: 0, isdir: true, permission: 777, owner: "sraizada", group: "supergroup", modification_time: 1704538751868, access_time: 0 }
But hadoop fs -ls /
shows
*$ hadoop fs -ls /
Found 6 items
dr----x--t - sraizada supergroup 0 2024-01-06 16:29 /hdfs-native-test
I am working on Hadoop 3.2.4
Can you please help with this?
Thank you!
Thanks for your great work! Append support will be on the roadmap ? @Kimahriman
If we create an fsspec
implementation in Python, it will make the client useable in a lot of other Python ecosystems, such as pyarrow.
This is an awesome project! I think it'll be more user-friendly if we add some examples code.
Currently the objectstore implementation is behind a feature flag in the main crate. It would probably be cleaner to have a separate crate for that, since it should just rely on the public API of the library. It will also make it easier to discover the implementation, and make it easier to support various feature flags related to just the objectstore, such as potentially allowing different versions of the upstream objectstore crate.
We should make some basic Python integration tests like the Rust ones to sanity check the Python behavior
Currently we run all the object store tests for all the combinations of hdfs features. We probably don't need to do this, as the regular tests can test all the functionality for each combinations of hdfs features, and then we can only run a single test for all the object store features with one set of hdfs features. This should speed things up a little bit.
Similar to #62, if a write pauses for more than 60 seconds, the datanode connection could timeout. We need to add datanode heartbeating like DFSOutputStream does
Vectorized IO is a very common thing to optionally support in various IO utilities. We could greatly benefit of vectorized IO on the reading side by doing things like:
Hi, I am using hdfs-native
for a poc and tried connecting to hdfs version 2.6 and facing the following error
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: RPCError("org.apache.hadoop.ipc.RpcNoSuchMethodException", "Unknown method msync called on org.apache.hadoop.hdfs.protocol.ClientProtocol protocol.\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)\n")', src/main.rs:22:27
stack backtrace:
0: rust_begin_unwind
My code looks something like this
#[tokio::main]
async fn main() -> Result<(), HdfsError> {
env::set_var("HADOOP_CONF_DIR", "/Users/sraizada/Downloads/hadoop-conf");
let client = Client::new("hdfs://mycluster")?;
let files = client.list_status("/", true);
let res = files.await.unwrap().iter().map(|f| println!("file - {:?}", f));
Ok(())
}
Could you please help ?
Thank you!
The erasure code writing test fails occasionally with some weird errors. There has to be some non-deterministic timing thing that's missing to make sure it works consistently, need to figure out what it is.
Examples:
https://github.com/Kimahriman/hdfs-native/actions/runs/7594249015/job/20685581012
https://github.com/Kimahriman/hdfs-native/actions/runs/7569498349/job/20612925706
each failed for a different reason
We already have the structure for the federated router state (just a Vec<u8>
). Just need to implement the state merge function.
If rust-rse/reed-solomon-erasure#108 gets merged, we can use the library directly instead of relying on a fork for custom matrices
We currently use gsasl for digest-md5, which doesn't support integrity or confidentiality modes for digest-md5 sasl. I haven't found another library that does, so we should just implement it directly so we can fully support all security features. This would be required if we ever wanted to support data transit encryption.
We don't really need to run all the reading and writing edge cases for each combo of HDFS features. We only need to test the very basics with all the various features to make sure they work in various security scenarios. Trying to test all the reading and writing edge cases can be done just once
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.