python 操作hadoop

本文介绍了如何在Python环境中,利用PyArrow库与Hadoop文件系统(HDFS)进行交互。主要内容包括通过libhdfs(JNI接口)和libhdfs3(Pivotal Labs的第三方库)连接HDFS的基本语法和配置,特别提到了环境变量的设置,如HADOOP_HOME、JAVA_HOME等,以确保正确运行。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

参考网站:http://arrow.apache.org/docs/python/filesystems.html#hadoop-file-system-hdfs

环境:

python 2.7.14 +pyarrow + hadoop 2.7

系统配置

File System Interfaces

In this section, we discuss filesystem-like interfaces in PyArrow.

Hadoop File System (HDFS)     语法

PyArrow comes with bindings to a C++-based interface to the Hadoop File System. You connect like so:

import pyarrow as pa
fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
with fs.open(path, 'rb') as f:
    # Do something with f

By default, pyarrow.hdfs.HadoopFileSystem uses libhdfs, a JNI-based interface to the Java Hadoop client. This library is loaded at runtime (rather than at link / library load time, since the library may not be in your LD_LIBRARY_PATH), and relies on some environment variables.    #环境变量配置

  • HADOOP_HOME: the root of your installed Hadoop distribution. Often has lib/native/libhdfs.so.
  • JAVA_HOME: the location of your Java SDK installation.
  • ARROW_LIBHDFS_DIR (optional): explicit location of libhdfs.so if it is installed somewhere other than $HADOOP_HOME/lib/native.
  • CLASSPATH: must contain the Hadoop jars. You can set these using:
export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`

If CLASSPATH is not set, then it will be set automatically if the hadoop executable is in your system path, or if HADOOP_HOME is set.

You can also use libhdfs3, a thirdparty C++ library for HDFS from Pivotal Labs:

fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path,
                    driver='libhdfs3')


接口:

HDFS API

hdfs.connect([host, port, user, …]) Connect to an HDFS cluster.
HadoopFileSystem.cat(path) Return contents of file as a bytes object
HadoopFileSystem.chmod(self, path, mode) Change file permissions
HadoopFileSystem.chown(self, path[, owner, …]) Change file permissions
HadoopFileSystem.delete(path[, recursive]) Delete the indicated file or directory
HadoopFileSystem.df(self) Return free space on disk, like the UNIX df command
HadoopFileSystem.disk_usage(path) Compute bytes used by all contents under indicated path in file tree
HadoopFileSystem.download(self, path, stream)  
HadoopFileSystem.exists(path)  
HadoopFileSystem.get_capacity(self) Get reported total capacity of file system
HadoopFileSystem.get_space_used(self) Get space used on file system
HadoopFileSystem.info(self, path) Return detailed HDFS information for path
HadoopFileSystem.ls(path[, detail]) Retrieve directory contents and metadata, if requested.
HadoopFileSystem.mkdir(path, **kwargs) Create directory in HDFS
HadoopFileSystem.open(self, path[, mode, …]) Open HDFS file for reading or writing
HadoopFileSystem.rename(path, new_path) Rename file, like UNIX mv command
HadoopFileSystem.rm(path[, recursive]) Alias for FileSystem.delete
HadoopFileSystem.upload(self, path, stream) Upload file-like object to HDFS path

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值