来源: https://ccp.cloudera.com/display/CDHDOC/HBase+Installation
Contents
Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Cloudera recommends installing HBase in a standalone mode before you try to run it on a whole cluster.
Upgrading HBase to the Latest CDH3 Release
The instructions that follow assume that you are upgrading HBase as part of an upgrade to the latest CDH3 release, and have already performed the steps under Upgrading CDH3.
To upgrade HBase to the latest CDH3 release, proceed as follows.
Step 1: Perform a Graceful Cluster Shutdown
To shut HBase down gracefully, stop the Thrift server and clients, then stop the cluster.
- Stop the Thrift server and clients
- Stop the cluster.
- Use the following command on the master node:
- Use the following command on each node hosting a region server:
This shuts down the master and the region servers gracefully.
Step 2. Stop the ZooKeeper Server
Step 3: Install the new version of HBase
Follow directions in the next section, Installing HBase.
Installing HBase
To install HBase on Ubuntu and other Debian systems:
To install HBase On Red Hat-compatible systems:
To install HBase on SUSE systems:
To list the installed files on Ubuntu and other Debian systems:
To list the installed files on Red Hat and SUSE systems:
You can see that the HBase package has been configured to conform to the Linux Filesystem Hierarchy Standard. (To learn more, run man hier).
You are now ready to enable the server daemons you want to use with Hadoop. Java-based client access is also available by adding the jars in /usr/lib/hbase/ and /usr/lib/hbase/lib/ to your Java class path.
Host Configuration Settings for HBase
Configuring the REST Port
You can use an init.d script, /etc/init.d/hadoop-hbase-rest, to start the REST server; for example:
The script starts the server by default on port 8080. This is a commonly used port and so may conflict with other applications running on the same host.
If you need change the port for the REST server, configure it in hbase-site.xml, for example:
Using DNS with HBase
HBase uses the local hostname to report its IP address. Both forward and reverse DNS resolving should work. If your machine has multiple interfaces, HBase uses the interface that the primary hostname resolves to. If this is insufficient, you can set hbase.regionserver.dns.interface in the hbase-site.xml file to indicate the primary interface. To work properly, this setting requires that your cluster configuration is consistent and every host has the same network interface configuration. As an alternative, you can set hbase.regionserver.dns.nameserver in the hbase-site.xmlfile to choose a different name server than the system-wide default.
Using the Network Time Protocol (NTP) with HBase
The clocks on cluster members should be in basic alignments. Some skew is tolerable, but excessive skew could generate odd behaviors. Run NTP on your cluster, or an equivalent. If you are having problems querying data or unusual cluster operations, verify the system time.
Setting User Limits for HBase
Because HBase is a database, it uses a lot of files at the same time. The default ulimit setting of 1024 for the maximum number of open files on Unix systems is insufficient. Any significant amount of loading will result in failures in strange ways and cause the error message java.io.IOException...(Too many open files) to be logged in the HBase or HDFS log files. For more information about this issue, see the Apache HBase Book. You may also notice errors such as:
Configuring ulimit for HBase
Cloudera recommends increasing the maximum number of file handles to more than 10,000. Note that increasing the file handles for the user who is running the HBase process is an operating system configuration, not an HBase configuration. Also, a common mistake is to increase the number of file handles for a particular user but, for whatever reason, HBase will be running as a different user. HBase prints the ulimit it is using on the first line in the logs. Make sure that it is correct.
If you are using ulimit, you must make the following configuration changes:
- In the /etc/security/limits.conf file, add the following lines:
- To apply the changes in /etc/security/limits.conf on Ubuntu and other Debian systems, add the following line in the /etc/pam.d/common-session file:
Using dfs.datanode.max.xcievers with HBase
A Hadoop HDFS DataNode has an upper bound on the number of files that it can serve at any one time. The upper bound property is called dfs.datanode.max.xcievers (the property is spelled in the code exactly as shown here). Before loading, make sure you have configured the value for dfs.datanode.max.xcievers in the conf/hdfs-site.xmlfile to at least 4096 as shown below:
Be sure to restart HDFS after changing the value for dfs.datanode.max.xcievers. If you don't change that value as described, strange failures can occur and an error message about exceeding the number of xcievers will be added to the DataNode logs. Other error messages about missing blocks are also logged, such as:
Starting HBase in Standalone Mode
By default, HBase ships configured for standalone mode. In this mode of operation, a single JVM hosts the HBase Master, an HBase Region Server, and a ZooKeeper quorum peer. In order to run HBase in standalone mode, you must install the HBase Master package:
Installing the HBase Master for Standalone Operation
To install the HBase Master on Ubuntu and other Debian systems:
To install the HBase Master On Red Hat-compatible systems:
To install the HBase Master on SUSE systems:
Starting the HBase Master
On Red Hat and SUSE systems (using .rpm packages) you can start now start the HBase Master by using the included service script:
On Ubuntu systems (using Debian packages) the HBase Master starts when the HBase package is installed.
To verify that the standalone installation is operational, visit http://localhost:60010. The list of Region Servers at the bottom of the page should include one entry for your local machine.
If you see this message when you start the HBase standalone master:
you will need to stop the hadoop-zookeeper-server or uninstall the hadoop-zookeeper-server package.
Accessing HBase by using the HBase Shell
After you have started the standalone installation, you can access the database by using the HBase Shell:
Using MapReduce with HBase
To run MapReduce jobs that use HBase, you need to add the HBase and Zookeeper JAR files to the Hadoop Java classpath. You can do this by adding the following statement to each job:
This distributes the JAR files to the cluster along with your job and adds them to the job's classpath, so that you do not need to edit the MapReduce configuration.
You can find more information about addDependencyJars here.
When getting an Configuration object for a HBase MapReduce job, instantiate it using theHBaseConfiguration.create() method.
Configuring HBase in Pseudo-distributed Mode
Pseudo-distributed mode differs from standalone mode in that each of the component processes run in a separate JVM.
Modifying the HBase Configuration
To enable pseudo-distributed mode, you must first make some configuration changes. Open /etc/hbase/conf/hbase-site.xml in your editor of choice, and insert the following XML properties between the <configuration> and</configuration> tags. Be sure to replace localhost with the host name of your HDFS Name Node if it is not running locally.
Creating the /hbase Directory in HDFS
Before starting the HBase Master, you need to create the /hbase directory in HDFS. The HBase master runs ashbase:hbase so it does not have the required permissions to create a top level directory.
To create the /hbase directory in HDFS:
Enabling Servers for Pseudo-distributed Operation
After you have configured HBase, you must enable the various servers that make up a distributed HBase cluster. HBase uses three required types of servers:
Installing and Starting ZooKeeper Server
HBase uses ZooKeeper Server as a highly available, central location for cluster management. For example, it allows clients to locate the servers, and ensures that only one master is active at a time. For a small cluster, running a ZooKeeper node colocated with the NameNode is recommended. For larger clusters, contact Cloudera Support for configuration help.
Install and start the ZooKeeper Server in standalone mode by running the commands shown in the "Installing the ZooKeeper Server Package on a Single Server" section of ZooKeeper Installation.
Starting the HBase Master
After ZooKeeper is running, you can start the HBase master in standalone mode.
Starting an HBase Region Server
The Region Server is the part of HBase that actually hosts data and processes requests. The region server typically runs on all of the slave nodes in a cluster, but not the master node.
To enable the HBase Region Server on Ubuntu and other Debian systems:
To enable the HBase Region Server On Red Hat-compatible systems:
To enable the HBase Region Server on SUSE systems:
To start the Region Server:
Verifying the Pseudo-Distributed Operation
After you have started ZooKeeper, the Master, and a Region Server, the pseudo-distributed cluster should be up and running. You can verify that each of the daemons is running using the jps tool from the Oracle JDK, which you can obtain from here. If you are running a pseudo-distributed HDFS installation and a pseudo-distributed HBase installation on one machine, jps will show the following output:
You should also be able to navigate to http://localhost:60010 and verify that the local region server has registered with the master.
Installing the HBase Thrift Server
The HBase Thrift Server is an alternative gateway for accessing the HBase server. Thrift mirrors most of the HBase client APIs while enabling popular programming languages to interact with HBase. The Thrift Server is multi-platform and performs better than REST in many situations. Thrift can be run collocated along with the region servers, but should not be collocated with the NameNode or the JobTracker. For more information about Thrift, visithttp://incubator.apache.org/thrift/.
To enable the HBase Thrift Server on Ubuntu and other Debian systems:
To enable the HBase Thrift Server On Red Hat-compatible systems:
To enable the HBase Thrift Server on SUSE systems:
Deploying HBase in a Distributed Cluster
After you have HBase running in pseudo-distributed mode, the same configuration can be extended to running on a distributed cluster.
Choosing where to Deploy the Processes
For small clusters, Cloudera recommends designating one node in your cluster as the master node. On this node, you will typically run the HBase Master and a ZooKeeper quorum peer. These master processes may be collocated with the Hadoop NameNode and JobTracker for small clusters.
Designate the remaining nodes as slave nodes. On each node, Cloudera recommends running a Region Server, which may be collocated with a Hadoop TaskTracker and a DataNode. When collocating with TaskTrackers, be sure that the resources of the machine are not oversubscribed – it's safest to start with a small number of MapReduce slots and work up slowly.
Configuring for Distributed Operation
After you have decided which machines will run each process, you can edit the configuration so that the nodes may locate each other. In order to do so, you should make sure that the configuration files are synchronized across the cluster. Cloudera strongly recommends the use of a configuration management system to synchronize the configuration files, though you can use a simpler solution such as rsync to get started quickly.
The only configuration change necessary to move from pseudo-distributed operation to fully-distributed operation is the addition of the ZooKeeper Quorum address in hbase-site.xml. Insert the following XML property to configure the nodes with the address of the node where the ZooKeeper quorum peer is running:
To start the cluster, start the services in the following order:
- The ZooKeeper Quorum Peer
- The HBase Master
- Each of the HBase Region Servers
After the cluster is fully started, you can view the HBase Master web interface on port 60010 and verify that each of the slave nodes has registered properly with the master.
Troubleshooting
The Cloudera packages of HBase have been configured to place logs in /var/log/hbase. While getting started, Cloudera recommends tailing these logs to note any error messages or failures.
Viewing the HBase Documentation
For additional HBase documentation, see http://archive.cloudera.com/cdh/3/hbase/.