Hadoop

Configure proxy for Hiveserver2/Impala

A proxy server is a server or application that acts as an intermediary for requests from clients seeking resources from other servers/applications.

A client connects to the proxy server, requesting some service or resource available from a different server and the proxy server evaluates the request and sends it to the intended server/service/application. The server’s response is returned to the proxy server which in turn returns it to the client.

Currently Cloudera manager does not have proxy and load balancing features, so we have to use an external proxy software of our choice.

In this section, we are going to use “HA Proxy” as our proxy software.

HIVESERVER2:

Before we begin installing “Ha proxy” and configuring for HiveServer2, please ensure that Hiveserver2 is running in more than 1 hosts. If only one instance of HS2 is available, please add one more role of HS2 in another server, so that we make full use of proxy setup.

Now we have HS2 instance running in two hosts, master and standby on the port 10000 (default port).

Let’s login to the server in which you want to set up HAproxy and this server shouldn’t be having a HS2 instances. i.e HS2 instances and proxy server should be different.

[root@server3 ~]# yum install haproxy 
(If specific version given, haproxy-version)

Got to haproxy config file,

[root@server3 ~]# vi /etc/haproxy/haproxy.cfg

#lines starting with ‘#’ sign below are comments for your understanding. 
You needn’t to mention them in configuration file

listen hiveserver2 :10000 
#haproxy will listen in port 10000 for hiveserver2 client requests.

mode tcp 
option tcplog
balance leastconn 

#tcp – connection mode between haproxy to hive servers
#leastconn – requests will be sent to server with less connection

server server1 master:10000
server server2 standby:10000

#first field ‘server’ indicates you’re mentioning the server in the line
#second field is a description for your server. You can give any name
#third field should contain FQDN of Hiveserver2 host with port number

Now the configuration is done, start the haproxy service and also enable it to start automatically during server startup.

# service haproxy start

# chkconfig haproxy on

 

Now come to the CM – Hive – configuration – search ‘load balancer’ and provide the haproxy server detail and the port (10000) in which haproxy is listening for Hiveserver2.

Click save changes and deploy the client configuration.

To test the proxy connection, connect to hiveserver2 via jdbc using the haproxy server as uri.

# beeline –u ‘jdbc:hive2://server3:10000/default’

If the connection is successful then the proxy setup is done successfully for Hiveserver2.

 


 

IMPALA

For impala, we have to do proxy setup for impala daemons and the configuration is similar to the hs2.

Please ensure that impala daemons are running in more than one hosts.

In the haproxy config file, add the below lines,

[root@server3 ~]# vi /etc/haproxy/haproxy.cfg

listen impala :21000 
#haproxy will listen in port 21000 for impala client requests.

mode tcp 
option tcplog
balance leastconn 

server server1 master:21050 
server server2 standby:21050
server server3 slave1:21050

# 21050 is the default port for impala daemon. 
# You can change it as per impala configuration

Once the haproxy configuration is done, update the impala loadbalancer property.

CM – Impala – configuration – search ‘load balancer’

Impala Load balancer —>  server3:21000

To test the proxy connection, use the jdbc connection string as below.

jdbc:impala://server3:21000

 

Problem Scenarios:

  • Setup proxy loadbalancing for hiveserver2
  • Configure the haproxy to loadbalance impala daemons

 

Thus we covered the proxy setup for HS2 and impala.

Use the comments section below to post your doubts, questions and feedback.

Please follow my blog to get notified of more certification related posts, exam tips, etc.

 


 

8 thoughts on “Configure proxy for Hiveserver2/Impala

  1. the load balancer configuration will be provided ?

    listen impala :21000
    #haproxy will listen in port 21000 for impala client requests.

    mode tcp
    option tcplog
    balance leastconn

    server server1 master:21050
    server server2 standby:21050
    server server3 slave1:21050

  2. I got the proxy for hive to work, but for impala I get the error
    ERROR beeline.ClassnameCompeter: Fail to parse the class name from the Jar file due to the exception: java.io.FileNotFoundException:
    and then some jars are mentioned….
    ‘No known driver to handle jdbc:impala:node005.cluster.local:21000’

    I assume that on the exam cluster these drivers will be available (it is doing everything described here : https://www.cloudera.com/documentation/enterprise/5-13-x/topics/impala_jdbc.html )?

      1. Kannan,
        thanks, very useful your blog.
        To check the impala proxy: how to run a simple test to know if the jdbc connection works ?

        1. If you configure Proxy correctly for impala you can use hue to run query and check, Make sure you update the load balancer property in Impala Configurations.

Leave a Reply

Your email address will not be published. Required fields are marked *