http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Quick-Start/cdh4qs_topic_3_3.html
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/Installing-and-Using-Impala.html
什么是Impala?
Cloudera发布了实时查询开源项目Impala,根据多款产品实测表明,它比原来基于MapReduce的Hive SQL查询速度提升3~90倍。Impala是Google Dremel的模仿,但在SQL功能上青出于蓝胜于蓝。
1. 安装JDK
$ sudo yum install jdk-6u41-linux-amd64.rpm
2. 伪分布式模式安装CDH4
$ cd /etc/yum.repos.d/
$ sudo wget http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/cloudera-cdh4.repo
$ sudo yum install hadoop-conf-pseudo
格式化NameNode.
$ sudo -u hdfs hdfs namenode -format
启动HDFS
$ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
创建/tmp目录
$ sudo -u hdfs hadoop fs -rm -r /tmp
$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
创建YARN与日志目录
$ sudo -u hdfs hadoop fs -mkdir /tmp/hadoop-yarn/staging
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging
$ sudo -u hdfs hadoop fs -mkdir /tmp/hadoop-yarn/staging/history/done_intermediate
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate
$ sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging
$ sudo -u hdfs hadoop fs -mkdir /var/log/hadoop-yarn
$ sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
检查HDFS文件树
$ sudo -u hdfs hadoop fs -ls -R /
1
drwxrwxrwt – hdfs supergroup 0 2012-05-31 15:31 /tmp
2
drwxr-xr-x – hdfs supergroup 0 2012-05-31 15:31 /tmp/hadoop-yarn
3
drwxrwxrwt – mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging
4
drwxr-xr-x – mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history
5
drwxrwxrwt – mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history/done_intermediate
6
drwxr-xr-x – hdfs supergroup 0 2012-05-31 15:31 /var
7
drwxr-xr-x – hdfs supergroup 0 2012-05-31 15:31 /var/log
8
drwxr-xr-x – yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn
启动YARN
$ sudo service hadoop-yarn-resourcemanager start
$ sudo service hadoop-yarn-nodemanager start
$ sudo service hadoop-mapreduce-historyserver start
创建用户目录(以用户dong.guo为例):
$ sudo -u hdfs hadoop fs -mkdir /user/dong.guo
$ sudo -u hdfs hadoop fs -chown dong.guo /user/dong.guo
测试上传文件
$ hadoop fs -mkdir input
$ hadoop fs -put /etc/hadoop/conf/*.xml input
$ hadoop fs -ls input
1
Found 4 items
2
-rw-r–r– 1 dong.guo supergroup 1461 2013-05-14 03:30 input/core-site.xml
3
-rw-r–r– 1 dong.guo supergroup 1854 2013-05-14 03:30 input/hdfs-site.xml
4
-rw-r–r– 1 dong.guo supergroup 1325 2013-05-14 03:30 input/mapred-site.xml
5
-rw-r–r– 1 dong.guo supergroup 2262 2013-05-14 03:30 input/yarn-site.xml
配置HADOOP_MAPRED_HOME环境变量
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
运行一个测试Job
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 ‘dfs[a-z.]+’
Job完成后,可以看到以下目录
$ hadoop fs -ls
1
Found 2 items
2
drwxr-xr-x – dong.guo supergroup 0 2013-05-14 03:30 input
3
drwxr-xr-x – dong.guo supergroup 0 2013-05-14 03:32 output23
$ hadoop fs -ls output23
1
Found 2 items
2
-rw-r–r– 1 dong.guo supergroup 0 2013-05-14 03:32 output23/_SUCCESS
3
-rw-r–r– 1 dong.guo supergroup 150 2013-05-14 03:32 output23/part-r-00000
$ hadoop fs -cat output23/part-r-00000 | head
1
1 dfs.safemode.min.datanodes
2
1 dfs.safemode.extension
3
1 dfs.replication
4
1 dfs.namenode.name.dir
5
1 dfs.namenode.checkpoint.dir
6
1 dfs.datanode.data.dir
3. 安装 Hive
$ sudo yum install hive hive-metastore hive-server
$ sudo yum install mysql-server
$ sudo service mysqld start
$ cd ~
$ wget ‘http://cdn.mysql.com/Downloads/Connector-J/mysql-connector-java-5.1.25.tar.gz’
$ tar xzf mysql-connector-java-5.1.25.tar.gz
$ sudo cp mysql-connector-java-5.1.25/mysql-connector-java-5.1.25-bin.jar /usr/lib/hive/lib/
$ sudo /usr/bin/mysql_secure_installation
01
[…]
02
Enter current password for root (enter for none):
03
OK, successfully used password, moving on…
04
[…]
05
Set root password? [Y/n] y
06
New password:hadoophive
07
Re-enter new password:hadoophive
08
Remove anonymous users? [Y/n] Y
09
[…]
10
Disallow root login remotely? [Y/n] N
11
[…]
12
Remove test database and access to it [Y/n] Y
13
[…]
14
Reload privilege tables now? [Y/n] Y
15
All done!
$ mysql -u root -phadoophive
01
mysql> CREATE DATABASE metastore;
02
mysql> USE metastore;
03
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql;
04
05
mysql> CREATE USER ‘hive’@’%’ IDENTIFIED BY ‘hadoophive’;
06
mysql> CREATE USER ‘hive’@’localhost’ IDENTIFIED BY ‘hadoophive’;
07
mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM ‘hive’@’%’;
08
mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM ‘hive’@’localhost’;
09
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO ‘hive’@’%’;
10
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO ‘hive’@’localhost’;
11
mysql> FLUSH PRIVILEGES;
12
mysql> quit;
$ sudo mv /etc/hive/conf/hive-site.xml /etc/hive/conf/hive-site.xml.bak
$ sudo vim /etc/hive/conf/hive-site.xml
01
<!–?xml version=”1.0″?–>
02
<!–?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?–>
03
04
<configuration>
05
<property>
06
<name>javax.jdo.option.ConnectionURL</name>
07
<value>jdbc:mysql://localhost/metastore</value>
08
<description>the URL of the MySQL database</description>
09
</property>
10
<property>
11
<name>javax.jdo.option.ConnectionDriverName</name>
12
<value>com.mysql.jdbc.Driver</value>
13
</property>
14
<property>
15
<name>javax.jdo.option.ConnectionUserName</name>
16
<value>hive</value>
17
</property>
18
<property>
19
<name>javax.jdo.option.ConnectionPassword</name>
20
<value>hadoophive</value>
21
</property>
22
<property>
23
<name>datanucleus.autoCreateSchema</name>
24
<value>false</value>
25
</property>
26
<property>
27
<name>datanucleus.fixedDatastore</name>
28
<value>true</value>
29
</property>
30
<property>
31
<name>hive.metastore.uris</name>
32
<value>thrift://127.0.0.1:9083</value>
33
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
34
</property>
35
<property>
36
<name>hive.aux.jars.path</name>
37
<value>file:///usr/lib/hive/lib/zookeeper.jar,file:///usr/lib/hive/lib/hbase.jar,file:///usr/lib/hive/lib/hive-hbase-handler-0.10.0-cdh4.2.0.jar,file:///usr/lib/hive/lib/guava-11.0.2.jar</value>
38
</property>
39
40
</configuration>
$ sudo service hive-metastore start
1
Starting (hive-metastore): [ OK ]
$ sudo service hive-server start
1
Starting (hive-server): [ OK ]
$ sudo -u hdfs hadoop fs -mkdir /user/hive
$ sudo -u hdfs hadoop fs -chown hive /user/hive
$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod 777 /tmp
$ sudo -u hdfs hadoop fs -chmod o+t /tmp
$ sudo -u hdfs hadoop fs -mkdir /data
$ sudo -u hdfs hadoop fs -chown hdfs /data
$ sudo -u hdfs hadoop fs -chmod 777 /data
$ sudo -u hdfs hadoop fs -chmod o+t /data
$ sudo chown -R hive:hive /var/lib/hive
$ sudo vim /tmp/kv1.txt
1
1 www.baidu.com
2
2 www.google.com
3
3 www.sina.com.cn
4
4 www.163.com
5
5 heylinx.com
$ sudo -u hive hive
01
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
02
Hive history file=/tmp/root/hive_job_log_root_201305140801_825709760.txt
03
04
hive> CREATE TABLE IF NOT EXISTS pokes ( foo INT,bar STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY “\t” LINES TERMINATED BY “\n”;
05
06
hive> show tables;
07
OK
08
pokes
09
Time taken: 0.415 seconds
10
hive> LOAD DATA LOCAL INPATH ‘/tmp/kv1.txt’ OVERWRITE INTO TABLE pokes;
11
Copying data from file:/tmp/kv1.txt
12
Copying file: file:/tmp/kv1.txt
13
Loading data to table default.pokes
14
rmr: DEPRECATED: Please use ‘rm -r’ instead.
15
Deleted /user/hive/warehouse/pokes
16
Table default.pokes stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 79, raw_data_size: 0]
17
OK
18
Time taken: 1.681 seconds
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
4. 安装 Impala
$ cd /etc/yum.repos.d/
$ sudo wget http://archive.cloudera.com/impala/redhat/6/x86_64/impala/cloudera-impala.repo
$ sudo yum install impala impala-shell
$ sudo yum install impala-server impala-state-store
$ sudo vim /etc/hadoop/conf/hdfs-site.xml
01
…
02
<property>
03
<name>dfs.client.read.shortcircuit</name>
04
<value>true</value>
05
</property>
06
<property>
07
<name>dfs.domain.socket.path</name>
08
<value>/var/run/hadoop-hdfs/dn._PORT</value>
09
</property>
10
<property>
11
<name>dfs.client.file-block-storage-locations.timeout</name>
12
<value>3000</value>
13
</property>
14
<property>
15
<name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
16
<value>true</value>
17
</property>
$ sudo cp -rpa /etc/hadoop/conf/core-site.xml /etc/impala/conf/
$ sudo cp -rpa /etc/hadoop/conf/hdfs-site.xml /etc/impala/conf/
$ sudo service hadoop-hdfs-datanode restart
$ sudo service impala-state-store restart
$ sudo service impala-server restart
$ sudo /usr/java/default/bin/jps
5. 安装 Hbase
$ sudo yum install hbase
$ sudo vim /etc/security/limits.conf
1
hdfs – nofile 32768
2
hbase – nofile 32768
$ sudo vim /etc/pam.d/common-session
1
session required pam_limits.so
$ sudo vim /etc/hadoop/conf/hdfs-site.xml
1
<property>
2
<name>dfs.datanode.max.xcievers</name>
3
<value>4096</value>
4
</property>
$ sudo cp /usr/lib/impala/lib/hive-hbase-handler-0.10.0-cdh4.2.0.jar /usr/lib/hive/lib/hive-hbase-handler-0.10.0-cdh4.2.0.jar
$ sudo /etc/init.d/hadoop-hdfs-namenode restart
$ sudo /etc/init.d/hadoop-hdfs-datanode restart
$ sudo yum install hbase-master
$ sudo service hbase-master start
$ sudo -u hive hive
01
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
02
Hive history file=/tmp/hive/hive_job_log_hive_201305140905_2005531704.txt
03
hive> CREATE TABLE hbase_table_1(key int, value string) STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,cf1:val”) TBLPROPERTIES (“hbase.table.name” = “xyz”);
04
OK
05
Time taken: 3.587 seconds
06
07
hive> INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=5;
08
Total MapReduce jobs = 1
09
Launching Job 1 out of 1
10
Number of reduce tasks is set to 0 since there’s no reduce operator
11
Starting Job = job_1368502088579_0004, Tracking URL = http://ip-10-197-10-4:8088/proxy/application_1368502088579_0004/
12
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1368502088579_0004
13
Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
14
2013-05-14 09:12:45,340 Stage-0 map = 0%, reduce = 0%
15
2013-05-14 09:12:53,165 Stage-0 map = 100%, reduce = 0%, Cumulative CPU 2.63 sec
16
MapReduce Total cumulative CPU time: 2 seconds 630 msec
17
Ended Job = job_1368502088579_0004
18
1 Rows loaded to hbase_table_1
19
MapReduce Jobs Launched:
20
Job 0: Map: 1 Cumulative CPU: 2.63 sec HDFS Read: 288 HDFS Write: 0 SUCCESS
21
Total MapReduce CPU Time Spent: 2 seconds 630 msec
22
OK
23
Time taken: 21.063 seconds
24
25
hive> select * from hbase_table_1;
26
OK
27
5 heylinx.com
28
Time taken: 0.685 seconds
29
30
hive> SELECT COUNT (*) FROM pokes;
31
Total MapReduce jobs = 1
32
Launching Job 1 out of 1
33
Number of reduce tasks determined at compile time: 1
34
In order to change the average load for a reducer (in bytes):
35
set hive.exec.reducers.bytes.per.reducer=<number>
36
In order to limit the maximum number of reducers:
37
set hive.exec.reducers.max=<number>
38
In order to set a constant number of reducers:
39
set mapred.reduce.tasks=<number>
40
Starting Job = job_1368502088579_0005, Tracking URL = http://ip-10-197-10-4:8088/proxy/application_1368502088579_0005/
41
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1368502088579_0005
42
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
43
2013-05-14 10:32:04,711 Stage-1 map = 0%, reduce = 0%
44
2013-05-14 10:32:11,461 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
45
2013-05-14 10:32:12,554 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
46
2013-05-14 10:32:13,642 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
47
2013-05-14 10:32:14,760 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
48
2013-05-14 10:32:15,918 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
49
2013-05-14 10:32:16,991 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
50
2013-05-14 10:32:18,111 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
51
2013-05-14 10:32:19,188 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.04 sec
52
MapReduce Total cumulative CPU time: 4 seconds 40 msec
53
Ended Job = job_1368502088579_0005
54
MapReduce Jobs Launched:
55
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 4.04 sec HDFS Read: 288 HDFS Write: 2 SUCCESS
56
Total MapReduce CPU Time Spent: 4 seconds 40 msec
57
OK
58
5
59
Time taken: 28.195 seconds
6. 测试Impala性能
View parameters on http://ec2-204-236-182-78.us-west-1.compute.amazonaws.com:25000
$ impala-shell
view sourceprint?
01
[ip-10-197-10-4.us-west-1.compute.internal:21000] > CREATE TABLE IF NOT EXISTS pokes ( foo INT,bar STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY “\t” LINES TERMINATED BY “\n”;
02
Query: create TABLE IF NOT EXISTS pokes ( foo INT,bar STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY “\t” LINES TERMINATED BY “\n”
03
[ip-10-197-10-4.us-west-1.compute.internal:21000] > show tables;
04
Query: show tables
05
Query finished, fetching results …
06
+——-+
07
| name |
08
+——-+
09
| pokes |
10
+——-+
11
Returned 1 row(s) in 0.00s
12
13
[ip-10-197-10-4.us-west-1.compute.internal:21000] > SELECT * from pokes;
14
Query: select * from pokes
15
Query finished, fetching results …
16
+—–+—————–+
17
| foo | bar |
18
+—–+—————–+
19
| 1 | www.baidu.com |
20
| 2 | www.google.com |
21
| 3 | www.sina.com.cn |
22
| 4 | www.163.com |
23
| 5 | heylinx.com |
24
+—–+—————–+
25
Returned 5 row(s) in 0.28s
26
27
[ip-10-197-10-4.us-west-1.compute.internal:21000] > SELECT COUNT (*) from pokes;
28
Query: select COUNT (*) from pokes
29
Query finished, fetching results …
30
+———-+
31
| count(*) |
32
+———-+
33
| 5 |
34
+———-+
35
Returned 1 row(s) in 0.34s
通过两个COUNT的结果来看,Hive使用了 28.195 seconds 而 Impala仅使用了0.34s,由此可以看出Impala的性能确实要优于Hive。
发表回复