Hadoop Conf 2014 Japan簡易メモ

2014/7/8に開催されたHadoop Conference 2014に参加してきました。
登録者数1,300人という事でした。5~6年前にHadoopがだんだんと自分の身の回りで、RDBMSに代る大規模データ解析システムで使われ始めてから
それなりに月日が過ぎましたが、改めて、日本におけるHadoopコミュニティの進化を感じることが出来たセミナーでした。
これからも、RDBMS&KVSとで共存しがなら、ビジネスサポートが進化していくと良いですね。
余談ですが、このようなセミナーをサポートしてくれているRecruit Technologyさんは太っ腹だな。。。と思いました。

hadoop

[ Big Data on Virtual Environment ]

1. Easy Provisioning
2. Easy Migration & Replacement
3. Reduce Facility Cost
4. Cost Reduction

1) BDD as a Serviceの検討
HADOOP + SPARK + Sentryの組み合わせを利用した、仮想サーバー上のBIG Data解析基盤の提供。
MapReduceの代わり。基本Searchは物理で提供しているので、Virtual環境でも同等のサービスを
提供するのも良い。(SPARK入れるとメモリーに展開してくれるみたい)

SPARK
SPARKは注目な技術です。
http://spark.apache.org/
http://www.cloudera.co.jp/blog/putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications.html
http://spark-summit.org/wp-content/uploads/2013/10/Feng-Andy-SparkSummit-Keynote.pdf

Sentry for Authentication (認証連携) Kerberos
会社でのアカウント管理やコンプライアンス対応には此方が便利そうです。
https://blogs.apache.org/sentry/

Splunkなどは優秀ですが、自分でログ解析作るのであれば・・・
お金が無いサービスでログ解析のニーズがあったら、サービスレベルや柔軟性は下がるかもしれませんが,
廉価版のサービスも提供すると利用者の選択肢が増えて良いのかも。但し、OPEXが増えるのであればNG…
Norikura+Fluentd+kibana+elasticsearch

https://github.com/norikra/fluent-plugin-norikra
http://www.slideshare.net/recruitcojp/20140709-lt 
http://dev.classmethod.jp/cloud/aws/block_dos_attack_by_norikra/
http://blog.excale.net/index.php/archives/1929
http://eure.jp/blog/fluentd_elasticsearch_kibana/

※Other Interesting Information
Evolution of Impala – Hadoop
http://www.slideshare.net/Cloudera_jp/evolution-of-impala-hcj2014
Recommendation Engine
http://www.slideshare.net/MapR_Japan/mahoutsolr-hadoop-conference-japan-2014
Basic Information
http://www.slideshare.net/Cloudera_jp/hadoop-hcj2014
E-Book for Cloudera (FREE)
http://www.oreilly.co.jp/books/9784873116723/


サーバーを含むシステムの高速化やビックデータ時代の到来に伴い、
分散処理に注目が集まっている様子。

10年前にHPCCが盛り上がった時にはあまり身近に感じなかったが、
HadoopやMongodbのようにオープンサースで気軽に分散処理出来る
システムが導入出来るようになり、ここ2~3年で再び注目を集めている。
忘れがちだったのだが、ネットワークがボトルネックになる可能性も高いので
システム導入の時点できちんとスケールアウトも含めて設計しておく必要がある。

HPCユーザーが知っておきたいTCP/IPの話
ESnet: http://fasterdata.es.net/
———————————————————-
To make better use of its accumulated knowledge, ESnet has developed this Fasterdata Knowledge Base.
The knowledge base provides provides proven, operationally-sound methods for troubleshooting and
solving performance issues. Our solutions fall into five categories:

Network Architecture, including the Science DMZ model
Host Tuning
Network Tuning
Data Transfer Tools
Network Performance Testing
———————————————————-
上記HPCの資料によるとここら辺もきちんとカスタマイズしておいた方が良さそう。
色々なツールもあるので調査したい場合にインストールして現状把握してみても良いかと思います。
nuttcpなどは再送処理なども見つける事が出来るようです。

■Data Transfer Tools
http://fasterdata.es.net/data-transfer-tools/

■Network Troubleshooting Tools
http://fasterdata.es.net/performance-testing/network-troubleshooting-tools/

■Phil Dykstra’s nuttcp quick start guide
http://wcisd.hpc.mil/nuttcp/Nuttcp-HOWTO.html

例)scamperでMTU含めてネットワークパス確認。
———————————————————————-
http://fasterdata.es.net/performance-testing/network-troubleshooting-tools/scamper/

To install scamper:
wget http://www.wand.net.nz/scamper/scamper-cvs-20110421.tar.gz
tar xvzf scamper-cvs-20110421.tar.gz
./configure; make; make install

[root@ip-xxx-xxx-xxx-xxx1 scamper-cvs-20110421]# ./configure; make; make install
checking for a BSD-compatible install… /usr/bin/install -c
checking whether build environment is sane… yes
checking for a thread-safe mkdir -p… /bin/mkdir -p
checking for gawk… gawk
checking whether make sets $(MAKE)… yes
checking build system type… x86_64-unknown-linux-gnu
checking host system type… x86_64-unknown-linux-gnu
checking how to print strings… printf
checking for style of include used by make… GNU
checking for gcc… gcc
checking whether the C compiler works… yes

[root@ip-xxx-xxx-xxx-xxx1 scamper-cvs-20110421]# dig yahoo.co.jp

; <<>> DiG 9.7.3-P3-RedHat-9.7.3-8.P3.15.amzn1 <<>> yahoo.co.jp
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24120 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;yahoo.co.jp. IN A ;; ANSWER SECTION: yahoo.co.jp. 287 IN A 203.216.243.240 yahoo.co.jp. 287 IN A 124.83.187.140 ;; Query time: 1 msec ;; SERVER: 172.16.0.23#53(172.16.0.23) ;; WHEN: Sun May 27 08:24:32 2012 ;; MSG SIZE rcvd: 61 [root@ip-xxx-xxx-xxx-xxx1 scamper-cvs-20110421]# [root@ip-xxx-xxx-xxx-xxx1 scamper-cvs-20110421]# /usr/local/bin/scamper -c "trace -M" -i 124.83.187.140 traceroute from 10.157.37.241 to 124.83.187.140 1 10.157.36.2 4.163 ms [mtu: 1500] 2 10.1.22.9 0.378 ms [mtu: 1500] 3 175.41.192.21 0.397 ms [mtu: 1500] 4 27.0.0.165 0.321 ms [mtu: 1500] 5 27.0.0.205 7.595 ms [mtu: 1500] 6 27.0.0.188 10.107 ms [mtu: 1500] 7 61.200.80.201 7.698 ms [mtu: 1500] 8 61.200.80.134 7.857 ms [mtu: 1500] 9 61.200.82.138 7.942 ms [mtu: 1500] 10 124.83.128.26 12.923 ms [mtu: 1500] 11 124.83.128.146 9.725 ms [mtu: 1500] 12 124.83.128.146 9.852 ms !X [mtu: 1500] [root@ip-xxx-xxx-xxx-xxx1 scamper-cvs-20110421]#

その他、サーバー側のNICメモリー設定も環境毎に最適化出来る様子。

[root@colinux ~]# /sbin/sysctl -a | grep mem
net.ipv4.udp_wmem_min = 4096
net.ipv4.udp_rmem_min = 4096
net.ipv4.udp_mem = 2324160 3098880 4648320
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_wmem = 4096 16384 4194304
net.ipv4.tcp_mem = 196608 262144 393216
net.ipv4.igmp_max_memberships = 20
net.core.optmem_max = 20480
net.core.rmem_default = 129024
net.core.wmem_default = 129024
net.core.rmem_max = 131071
net.core.wmem_max = 131071
vm.lowmem_reserve_ratio = 256 256 32
vm.overcommit_memory = 0
[root@colinux ~]#

[root@ip-xxx-xxx-xxx-xxx ec2-user]# /sbin/sysctl -a | grep mem
vm.overcommit_memory = 0
vm.lowmem_reserve_ratio = 256 256 32
net.core.wmem_max = 131071
net.core.rmem_max = 131071
net.core.wmem_default = 229376
net.core.rmem_default = 229376
net.core.optmem_max = 20480
net.ipv4.igmp_max_memberships = 20
net.ipv4.tcp_mem = 14679 19574 29358
net.ipv4.tcp_wmem = 4096 16384 626368
net.ipv4.tcp_rmem = 4096 87380 626368
net.ipv4.udp_mem = 14679 19574 29358
net.ipv4.udp_rmem_min = 4096
net.ipv4.udp_wmem_min = 4096
[root@ip-xxx-xxx-xxx-xxx ec2-user]#

補足:
Windowsに関しては、Windows2008からいくつか注意しておくべき事がありそうです。

TCP 受信ウィンドウの自動調整機能が機能しない正しくで Windows Server 2008 R2

All the TCP/IP ports that are in a TIME_WAIT status are not closed after 497 days

Scalable Networking Pack をご存知ですか?


HiveはHiveQLというSQL風の言語でHadoop上のデータを操作できます。
Hadoop上のデータベースというとHBaseが有名ですが、
HiveはHDFSに対してよりユーザーフレンドリなインターフェイスを提供するもので、
HBaseとは根本的に存在意義が異なります。

———————————————————————————
http://hive.apache.org/ 抜粋
———————————————————————————
Hive is a data warehouse system for Hadoop that facilitates easy data summarization,
ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive provides a mechanism to project structure onto this data and query the data using a
SQL-like language called HiveQL. At the same time this language also allows traditional
map/reduce programmers to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
———————————————————————————-

Hadoop+Hive検証環境を構築してみる
SQLライクにHadoop Hiveを使い倒す!
Hadoop Hiveと MySQLの利用例(29-FEB-2012)

[root@colinux ~]# vi /home/hiveuser/.bashrc

hive user

hive user

【/home/hiveuser/.bashrcに追記】
export PATH=$PATH:/usr/java/latest/bin
export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=/usr/local/hadoop

Hive Download Site
http://www.apache.org/dyn/closer.cgi/hive/

Hive Install

[root@colinux ~]# su – hadoop
[hadoop@colinux ~]$ pwd
/home/hadoop
[hiveuser@colinux ~]$ wget http://ftp.kddilabs.jp/infosystems/apache/hive/stable/hive-0.8.1.tar.gz
–2012-05-20 14:50:13– http://ftp.kddilabs.jp/infosystems/apache/hive/stable/hive-0.8.1.tar.gz
ftp.kddilabs.jp をDNSに問いあわせています… 192.26.91.193, 2001:200:601:10:206:5bff:fef0:466c
ftp.kddilabs.jp|192.26.91.193|:80 に接続しています… 接続しました。
HTTP による接続要求を送信しました、応答を待っています… 200 OK
長さ: 31325840 (30M) [application/x-gzip]
`hive-0.8.1.tar.gz’ に保存中

100%[================================================================================>] 31,325,840 2.90M/s 時間 11s

2012-05-20 14:50:24 (2.81 MB/s) – `hive-0.8.1.tar.gz’ へ保存完了 [31325840/31325840]
[hadoop@colinux ~]$

hive install

hive install

[hadoop@colinux ~]$ tar xvfz hive-0.8.1.tar.gz

[root@colinux ~]# mv /home/hadoop/hive-0.8.1 /usr/local/
[root@colinux ~]# cd /usr/local/
[root@colinux local]# ln -s hive-0.8.1 hive

バージョンアップを考えて、展開後にシンボリックリンク作成します。
[root@colinux local]# ls -l
合計 88
drwxr-xr-x 2 root root 4096 2011-12-10 10:12 bin
drwxr-xr-x 2 root root 4096 2011-12-10 10:12 etc
drwxr-xr-x 2 root root 4096 2007-04-17 21:46 games
lrwxrwxrwx 1 root root 23 2012-05-12 12:17 hadoop -> /usr/local/hadoop-1.0.1
drwxr-xr-x 15 hadoop hadoop 4096 2012-05-12 13:06 hadoop-1.0.1
lrwxrwxrwx 1 root root 10 2012-05-20 18:37 hive -> hive-0.8.1
drwxr-xr-x 9 root root 4096 2012-05-20 18:36 hive-0.8.1

[root@colinux local]# chown -R hadoop:hadoop hive/

Hadoopを起動してjpsで起動確認します。

[hadoop@colinux ~]$ /usr/local/hadoop/bin/hadoop namenode -format
12/05/26 09:21:10 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = colinux/127.0.0.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.0.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1243785; compiled by ‘hortonfo’ on Tue Feb 14 08:15:38 UTC 2012
************************************************************/
Re-format filesystem in /tmp/hadoop-hadoop/dfs/name ? (Y or N) Y
12/05/26 09:21:16 INFO util.GSet: VM type = 32-bit
12/05/26 09:21:16 INFO util.GSet: 2% max memory = 19.33375 MB
12/05/26 09:21:16 INFO util.GSet: capacity = 2^22 = 4194304 entries
12/05/26 09:21:16 INFO util.GSet: recommended=4194304, actual=4194304
12/05/26 09:21:17 INFO namenode.FSNamesystem: fsOwner=hadoop
12/05/26 09:21:17 INFO namenode.FSNamesystem: supergroup=supergroup
12/05/26 09:21:17 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/05/26 09:21:17 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/05/26 09:21:17 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/05/26 09:21:17 INFO namenode.NameNode: Caching file names occuring more than 10 times
12/05/26 09:21:18 INFO common.Storage: Image file of size 112 saved in 0 seconds.
12/05/26 09:21:18 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.
12/05/26 09:21:18 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at colinux/127.0.0.1
************************************************************/
[hadoop@colinux ~]$ /usr/local/hadoop/bin/start-all.sh
starting namenode, logging to /usr/local/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-namenode-colinux.out
localhost: starting datanode, logging to /usr/local/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-datanode-colinux.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-secondarynamenode-colinux.out
starting jobtracker, logging to /usr/local/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-jobtracker-colinux.out
localhost: starting tasktracker, logging to /usr/local/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-tasktracker-colinux.out
[hadoop@colinux ~]$

[hadoop@colinux ~]$ jps
3438 Jps
3270 JobTracker
3402 TaskTracker
3058 DataNode
3185 SecondaryNameNode
[hadoop@colinux ~]$

テスト用のデータファイルをダウンロードしてHadoopとHiveの動作検証。

[hadoop@colinux ~]$ mkdir ~/localfiles
[hadoop@colinux ~]$ cd localfiles/
[hadoop@colinux localfiles]$ wget http://www.atmarkit.co.jp/fdb/single/s_hive/dl/data.tar.gz
–2012-05-26 08:55:18– http://www.atmarkit.co.jp/fdb/single/s_hive/dl/data.tar.gz
www.atmarkit.co.jp をDNSに問いあわせています… 202.218.219.147
www.atmarkit.co.jp|202.218.219.147|:80 に接続しています… 接続しました。
HTTP による接続要求を送信しました、応答を待っています… 200 OK
長さ: 2071417 (2.0M) [application/x-tar]
`data.tar.gz’ に保存中

100%[=========================================================================================>] 2,071,417 705K/s 時間 2.9s

2012-05-26 08:55:21 (705 KB/s) – `data.tar.gz’ へ保存完了 [2071417/2071417]

[hadoop@colinux localfiles]$

テスト用のファイルを展開して、Hiveコマンドを実行。
[hadoop@colinux ~]$ /usr/local/hive/bin/hive

hive> CREATE TABLE pref (id int, pref STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\n’;

hive create table

hive create table

hive> desc pref;
OK
id int
pref string
Time taken: 0.61 seconds
hive>

hive> LOAD DATA LOCAL INPATH ‘/home/hadoop/localfiles/pref.csv’ OVERWRITE INTO TABLE pref;

load data

load data

SELECTしてデータの確認。

Hadoop job information for Stage

hive> select A.pref from pref A;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201205260921_0003, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201205260921_0003
Kill Command = /usr/local/hadoop-1.0.1/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201205260921_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2012-05-26 09:45:36,731 Stage-1 map = 0%, reduce = 0%
2012-05-26 09:45:45,791 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2012-05-26 09:45:46,811 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2012-05-26 09:45:47,831 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2012-05-26 09:45:48,831 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2012-05-26 09:45:49,851 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2012-05-26 09:45:50,881 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2012-05-26 09:45:51,891 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2012-05-26 09:45:52,911 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2012-05-26 09:45:53,931 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2012-05-26 09:45:55,511 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.5 sec
MapReduce Total cumulative CPU time: 1 seconds 500 msec
Ended Job = job_201205260921_0003
MapReduce Jobs Launched:
Job 0: Map: 1 Accumulative CPU: 1.5 sec HDFS Read: 820 HDFS Write: 479 SUCESS
Total MapReduce CPU Time Spent: 1 seconds 500 msec
OK
北海道
青森県
岩手県
宮城県
秋田県

注:先日作成したhadoopユーザーを利用するので,hiveアカウントは利用しませんでした。

select

select


【オリジナルサイト】
Welcome to Apache™ Hadoop™!
Hadoop Common Releases

Other Info
Linux と Hadoop による分散コンピューティング

【JAVAのインストール】
■ JDKインストール Java SE Development Kit(JDK、v1.6以上推奨)
http://www.oracle.com/technetwork/java/javase/downloads/index.html
http://www.oracle.com/technetwork/java/javase/downloads/jdk-6u32-downloads-1594644.html

[root@colinux hadoop]# ls -l
合計 67060
-rw-rw-r– 1 root root 68593311 2012-05-12 08:03 jdk-6u32-linux-i586-rpm.bin
[root@colinux hadoop]# chmod 755 jdk-6u32-linux-i586-rpm.bin

[root@colinux hadoop]# ./jdk-6u32-linux-i586-rpm.bin
Unpacking…
Checksumming…
Extracting…
UnZipSFX 5.50 of 17 February 2002, by Info-ZIP (Zip-Bugs@lists.wku.edu).
inflating: jdk-6u32-linux-i586.rpm
inflating: sun-javadb-common-10.6.2-1.1.i386.rpm
inflating: sun-javadb-core-10.6.2-1.1.i386.rpm
inflating: sun-javadb-client-10.6.2-1.1.i386.rpm
inflating: sun-javadb-demo-10.6.2-1.1.i386.rpm
inflating: sun-javadb-docs-10.6.2-1.1.i386.rpm
inflating: sun-javadb-javadoc-10.6.2-1.1.i386.rpm
準備中… ########################################### [100%]
1:jdk ########################################### [100%]
Unpacking JAR files…
rt.jar…
jsse.jar…
charsets.jar…
tools.jar…
localedata.jar…
plugin.jar…
javaws.jar…
deploy.jar…
Installing JavaDB
準備中… ########################################### [100%]
1:sun-javadb-common ########################################### [ 17%]
2:sun-javadb-core ########################################### [ 33%]
3:sun-javadb-client ########################################### [ 50%]
4:sun-javadb-demo ########################################### [ 67%]
5:sun-javadb-docs ########################################### [ 83%]
6:sun-javadb-javadoc ########################################### [100%]

Java(TM) SE Development Kit 6 successfully installed.

Product Registration is FREE and includes many benefits:
* Notification of new versions, patches, and updates
* Special offers on Oracle products, services and training
* Access to early releases and documentation

[root@colinux hadoop]# java -version
java version “1.6.0_32”
Java(TM) SE Runtime Environment (build 1.6.0_32-b05)
Java HotSpot(TM) Client VM (build 20.7-b02, mixed mode, sharing)
[root@colinux hadoop]#

【 Hadoopインストール】
※2012年5月現在
http://hadoop.apache.org/common/releases.html#Download
http://ftp.kddilabs.jp/infosystems/apache/hadoop/common/
http://ftp.kddilabs.jp/infosystems/apache/hadoop/common/stable/

1.0.X – current stable version, 1.0 release
1.1.X – current beta version, 1.1 release
0.23.X – current alpha version, MR2
0.22.X – does not include security
0.20.203.X – legacy stable version
0.20.X – legacy version

リリースノート
http://ftp.kddilabs.jp/infosystems/apache/hadoop/common/stable/RELEASE_NOTES_HADOOP-1.0.1.html

[root@colinux hadoop]# wget http://ftp.kddilabs.jp/infosystems/apache/hadoop/common/stable/hadoop-1.0.1.tar.gz
–2012-05-12 12:11:03– http://ftp.kddilabs.jp/infosystems/apache/hadoop/common/stable/hadoop-1.0.1.tar.gz
ftp.kddilabs.jp をDNSに問いあわせています… 192.26.91.193, 2001:200:601:10:206:5bff:fef0:466c
ftp.kddilabs.jp|192.26.91.193|:80 に接続しています… 接続しました。
HTTP による接続要求を送信しました、応答を待っています… 200 OK
長さ: 60811130 (58M) [application/x-gzip]
`hadoop-1.0.1.tar.gz’ に保存中

100%[=============================================================================================================================>] 60,811,130 3.24M/s 時間 21s

2012-05-12 12:11:24 (2.72 MB/s) – `hadoop-1.0.1.tar.gz’ へ保存完了 [60811130/60811130]

[root@colinux hadoop]#

[root@colinux hadoop]# mv hadoop-1.0.1.tar.gz /usr/local/
[root@colinux local]# pwd
/usr/local
[root@colinux local]# tar zxf hadoop-1.0.1.tar.gz
[root@colinux local]#

[root@colinux local]# ls -l
合計 84
drwxr-xr-x 2 root root 4096 2011-12-10 10:12 bin
drwxr-xr-x 2 root root 4096 2011-12-10 10:12 etc
drwxr-xr-x 2 root root 4096 2007-04-17 21:46 games
drwxr-xr-x 14 root root 4096 2012-02-14 17:18 hadoop-1.0.1
drwxr-xr-x 3 root root 4096 2011-11-05 09:00 include
drwxr-xr-x 2 root root 4096 2007-04-17 21:46 lib
drwxr-xr-x 2 root root 4096 2007-04-17 21:46 libexec
lrwxrwxrwx 1 root root 38 2009-12-26 01:40 mysql -> mysql-5.5.0-m2-linux-i686-icc-glibc23/
drwxr-xr-x 14 mysql mysql 4096 2009-12-22 00:23 mysql-5.1.41-linux-i686-icc-glibc23
drwxr-xr-x 14 mysql mysql 4096 2009-12-26 01:37 mysql-5.5.0-m2-linux-i686-icc-glibc23
drwxr-xr-x 2 root root 4096 2007-04-17 21:46 sbin
drwxr-xr-x 6 root root 4096 2011-12-10 10:12 share
drwxr-xr-x 2 root root 4096 2011-01-09 17:14 src
drwxrwxrwt 2 root root 40 2012-05-12 06:49 tmp
[root@colinux local]# ln -s /usr/local/hadoop-1.0.1 /usr/local/hadoop
[root@colinux local]# ls -l
合計 84
drwxr-xr-x 2 root root 4096 2011-12-10 10:12 bin
drwxr-xr-x 2 root root 4096 2011-12-10 10:12 etc
drwxr-xr-x 2 root root 4096 2007-04-17 21:46 games
lrwxrwxrwx 1 root root 23 2012-05-12 12:17 hadoop -> /usr/local/hadoop-1.0.1
drwxr-xr-x 14 root root 4096 2012-02-14 17:18 hadoop-1.0.1
drwxr-xr-x 3 root root 4096 2011-11-05 09:00 include
drwxr-xr-x 2 root root 4096 2007-04-17 21:46 lib
drwxr-xr-x 2 root root 4096 2007-04-17 21:46 libexec
lrwxrwxrwx 1 root root 38 2009-12-26 01:40 mysql -> mysql-5.5.0-m2-linux-i686-icc-glibc23/
drwxr-xr-x 14 mysql mysql 4096 2009-12-22 00:23 mysql-5.1.41-linux-i686-icc-glibc23
drwxr-xr-x 14 mysql mysql 4096 2009-12-26 01:37 mysql-5.5.0-m2-linux-i686-icc-glibc23
drwxr-xr-x 2 root root 4096 2007-04-17 21:46 sbin
drwxr-xr-x 6 root root 4096 2011-12-10 10:12 share
drwxr-xr-x 2 root root 4096 2011-01-09 17:14 src
drwxrwxrwt 2 root root 40 2012-05-12 06:49 tmp
[root@colinux local]#

【Hadoopサービスアカウント設定(パス無し鍵認証)】

[root@colinux local]# /usr/sbin/useradd hadoop
[root@colinux local]# chown -R hadoop:hadoop /usr/local/hadoop-1.0.1
[root@colinux local]#
[root@colinux local]# passwd hadoop
Changing password for user hadoop.
新しいUNIX パスワード:
新しいUNIX パスワードを再入力してください:
passwd: all authentication tokens updated successfully.
[root@colinux local]#
[root@colinux local]# id hadoop
uid=503(hadoop) gid=503(hadoop) 所属グループ=503(hadoop)
[root@colinux local]#

[root@colinux local]# su – hadoop
[hadoop@colinux ~]$ ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa
Generating public/private dsa key pair.
Created directory ‘/home/hadoop/.ssh’.
Your identification has been saved in /home/hadoop/.ssh/id_dsa.
Your public key has been saved in /home/hadoop/.ssh/id_dsa.pub.
The key fingerprint is:
d0:5c:57:22:9b:8e:38:97:e4:47:0f:ac:08:13:4c:ae hadoop@colinux
[hadoop@colinux ~]$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
[hadoop@colinux ~]$ chmod 600 ~/.ssh/authorized_keys
[hadoop@colinux ~]$

[hadoop@colinux ~]$ ssh localhost
The authenticity of host ‘localhost (127.0.0.1)’ can’t be established.
RSA key fingerprint is a2:b7:25:e3:78:61:15:2a:59:ed:fb:9f:1c:e7:94:db.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘localhost’ (RSA) to the list of known hosts.
[hadoop@colinux ~]$exit
[hadoop@colinux ~]$ ssh localhost
Last login: Sat May 12 12:31:20 2012 from localhost.localdomain
[hadoop@colinux ~]$

[hadoop@colinux ~]$ ls -l /usr/java/
合計 4
lrwxrwxrwx 1 root root 16 2012-05-12 08:11 default -> /usr/java/latest
drwxr-xr-x 7 root root 4096 2012-05-12 08:11 jdk1.6.0_32
lrwxrwxrwx 1 root root 21 2012-05-12 08:11 latest -> /usr/java/jdk1.6.0_32
[hadoop@colinux ~]$

【HADOOP設定ファイル変更】

[hadoop@colinux ~]$ cd /usr/local/hadoop-1.0.1/conf/

[hadoop@colinux conf]$ vi hadoop-env.sh
# Set Hadoop-specific environment variables here.
—————————————————-
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/java/default
# Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH=
—————————————————-
[hadoop@colinux conf]$ vi core-site.xml
[hadoop@colinux conf]$ cat core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

[hadoop@colinux conf]$
[hadoop@colinux conf]$ vi hdfs-site.xml
[hadoop@colinux conf]$ cat hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

[hadoop@colinux conf]$
[hadoop@colinux conf]$ vi mapred-site.xml
[hadoop@colinux conf]$ cat mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>localhost:9001</value>
     </property>
</configuration>

[hadoop@colinux conf]$

【初期設定とサービスの開始】

[hadoop@colinux conf]$ /usr/local/hadoop/bin/hadoop namenode -format
12/05/12 13:05:05 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = colinux/127.0.0.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.0.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1243785; compiled by ‘hortonfo’ on Tue Feb 14 08:15:38 UTC 2012
************************************************************/
12/05/12 13:05:06 INFO util.GSet: VM type = 32-bit
12/05/12 13:05:06 INFO util.GSet: 2% max memory = 19.33375 MB
12/05/12 13:05:06 INFO util.GSet: capacity = 2^22 = 4194304 entries
12/05/12 13:05:06 INFO util.GSet: recommended=4194304, actual=4194304
12/05/12 13:05:08 INFO namenode.FSNamesystem: fsOwner=hadoop
12/05/12 13:05:08 INFO namenode.FSNamesystem: supergroup=supergroup
12/05/12 13:05:08 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/05/12 13:05:08 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/05/12 13:05:08 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/05/12 13:05:08 INFO namenode.NameNode: Caching file names occuring more than 10 times
12/05/12 13:05:09 INFO common.Storage: Image file of size 112 saved in 0 seconds.
12/05/12 13:05:10 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.
12/05/12 13:05:10 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at colinux/127.0.0.1
************************************************************/
[hadoop@colinux conf]$

[hadoop@colinux conf]$ /usr/local/hadoop/bin/start-all.sh
starting namenode, logging to /usr/local/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-namenode-colinux.out
localhost: starting datanode, logging to /usr/local/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-datanode-colinux.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-secondarynamenode-colinux.out
starting jobtracker, logging to /usr/local/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-jobtracker-colinux.out
localhost: starting tasktracker, logging to /usr/local/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-tasktracker-colinux.out
[hadoop@colinux conf]$

[hadoop@colinux conf]$ jps
4689 Jps
4313 SecondaryNameNode
4062 NameNode
4186 DataNode
4561 TaskTracker
4399 JobTracker
[hadoop@colinux conf]$

【基本設定確認】
NameNode
$ http://localhost:50070/
 例) http://192.168.0.2:50070/dfshealth.jsp

JobTracker
$ http://localhost:50030/
 例)http://192.168.0.2:50030/jobtracker.jsp

【サンプルテスト】

[hadoop@colinux hadoop]$ ./bin/hadoop jar hadoop-examples-1.0.1.jar pi 4 1000
Number of Maps = 4
Samples per Map = 1000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Starting Job
12/05/12 13:24:36 INFO mapred.FileInputFormat: Total input paths to process : 4
12/05/12 13:24:37 INFO mapred.JobClient: Running job: job_201205121308_0001
12/05/12 13:24:38 INFO mapred.JobClient: map 0% reduce 0%
12/05/12 13:25:35 INFO mapred.JobClient: map 25% reduce 0%
12/05/12 13:25:50 INFO mapred.JobClient: map 50% reduce 0%
12/05/12 13:26:35 INFO mapred.JobClient: map 75% reduce 0%
12/05/12 13:26:54 INFO mapred.JobClient: map 75% reduce 16%
12/05/12 13:27:09 INFO mapred.JobClient: map 100% reduce 25%
12/05/12 13:27:16 INFO mapred.JobClient: map 100% reduce 33%
12/05/12 13:27:29 INFO mapred.JobClient: map 100% reduce 100%
12/05/12 13:27:43 INFO mapred.JobClient: Job complete: job_201205121308_0001
12/05/12 13:27:45 INFO mapred.JobClient: Counters: 30
12/05/12 13:27:45 INFO mapred.JobClient: Job Counters
12/05/12 13:27:45 INFO mapred.JobClient: Launched reduce tasks=1
12/05/12 13:27:45 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=245935
12/05/12 13:27:45 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/05/12 13:27:45 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/05/12 13:27:45 INFO mapred.JobClient: Launched map tasks=4
12/05/12 13:27:45 INFO mapred.JobClient: Data-local map tasks=4
12/05/12 13:27:45 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=110001
12/05/12 13:27:45 INFO mapred.JobClient: File Input Format Counters
12/05/12 13:27:45 INFO mapred.JobClient: Bytes Read=472
12/05/12 13:27:45 INFO mapred.JobClient: File Output Format Counters
12/05/12 13:27:45 INFO mapred.JobClient: Bytes Written=97
12/05/12 13:27:45 INFO mapred.JobClient: FileSystemCounters
12/05/12 13:27:45 INFO mapred.JobClient: FILE_BYTES_READ=94
12/05/12 13:27:45 INFO mapred.JobClient: HDFS_BYTES_READ=964
12/05/12 13:27:45 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108240
12/05/12 13:27:45 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215
12/05/12 13:27:45 INFO mapred.JobClient: Map-Reduce Framework
12/05/12 13:27:45 INFO mapred.JobClient: Map output materialized bytes=112
12/05/12 13:27:45 INFO mapred.JobClient: Map input records=4
12/05/12 13:27:45 INFO mapred.JobClient: Reduce shuffle bytes=112
12/05/12 13:27:45 INFO mapred.JobClient: Spilled Records=16
12/05/12 13:27:45 INFO mapred.JobClient: Map output bytes=72
12/05/12 13:27:45 INFO mapred.JobClient: Total committed heap usage (bytes)=816316416
12/05/12 13:27:45 INFO mapred.JobClient: CPU time spent (ms)=44840
12/05/12 13:27:45 INFO mapred.JobClient: Map input bytes=96
12/05/12 13:27:45 INFO mapred.JobClient: SPLIT_RAW_BYTES=492
12/05/12 13:27:45 INFO mapred.JobClient: Combine input records=0
12/05/12 13:27:45 INFO mapred.JobClient: Reduce input records=8
12/05/12 13:27:45 INFO mapred.JobClient: Reduce input groups=8
12/05/12 13:27:45 INFO mapred.JobClient: Combine output records=0
12/05/12 13:27:45 INFO mapred.JobClient: Physical memory (bytes) snapshot=451342336
12/05/12 13:27:45 INFO mapred.JobClient: Reduce output records=0
12/05/12 13:27:45 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1873936384
12/05/12 13:27:45 INFO mapred.JobClient: Map output records=8
Job Finished in 189.673 seconds
Estimated value of Pi is 3.14000000000000000000
[hadoop@colinux hadoop]$

[hadoop@colinux hadoop]$ /usr/local/hadoop/bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
[hadoop@colinux hadoop]$

ジョブ実行中の管理画面①

ジョブ実行中の管理画面②

ジョブ実行中の管理画面③ 実行時間など

ジョブ実行中の管理画面④ その他詳細

ジョブ実行中の管理画面⑤ 

参考サイト
Apache Hadoopプロジェクトとは何か?

Hadoop入門 – Hadoopと高可用性(09-MAY-2012)

Hadoopをインストールし使ってみる(06-APR-2011)

Nagiosで Hadoopを監視する(21-APR-2011)

Gangliaで Hadoopを監視する(22-APR-2011)