Hadoop 한글 인코딩 문제

이 문제는 한 9개월 전에 Hadoop으로 미니 프로젝트를 하면서 날려본 문제다.

문제를 해결하기 위해 백방으로 알아보고(심지어 더그커팅에게 문의까지…) 엄청난 삽질을 했지만 문제 해결을 위한 결정적 단서를 김형준님께서 주셨었다.

알다시피 자바는 유니코드, utf8기반이다.
그래서 로그 처리를 하고 역색인 파일을 만드는데 입력 파일을 utf8로 변환을 해서 Hadoop 5대의 머신에 넣고 설치를 했다.
뭐 처음에 영문으로만 간단하게 테스트 하고 본격적으로 한글 자료들을 Hadoop에 먹이기 시작했는데 한글이 알아보기 힘들정도로 깨져서 나오는 것이였다.

writer, 및 reader를 고쳐보고 별짓을 다해도 안되서 고민하던중 Hadoop의 설정 파일을 고치면서 해결을 했다. (개인적으로 오픈소스 프로그램의 소스를 고치는것은 지양해야 한다고 생각한다. 소스코드의 다른 하나의 branch를 만드는 것이기 때문이다.)

문제의 요점이 아래와 같은데…
Hadoop을 돌리면 JVM에서 수많은 자식 JVM을 생성하는데 그 자식 JVM이 어미의 설정을 따르지 않아서 생기는 문제였다.
그래서 아래와 같은 파일경로의 설정을 추가해주면 된다.

{$HADOOP_HOME}/conf/hadoop-site.xml

<property>
<name>mapred.child.java.opts</name>
<value>-Xmx200m -Dfile.encoding=utf-8</value>
<description>Java opts for the task tracker child processes. Subsumes
‘mapred.child.heap.size’ (If a mapred.child.heap.size value is found
in a configuration, its maximum heap size will be used and a warning
emitted that heap.size has been deprecated). Also, the following symbols,
if present, will be interpolated: @taskid@ is replaced by current TaskID;
and @port@ will be replaced by mapred.task.tracker.report.port + 1 (A second
child will fail with a port-in-use if mapred.tasktracker.tasks.maximum is
greater than one). Any other occurrences of ‘@’ will go unchanged. For
example, to enable verbose gc logging to a file named for the taskid in
/tmp and to set the heap maximum to be a gigabyte, pass a ‘value’ of:

-Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc
</description>
</property>