This is just a very short post about Hadoop Streaming which I hope will be helpful for people who stumble upon the same problem. For those who don’t know Hadoop streaming: It enables you to create MapReduce jobs with arbitrary executables/scripts as the mapper and/or reducer so you don’t have to write a Java wrapper application to execute it on your Hadoop cluster. This allows for parallelized execution of e.g. a file compression utility for a large set of files with just one command line call.
Suppose you would like to process a set of files using Hadoop streaming with one map job per input file. The official FAQ recommends to:
“Generate a file containing the full HDFS path of the input files. Each map task would get one file name as input.” 1
I’ve followed this advice, but for some reason I couldn’t figure out Hadoop created jobs in a different fashion than I had expected. 2 For example, in one use case I had only two files listed in my input file on a 10 node test cluster. Based on different debug specifications for mapreduce.map.tasks it either (1) processed just one file at the same time but with two mappers each or (2) it processed both at the same time and again with two mappers each. For the mapper I’m using this is a no-go because the output file for the corresponding input file then is written to by both mapper processes causing inconsistencies to the output.
The solution was quite straightforward. Instead of using the default TextInputFormat I passed org.apache.hadoop.mapred.lib.NLineInputFormat as the parameter for the -inputformat switch. In my mapper script I now receive not just the file name as input, but key values pair, the key being the start position of the line in the file and value being the line contents separated by a tab. My mapper script is written in Python, so after doing a
# ... other code here ...
for task in sys.stdin:
(key, file) = task.rstrip().split('\t')
I have the input file name for my map job and do the actual processing.
- http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html#How_do_I_process_files_one_per_map ↩
- Admittedly, one thing I haven’t checked is whether this happens because I’m not using hdfs:// style URLs, but the corresponding NFS file paths instead due to the fact that I’m operating on the HDFS via the HDFS NFS Gateway. But even if it was different for hdfs:// URLs, the solution will be useful for other NFS gateway users. ↩