dumbo cat /hdfs/path/part* silently fails to concatenate all part files
It appears that dumbo cat /hdfs/path/part* does not actually concatenate all of the parts in an HDFS directory -- instead, it silently emits only the key-value pairs from the first part.
Since the normal Dumbo syntax without the final star chokes on the _logs directory that Hadoop creates by default, people may be using this part* syntax frequently, and they may not realize that it yields incorrect results.
Current workarounds include using dumbo cat without the star by manually deleting the _logs directory or configuring Hadoop not to create it. It may be more convenient to use the HDFS ls command to iterate through the part files in a directory explicitly to ensure that each one is processed as expected.
Since the normal Dumbo syntax without the final star chokes on the _logs directory that Hadoop creates by default, people may be using this part* syntax frequently, and they may not realize that it yields incorrect results.
Current workarounds include using dumbo cat without the star by manually deleting the _logs directory or configuring Hadoop not to create it. It may be more convenient to use the HDFS ls command to iterate through the part files in a directory explicitly to ensure that each one is processed as expected.
Leave a comment