#66

dumbo cat /hdfs/path/part* silently fails to concatenate all part files

    • Created on: Thu, Dec 03 2009 (over 2 years ago)
    • Reported by: zstone
    • Assigned to: -
    • Milestone: -
    • Type: -
    • Status: New
    • Priority: High (2)
    • Component: -
    • Estimate: None/Small/Medium/Large None
    It appears that dumbo cat /hdfs/path/part* does not actually concatenate all of the parts in an HDFS directory -- instead, it silently emits only the key-value pairs from the first part.

    Since the normal Dumbo syntax without the final star chokes on the _logs directory that Hadoop creates by default, people may be using this part* syntax frequently, and they may not realize that it yields incorrect results.

    Current workarounds include using dumbo cat without the star by manually deleting the _logs directory or configuring Hadoop not to create it. It may be more convenient to use the HDFS ls command to iterate through the part files in a directory explicitly to ensure that each one is processed as expected.
  • Followers
     
    Ico-users zstone 
     
    Attachments
    No attachments
    Associations
     
    No associations
    Activity
    Time Expenditure
    Loading