Integrate hadoop and pig

Registered by Mathias Gug

Hadoop is a popular map/reduce implementation that should be packaged properly in Ubuntu. You can also consider packaging the Hadoop Pig data analysis platform.

Blueprint information

Status:
Complete
Approver:
Jos Boumans
Priority:
High
Drafter:
Thierry Carrez
Direction:
Approved
Assignee:
Mathias Gug
Definition:
Approved
Series goal:
Accepted for maverick
Implementation:
Implemented
Milestone target:
milestone icon ubuntu-10.10
Started by
Mathias Gug
Completed by
Mathias Gug

Sprints

Whiteboard

Status:
[20100811]:
 * What did you say you would do?
     [WI] Suggest packaging improvements to Cloudera: TODO
 * What did you actually do?
     Continue to test cloudera packages and sent suggestions. Public issue tracker has been setup and all suggestions have been entered. Deployed hadoop cluster on two systems.
 * What issues or problems are you having? What do you need help with?
      None.
 * What's next?
    Use hadoop cluster to process ubuntu-server-bugs mailing list archive for statistics.
    [WI] Test cloudera packages (CDH3): INPROGRESS

[20100730]:
 * What did you say you would do?
      [WI] Drop Debian hadoop, hbase packages from the Ubuntu archive: TODO
 * What did you actually do?
      Tested the cloudera packages and have a list of improvements to suggests. Have a contact within Cloudera to forward suggestions.
 * What issues or problems are you having? What do you need help with?
      None.
 * What's next?
     [WI] Suggest packaging improvements to Cloudera: TODO

[20100705]:
 * What did you say you would do?
      [WI] Drop Debian hadoop, hbase packages from the Ubuntu archive: TODO
 * What did you actually do?
      Followed up with the Debian maintainer to ask his advice on dropping the packages from Ubuntu. He doesn't have a strong opinion on what Ubuntu should do.
 * What issues or problems are you having? What do you need help with?
      None.
 * What's next?
      [WI] Drop Debian hadoop, hbase packages from the Ubuntu archive: TODO

[20100630]
 * What happened?: Following up discussions at the Hadoop summit the cloudera patchset is the best option for providing a good experience of hadoop in Ubuntu. The current state of hadoop packages in Debian are not production ready. We should drop them from the Ubuntu archive and review the cloudera packages instead.
 * Any blockers?: Nope.
 * Next step: [WI] Drop Debian hadoop, hbase packages from the Ubuntu archive: TODO

[20100616]
 * What happened?:
 * Any blockers?:
 * Next step: Attend the hadoop summit (Tuesday June 29) to flesh out a hadoop strategy.

[20100614]
 * What happened?: Talked with Debian about solution: not so enthusiastic as it would encourage multiple flavors of hadoop while upstream is working on avoiding that.
 * Any blockers?:
 * Next step:

[20100607]
 * What happened?: Looked at the different patch sets. Rather than selecting one patch set the idea is to provide one package (libhadoop-*-java) where the integration could be done.
 * Any blockers?: Nope.
 * Next step: contact the Debian maintainer about using libhadoop-*-java as the point of integration.

Complexity:
maverick-alpha-3: 2
ubuntu-10.10-beta: 1
ubuntu-10.10: 1

Work items for maverick-alpha-2:
Analysis of Yahoo codebase: DONE
Analysis of Cloudera codebase: DONE
Analysis of Debian/Cloudera packaging: DONE
Selection of codebases to package: DONE
Finalize spec design: DONE
Brainstorm potential improvements in Hadoop stack packaging (to be done at the Hadoop Summit): DONE

Work items for maverick-alpha-3:
Drop Debian hadoop, hbase packages from the Ubuntu archive: POSTPONED
Review Cloudera packages (CDH3 beta2): DONE
Suggest packaging improvements to Cloudera: DONE
Test Cloudera packages: DONE

Work items for ubuntu-10.10-beta:
Drop Debian hadoop, hbase packages from the Ubuntu archive: POSTPONED
Write puppet recipe to automate hadoop deployement: POSTPONED
Test cloudera packages (CDH3): DONE

Work items for ubuntu-10.10:
Write puppet recipe to automate namenode deployement: DONE
Write puppet recipe to automate jobtracker deployement: DONE
Write puppet recipe to automate worker (tasktracker + datanode) deployement: DONE
Drop Debian hadoop, hbase packages from the Ubuntu archive: DONE
Test cloudera hadoop packages (CDH3): DONE
Write blog post about hadoop+puppet on UEC/EC2 (https://ubuntumathiaz.wordpress.com/2010/09/27/deploying-a-hadoop-cluster-on-ec2uec-with-puppet-and-ubuntu-maverick/): DONE

Reviews:
[20100526] / jib:
 * Suggested priority: 2
 * Suggested Subcycle: Every
 * We should engage with Yahoo/Cloudera/Facebook/Upstream to converge on a single packaging effort
 * Attending Hadoop conference in June may be helpful to this; assignee should probably go
 * Are the packaging improvements understood already? If so, clarify in WI please

(?)

Work Items