Sahara

[EDP] Data discovery component

Registered by Alexander Kuznetsov on 2013-07-11

EDP can have several sources of data for processing. Data can be pulled from Swift, GlusterFS or NoSQL database like Cassandra or HBase. To provide an unified access to this data we’ll introduce a component responsible for discovering data location and providing right configuration for Hadoop cluster.

This blueprint is about engine implementation, not all plugins required.

Supported Data Sources Types:
  * HDFS
  * HDFS on existing cluster
  * NoSQL databases
  * Distributed storages (Swift, Ceph, Gluster)
  * RDBMS with Apache Sqoop

Data Source Types are introduced via plugin mechanism.

Data Source Object
  * Category (NoSQL database, RDBMS, Distributed file storage, existing Hadoop cluster)
  * Type
  * Description
  * Credentials
  * Possible representations for example Hive table
  * List of job using this data source

Method
  * Registration data source
  * Get credential for data source. Before getting credential plugin (responsible for this data type transforms) JSON object into the right format.
  * Assign to job
  * Detach from job
  * Get list of available components
  * Delete data source