Kedar Chitale, Sr. Solution Architect and Ashutosh Marathe, Sr. Technical Specialist on Apr 10, 2017
H-Scale, CitiusTech’s big data platform, facilitates healthcare organizations in managing large structured and unstructured data sets. The Big Data Platform enables organizations to accelerate the use of Hadoop and other Big Data technologies in healthcare. It has the ability to parse, store, manage and query large healthcare datasets through a SaaS based, HIPAA compliant Hadoop environment. H-Scale leverages Industry standard transport mechanisms (REST, JDBC).
H-Scale expedites automatic healthcare message processing, data aggregation and Data Lake, configurable rules for streaming analysis, late binding and schema-on-read and advanced analytics.
In the article we will talk about Ingestion components of H-Scale Framework in detail.
H-Scale supports two types of ingestion techniques as follows:
1. REST/SOAP - Real time ingestion
2. Sqoop – DB Ingestion
1. REST/SOAP - Real time ingestion
REST/SOAP Endpoint - H-Scale Ingestion REST/SOAP service is developed for real time ingestion of data (file or binary message) into H-Scale Data Lake. The service exposes single interface for storing messages including both big (> 10 MB) and small (<10 MB) files in H-Scale Data Lake.
NOTE: The size for identifying big file and small file is configurable.
When a user calls store method with required parameters, H-Scale Ingestion REST service publishes that message to respective Kafka topic, later the message is processed by Store Manager to store data in H-Scale Data Lake.
Kafka - Healthcare organizations are showing interest in streaming data and most applications need to use a producer-consumer model to ingest and process data in real time. Apache Kafka is a fast, scalable, durable and distributed messaging system. H-Scale uses Apache Kafka to consume the real time data ingested using H-Scale’s REST/SOAP web services.
* Kafka Producer- H-Scale REST/SOAP web service invokes Kafka Producer to provide facility to publish message on Kafka topic. Kafka Producer consists of following subcomponents :
o Message Router - Message Router is implemented as a content based router pattern (CBR). It selects specific route to execute, based on file size threshold provided by user.
For big files, Message Router stores them in HDFS temp location and calls Kafka Publisher component to publish the metadata on Kafka topic mentioning the HDFS temp folder location. For small files, Message Publisher is invoked which post content + metadata on Kafka topic.
o Message Publisher- Message Publisher is a utility to publish messages on Kafka topic. If Kafka topic is not present, Message Publisher utility creates the topic before posting the message.
* Kafka Consumer- H-Scale Kafka Consumer consumes messages from Kafka topics. Kafka Consumer delegates the processing to Storm Topology. It is recommended to have one consumer per Kafka partition.
Storm Topology - H-Scale Storm Topology has Storage Manager Bolt which manages the storage aspects of the received content.
* Storage Bolt - Apache Storm is used for real time message processing. All processing in storm topologies is done in bolts. Bolts can be used for filtering, functions, aggregations, joins, talking to databases, and more. H-Scale Storage Bolt is a processing unit that processes the real time messages consumed by Storm’s Kafka Consumer.
* Storage Manager - Switch based on size HDFS/HBase
Storm’s Storage Bolt invokes the H-Scale Storage Manager API, which stores the real time messages to HBase/HDFS based on the message size. If the message size is more than 10 MB, then message is stored in HDFS else the message is stored in HBase.
2. Sqoop - DB Ingestion
Apache Sqoop - Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. H-Scale Database Import API’s provides a mechanism by which relational data can be ingested in Hadoop. API provides a flexibility to connect to any relational DB through the JDBC connection and read the data from tables. API can insert data to Hive/HBase that was read from relational DB. This DB import API can be triggered using Apache Oozie jobs. H-Scale Database import API has an inbuilt error handling mechanism and auditing capability which would be useful for tracking and dashboard management.
CitiusTech leverages H-Scale Accelerator Ingestion Component to –
* Support onboarding of multiple LoBs on a multi-tenant, shared service platform
* Provide users single source for analytics and reporting needs
* Demonstrate significant improvement in performance
* Reduce time-to-market for its clients in creation of Data Lake
* Create robust big data infrastructure that could manage high volumes of data