H-Scale, CitiusTech’s big data platform, facilitates healthcare organizations in managing large structured and unstructured data sets. The Big Data Platform enables organizations to accelerate the use of Hadoop and other Big Data technologies in healthcare. It has the ability to parse, store, manage and query large healthcare datasets through a SaaS-based, HIPAA compliant Hadoop environment. H-Scale leverages Industry standard transport mechanisms (REST, JDBC).
H-Scale expedites automatic healthcare message processing, data aggregation and Data Lake, configurable rules for streaming analysis, late binding and schema-on-read and advanced analytics.
In the article, we will talk about Ingestion components of H-Scale Framework in detail.
H-Scale supports two types of ingestion techniques as follows:
1. REST/SOAP - Real-time ingestion
2. Sqoop – DB Ingestion
REST/SOAP Endpoint - H-Scale Ingestion REST/SOAP service is developed for real-time ingestion of data (file or binary message) into H-Scale Data Lake. The service exposes a single interface for storing messages including both big (> 10 MB) and small (<10 MB) files in H-Scale Data Lake.
NOTE: The size for identifying a big file and small file is configurable.
When a user calls store method with required parameters, H-Scale Ingestion REST service publishes that message to respective Kafka topic, later the message is processed by Store Manager to store data in H-Scale Data Lake.
Kafka - Healthcare organizations are showing interest in streaming data and most applications need to use a producer-consumer model to ingest and process data in real time. Apache Kafka is a fast, scalable, durable and distributed messaging system. H-Scale uses Apache Kafka to consume the real-time data ingested using H-Scale’s REST/SOAP web services.
o Message Router - Message Router is implemented as a content-based router pattern (CBR). It selects a specific route to execute, based on file size threshold provided by the user.
For big files, Message Router stores them in HDFS temp location and calls Kafka Publisher component to publish the metadata on Kafka topic mentioning the HDFS temp folder location. For small files, Message Publisher is invoked which post content + metadata on Kafka topic.
o Message Publisher- Message Publisher is a utility to publish messages on Kafka topic. If Kafka topic is not present, Message Publisher utility creates the topic before posting the message.
Storm Topology - H-Scale Storm Topology has Storage Manager Bolt which manages the storage aspects of the received content.
Storm’s Storage Bolt invokes the H-Scale Storage Manager API, which stores the real-time messages to HBase/HDFS based on the message size. If the message size is more than 10 MB, then a message is stored in HDFS else the message is stored in HBase.
Apache Sqoop - Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. H-Scale Database Import API’s provides a mechanism by which relational data can be ingested in Hadoop. API provides a flexibility to connect to any relational DB through the JDBC connection and read the data from tables. API can insert data to Hive/HBase that was read from relational DB. This DB import API can be triggered using Apache Oozie jobs. H-Scale Database import API has an inbuilt error handling mechanism and auditing capability which would be useful for tracking and dashboard management.