The Lens platform is built on the principles of distributed architecture. Some of the highlights of the platform are:
Scribe is a Collector API that collects data from the trackers and then writes it to the Kinesis stream. The Collector API, developed in GoLang, acts as a listener and continuously checks for data from the trackers. Additionally, the Collector API also performs minimal processing for the following:
For the above cases, Facebook/Google allows the trackers to measure only those key metrics that do not impact the page loading performance. The Collector API then processes the available key metrics into a standardized format accepted by the Lens platform. The Scribe module is containerized using AWS Elastic Container Service (ECS), and the Application Load Balancer (ALB) balances the load across multiple EC2 instances hosted in different availability zones.
Once the data passes through Scribe, it is then written to Kinesis, which acts like a data storage queue. Kinesis can store up to 7 day’s data. One of Kinesis's most significant advantages is that data can be retrieved from any point in the queue so you can get timely insights and react quickly to new information.
Accumulo reads the data from Kinesis and converts the data into Avro format files. This module, developed
in Golang and Java, is a containerized module designed to solve the small file problem existent in the big
data ecosystem. The module waits until the data reaches the desired file size; it then converts the data
to an Avro format file and then finally writes this file to the data lake on S3.
Additionally, Accumulo performs schema validation, schema evolution, and schema compatibility checks to ensure the raw data confirms the defined standards set by the platform.
Prism is a unified data processing engine which cleanses, reformats, and enriches the data. The data (in
the desired format) is then written to different data stores, which in turn powers the Lens Dashboard.
Prism supports both Lambda and Kappa architecture with the Databricks Delta Lake transactional
Following are the workflows supported by Prism:
This workflow supports a constant flow of data from Kinesis, which updates with high frequency. Real-time
streaming analytics is particularly useful to analyze real-time data, such as “How is the performance of
the key metric at this point in time” and realign the business strategy, for example, “How can we improve
the key metric performance?”.
For real-time streaming analytics, we use the spark structured streaming job, which runs in either AWS EMR or Databricks on the AWS environment.
Post-processing, the data is copied to ElasticSearch Service, and the real-time streaming Lens Dashboards is powered from ElasticSearch Service.
This workflow supports the processing of a large volume of data collected over a period of time. The Batch Streaming is particularly useful to analyze historical data, such as “Compare the key metric performance for the current week with last week” .
This workflow generates recommendations and personalization. The recommendation and personalized engine uses AWS SageMaker/Databricks ML Flow for the ML model lifecycle management on top of the raw data and semi-processed data in the data lake.
The data cleansing process ensures data quality and utility by catching and correcting errors before further processing. The following are the data cleansing process followed:
Data reconciliation ensures that the business-critical conversion event data syncs with the recorded OLTP transaction. It also recalculates the aggregated results to handle late events to improve overall data accuracy.
Enriched data enables you to gain valuable insights about your audience segments and alter your marketing
and business strategies to suit the audience.
We employ the following methodologies to enrich your data:
Based on the business requirements, the application loads the cleansed and enriched data to different data stores. This enriched data is moved to the primary data store AWS Redshift and the data lake as parquet files. Aggregation is performed to minimize the processing time required for data comparison and to improve user experience. Cleansed data is aggregated over a given time period to provide statistics such as the key metric average for a quarter. You can analyze the aggregated data to gain insights about specific key metrics. The aggregated data is stored in Amazon Aurora.
The data lake is a storage repository that holds a vast amount of raw data/enriched data until it is needed. The raw, minimally-processed data from Accumulo is copied to S3, which is the data lake of choice for the Lens platform. AWS Sagemaker then uses this data for ML workflows.
Based on the sharpness and the level of cleansing, data is classified into the following categories in
the data lake using Databricks Delta Lake:
Note: The Databricks Delta Lake also supports ACID transactions.
Following are the available data sources used by the Lens platform:
Lens Dashboard is an in-house, powerful data visualization UI that can be customized based on your
You can use the Lens Dashboard to:
Lens Dashboard is powered by any of the data sources based on different business questions. The user’s
cache page is copied to the nearest edge location and served using AWS CloudFront. The Lens Dashboard is
also designed based on distributed architecture concepts to ensure scalability and reduce fault
Some of the custom reports available are: