Understanding Basics of Lambda Architecture

 

I came across the Lambda architecture term for the first time while reading a post on highscalability.com. Lambda architecture was used by parall.ax to build an internationalized app for David Guetta’s social online campaign “This One’s For You”. Lambda architecture was mentioned to be a ‘quick and simple’ way of achieving scalability.

Lambda architecture was introduced by Nathan Marz, a renowned personality in big data community for his work on Storm project. The book “Big Data – Principles and Best Practices of Scalable Realtime Data Systems” written by Nathan Marz and James Warren, presents a much deeper understanding of the architecture. Lambda architecture is a data processing architecture or more specifically associated with big data.

Data systems are an integral part of software design. They need to be able to handle really HUGE amounts of data (well, most of the time, atleast the web software solutions); handle in terms of storing and quickly answering to queries. And sometimes, these data systems are required to last longer than the actual application itself! Considering this data systems have to be designed very meticulously so that they are not only reliable but also scalable.

The picture below shows what a lambda architecture is.

1. Lambda Architecture

Lambda architecture seeks to get the best out of both batch processing and stream (or near real-time) processing. Batch processing is simple, more accurate and is not much affected by the issues of consistency and locking. However, it can be annoyingly slow. And it is there then the role of real time computations which are much faster (a lot of time by working on approximations) is understood.

Comprised of a system of three layers – Batch layer, Serving layer and Speed layer – Lambda architecture works by copying and processing data on two layers – batch and speed layer. It is necessary that the data is immutable as will be seen later. The time-stamped data received is simply appended rather than overwriting any previous record.

Also the architecture requires queries to be pre-computed and stored as views. This helps in achieving speed. These views are created by both real time processing and by batch processing. Results from the two types of computations are merged such that real-time views are overwritten by the batch views because the latter is more accurate. Any query can be answered by merging the two types of views.

To have a detailed understanding of the architecture, let’s see each of the three layer individually.

  1. Batch Layer

This layer has basically two functions. First one, is to store raw data as it comes, thereby continuously growing the master data set which is stored as HDFS (Hadoop Distributed File System). Note that store by appending!

Second function of this layer is to compute views using MapReduce. Iterations of MapReduce are carried on for recomputing the views again and again using the complete data set. Since it uses complete data sets, this means that it can fix errors and give highly accurate results.

  1. Serving Layer

This layer stores the computed views from the batch layer, indexes them and makes them available for queries. More appropriately Serving layer is more like QFD (Question Focused Database) as James Kinley calls them.

Cloudera Impala along with Hive Metastore could be utilized for queries from this layer.

  1. Speed Layer

MapReduce by design has high latency and thus not suited for real time computations. This layer creates more real time views using Storm and also exposes these views for queries. These real-time computed views are discarded as soon as more accurate and precise views from batch layer are generated.

How we say it, Lambda Architecture is the best of both worlds. Low latency, high throughput, high accuracy and fault tolerant, it meets the requirements of today’s web – reliable and scalable solutions.

__________________________________________________

For further reading on the topic, you can go through the following resources:

  1. lambda-architecture.net/
  2. Semantikoz
  3. Wikipedia
  4. Infoq
  5. Highscalability post mentioned in the beginning
  6. James Kinley’s blog
  7. mapr.com
  8. datasalt.com
  9. Dr. dobbs

#Off-topic now

Also, anyone interested to read Nathan Marz’s blog, here you go: nathanmarz.com. His two posts that really inspired me (as you can see) are: 1. You should write even if you have no Readers and 2. Break into Silicon Valley with a Blog

________________________________________________

I hope, I have been able to develop curiosity and some understanding about Lambda architecture. If you enjoyed, feel free to comment/post on social networks. If you found anything wrong, kindly let me know. If you want to share some useful insight on the topic, please do that. I would love to know more!

Ok, bye!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s