This is my first blog and i take this topic because this is extremely popular interview question. So the problem statement is extremely simple that is when user gives you a long URL you need to return a unique short URL and when users gives you the short URL you need to return the original long URL.
Design questions are extremely subjective in nature, so take your time, think about the problem and come up with your own solution
When you are asked a design question in an interview, interviewer is not looking for the solution right away, he wants to see how do you approach the problem. There are few areas which you need to focus on for the design question :
- Firstly do a capacity estimation(for traffic, disk usage, bandwidth and memory).
- Figure out how will your service scale with load.
- What will be your database design.
- Do you need to use cache any where.
- Where do you need to put the load balancer.
- How will your service APIs look like.
- How do you plan to establish monitoring and alerting for your service.
If you are able to answer all the these questions then believe me no interviewer can reject you.
So before finding answer to these questions, let us see what is URL shortening :
URL shortening is used to create shorter links for long URLs and and when user hits the short links they will be redirected to the original URL.
For example if we shorten this page URL through tinyurl :
https://www.ublog.in/url-shortener-system-design/
We would get :
Problem statement requirements
You need to ask as many questions from the interview to clarify the problem requirements which you need to take care in your design, no interviewer will give you the list right away, you need to ask questions like for this problem statement you can ask the following question :
Do I need to take care of link expiration and can user pass the custom expiration time?
So here are the list of requirements i will take care of in this blog :
- When user pass a long URL, our service should generate a unique short URL for it. The short link should be short enough to be easily copied and pasted in various applications.
- When user will try to access the short link, our service should redirect them to the original link.
- Links will expire after a standard default timespan. Users should be able to specify the expiration time.
So above i have listed all the functional requirements. Now there are few non functional requirements which every new service should take care, in our case following should be non functional requirements which are the main area of focus for any interviewer:
- High availability
- Low latency
- URL being generated through our service should not be predictable (human guessable)
Now lets start answering all the key focus areas we discussed abobe:
Capacity Estimation
Looking at the nature of service, our service will be read heavy and there will be much more redirection requests than the request to create new short URLs.
Let us assume a 50:1 ratio between read and write which means for every 50 redirections requests there will be 1 request to create a new short URL.
Traffic Estimation
Let’s assume we will have 100M requests for creation of new short URLs in a month and going by our above 50:1 ratio we can fairly expect 5B redirection request per month.
Requests for redirection per month : 5B
Requests for redirection per day : 5B/30 ~ 166M
Requests for redirection per hour : 6.9M
Requests for redirection per second ~ 2K
Requests for new short URL creation per month : 100M
Requests for new short URL creation per day : 100M/30 ~ 3.3M
Requests for new short URL creation per hour : 137.5k
Requests for new short URL creation per second ~ 38
Disk Storage Estimation
Let’s assume we will store all the URL shortening requests and their corresponding unique short links for 3 years.
Per month we already assumed number of URL shortening requests to be 100M, so for 3 years we will get 3 * 12 * 100M = 3.6B
Now let’s assume space required to store data for one request is around 500 bytes, so in total we would require 1.8 TB space on disk.
Bandwidth Estimation
For new short url creation, since we expect 38 requests per second, total incoming data for our service will be : 38 * 500 bytes = 18 KB/sec
For redirection requests, since we expect 2k requests per second, so total outgoing data for our service will be : 2k * 500 bytes = 1 MB/sec
Total incoming data = 18 KB/sec
Total outgoing data = 1MB/sec
Memory Estimation
Here comes, another question : Do we need to use cache ?
If we follow 80-20 rule, which means 80% of the traffic is generated by only 20% URLs, so if we can cache the frequently used URLs we can reduce much of latency introduced by DB lookup.
How much memory we require for cache :
Requests for redirection per day ~ 166M
Memory required per request = 500 bytes
How many requests we need to cache : 20% of 166M ~ 33M
So total memory required for cache = 33M * 500 bytes ~ 16.5 GB
One thing to note here is that there will be multiple duplicate requests (for same URL) so actual memory required will be lesser than the estimated 16.5 GB.
Capacity Estimation Summary
- Request for new URLs = 38/sec
- Request for redirections = 2k/sec
- Storage for 3 years = 1.8 TB
- Incoming data = 18KB/sec
- Outgoing data = 1 MB/sec
- Memory for cache = 16.5 GB
Database Design
Key points to consider while deciding the database design:
- How many records we need to store.
- What is the size of each record.
- What relationship exists between the data that we plan to store.
- Is our server read intensive (which will help us decide which DB to use to reduce the latency)
Lets answer these questions:
- We need to store billions of records
- Each record is around 500 bytes
- The relationship that exists between the data is which user created the URL, other than that i don’t see any relationship
- Looking at the number of redirection calls we assumed above, we can clearly say our service will be read intensive.
Choice for Database
Based on our answers to the questions, storing billions of records, no as such relationship between the data a NoSQL database like CosmoDB, Cassandra would be better choice.
Also NoSQL database would be better at scaling as well.
Database Schema
We would need two tables: one for storing information about the URL mappings, and one for the user’s data who created the short link.
System Design
The problem we are solving here is how to generate a short and unique key for a given URL.
Let’s have a look at the short URL TinyURL generated for this post : https://tinyurl.com/y8wyr4h4.
The last 8 characters in the URL is the key we want to generate.
Possible solutions to generate the key:
a) Encoding the given URL
We can generate a unique hash of the URL using well known hashing algorithms like MD5 or SHA256 and then encode the hash generated using Base64 encoding.
How long our key should be?
Using Base64 encoding a 6 character string would result into 64^6 ~ 68.7 billion possible strings and a 8 character string would result into 64^8 ~ 281 trillion possible strings.
As per our capacity planning 6 character string would be sufficient for our service.
The issue here is, if we use MD5 as hash function, it will generate 128 bit hash value, which when encoded using Base64 encoding would produce a string having more than 21 characters (since each base64 character encodes 6 bits of the hash value). But 21 character key would be way too long then we want. Our use case can be solved using 6 character key.
Another problem with this solution is if multiple users make request for same URL, then they will get same short link which is not correct.
b) Pre calculating keys
We can design another service whose only responsibility is to generate the keys of six characters and store them in DB beforehand. This service can work independently and have its own DB. We can call this service as Key Generation Service (KGS).
Whenever our service gets a request to shorten a URL, our service will request for a key from the Key Generation Service and KGS will return one of the already computed key. KGS will make sure all the keys inserted into key-DB are unique
Advantages of this approach
- Simple and fast
- No Duplication or collision
Most of the candidates easily arrive at this solution, but the most important part is still not over, responding to the interviews questions from this approach.
Think for a while and see what all we have missed in the solution.
Following are possible questions any interviewer can ask from this approach:
- How would you handle concurrency issue here, as there can be a case when you can assign same key to two different URLs.?
- Key generation service is single point of failure, How do you solve it?
- What would be the size of Key DB (the DB of Key generation Service)?
- How will key look up happen?
Take some time and try to answer these questions yourself first.
Lets discuss the answer of these questions:
- How would you handle concurrency issue here, as there can be a case when you can assign same key to two different URLs.?
We need to make sure as soon as a key is used, it should be marked in the database so that it is not reused again.
The problem comes when two or more servers are trying to read the same key from the DB. How to solve this concurrency problem.
App servers can use KGS to read/mark keys in the database. KGS can use two tables to store keys: one for the keys that are not used yet, and one for the keys which are already used. As soon as KGS gives keys to one of the servers, it can move them to the used keys table. KGS can always keep some keys in memory so that it can quickly provide them whenever a server needs them and as soon as the keys are loaded in memory, they can be moved to the used keys table. This ensures each server gets unique keys.
But what happens if KGS crash?
In this case all the loaded keys which were not allocated to any server will get wasted which is acceptable given the total number of keys we have.
Using the above we are able to remove the impact of concurrency while fetching the keys from DB. But we are still missing something here.
Take some time and think what is still missing here? (A major loophole)
So we didn’t take care of the case when multiple app servers will request the key from KGS at the same time, so there can be a chance where KGS can allot same key to two app server. So to prevent that we KSG must synchronize (put a lock) on the data structure which is holding the keys in memory before removing the keys from it and giving them to a server. - Key generation service is single point of failure, How do you solve it?
To solve this, we will have standby instance of KGS and as soon as the primary instance die, the standby instance will take over to generate and provide keys. - What would be the size of Key DB (the DB of Key generation Service)?
Wit Base64 encoding, we can generate 68.7B unique 6 character keys. We would need 1 byte to store one character, so total size required will be
6 (characters per key) * 68.7B keys = 412 GB.
This size is not a challenge as now all the major cloud providers provide database as a service and you can choose the size of database you want. - How will key look up happen?
Key lookup will happen in the database to fetch the original URL. If the key is present in the DB, we will respond with ‘HTTP 302 Redirect’ status with the original URL in the Location header and if the key is not present in the DB which means the key was never generated by our service we will respond with HTTP 404 Not found.
Load balancing
We can add Load balancer at two places, one between client and application server and other between application server and KGS.
Which load balancing algorithm?
Since our system is not complex, we can have the most simple Round robin algorithm for load balancing the traffic. This is simple and doesn’t bring any overhead.
Cache
We should cache the URLs which are frequently accessed.
- Which cache we should use?
- Can we use local cache?
Since here we are designing a scalable distributed system so we can’t use local in memory cache since request for redirection can go to any application server which will force us to keep in memory cache for most of the URLs at multiple server which will be duplication of efforts and an over head.
So we will have to use distributed cache like Redis.
Which cache eviction policy to use?
According to me, as per our use case Least Recently Used (LRU) would be a good choice
How will cache be updated?
Whenever there is a cache miss, application server will fetch the value from DB and subsequently update the cache with the value.
What happens if cache instance fails?
Like I suggested Redis to be used as our distributed cache, so Redis has a master slave architecture where one instance runs as master and 2 other instances run as slave. And in case master slaves, an election among the live nodes happens to choose the new master.
Monitoring
Other than the system monitoring and alerting which is a basis necessity of the service, here I am talking about the application level monitoring.
According to me, following things should be good candidates for monitoring
- How many times a short link has been used.
- Location from where the URL was accessed
- Date and time of access
- Webpage from where the link was accessed.
Overall architecture
Homework
I have knowingly missed few sections here which are also very important and plays a very important role in the interview. I want you to come up with solution for these areas and share them in comments, if you have a solution and want to know what my solution is you can post your solution in comment and I can then share my solution with you.
The areas are :
1) How to scale the DB?
2) DB cleanup strategy?
3) Sequence diagram of the flow?
4) How to handle security and permissions?
Enjoy extra discount on Geeksforgeeks. Use coupon code : SHIVAMUBLOG
Refer here : https://www.ublog.in/geeksforgeeks-coupon-code/
Nice post, very detailed and thorough analysis of the problem statement and solution.
I like how every possible question was taken care in the post.
I just like the helpful information you provide in your articles.
I will bookmark your blog and test once more right here frequently.
I’m moderately certain I will learn plenty of new stuff proper here!
Good luck for the next!
Very nice article !
The clarity and flow is superb. Keep them coming.
Good luck!!
Great resource
Excellent pieces. Keep writing such kind of info on your site.
Im really impressed by your site.
Hey there, You’ve done an incredible job. I will certainly digg it and individually recommend to my friends.
I’m confident they’ll be benefited from this web site.
Feel free to visit my site :: situs judi togel online indonesia
Helⅼo! I’vе bеen reading yourr site foor ѕome time noow and finaⅼly gоt thе bravery to ɡo aheadd аnd give you a shout out from
Dallas Texas! Just wanted to teⅼl you keеp սр tһe good work!
Also visit my website … read more
There’s definately a great deal to learn about
this topic. I really like all the points you
made.
Terrific article! This is the type of information that are meant
to be shared across the net. Shame on Google for no longer positioning this
put up upper! Come on over and visit my web site .
Thanks =)
My web site :: starxo88
It’s remarkable in support of me to have a web site, which is useful designed for my experience.
thanks admin
Also visit my page; flaxseed oil
This is very interesting, You’re a very skilled blogger.
I have joined your rss feed and look forward to seeking more of your great post.
Also, I’ve shared your site in my social networks!
I pay a quick visit day-to-day a few websites and information sites to read content,
except this webpage offers quality based writing.
Another person automatically lend a hand to generate appreciably content articles I will express vintage brooches. That is the very first time that My partner and i been to internet web site therefore considerably? I actually shocked with the research you’ve made to develop this unique organize remarkable. Superb activity!
Excellent post. Keep writing such kind of information on your blog.
Im really impressed by your blog.
Hello there, You have performed a great job. I will definitely digg it and in my opinion recommend to my friends.
I am confident they’ll be benefited from this website.
What’s up, just wanted to tell you, I enjoyed this post.
It was inspiring. Keep on posting!
My blog post – starxo88
I must thank you for the efforts you’ve put in writing this blog. I really hope to see the same high-grade content from you in the future as well. In fact, your creative writing abilities has inspired me to get my very own site now 😉
Hey there, You have done a great job. I’ll certainly digg it and personally recommend to my friends. I’m confident they will be benefited from this site.
What’s up, yes this article is really fastidious and I have learned lot of things from it about blogging. thanks.
constantly i used to read smaller articles that also clear their
motive, and that is also happening with this post which I am reading here.
I was suggested this web site by my cousin. I
am not sure whether this post is written by him as no one else know such detailed about my trouble.
You are incredible! Thanks!