How Digg is Built
2011-04-01 13:45
316 查看
文章来源:http://about.digg.com/blog/how-digg-is-built
At Digg we have substantially rebuilt our infrastructure over the last year in what we call "Digg V4". This blog post gives a high-level view of the systems and technologies involved and how we use them. Read on to find out the secrets of the Digg engineers!
Let us start by reviewing the public products that Digg provides to users:
a social news site for all users,
a personalized social news site for an individual,
an Ads platform,
an API service and
blog and documentation sites.
These sites are primarily accessed by people visiting in browsers or applications. Some people have Digg user accounts and are logged in; they get the personalized site My News. Everyone gets the all user site which we call Top News. These products are all seen on '
This post will mainly cover the high-level technology of the social news products.
Story Submission: The stories are submitted by logged-in users with some descriptive fields: a title, a paragraph, a media type, a topic and a possible thumbnail. These fields are extracted from the source document by a variety of metadata standards (such as the Facebook open graph protocol, OEmbed plus some filtering) but the submitter has the final edit on all of these. Ads are submitted by publishers to a separate system but if Dugg enough, may become stories.
Story Lists: All stories are shown in multiple "story lists" such as Most Recent (by date, newest first), by topic of the story, by media type and if you follow the submitted user, in the personalized social news product, MyNews.
Story Actions: Users can do actions on the stories and story lists such as read them, click on them, Digg them, Bury them, make comments, vote on the comments and more. A non-logged in user can only read or click stories.
Story Promotion: Several times per hour we determine stories to move from recent story lists to the Top News story list. The algorithm (our secret sauce!) picks the stories by looking at both user actions and content classification features.
![](http://developers.diggstatic.com/files/digg-user-facing-public.png)
The edge of our internal systems is simplified here but does show that the API Servers proxy requests to our internal back end services servers. The front end servers are virtually stateless (apart from some caching) and rely on the same service layer. The CMS and Ads systems will not be described further in this post.
Taking a look at the internal high level services in an abstract fashion, these can be generally divided into two system parts:
Online or Interactive or Synchronous
Serve user requests for a page or API directly or indirectly. These have to return the response in some number of milliseconds (for services) which in aggregate to the user cannot be more than 1 or 2 seconds to give a good page response. This includes AJAX requests which are asynchronous for the user in the browser but are request/response from the serving system point of view. Offline or Batch or Asynchronous
Serve requests that are not in the interactive request-response loop and are typically only indirectly initiated by a user. The work here can take from seconds, minutes or hours (rarely).
The two parts above are used in Digg as shown in this diagram:
![](http://developers.diggstatic.com/files/digg-high-level-services-public.png)
Looking deeper into the components.
Cassandra: The primary store for "Object-like" access patterns for such things as Items (stories), Users, Diggs and the indexes that surround them. Since the Cassandra 0.6 version we use does not support secondary indexes, these are computed by application logic and stored here. This allows the services to look up, for example, a user by their username or email address rather than the user ID. We use it via the Python Lazyboy wrapper.
HDFS: Logs from site and API events, user activity. Data source and destination for batch jobs run with Map-Reduce and Hive in Hadoop. Big Data and Big Compute!
MogileFS: Stores image binaries for user icons, screenshots and other static assets. This is the backend store for the CDN origin servers which are an aspect of the Front End systems and can be fronted by different CDN vendors.
MySQL: This is mainly the current store for the story promotion algorithm and calculations, because it requires lots of
Redis: The primary store for the personalized news data because it needs to be different for every user and quick to access and update. We use Redis to provide the Digg Streaming API and also for the real time view and click counts since it provides super low latency as a memory-based data storage system.
SOLR: Used as the search index for text queries along with some structured fields like date, topic.
Scribe: the log collecting service. Although this is a primary store, the logs are rotated out of this system regularly and summaries written to HDFS.
At Digg we have substantially rebuilt our infrastructure over the last year in what we call "Digg V4". This blog post gives a high-level view of the systems and technologies involved and how we use them. Read on to find out the secrets of the Digg engineers!
Let us start by reviewing the public products that Digg provides to users:
a social news site for all users,
a personalized social news site for an individual,
an Ads platform,
an API service and
blog and documentation sites.
These sites are primarily accessed by people visiting in browsers or applications. Some people have Digg user accounts and are logged in; they get the personalized site My News. Everyone gets the all user site which we call Top News. These products are all seen on '
digg.com' and the mobile site '
m.digg.com'. The API service is at '
services.digg.com'. Finally, there are the '
about.digg.com' (this one) and '
developers.digg.com' sites which together provide the company blog and documentation for users, publishers, advertisers and developers.
This post will mainly cover the high-level technology of the social news products.
What we are trying to do
We are trying to build social news sites based on user-submitted stories and advertiser-submitted ad content.Story Submission: The stories are submitted by logged-in users with some descriptive fields: a title, a paragraph, a media type, a topic and a possible thumbnail. These fields are extracted from the source document by a variety of metadata standards (such as the Facebook open graph protocol, OEmbed plus some filtering) but the submitter has the final edit on all of these. Ads are submitted by publishers to a separate system but if Dugg enough, may become stories.
Story Lists: All stories are shown in multiple "story lists" such as Most Recent (by date, newest first), by topic of the story, by media type and if you follow the submitted user, in the personalized social news product, MyNews.
Story Actions: Users can do actions on the stories and story lists such as read them, click on them, Digg them, Bury them, make comments, vote on the comments and more. A non-logged in user can only read or click stories.
Story Promotion: Several times per hour we determine stories to move from recent story lists to the Top News story list. The algorithm (our secret sauce!) picks the stories by looking at both user actions and content classification features.
How do we do it?
Let us take a look at a high level view of how somebody visiting one of the Digg sites gets served with content and can do actions. The following picture shows the public view and the boundary to the internal services that Digg uses to provide the Pages, Images or API requests.![](http://developers.diggstatic.com/files/digg-user-facing-public.png)
The edge of our internal systems is simplified here but does show that the API Servers proxy requests to our internal back end services servers. The front end servers are virtually stateless (apart from some caching) and rely on the same service layer. The CMS and Ads systems will not be described further in this post.
Taking a look at the internal high level services in an abstract fashion, these can be generally divided into two system parts:
Online or Interactive or Synchronous
Serve user requests for a page or API directly or indirectly. These have to return the response in some number of milliseconds (for services) which in aggregate to the user cannot be more than 1 or 2 seconds to give a good page response. This includes AJAX requests which are asynchronous for the user in the browser but are request/response from the serving system point of view. Offline or Batch or Asynchronous
Serve requests that are not in the interactive request-response loop and are typically only indirectly initiated by a user. The work here can take from seconds, minutes or hours (rarely).
The two parts above are used in Digg as shown in this diagram:
![](http://developers.diggstatic.com/files/digg-high-level-services-public.png)
Looking deeper into the components.
Online Systems
The applications serving pages or API requests are mainly written in PHP (Front End, Drupal CMS) and Python (API server) using Tornado. They call the back end services via the Thrift protocol to a set of services written in Python. Many things are cached in the online applications (FE and BE) using Memcached and Redis; some items are primarily stored in Redis too, described below.Messaging and Events
The online and offline worlds are connected in a synchronous fashion by calls to the primary data stores, transient / logging systems and in an asynchronous way using RabbitMQ to queue up events that have happened like "a user Dugg a story" or jobs to perform such as "please compute this thing".Batch and Asynchronous Systems
When a message is found in a queue, a job worker is called to perform the specific action. Some messages are triggered by a time based cron-like mechanism too. The workers then typically work on some of the data in the primary stores or offline stores e.g. Logs in HDFS and then usually write the results back into one of the primary stores so that the online services can use them. Examples here are things like indexing new stories, calculating the promotion algorithm, running analytics jobs over site activity.Data Stores
Digg stores data in multiple types of systems depending on the type of data and the access patterns, and also for historical reasons in some cases :)Cassandra: The primary store for "Object-like" access patterns for such things as Items (stories), Users, Diggs and the indexes that surround them. Since the Cassandra 0.6 version we use does not support secondary indexes, these are computed by application logic and stored here. This allows the services to look up, for example, a user by their username or email address rather than the user ID. We use it via the Python Lazyboy wrapper.
HDFS: Logs from site and API events, user activity. Data source and destination for batch jobs run with Map-Reduce and Hive in Hadoop. Big Data and Big Compute!
MogileFS: Stores image binaries for user icons, screenshots and other static assets. This is the backend store for the CDN origin servers which are an aspect of the Front End systems and can be fronted by different CDN vendors.
MySQL: This is mainly the current store for the story promotion algorithm and calculations, because it requires lots of
JOINheavy operations which is not a natural fit for the other data stores at this time. However... HBase looks interesting.
Redis: The primary store for the personalized news data because it needs to be different for every user and quick to access and update. We use Redis to provide the Digg Streaming API and also for the real time view and click counts since it provides super low latency as a memory-based data storage system.
SOLR: Used as the search index for text queries along with some structured fields like date, topic.
Scribe: the log collecting service. Although this is a primary store, the logs are rotated out of this system regularly and summaries written to HDFS.
Operating System and Configuration
Digg runs on Debian stable based GNU/Linux servers which we configure with Clusto, Puppet and using a configuration system over Zookeeper.More
In future blog posts we will describe in more detail some of the systems outlined here. Watch this space!相关文章推荐
- How Digg is Built:讲述Digg背后的技术,互联网营销
- How Digg is Built
- How Digg is Built:讲述Digg背后的技术(转)
- Android make: How to control which module is built
- How to control which module is built
- How to check a static library is built contain bitcode?
- 一段让你了解 Linux 的3分钟视频——How Linux is Built
- How to check if a static library is built for 64-bit?
- How to Check or Verify PC Motherboard BIOS SLIC Version is SLP OA 2.0 or 2.1 for OEM Activation
- How does a C compiler find that -lm is pointing to the file libm.a?
- Summary on 2008-06-12: How to check one session is valid
- VS2008工程移植到2010的问题'system.io.fileloadexception was unhandled message=mixed mode assembly is built ag
- How to Tell if the I/O of the Database is Slow - 1
- 【转】How to solve “add/remove operation is impossible, because the code element 'Cxxx' is read only”
- Sys.WebForms.PageRequestManagerParserErrorException - what it is and how to avoid it
- It is going well-Scene to how work is going
- What is xylitol and how does it help to prevent cavities?
- How and Why Unsafe is Used in Java---reference
- When the PC is obsolete, how will you do this, and this, and this?
- (Page 1 of 3 )A walking tour of JavaBeans What JavaBeans is, how it works, and why you want to use it