How we found the rudest cities in the world – Analytics @ foursquare
2011-03-31 15:57
423 查看
文章来源: http://engineering.foursquare.com/2011/02/28/how-we-found-the-rudest-cities-in-the-world-analytics-foursquare/
推荐阅读:Foursquare outage post mortem
With over 400 million check-ins in the last year, it’s safe to say
that our servers log a lot of data. We use that data to do a lot of
interesting analysis, from finding the most popular local bars in any
city, to recommending people you might know, and even for drawing pretty pictures
.
However, until recently, our data was only stored in production
databases and log files. Most of the time this was fine, but whenever someone non-technical
wanted to do some data exploration, it required them knowing scala
and being able to query against production databases.
This has become a larger problem as of late, as many of our business
development managers, venue specialists, and upper management eggheads
need access to the data in order to inform some important decisions. For
example, which venues are fakes or duplicates (so we can delete them),
what areas of the country are drawn to which kinds of venues (so we can
help them promote themselves), and what are the demographics of our
users in Belgium (so we can surface useful information)?
In short, without easy access to this data, we are not able to make smart decisions at any level of the company.
Thus we needed two things:
A set of high-level scheduled reports to inform general business decisions.
A way for anyone in the company to do data-exploration without
hurting our production systems or learning about scala, sbt, ssh, and
mongo.
, and Apache Hive
in combination with a custom data server (built in Ruby
), all running in Amazon EC2
.
For those who don’t know, Hadoop is an open-source Map-Reduce framework
for parallel data processing, and Hive is a secondary service that
allows you to interact with Hadoop by defining ‘virtual’ tables and
using familiar SQL syntax.
The data server is built using Rails
, MongoDB
, Redis
, and Resque
and communicates with Hive using the ruby Thrift
client.
We all like pictures, so here is a diagram:
![](http://engineering.foursquare.com/wp-content/uploads/2011/02/InfographicBlogPost-1.png)
The idea is simple: we run our own ‘data server’ to act as a gateway to reports. This allows us to:
Provide an easy-to-use end-point to run data exploration queries (using SQL and simple web-forms).
Cache the results of queries (in a database) to power reports, so that the data is available to everyone, whenever it is needed.
Allow our hadoop cluster to be totally dynamic without having to move data around (we shut it down at night and on weekends).
Add new data in a simple way (just put it in Amazon S3!).
Analyse data from several data sources (mongodb, postgres, log-files).
the data-server’s code base is dedicated to data cleaning and importing.
We found it best to represent all data in tab-delimited flat files. To
turn mongo/log/postgres/json data into this format, each ‘table’ has a
specification written in ruby. Here is a simple example:
class
Checkin <
Foursquare::MappedClass
include
Foursquare::LocationLookup
mapped_attributes :id
, :venue
, :shout
, :lat
, :long
mapped_attributes :country
, :state
, :timezone
end
data = “{
id: ‘123
’,
venue: ‘456
’,
shout: ‘ayup mum!’,
ll:’24.5
,-
50.4
’,
something_else: ‘boo!’
}
”
checkin = Checkin.new
(
data)
So now:
puts
checkin.to_tab_delimited
=>
123
456
ayup mum! 24.5
-
50.4
us New York America/
New_York
The initialize method provided by Foursquare::MappedClass can
interpret several data types, in this example JSON is used. By including
the LocationLookup module, country, state, and timezone can be
automatically added if a lat/long field exists (using a local Mongodb
database). For all such transformations, tabs, newlines, and excess
white-space are removed from field values to ensure that each record
occupies only a single line.
We have rake tasks to run this as either a simple script, or as part of a hadoop streaming job
.
run queries that generate 1,000,000,000,000 records if they want to (I’m
looking at you @injust
), and
the system simply emails them a link to their results when the query has
finished (so they don’t have to wait around). In fact we can run all
sorts of cool stats.
how often a tip left in that city contains a curse word. We could run
this query:
SELECT
v.
city,
v.
state,
sum(
curse)
AS
curses,
sum(
any)
AS
any_tip,
sum(
curse)
/
sum(
any)
AS
percentage
FROM
(
SELECT
venueid,
IF
(
text LIKE
‘%curseword_here%’,
1
,
0
)
AS
curse,
1
AS
any
FROM
tips
)
tips
JOIN
venues v ON
tips.
venueid =
v.
id
GROUP
BY
v.
city,
v.
state
SORT BY
percentage DESC
After 5 minutes of waiting, we have a list of top 20 offenders (highest % of tips containing curse words):
![](http://engineering.foursquare.com/wp-content/uploads/2011/02/chart-1024x864.png)
(I’ve filtered out cities that had less than 1000 tips total.)
Its good to see that the Mancunians
truly are not only the rudest people in the UK, but the rudest people
globally, only El Paso comes close. Although please keep in mind that
this only evaluates the rudeness of English speaking countries (like
that would make a difference?).
make a powerful (and cheap) data analysis tool. By reducing the barrier
to data-exploration we have been able to inform better business
decisions, and even create a little fun
.
- Matthew Rathbone
, Foursquare Engineer (and a proud British midlander
)
推荐阅读:Foursquare outage post mortem
With over 400 million check-ins in the last year, it’s safe to say
that our servers log a lot of data. We use that data to do a lot of
interesting analysis, from finding the most popular local bars in any
city, to recommending people you might know, and even for drawing pretty pictures
.
However, until recently, our data was only stored in production
databases and log files. Most of the time this was fine, but whenever someone non-technical
wanted to do some data exploration, it required them knowing scala
and being able to query against production databases.
This has become a larger problem as of late, as many of our business
development managers, venue specialists, and upper management eggheads
need access to the data in order to inform some important decisions. For
example, which venues are fakes or duplicates (so we can delete them),
what areas of the country are drawn to which kinds of venues (so we can
help them promote themselves), and what are the demographics of our
users in Belgium (so we can surface useful information)?
In short, without easy access to this data, we are not able to make smart decisions at any level of the company.
Thus we needed two things:
A set of high-level scheduled reports to inform general business decisions.
A way for anyone in the company to do data-exploration without
hurting our production systems or learning about scala, sbt, ssh, and
mongo.
The Solution
We decided to use Apache Hadoop, and Apache Hive
in combination with a custom data server (built in Ruby
), all running in Amazon EC2
.
For those who don’t know, Hadoop is an open-source Map-Reduce framework
for parallel data processing, and Hive is a secondary service that
allows you to interact with Hadoop by defining ‘virtual’ tables and
using familiar SQL syntax.
The data server is built using Rails
, MongoDB
, Redis
, and Resque
and communicates with Hive using the ruby Thrift
client.
We all like pictures, so here is a diagram:
![](http://engineering.foursquare.com/wp-content/uploads/2011/02/InfographicBlogPost-1.png)
The idea is simple: we run our own ‘data server’ to act as a gateway to reports. This allows us to:
Provide an easy-to-use end-point to run data exploration queries (using SQL and simple web-forms).
Cache the results of queries (in a database) to power reports, so that the data is available to everyone, whenever it is needed.
Allow our hadoop cluster to be totally dynamic without having to move data around (we shut it down at night and on weekends).
Add new data in a simple way (just put it in Amazon S3!).
Analyse data from several data sources (mongodb, postgres, log-files).
Importing Data
The last two points are very important. In fact a large portion ofthe data-server’s code base is dedicated to data cleaning and importing.
We found it best to represent all data in tab-delimited flat files. To
turn mongo/log/postgres/json data into this format, each ‘table’ has a
specification written in ruby. Here is a simple example:
class
Checkin <
Foursquare::MappedClass
include
Foursquare::LocationLookup
mapped_attributes :id
, :venue
, :shout
, :lat
, :long
mapped_attributes :country
, :state
, :timezone
end
data = “{
id: ‘123
’,
venue: ‘456
’,
shout: ‘ayup mum!’,
ll:’24.5
,-
50.4
’,
something_else: ‘boo!’
}
”
checkin = Checkin.new
(
data)
So now:
puts
checkin.to_tab_delimited
=>
123
456
ayup mum! 24.5
-
50.4
us New York America/
New_York
The initialize method provided by Foursquare::MappedClass can
interpret several data types, in this example JSON is used. By including
the LocationLookup module, country, state, and timezone can be
automatically added if a lat/long field exists (using a local Mongodb
database). For all such transformations, tabs, newlines, and excess
white-space are removed from field values to ensure that each record
occupies only a single line.
We have rake tasks to run this as either a simple script, or as part of a hadoop streaming job
.
Running Queries
Because we’re storing data away from the production system, we canrun queries that generate 1,000,000,000,000 records if they want to (I’m
looking at you @injust
), and
the system simply emails them a link to their results when the query has
finished (so they don’t have to wait around). In fact we can run all
sorts of cool stats.
A Fun Example
Lets say we want to find the city with the rudest citizens, judged byhow often a tip left in that city contains a curse word. We could run
this query:
SELECT
v.
city,
v.
state,
sum(
curse)
AS
curses,
sum(
any)
AS
any_tip,
sum(
curse)
/
sum(
any)
AS
percentage
FROM
(
SELECT
venueid,
IF
(
text LIKE
‘%curseword_here%’,
1
,
0
)
AS
curse,
1
AS
any
FROM
tips
)
tips
JOIN
venues v ON
tips.
venueid =
v.
id
GROUP
BY
v.
city,
v.
state
SORT BY
percentage DESC
After 5 minutes of waiting, we have a list of top 20 offenders (highest % of tips containing curse words):
![](http://engineering.foursquare.com/wp-content/uploads/2011/02/chart-1024x864.png)
(I’ve filtered out cities that had less than 1000 tips total.)
Its good to see that the Mancunians
truly are not only the rudest people in the UK, but the rudest people
globally, only El Paso comes close. Although please keep in mind that
this only evaluates the rudeness of English speaking countries (like
that would make a difference?).
In Summary
Amazon’s Elastic MapReduce plus a simple Ruby on Rails server canmake a powerful (and cheap) data analysis tool. By reducing the barrier
to data-exploration we have been able to inform better business
decisions, and even create a little fun
.
- Matthew Rathbone
, Foursquare Engineer (and a proud British midlander
)
相关文章推荐
- 苹果审核:2.12 We found that the usefulness of your app is limited by the minimal amount of content it in
- How in the world did my (good) name get dragged into this? - 11.01
- How to reset the root passwd if we forget the root password in red hat
- How to invoke the method of managed bean and render view in JSF when we are outside the lifecycle of JSF
- How Rich Are You in the World?
- How Rich Are You in the World?
- How we reindexed 36 billion documents in 5 days within the same Elasticsearch cluster
- how do we change the world
- QTP的那些事 -– Visual Relation Identifier Feature: How to use in the real world
- iOS 上架Appstore被拒原因:PLA 3.3.12 We found that your app uses the Advertising but does not in
- There Are Only Four Jobs in the Whole World – Are You in the Right One?
- [转] How we do the Daily Scrum in my team
- this-is-how-we-troubleshoot-windows-interoperability-issues-in-the-open-specifications-support-team/
- Redefine:Change in the Changing World
- How to put an object on the request in a servlet
- How to prevent the China Dog collar in Business Clothing
- Python version 2.7 required, which was not found in the registry
- HowToRemoveTheActionSuffixExtensionInStruts2
- 安装第三方库出现 Python version 2.7 required, which was not found in the registry
- How to programmatically remove/hide the system bar in Honeycomb (requires root)