您的位置:首页 > 其它

How we found the rudest cities in the world – Analytics @ foursquare

2011-03-31 15:57 423 查看
文章来源: http://engineering.foursquare.com/2011/02/28/how-we-found-the-rudest-cities-in-the-world-analytics-foursquare/

推荐阅读:Foursquare outage post mortem

With over 400 million check-ins in the last year, it’s safe to say
that our servers log a lot of data. We use that data to do a lot of
interesting analysis, from finding the most popular local bars in any
city, to recommending people you might know, and even for drawing pretty pictures
.
However, until recently, our data was only stored in production
databases and log files. Most of the time this was fine, but whenever someone non-technical
wanted to do some data exploration, it required them knowing scala
and being able to query against production databases.

This has become a larger problem as of late, as many of our business
development managers, venue specialists, and upper management eggheads
need access to the data in order to inform some important decisions. For
example, which venues are fakes or duplicates (so we can delete them),
what areas of the country are drawn to which kinds of venues (so we can
help them promote themselves), and what are the demographics of our
users in Belgium (so we can surface useful information)?

In short, without easy access to this data, we are not able to make smart decisions at any level of the company.

Thus we needed two things:

A set of high-level scheduled reports to inform general business decisions.

A way for anyone in the company to do data-exploration without
hurting our production systems or learning about scala, sbt, ssh, and
mongo.

The Solution

We decided to use Apache Hadoop
, and Apache Hive
in combination with a custom data server (built in Ruby
), all running in Amazon EC2
.

For those who don’t know, Hadoop is an open-source Map-Reduce framework
for parallel data processing, and Hive is a secondary service that
allows you to interact with Hadoop by defining ‘virtual’ tables and
using familiar SQL syntax.

The data server is built using Rails
, MongoDB
, Redis
, and Resque
and communicates with Hive using the ruby Thrift
client.

We all like pictures, so here is a diagram:



The idea is simple: we run our own ‘data server’ to act as a gateway to reports. This allows us to:

Provide an easy-to-use end-point to run data exploration queries (using SQL and simple web-forms).

Cache the results of queries (in a database) to power reports, so that the data is available to everyone, whenever it is needed.

Allow our hadoop cluster to be totally dynamic without having to move data around (we shut it down at night and on weekends).

Add new data in a simple way (just put it in Amazon S3!).

Analyse data from several data sources (mongodb, postgres, log-files).

Importing Data

The last two points are very important. In fact a large portion of
the data-server’s code base is dedicated to data cleaning and importing.
We found it best to represent all data in tab-delimited flat files. To
turn mongo/log/postgres/json data into this format, each ‘table’ has a
specification written in ruby. Here is a simple example:

class
Checkin <
Foursquare::MappedClass

include
Foursquare::LocationLookup

mapped_attributes :id
, :venue
, :shout
, :lat
, :long

mapped_attributes :country
, :state
, :timezone

end

data = “{

id: ‘123
’,

venue: ‘456
’,

shout: ‘ayup mum!’,

ll:’24.5
,-
50.4
’,

something_else: ‘boo!’

}


checkin = Checkin.new
(
data)

So now:

puts
checkin.to_tab_delimited

=>
123
456
ayup mum! 24.5
-
50.4
us New York America/
New_York

The initialize method provided by Foursquare::MappedClass can
interpret several data types, in this example JSON is used. By including
the LocationLookup module, country, state, and timezone can be
automatically added if a lat/long field exists (using a local Mongodb
database). For all such transformations, tabs, newlines, and excess
white-space are removed from field values to ensure that each record
occupies only a single line.

We have rake tasks to run this as either a simple script, or as part of a hadoop streaming job
.

Running Queries

Because we’re storing data away from the production system, we can
run queries that generate 1,000,000,000,000 records if they want to (I’m
looking at you @injust
), and
the system simply emails them a link to their results when the query has
finished (so they don’t have to wait around). In fact we can run all
sorts of cool stats.

A Fun Example

Lets say we want to find the city with the rudest citizens, judged by
how often a tip left in that city contains a curse word. We could run
this query:

SELECT

v.
city,

v.
state,

sum(
curse)
AS
curses,

sum(
any)
AS
any_tip,

sum(
curse)
/
sum(
any)
AS
percentage

FROM

(

SELECT

venueid,

IF
(
text LIKE
‘%curseword_here%’,
1
,
0
)
AS
curse,

1
AS
any

FROM
tips

)
tips

JOIN
venues v ON
tips.
venueid =
v.
id

GROUP
BY
v.
city,
v.
state

SORT BY
percentage DESC

After 5 minutes of waiting, we have a list of top 20 offenders (highest % of tips containing curse words):



(I’ve filtered out cities that had less than 1000 tips total.)

Its good to see that the Mancunians
truly are not only the rudest people in the UK, but the rudest people
globally, only El Paso comes close. Although please keep in mind that
this only evaluates the rudeness of English speaking countries (like
that would make a difference?).

In Summary

Amazon’s Elastic MapReduce plus a simple Ruby on Rails server can
make a powerful (and cheap) data analysis tool. By reducing the barrier
to data-exploration we have been able to inform better business
decisions, and even create a little fun
.

- Matthew Rathbone
, Foursquare Engineer (and a proud British midlander
)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: