您的位置：首页 > 数据库

快速入门cassandra

2015-09-29 12:21 465 查看

cassandra简介

**

Cassandra

Cassandra是一套开源分布式NoSQL数据库系统。它最初由Facebook开发，用于储存收件箱等简单格式数据，集GoogleBigTable的数据模型与Amazon Dynamo的完全分布式的架构于一身Facebook于2008将 Cassandra 开源，此后，由于Cassandra良好的可扩展性，被Digg、Twitter等知名Web 2.0网站所采纳，成为了一种流行的分布式结构化数据存储方案。

Cassandra是一个混合型的非关系的数据库，类似于Google的BigTable。其主要功能比Dynamo （分布式的Key-Value存储系统）更丰富，但支持度却不如文档存储MongoDB（介于关系数据库和非关系数据库之间的开源产品，是非关系数据库当中功能最丰富，最像关系数据库的。支持的数据结构非常松散，是类似json的bjson格式，因此可以存储比较复杂的数据类型）。

Cassandra最初由Facebook开发，后转变成了开源项目。它是一个网络社交云计算方面理想的数据库。以Amazon专有的完全分布式的Dynamo为基础，结合了Google BigTable基于列族（Column Family）的数据模型。P2P去中心化的存储。很多方面都可以称之为Dynamo 2.0。

高可靠性

Cassandra采用gossip作为集群中结点的通信协议，该协议整个集群中的节点都处于同等地位，没有主从之分，这就使得任一节点的退出都不会导致整个集群失效。

Cassandra和HBase都是借鉴了Google BigTable的思想来构建自己的系统，但Cassandra另一重要的创新就是将原本存在于文件共享架构的p2p(peer to peer)引入了NoSQL。

P2P的一大特点就是去中心化，集群中的所有节点享有同等地位，这极大避免了单个节点退出而使整个集群不能工作的可能。

与之形成对比的是HBase采用了Master/Slave的方式，这就存在单点失效的可能。

高可扩性

随着时间的推移，集群中原有的规模不足以存储新增加的数据，此时进行系统扩容。Cassandra级联可扩，非常容易实现添加新的节点到已有集群，操作简单。

最终一致性

分布式存储系统都要面临CAP定律问题，任何一个分布式存储系统不可能同时满足一致性(consistency)，可用性(availability)和分区容错性(partition tolerance)。

Cassandra是优先保证AP，即可用性和分区容错性。

Cassandra为写操作和读操作提供了不同级别的一致性选择，用户可以根据具体的应用场景来选择不同的一致性级别。

高效写操作

写入操作非常高效，这对于实时数据非常大的应用场景，Cassandra的这一特性无疑极具优势。

数据读取方面则要视情况而定：

如果是单个读取即指定了键值，会很快的返回查询结果。

如果是范围查询，由于查询的目标可能存储在多个节点上，这就需要对多个节点进行查询，所以返回速度会很慢

读取全表数据，非常低效。

结构化存储

Cassandra是一个面向列的数据库，对那些从RDBMS方面转过来的开发人员来说，其学习曲线相对平缓。

Cassandra同时提供了较为友好CQL语言，与SQL语句相似度很高。

维护简单

从系统维护的角度来说，由于Cassandra的对等系统架构，使其维护操作简单易行。如添加节点，删除节点，甚至于添加新的数据中心，操作步骤都非常的简单明了。

**

ubuntu下快速安装和使用

**

**

步骤 0: 下载并安装jdk

**

1.jdk下载地址：jdk8

2.jdk安装步骤如下：

root@zctt-Lenovo-Product:/home/zctt/下载# tar -xzf jdk-8u60-linux-x64.tar.gz
root@zctt-Lenovo-Product:/home/zctt/下载# cd jdk1.8.0_60
root@zctt-Lenovo-Product:/home/zctt/下载/jdk1.8.0_60# ls
bin        javafx-src.zip  man          THIRDPARTYLICENSEREADME-JAVAFX.txt
COPYRIGHT  jre             README.html  THIRDPARTYLICENSEREADME.txt
db         lib             release
include    LICENSE         src.zip
root@zctt-Lenovo-Product:/home/zctt/下载/jdk1.8.0_60# gedit /etc/profile
添加如下配置：
export JAVA_HOME=/usr/local/java/jdk1.8.0_60
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/lib/rt.jar
root@zctt-Lenovo-Product:/home/zctt/下载/jdk1.8.0_60#source /etc/profile
root@zctt-Lenovo-Product:/home/zctt/下载/jdk1.8.0_60# java -version
java version "1.7.0_79"
OpenJDK Runtime Environment (IcedTea 2.5.6) (7u79-2.5.6-0ubuntu1.14.04.1)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)

步骤 1: 下载 Cassandra

**

下载最新的稳定版本可以请点击链接： apache-cassandra-2.1.9-bin.tar.gz.

**

步骤 2: 基本配置

**

Cassandra配置文件可以在编译版和源代码发布版的conf目录中找到。如果你已经从一个deb或rpm包安装Cassandra，配置文件将位于/ etc/cassandra。

**

步骤 2.1: 建立Cassandra需要的目录

**

1.配置目录：

配置conf/cassandra.yaml文件中的

data_file_directories (/var/lib/cassandra/data)、commit
4000
log_directory (/var/lib/cassandra/commitlog)、saved_caches_directory (/var/lib/cassandra/saved_caches)路径。并确保目录有读写权限。

步骤如下：

mkdir /var/lib/cassandra/data
mkdir /var/lib/cassandra/commitlog
mkdir /var/lib/cassandra/saved_caches
chmod 777 /var/lib/cassandra/data
chmod 777 /var/lib/cassandra/commitlog
chmod 777 /var/lib/cassandra/saved_caches

配置conf/logback.xml中日志存放路径

mkdir /var/log/cassandra
chmod 777 /var/log/cassandra
gedit conf/logback.xml
修改日志存放路径：
<file>/var/log/cassandra/system.log</file>

JVM-level中关于堆大小等可以conf/cassandra-env.sh中配置。

**

步骤 3: 启动cassandra

**

1.前台启动 ‘bin/cassandra -f’ ,关闭可以按下”Control-C”.

2.后台启动 ‘bin/cassandra’ ,关闭可以使用”pkill -f CassandraDaemon”.

3.若有启动错误可以根据日志提醒,在http://news.gmane.org/gmane.comp.db.cassandra.user中搜索解决方法或检查自己的配置。

**

步骤 4: 使用 cqlsh

4.1 cqlsh的简单使用：

默认登录到localhost：
bin/cqlsh
登录到特定节点：
bin/cqlsh 192.168.123.3
登录成功显示如下：
Connected to Test Cluster at localhost:9160.
[cqlsh 2.3.0 | Cassandra 1.2.2 | CQL spec 3.0.0 | Thrift protocol 19.35.0]
Use HELP for help.
首先，创建一个密钥空间 - 表的命名空间：
CREATE KEYSPACE mykeyspace
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
其次，验证到新的密钥空间：
USE mykeyspace;
三，创建一个用户表：
CREATE TABLE users (
user_id int PRIMARY KEY,
fname text,
lname text
);
现在，可以将数据存储到表usres：
INSERT INTO users (user_id,  fname, lname)
VALUES (1745, 'john', 'smith');
INSERT INTO users (user_id,  fname, lname)
VALUES (1744, 'john', 'doe');
INSERT INTO users (user_id,  fname, lname)
VALUES (1746, 'john', 'smith');
现在，让我们查询插入的数据：
SELECT * FROM users;
输出如下:
user_id | fname | lname
---------+-------+-------
1745 |  john | smith
1744 |  john |   doe
1746 |  john | smith
可以通过建立索引检索自己需要的数据：
CREATE INDEX ON users (lname);

SELECT * FROM users WHERE lname = 'smith';

user_id | fname | lname
---------+-------+-------
1745 |  john | smith
1746 |  john | smith

4.2 小结

**

通过以上步骤可以迅速安装并使用cassandra。

**

5. Cassandra数据模型

**

5.1 单表查询

5.1.1 单表主键查询

在建立个人信息数据库的时候，以个人身份证id为主键，查询的时候也只以身份证为关键字进行查询，则表可以设计成为：

create table person (
userid text primary key,
fname text,
lname text,
age int,
gender int);

Primary key中的第一个列名是作为Partition key。也就是说根据针对partition key的hash结果决定将记录存储在哪一个partition中，如果不湊巧的情况下单一主键导致所有的hash结果全部落在同一分区，则会导致该分区数据被撑满。

解决这一问题的办法是通过组合分区键(compsoite key)来使得数据尽可能的均匀分布到各个节点上。

举例来说，可能将(userid,fname)设置为复合主键。那么相应的表创建语句可以写成

create table person (
userid text,
fname text,
lname text,
gender int,
age int,
primary key((userid,fname),lname);
) with clustering order by (lname desc);

稍微解释一下primary key((userid, fname),lname)的含义：

其中(userid,fname)称为组合分区键(composite partition key)

lname是聚集列(clustering column)

((userid,fname),lname)一起称为复合主键(composite primary key)

5.1.2 单表非主键查询

如果要查询表person中具有相同的first name的人员，那么就必须针对fname创建相应的索引，否则查询速度会非常缓慢。

Create index on person(fname);

Cassandra目前只能对表中的某一列建立索引，不允许对多列建立联合索引。

5.2 多表关联查询

Cassandra并不支持关联查询，也不支持分组和聚合操作。

那是不是就说明Cassandra只是看上去很美其实根本无法解决实际问题呢？答案显然是No,只要你不坚持用RDBMS的思路来解决问题就是了。

比如我们有两张表，一张表(Departmentt)记录了公司部门信息，另一张表(employee)记录了公司员工信息。显然每一个员工必定有归属的部门，如果想知道每一个部门拥有的所有员工。如果是用RDBMS的话，SQL语句可以写成：

select * from employee e , department d where e.depId = d.depId;

要用Cassandra来达到同样的效果，就必须在employee表和department表之外，再创建一张额外的表(dept_empl)来记录每一个部门拥有的员工信息。

Create table dept_empl (
deptId text,

看到这里想必你已经明白了，在Cassandra中通过数据冗余来实现高效的查询效果。将关联查询转换为单一的表操作。

5.3 分组和聚合

在RDBMS中常见的group by和max、min在Cassandra中是不存在的。

如果想将所有人员信息按照姓进行分组操作的话，那该如何创建数据模型呢？

Create table fname_person (
fname text,
userId text,
primary key(fname)
);

5.4 子查询

Cassandra不支持子查询，下图展示了一个在MySQL中的子查询例子：

要用Cassandra来实现，必须通过添加额外的表来存储冗余信息。

Create table office_empl (
officeCode text,
country text,
lastname text,
firstname,
primary key(officeCode,country));
create index on office_empl(country);

5.5 小结

总的来说，在建立Cassandra数据模型的时候，要求对数据的读取需求进可能的清晰，然后利用反范式的设计方式来实现快速的读取，原则就是以空间来换取时间。

**

6.编写应用程序

**

要连接到Cassandra，你需要为你的语言选择数据库驱动程序。

DataStax发起了CQL驱动的开发，详情请点击https://github.com/datastax。 CQL驱动程序的完整列表可在ClientOptions页面上找到。

决定如何设计你的架构和数据的布局，这将有助于如何审查数据模型的资源。

您可能还需要阅读完整的CQL文档。

**

7.配置多节点集群

**

现在你有了一个Cassandra工作节点。它是具有只有一个节点的一个Cassandra群集。通过增加更多的节点，你可以把它变为一个多节点集群。

建立一个Cassandra集群几乎是一样简单重复以上每个集群中节点的配置。但有一些小的差别。

需要配置conf/cassandra.yaml中seed选项，此为节点相互交换信息的种子节点。listen_address选项关于每个节点的监听的ip地址。rpc_address选项关于客户端访问每个节点的ip地址。

cluster_name选项配置自己的集群名，每个节点的集群名需要保持一致。

详细参考配置cassandra.yaml如下：

# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: 'zctt'
# Directories where Cassandra should store data on disk. Cassandra
# will spread data evenly across them, subject to the granularity of
# the configured compaction strategy.
# If not set, the default directory is $CASSANDRA_HOME/data/data.
data_file_directories:
- /usr/local/apache-cassandra-2.1.9/data
# commit log. when running on magnetic HDD, this should be a
# separate spindle than the data directories.
# If not set, the default directory is $CASSANDRA_HOME/data/commitlog.
commitlog_directory: /usr/local/apache-cassandra-2.1.9/commitlog
# saved caches
# If not set, the default directory is $CASSANDRA_HOME/data/saved_caches.
saved_caches_directory: /usr/local/apache-cassandra-2.1.9/saved_caches
# any class that implements the SeedProvider interface and has a
# constructor that takes a Map<String, String> of parameters will do.
seed_provider:
# Addresses of hosts that are deemed contact points.
# Cassandra nodes use this list of hosts to find each other and learn
# the topology of the ring. You must change this if you are running
# multiple nodes!
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
# seeds is actually a comma-delimited list of addresses.
# Ex: "<ip1>,<ip2>,<ip3>"
- seeds: "192.168.123.3,127.0.0.1"
# If you choose to specify the interface by name and the interface has an ipv4 and an ipv6 address
# you can specify which should be chosen using rpc_interface_prefer_ipv6. If false the first ipv4
# address will be used. If true the first ipv6 address will be used. Defaults to false preferring
# ipv4. If there is only one address it will be selected regardless of ipv4/ipv6.
#rpc_interface: eth0
#rpc_interface_prefer_ipv
9f01
6: false
rpc_address: 192.168.123.3
# If you choose to specify the interface by name and the interface has an ipv4 and an ipv6 address
# you can specify which should be chosen using rpc_interface_prefer_ipv6. If false the first ipv4
# address will be used. If true the first ipv6 address will be used. Defaults to false preferring
# ipv4. If there is only one address it will be selected regardless of ipv4/ipv6.
#rpc_interface: eth0
#rpc_interface_prefer_ipv6: false
rpc_address: 192.168.123.3

一旦所有的配置和节点都在运行，使用’bin/nodetool status’实用程序来验证是否正确连接集群。例如：

root@zctt-Lenovo-Product:/usr/local/apache-cassandra-2.1.9# ./bin/nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  192.168.123.3  215.32 KB  256     ?       e8880f57-b5f8-4e4f-8406-41b361ef109e  rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

参考文档

**

1.http://wiki.apache.org/cassandra/

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： cassandra nosql 分布式

相关文章推荐

新的分享

章节导航