大数据工程人员知识图谱
2014-01-27 14:22
30 查看
Topic | Content | Key points | Reference |
DB/OLTP & DW/OLAP | Database/OLTP basic | The relational model, SQL, index/secondary index, inner join/left join/right join/full join, transaction/ACID | Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems. |
Database internal & implementation | Architecture, memory management, storage/B+ tree, query parse /optimization/execution, hash join/sort-merge join | ||
Distributed and parallel database | Sharding, database proxy | ||
Data warehouse/OLAP | Materialized views, ETL, column-oriented storage, reporting, BI tools | ||
Basic programming | Programming language | Java, Python (Pandas/NumPy/SciPy/scikit-learn), SQL, Functional programming, R/SAS/SPSS | Wes McKinney. Python for Data Analysis: Agile Tools for Real World Data. |
OS | Linux | ||
DB & DW system | MySQL/ Hive/Impala | ||
Text format and process | JSON/XML, regex | ||
Tool | Git/SVN, Maven | ||
Distributed system & Hadoop ecosystem & NoSQL | Distributed system principal theory | CAP theorem, RPC (Protocol Buffer/Thrift/Avro), Zookeeper, Metadata management (HCatalog) | |
Distributed storage & computing framework & resource management | Hadoop/HDFS/MapReduce/YARN |
Tom White. Hadoop : The Definitive Guide.
Donald Miner, Adam Shook. MapReduce Design Patterns : Building Effective Algorithm and Analytics for Hadoop and Other Systems. |
|
SQL on Hadoop | Data (log) acquisition/integration/fusion, normalization, feature extraction | Sqoop, Flume/Scribe/Chukwa,SerDe | Edward Capriolo, Dean Wampler, Jason Rutherglen. Programming Hive. |
Query & In-database analytics | Hive, Impala, UDF/UDAF | ||
Large scale data mining & machine learning framework | Spark/MLbase, MR/Mahout | ||
Streaming process | Storm | ||
NoSQL | HBase/Cassandra (column oriented database) | Lars George. HBase: The Definitive Guide. | |
Mongodb (Document database) | |||
Neo4j (graph database) | |||
Redis (cache) | |||
Data mining & Machine learning | DM & ML basic | Numerical/Categorical variable, training/test data, over fitting, bias/variance, precision/recall, tagging | |
Statistic | Data exploration (mean, median/range/standard deviation/variance/histogram), Continues distributions (Normal/ Poisson/Gaussian), covariance, correlation coefficient, distance and similarity computing, Bayes theorem, Monte Carlo Method, Hypothesis testing | ||
Supervised learning | Classifier, boosting, prediction, regression analysis |
Han, Jiawei,Micheline Kamber, and Jian Pei. Data mining: concepts and techniques.
|
|
Unsupervised learning | Cluster, deep learning | ||
Collaborative filtering |
Item based CF, user based CF
|
||
Algorithm | Classifier | Decision trees, KNN (K-Nearest neighbor), SVM (support vector machines), SVD (Singular Value Decomposition), naïve Bayes classifiers, neural networks, | |
Regression | Linear regression, logistic regression, ranking, perception | ||
Cluster | Hierarchical cluster, K-means cluster, Spectral Cluster | ||
Dimensionality reduction | PCA (Principal Component Analysis), LDA (Linear discriminant Analysis), MDS (Multidimensional scaling) | ||
Text mining & Information retrieval | Corpus, term document matrix, term frequency & weight, association rules, market based analysis, vocabulary mapping, sentiment analysis, tagging, PageRank, VSM (Vector Space Model), inverted index | Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce |
相关文章推荐
- 大数据工程人员知识图谱
- 大数据工程人员知识图谱
- 大数据工程人员知识图谱
- 大数据工程人员知识图谱
- 大数据工程人员知识图谱
- 腾讯技术工程 | 厉害了!腾讯AI Lab首次参加知识图谱顶级赛事KBP 2017,就夺得世界冠军
- 极光推送CTO黄鑫:技术人员要建立自己的知识图谱
- 开讲了!大数据、人工智能、Python、区块链,机器人......(文末赠送IT知识图谱)
- 【知识图谱】大数据环境下知识工程的机遇和挑战
- java--知识图谱
- [知识图谱实战篇] 八.HTML+D3绘制时间轴线及显示实体
- 如何系统学习知识图谱-胖子哥的实践经验分享
- 知识图谱技术原理介绍
- 软件测试人员必备网络知识(一):什么是cookie?
- 2.2-知识图谱中语义关系设计
- 嵌入式Linux驱动开发的知识图谱
- 交互设计那些事儿(一):开发人员必备知识
- Web 开发人员需知的 Web 缓存知识
- 软件工程知识框图及要点名列
- Java知识图谱