Web Content Mining
2008-06-17 11:36
393 查看
Web mining is a rapid growing research area. It consists of Web usage mining, Web structure mining, and Web content mining. Web usage mining refers to the discovery of user access patterns from Web usage logs. Web structure mining tries to discover useful knowledge from the structure of hyperlinks. Web content mining aims to extract/mine useful information or knowledge from web page contents. This tutorial focuses on Web Content Mining.
Web content mining is related but different from data mining and text mining. It is related to data mining because many data mining techniques can be applied in Web content mining. It is related to text mining because much of the web contents are texts. However, it is also quite different from data mining because Web data are mainly semi-structured and/or unstructured, while data mining deals primarily with structured data. Web content mining is also different from text mining because of the semi-structure nature of the Web, while text mining focuses on unstructured texts. Web content mining thus requires creative applications of data mining and/or text mining techniques and also its own unique approaches. In the past few years, there was a rapid expansion of activities in the Web content mining area. This is not surprising because of the phenomenal growth of the Web contents and significant economic benefit of such mining. However, due to the heterogeneity and the lack of structure of Web data, automated discovery of targeted or unexpected knowledge information still present many challenging research problems. In this tutorial, we will examine the following important Web content mining problems and discuss existing techniques for solving these problems. Some other emerging problems will also be surveyed.
Data/information extraction: Our focus will be on extraction of structured data from Web pages, such as products and search results. Extracting such data allows one to provide services. Two main types of techniques, machine learning and automatic extraction are covered.
Web information integration and schema matching: Although the Web contains a huge amount of data, each web site (or even page) represents similar information differently. How to identify or match semantically similar data is a very important problem with many practical applications. Some existing techniques and problems are examined.
Opinion extraction from online sources: There are many online opinion sources, e.g., customer reviews of products, forums, blogs and chat rooms. Mining opinions (especially consumer opinions) is of great importance for marketing intelligence and product benchmarking. We will introduce a few tasks and techniques to mine such sources.
Knowledge synthesis: Concept hierarchies or ontology are useful in many applications. However, generating them manually is very time consuming. A few existing methods that explores the information redundancy of the Web will be presented. The main application is to synthesize and organize the pieces of information on the Web to give the user a coherent picture of the topic domain..
Segmenting Web pages and detecting noise: In many Web applications, one only wants the main content of the Web page without advertisements, navigation links, copyright notices. Automatically segmenting Web page to extract the main content of the pages is interesting problem. A number of interesting techniques have been proposed in the past few years.
All these tasks present major research challenges and their solutions also have immediate real-life applications. The tutorial will start with a short motivation of the Web content mining. We then discuss the difference between web content mining and text mining, and between Web content mining and data mining. This is followed by presenting the above problems and current state-of-the-art techniques. Various examples will also be given to help participants to better understand how this technology can be deployed and to help businesses. All parts of the tutorial will have a mix of research and industry flavor, addressing seminal research concepts and looking at the technology from an industry angle.
<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />
For more information, please visit our website: http://www.knowlesys.com
相关文章推荐
- Web Content Mining
- Web Content Mining
- Graph-Theoretic Techniques for Web Content Mining
- Web Content Mining
- Web Content Mining
- Web Content Mining
- Web Content Mining
- Web Content Mining
- Web Content Mining
- Web Content Mining
- Web Content Mining
- WebRoot与WebContent的区别
- JavaEE_Web文件下载_lesson2_ JavaEE下载文件,中文乱码解决方法(Content-Disposition)
- eclipse默认指向 WebContent 目录 修改为 webRoot
- http://webcache.googleusercontent.com/search?q=cache:Xh34UJ65wvcJ:http://ja.w3support.net/index.php%
- Web页面未完全渲染:content部分未渲染,与navbar渲染效果相同(关键词:flask/bug)
- Mining the Link Structure of the World Wide Web
- Content Query Webparts 返回页面
- web content 嵌入到应用上
- java web Content is not allowed in prolog-解决