Icait2011 Submission 5
Icait2011 Submission 5
Icait2011 Submission 5
Proceedings of 2011 International Conference on Advancements in Information Technology (ICAIT 2011) Chennai, India, 17-18 December, 2011, An algorithmic Schema Matching Technique for Information Extraction on the Web 1 2 Suresh Jain , C. S.Bhatia
2
SIA,Indore
KCBTA, Indore
Abstract. World today is facing the challenge of handling and managing information expansion, which is
expanding at exponential rate. Search engines are powerful to extract HTML based information stored over the websites. However, the search engines have limitations extracting information stored in backend database due to schema matching problems or security reasons. On the web there are so many problems still to solve, so that we can not make our searches precise & relevant to context. These problems fall in different categories. First problem is technical problem. These technical problems are related to data rather than to meaning. Second problem is Semantic problems. These are related to the meaning of data. This paper deals with solution of these semantic problems with the help of Novel Schema Matching techniques.
Keywords: XML, RDF, Schema Matching, Global Schema Based Querying (GSQ) 1. Introduction
Nowadays number of information sources & websites are increasing on the web. There are 32 million active websites on the internet accessed by one billion Internet users [1, 2]. However, there are scores of Web pages that users cannot navigate all of them to gain the information needed. A system that can allow users to query Web pages like a database is becoming increasingly desirable. It means to an idea or fact. Information is used for making decision [5]. However, in information science it has been widely accepted that information = data + meaning. Data is often obtained as a result of recording or observation. The huge amount of raw input is data and it has no meaning when it exist in that form i.e. it is meaningless. For example the temperature of days is data. When data is to be assembled, a system or person monitors the daily temperature and record it. Finally when it is converted into meaningful information, the patterns in the temperature are analyzed and a conclusion about temperature is arrived at. While raw data will not give any answer to questions, but information gives answers to questions such as where, what, who, and when. Data is used as input for processing and output of this processing is known as information. Observation and recording are finished to obtain data, while analysis is finished to obtain information. From a broader perspective, information extraction is part of the learning process through which humans increase their knowledge (consist some semantic meaning) and wisdom [7]. Fig 1 shows Raw Data-InformationKnowledge processing
All type of data and information are stored on the internet. Internet supports all type of the information forms that can be digitized, and implement tools for information exchange, such as mail, libraries, and television broadcast, and furthermore, medium for business communication, constantly introduces new ones. So, Internet commencement overcomes all the predecessor technology of information sources and has now become a language of communication for the world. Extraction information in such an atmosphere is only possible with the use of information extraction tools. The search engine is the most popular information extraction tool, has come to be essential for Internet users although information extraction tools are not perfect and still have many problems to solve. These problems fall in two broad categories. Technical or Syntactic problems: The syntactic problems are related to data rather than to meaning. These address data management problems such as data representation, data storage, and data exchange protocols. Technical problems have been solved through the use of mature technologies, such as databases for storing and querying large volumes of data, and use of common Internet protocols for data exchange on the Internet. Thanks to these technologies,
todays Internet users are mostly unaware of technical problems. Semantic problems: These are related to the meaning of data. Semantic problems occur when there is a disagreement about the meaning, interpretation, and the computer system which manipulates the information. It cannot really understand the meaning of the information. That type of problem is known as the semantic gap. Internet users are well aware that semantic problems exist. So we try to solve the semantic problems on the Internet, through two complementary directions. (1) It develops smarter tools, which try to capture and use the meaning of information, and exact answering of the user queries. (2) To represent (i.e., store) and organize information, become explicit and machine readable. To illustrate these approaches, the following sections briey survey information representations (Sec. 2) and information extraction tools (Sec. 3) on the Internet. Sec. 3.4 proposes global schema based querying as a new way for gathering useful information from the XML portion of the Internet. Different experts give the definition of Schema Matching (Sec 4).We have explained Schema matching in global schema based querying is presented in section 4.1. Finally section 5 con clude s this pape r.
Language (OWL). RDF models information as a set of subject-predicate-object triples, where predicate is a directed relation between two resources: the subject and the object [8]. The three information representation techniques, HTML, XML, and RDF with OWL, presently coexist on the Internet. However, the research in this paper only considers XML as a format for representing information. Furthermore, as both XML and the Semantic Web tackle the problem of semantic heterogeneity, so this problem consider in this paper. We envision internet as a massive library with no Catalog and no staff to support. It is not easy to remember all the addresses of required sites/sources. Developer develops a number of search tools or services for this purpose. Provide Browse/Search interface to retrieve and access the required information. In this section we will rst discuss the three most general information extraction tools: Directories, Search Engine, Meta Search Engines and Tools for Extraction People and then took a look at research efforts for carrying more powerful information extraction in XML environment.
2. Information Representation
HTML: which stands for Hyper Text Markup Language, is the documents describe web pages. It provides a means to create structured documents b y denoting structural semantics for text such as headings, paragraphs, lists, links, quotes and other items. It allows images and objects to be embedded and can be used to create interactive forms. It is written in the form of HTML elements consisting of "tags" surrounded by angle brackets within the web page content. The purpose of a web browser (like Internet Explorer or Firefox) is to read HTML documents and display them as web pages [5]. The browser does not display the HTML tags, but uses the tags to interpret the content of the page [8]. XML: HTML quickly became a bottleneck in the effort to store and manages huge volume of data on the internet. In May 1996, Jon Bosak became the leader of the group responsible for adopting SGML for the use on the internet and became a standard W3C recommendation in February 1998. XML, Extensible Markup Language (XML) is a set of rules for encoding documents in machine-readable form. It was designed to carry data, not to display data. XML tags are not predefined so you must define your own tags. XML documents are further proceed as needed. For example, to display information, technologies such as XSLT or XQuery [5], transform an XML document to a desired HTML document. RDF: In 1998, Tim Berners-Lee conceives the Semantic Web: an information representation technique geared towards enabling fully automatic reasoning over the represented information. The core components of the technique are the Resource Description Framework (RDF) and the Web Ontology
User Requesting Information: This is the first stage where the user types the query or keyword in the query interface and submits this query to the information extraction tool. After receiving the query, the tool is to search on the internet and find out the information source which contains the requested information. Understanding the Information Source: When the information extraction tool has nished the search and has discovered multiple sources with information which contain requested information. Due to the semantic gap, the information extraction tool cannot reach on the perfect information sources but still solve so many problems. Information Processing: Information source usually contain massive amount of information. In this stage user analyze or filter whole information and find out exact information from this information source. Dierentthe information extraction workow. First and organize the user in extraction tools and dierently support to informationthese three stages, provide dierent
second stage is common near about every information extraction tools. In some case, third stage is used. Search Engine
Library information, a global schema similar to the one given in Fig. 5a.
/lib/publication/(author, Dickens)/title
A web search engine [11] is designed to search for information on the web (www). A search engine is a kind of Information retrieval (IR) system [9] which sees the Internet as a document collection. Information extraction with a search engine is illustrated in Fig. 4. In search engine, the user expresses his information need with a query K or keyword, and the IR system provides the answer by pointing to the documents which are the most relevant to the users query. To help the user in understanding the proposed documents, search engine takes you to the exact page on which the words and phrases you are looking for appear. A search engine, in general, does not provide support for information processing. Block diagram of search engine is determined by 2 requirements. 1] Effectiveness (quality of result) 2] Efficiency (response time & throughput) In case of Web directories and Meta search engine Information processing phase is not supported. The large amount of XML data sources on the Internet is increasing, and the Internet to come together from unstructured in the direction of a more structured state. In current research, two approaches are considered for this problem: (1) User responsible for representation mismatches between his query and the structure of data, and (2) System responsible for representation mismatches. Tools are used for free querying of structured data on the Internet can be considered both as information extraction and querying tools. As a result, we use both terms to describe the purpose of these systems.So, we introduce our technique for structured querying of unidentified XML data: global schema based querying. The technique is used for the second group of techniques, i.e., the system takes the responsibility for resolving representation mismatches, and is similar to the idea of semantic query processing. The following section describes the technique.
These global schemas represent Library inform ation that is private, and the actual Library information structure may be different in the internet, but user is unaware from this knowledge. The user is also allowed to ask queries on his global schema. These queries are called global queries. For the user point of view, the three stages of information extraction of a GSQ can be explained or distinguished here in Fig. 7. In stage 1, first the user submits the query in his global schema. The main responsibility of GSQ is to find data sources on the Internet to match this query with the local schema. The GSQ finds many schema mappings. Mapping might be in the form of a query, or it might be a set of expressions between items in each schema
In stage 2, when implemented, a GSQ must have at least two parts: a schema matcher and a query evaluator. The schema matcher is responsible for matching the global schema, supplied by the user, against the schemas of the Internet, which are presumably stored in a large schema repository. The schema matcher delivers a list of possible mappings between the two. In stage 3, the query evaluator of the GSQ rewrites the global query q into a query over a concrete data source qi, and optionally, transforms back the answer ai into the structure corresponding to that of the global schema. However, the idea of global schema based querying is not entirely new and shares many properties and problems with other techniques mentioned above. This paper addresses problems of interest to the broader area of structured querying of distributed XML information sources.