.

Tuesday, August 6, 2019

Challenges In Web Information Retrieval Computer Science Essay

Challenges In Web Information Retrieval Computer Science Essay An overview of Information Retrieval is presented in this chapter. This defines the need of information retrieval. This discusses how the IR problem can be handled. It discusses about the model for efficient and intelligent retrieval. It briefly defines the major issues in information retrieval. It also discusses about the necessity of retrieval and the basis of the study for the motivation of the selection of search topic for dissertation requirements of information retrieval and how it can be used in the web searching. This discusses the user involvement in the retrieval model. This chapter also defines the numbers of approaches are proposed for the user, system and data for the efficient and intelligent retrieval. The different models are focuses on the organization and storing of the data/documents. This chapter defines the need of the retrieval system and also the proposed study in the direction of efficient and intelligent retrieval. The observations are properly explored with the particular emphasis on the necessities of the information retrieval. It is very surprising in a way the information is available in the world today. This leads to the explosion of information soon. The explosion is due to the availability of data and documents online. At the same time while searching and accessing a data/document is a problem. The digitalization is a basis where the ordinary man is involve in storing a huge amount of electronic data. An electronic data can be easily transmitted via email and easily disseminated on the web. The search can be applied on the stored text to require the relevant information on any topic and reuse it. The information explosion means there is too much relevant information readily available to meet the cognitive capacity, for that we will be finding a difficulty in defining the document relevant. Now it becomes necessary for information retrieval (IR) systems to employ intelligent techniques to provide effective access to such a huge amount of available information. Particularly with the emergence of the Worl d Wide Web, users have an access to such huge amount of documents. More and more information services such as new services; library and electronic mail etc are easily available. Things are becoming online in order to provide with a prompt access to the users. The, more textual information is available on web, due to increasing size of information sources has made it difficult for the people to find relevant textual documents. The information that reaches to the user does not match with his/her interest and merely end up with the overloading him/her. The users have to select manually the relevant information from the huge bundle of information. This makes an urge demand for more effective retrieval systems to perform the efficient and intelligent retrieval of data/documents. This research effort will capture the semantics and also integrate it in IR systems. This study will explore this idea by considering in two directions. Firstly, the efficiency of search results, that can be focu sed on the statistical methods. Secondly, the need to improve upon the relevance (in semantic sense and relevant technique) has to be satisfied. This will motivate you in the direction of attempt to improve upon the document storing and query representation. Also natural language processing (NLP) technique can help to segregate/classifies the data for the best use. A relevancy technique is used not only for the efficiency of retrieval but also judge intelligently for capturing the semantics in representation of matching and representation process. The research mainly in this area has to be focus broadly in two directions. Firstly, expanding the query entered in the better representation as per used needs and secondly, determining the relevant in the document urge to representation for improved the results. If the information of any document is lost then that can be recovered by using relevance assessment technique. The relevance cannot be judge only on the on the basis of term occurrence but it depends on the existing retrieval system lie on basic retrieval models such as boolean, standard vector and probabilistic that treat both documents and queries as a set of unrelated terms. These classical models have the advantage of being simple, scalable and computationally feasible, but they do not offer accurate and complete representation. Due to this ignorance in the present classical model, the role of semantic and relative information about the document in the retrieval process is important. It is difficult to identify useful do cuments simply on the basis of words used by the author of the document, as words may mean differently in different context, as pointed out in [Zrehen S, 2000]. It is impossible to retrieve all documents pertaining to a particular subject, because such documents do not share a common set of keywords and because current search engines may or may not address semantics or context. The work focuses mainly on the semantic techniques. However, building a complete semantic understanding of the text requires human-like processing of text and is beyond the scope of this work. The objective of this work is to classify documents as relevant and non-relevant with respect to a standing query with more accuracy and less overhead. A detailed and accurate semantic interpretation is not needed for this classification [Evans David A. Zhai C.,1996]. This fact distinguishes IR application from other NLP applications. The semantic knowledge needed to define the relevance of the document and that can be easily extracted from the text with respect to the author or user. This can be implemented by approach to the overlaying facility, which helps in dealing with the relationships issue, which is one of the most important factors in the design of information retrieval systems. These techniques allow the search and retrieval systems to involve in the improve document and/or query representation. It involves into the address document semantics .It not only improved the ranking of retrieved documents, further adapt queries based on relevance feedback and improve retrieval performance. Finally, producing the relationship between the fact that so much information is being produced and at such a rate that no single technique can offer remedy to all problems, we propose hybrid approach to information retrieval and also evaluate one such model. This will explore to both directions for the efficiency and intelligent retrieval. The realization of inadequacy of the current approaches of information retrieval, work focuses on investigating intelligent techniques t hat will help in retrieving information effectively. IR enables the programs for representation, comparison, and interaction methods to implement in the system result in effective performance. The techniques that improve these aspects i.e., the representation, comparison, or interaction, will lead to intelligent retrieval. The use of overlaying facility will be capturing the relationships between the different layers of data. This will cultivate to a hybrid model by applying the efficient and intelligent technique using hierarchical and semantics approach. To improve the efficacy of an IR system, we need a better understanding of the issues involved in information retrieval and problems associated with existing traditional information retrieval systems. The algorithm/application of these techniques can provide significant benefit. This exactly defines the scope of the work. In the rest of the chapter, we first discuss the issues involved and the problems associated with current approaches to information retrieval. And the motivation behind the retrieval is discussed. The proposed work for the information retrieval is studied thoroughly. This overview also serves as a summary of the core technical contributions of this work. It briefly reviews some of the previous research aiming at necessity of the work. Lastly, it describes the organization of the dissertation 1.2. Major issues in information retrieval There are a number of issues that are involved in the design and evaluation of IR systems some of them are discussed. The first important issue to address is to choose a representation of the document. Most of the human knowledge is coded in natural language. However, it is difficult to use natural language as knowledge representation language for computer systems. The current retrieval models are based on either keywords for search or author. This keyword representation creates problem during retrieval due to polysemy, homonymy and synonymy. Polysemy involves the phenomenon of a lexeme with multiple meaning. Keyword matching may not always include word sense matching [Justin Picard Jacques Savoy ,2000]. Homonymy is an ambiguity in which words that appear the same have unrelated meanings. Ambiguity makes it difficult for a computer to automatically determine the conceptual content of documents. Synonymy creates problem when a document is indexed with one term and the query contains a different term, and the two terms share a common meaning. The previous studies indicate that human beings tend to use different expressions to convey the same meaning [Blair D., Maron M., 1990]. The recent work in developing extensive lexicon is an attempt to improve the situation [Mittendorf E. ed. Al, 2000]. Traditional retrieval models ignore semantic and contextual information in the retrieval process [Judith P. Dick, 1992], [Ounis I. Huibers T,W.C. 1997]. This information is lost in the extraction of keywords from the text and can not be recovered by the retrieval algorithms. The improving IR demands an improved representation of text, which is very important. The related issue can look forward in characterization of queries by users. This is inappropriate in this case because of v agueness and inaccuracy of the users queries, say for instance, their lack of knowledge of the subject or the inherent vagueness of the natural language itself. The users may fail to include relevant terms in the query or may include irrelevant terms. Inappropriate or inaccurate query leads to poor retrieval performance. The problem of ill-specified query can be dealt with by modifying or expanding queries. An effective technique based on users interaction is the relevance feedback. This will Improve the representation of documents and/or queries is thus central to improving IR. In order to satisfy users request an IR system matches document representation with the query representation. How to match the representation of a query with that of the document is another issue. A number of similarity measures have been proposed to quantify the similarity between a query and the document to produce a ranked list of results. The selection of the appropriate similarity measure is a very cruc ial issue in the IR system design. The evaluation of the performance of IR systems is also one of the major issues in IR. There are many aspects of evaluation; most important being the effectiveness of an IR system. Recall and precision are the most widely used measures of effectiveness in IR community. As improving effectiveness in IR is the underlying theme for evaluating any technique and is one of the core issues in this work. The evaluation of the performance of IR systems relies on the notion of relevance. The relevance is subjective in nature [Saracevic T., 1991]. Only the user can tell the true relevance. This cannot be measure as it is based on user perception. However, it is not possible to measure this true relevance. One may define the degree of relevance. The relevance has been considered as a binary concept, whereas it is a continuous function (a document may be exactly what the user wants or it may be closely related). The current evaluation techniques do not support this continuity. The number of relevance frameworks has been proposed in [Saracevic T., 1996]. This includes the system, communication, psychological and situational frameworks. The most inclusive is the situational framework, which is based on the cognitive view of the information seeking process and considers the importance of situation, context, multi-dimensionality and time. A survey of relevance studies can be found in [Mizzaro S. ,1997]. Most of the evaluations of IR systems so far have been done on document test collections with known relevance judgments. The large size of document collections also complicates text retrieval. Further, users may have varying in need of documents. Some users require answers of limited scope, while others require documents having wide scope. These different needs can require that different and specialized retrieval methods be employed. The work attempts to handle some of these problems by proposing techniques. To improve representation of docume nts and queries and by incorporating new similarity measures. Information retrieval models based on these representations and similarity measures have been proposed and evaluated in this work. The another factor that decreases search engine usefulness is the dynamic nature of the Web, resulting in many dead links and out of date pages that have changed since indexed. But even accepting these factors, finding relevant information using Web search engines often fails. The document retrieval systems typically present search results in a ranked list, ordered by their estimated relevance to the query. The relevancy is estimated based on the similarity between the text of a document and the query. Such ranking schemes work well when users can formulate a well-defined query for their searches. However, users of Web search engines often formulate very short queries (70% are single word queries [Motro, 98]) that often retrieve large numbers of documents. Based on such a condensed representat ion of the users search interests, it is impossible for the search engine to identify the specific documents that are of interest to the users. Moreover, many webmasters now actively work to influence rankings. These problems are intensify when the users are unfamiliar with the topic they are querying about, when they are novices at performing searches, or when the search engines database contains a large number of documents. All these conditions commonly exist for Web search engine users. Therefore the vast majority of the retrieved documents are often of no interest to the user; such searches are termed low precision searches. The low precision of the Web search engines coupled with the ranked list presentation force users to examine through a large number of documents and make it hard for them to find the information they are looking for. As low precision Web searches are inevitable, tools must be provided to help users cope with (and make use of) these large document sets. Such tools should include means to easily browse through large sets of retrieved documents. 1.3 Necessity of present work The motivation for this research is to make search engine results easy to browse. The document classification algorithms attempt to group similar documents together. The Classification / Grouping the results of Web search engines can provide a powerful browsing tool. The automatic grouping of similar documents (document groups) a feasible method of presenting the results of Web search engines. 1.3.1 Classification: The document groups have initially been investigated in Information Retrieval mainly as a means of improving the performance of search engines by pre-clustering the entire corpus [Jardine and van Rijsbergen, 71]. The cluster hypothesis [van Rijsbergen, 79] stated that similar documents will tend to be relevant to the same queries, thus the automatic detection of clusters of similar documents can improve recall by effectively broadening a search request. However we are investigating classification as a means of browsing large retrieved document sets. We therefore need to slightly modify the group classification which suit to the domain. This can be attempted for user-class hypothesis is that users have a mental model of the topics and subtopics of the documents present in the result set; similar documents will tend to belong to the same category in the users model. Thus the automatic detection of clusters of similar documents can help the user in browsing the res ult set. The classification and the groups of the documents with respect to the author can help users in three ways: (1) it can allow them to find the information they are looking for more easily, (2) it can help them to realize faster that a query is poorly formulated (e.g., too general) and to reformulate it, and (3) it can reduces the fraction of the queries on which the user gives up before reaching the desired information. For example, if a user wishes to find salsa recipes on the Web, and performs a search using the query apple, only 10% of the returned documents will be related to apple recipes (the rest will relate to apple music, apple products that can be bought on the web and a software product called apple; many documents will have no apparent connection to apple at all). If we were to cluster the results, the user could find the group relating to apple recipes and thus save valuable browsing time. We have identified some key requirements for document clustering of searc h engine results. The support vector machine is used to implement such types of cluster techniques: 1) Coherent Clusters is the clustering algorithm should group similar documents together. 2) Efficiently browsable that the user needs to determine at a glance whether the contents of a cluster are of interest. Therefore, the system has to provide concise and accurate cluster descriptions. 3) Speed of the system should not introduce a substantial delay before displaying the results. 4) In preliminary experimentation carried out at the beginning of this study we found Web documents, and especially search engine snippets, to be poor candidates for classification because they are short and often poorly formatted. This led us to consider the use of phrases in the classification of search engine results, as they contain more information than simple words (information regarding proximity and order of words). The phrases have the equally important advantage of having a higher descriptive pow er (compared to single words). This is very important when attempting to describe the contents of a group to the user in a concise manner. The groups can be making with the keyword in respect to the subject and sub-subject or it can be in respect to the author or user. 1.3.2 Relevancy in documents: With respect to the clustering of the documents or users, they important study that is made for the retrieval is as follows. The search engines are extremely important to help users to find relevant retrieval of information on the World Wide Web. In order to give the best according to the needs of users, a search engine must find and filter the most relevant information matching a users query, and then present that information in a manner that makes the information most readily presentable to the user. The system is used to apply the technique and also work in between the user and the document to efficient retrieval the relevant document. Moreover, the task of information retrieval and presentation must be done in a scalable fashion to serve the hundreds of millions of user queries that are issued every day to a popular web search engines (Tomlin, 2003). In addressing the problem of Information Retrieval (IR) on the web, there are a number of challenges researchers are involved. Some of these challenges are dealt with and identified additional problems that may motivate future work in the IR research community. It also describes some work in these areas that has been conducted at various search engines. It begins by briefly outlining some of the issues or factors that arise in web information retrieval. The people/User relates to the system directly for the Information retrieval as shown in Figure 1. Figure 1.1 IR System Components. They are easy to compare fields with well-defined semantics to queries in order to find matches. For example the Records are easy to find-for example, bank database query. The semantics of the keywords also plays an important role, which is, send through the interface. System includes the interface of search engine servers, the databases and the indexing mechanism, which include the stemming techniques. The User defines the search strategy and also gives the requirement for searching .The documents available in www apply subject indexing, ranking and clustering (Herbach, 2001).The relevant matches are easily found. There are three major components such as data, user and system. These three components are interlinked with each other with two-way relationship. The system is a computer system and the software application loaded. The interfaces of search engine servers, the databases and the indexing mechanism, which include the stemming techniques etc, are associated in the system and i ts linked components. Similarly, user defines the search strategy (Herbach, 2001) and also gives the requirement for searching .The documents available in www apply subject indexing, ranking and clustering (Kleinberg,1999). The relevant matches easily found by comparison with field values of records. The involvement of relevance feedback technique can also be incorporated for efficient searching. And the data are a simple as documents in different formats use database, it terms of maintenance and retrieval of records but for the unstructured documents, it is difficult where we use text. Search engine developments are based primarily on the indexing range, which is assisted by www users in performing information retrieval task. The evaluation of efficient and intelligent studies have considered and an impact can be seen on system features (Kunchukuttan,2006), in particular those with which the user interacts for search assistance. The information retrieval system evaluation the compl ex environment, which measures of the utility and the usability of the search results of the system are required from a user perspective layout. The proposed model for a user-centered evaluation is based on a conceptual framework in which user-satisfaction is characterized on the variable dependent on system features and system functions. It will be simple for the database it terms of maintenance and retrieval of records but for the unstructured documents it is difficult where we use text. The same criteria for searching will give better matches and also better results. The different dimensions of IR have become vast because of different media, different types of search applications, and different tasks, which is not only a text, but also a web search as a central. The IR approaches to search and evaluation are appropriate in all media is an emerging issues of IR. The information retrieval is involved in the following tasks and sub tasks: 1) Ad-hoc search involve with the process where it generalizes the criteria and searches for all the records, which finds all the relevant documents for an arbitrary text query; 2) Filtering is an important process where the users identify the relevant user profiles for a new document. The user profile is maintained where the user can be identified with a profile and accordingly the relevant documents are categorized and displayed; 3) Classification is involved with respect to the identification and lies in the relevant list of the cl assification. This works in identifying the relevant labels for documents; 4) Question Answering Technique involves for the better judgment of the classification with the relevant questions automatically frames to generate the focus of the individuals. The tasks are described in the Figure 2. Figure 1.2: Proposed Model of Search Engine. The field of IR deals with the relevance, evaluation and interacts with the user to provide them according to their needs/query. IR involves in the effective ranking and testing. Also it measures of the data available for the retrieval. The relevant document contains the information that a person was looking for when they submitted a query to the search engine. There are many factors influence a persons to take the decision about the relevancy that may be task, context, novelty, and style. The topical relevance (same topic) and user relevance (everything else) are the dimensions, which help in the IR modeling. The retrieval models define a view of relevance. The user provides information that the system can use to modify its next search or next display. The relevance feedback is as to how much system understands the user in terms of what is the need, and also to know about the concept and terms related to the information needs. The retrieval uses the different techniques such as the web pages contains links to other pages and by analyzing this web graph structure it is possible to determine a more global notion of page quality. The remarkable successes in this area include the Page Rank algorithm (Tomlin, 2003), which globally analyzes the entire web graph and provided the original basis for ranking in the various search engines, and Kleinbergs hyperlink algorithm (Herbach, 2001, Kleinberg,1999), which analyzes a local neighborhood of the web graph containing an initial set of web pages matching the users query. Since that time, several other linked-based methods for ranking web pages have been proposed including variants of both PageRank and HITS (Kleinberg, 1999, Joachims, 2003), and this remains an active research area in which there is still much fertile research ground to be explored. This may refer to the recent work on Hub and researchers from where it identifies in the form of equilibrium for WWW sources on a common theme/topic in which we explicitly build into the model by taking care of the diversity of roles between the different types of pages (Herbach,2001) .Some pages are the prominent sources of primary data/content and are considered to be the authorities on the topic; other pages, equally essential to the structure, accumulate high-quality guides and resource lists that act as focused hubs, directing users to suggested authorities. The nature of the linkage in this framework is highly asymmetric. Hubs link heavily to authorities, and they may have very few incoming links linked to them, and the authorities are not link to other authorities. This is completely a suggested model (Herbach,2001), is completely natural; relatively anonymous individuals are creating many good hubs on the Web. A formal type of equilibrium consistent model can be defined only by assigning the weights to the two numbers called as a hub weight and an authority weight .The weights to each page are assigned in such a way that a pages authority weight is proportional to the sum of the hub weights of pages that link to it to maintain the balance and a pages hub weight is proportional to the sum of the authority weights of pages that it links to. The adversarial Classification (Sahami et al.,1998) may be dealing with Spam on the Web. One particularly interesting problem in web IR arises from the attempt by some commercial interests to excessively heighten the ranking of their web pages by engaging in various forms of spamming (Joachims, 2003). The SPAM methods can be effective against traditional IR ranking schemes that do not make use of link structure, but have more limited utility in the context of global link analysis. Realizing this, spammers now also utilize link spam where they will create large numbers of web pages that contain links to other pages whose rankings they wish to rise. The interesting technique applied will continually to the automatic filters. The spam filtering in email is very popular. This technique with concurrently involved the applying the indexes the documents. The current study will propose a hybrid semantic model where is a combination algorithm and the application used for the efficient and intelligent retrieval model. This will involve the different practices for the retrieval the system will be playing an important role. Further the tri-sectional considering system, document and user are identified by applying the Analytical Hierarchal process (AHP) model. This study will help to you carry out the algorithm, application and the models associated with them with respect to these components. 1.5. Organization of the thesis The thesis is organized into seven chapters including the present chapter which introduced IR problem, presented a brief review of the work done in the field and provided an overview of our work. An outline of the remaining chapters follows. The intelligent and efficient Information Retrieval needs to explain the data organization, the user prospects and also the user interface system study and its importance. The different tests for the present theoretical investigations are reported in the thesis, have been organized as follows: The understanding of the theoretical analysis of proposed methods to explain the various intelligent and efficient structural algorithm and application based approach; the techniques have been discussed in further consecutive chapters. Also, it is adequate to take a real scenario that the interaction mechanism between the layers of user and data are important to define the model with their properties. Briefly the remarkable success achieved from the present models has been given below. The understanding of basic parameters for efficient and intelligent retrieval needs the formulation of an effective and intelligent retrieval and this is outlined in Chapter II. To make information retrieval study successful, there is the need to prioritize their efforts in terms of user, system and data centric aspects, because of the range interactions they are effective up to the second-hierarchy. The forces occur between the layer itself and also by joining to the upper/lower layer within the system. A straightforward extension is possible since; these systems are open-ended and allow data and user to join them with internal requirements and for a complete collection of document/data etc. The effective parameters as relevancy, ranking and layout have been incorporated in the implementation of analytical hierarchical process (AHP) for analysis. In order to make the proposed work more revealing, the applicability of these parameters has been explored for the further focus on the proposed model to describe the interaction and interrelation between the data and user as presented in Chapter II. The research study provides a theoretical background of IR techniques, which helps in designing the retrieval model. The detailed study will be defined on the basic concept in establishing the relationship between the system and data primarily. There are different techniques that are based on this relationship/link to define the efficient data retrieval, which has been investigated, and results presented in Chapter III. The later part of this chapter explores Intelligent Data processing and analysis with respect to the intelligent data retrieval by using different techniques used for designing the retrieval model. The detailed study will define the basic concept in establishing the relationship between the system, user and data primarily. There are different techniques that are based on this relationship/link to define the intelligent data retrieval. This is very much dependent on the semantics of the individual layer as per user interest or taste. The links between the two objects is to change the strength of the object. The objects are powerful, based on incoming and outgoing link i.e. the popularity of the object. Based on strength, this object can be considered as highest ranked object and also relevant one. Effective interrelation is successful in explaining popularity of object with consistent behavior. Semantics annotation framework helps in intelligent retrieval by using natural semantics. The Vector Space Model and Latent Semantic Indexing techniques are theoretically analyzed in Chapter IV. The research used an effective inte

No comments:

Post a Comment