Report a Profile Issue

If you notice any factual inaccuracies or other issues with this profile, please let us know. Please do not include any personal health information in this form.

Text Search Engine

Published in: Computer Science

2,969 Views

Final Year CSE project PPT

Mishal A / Noida

33 years of teaching experience

Qualification: B.Tech/B.E. (BIT MESRA - 2014)

Teaches: All Subjects, Chemistry, Computer Science, Mathematics, Physics

Contact this Tutor

TEXT SEARCH ENGINE
AIM To build a proto-type of a search engine which works on millions of wikipedia pages(which are in XML format) and retrieve the top 1 0 relevant wikipedia documents that matches the input
Introduction Searching Engine-A Misnomer Has to be a "Sorting Engine' How to rank the documents and their titles? Exact value search vs Full text search cannot be strict wrt search engine Relevance-based search Machine learning approach
Theory Involved Inverted Index All unique words which appear in any document For each word,a list of documents in which it appears Various levels of indexing can be applied Text Pre-processing Tokenisation White space tokeniser Penn treebank tokeniser Standard defined by LDC(Linguistic Data Consortium) Separates out clitics Separates all punctuations Keeps hyphenated words together
Theory-contd. Sentence Segmentation Ending of a sentence Case Folding Words with same root meaning Removing stop words Words not important wrt search Stemming Taking out the root words
Theory-contd. XML(Extensible Markup Language) Used for communication purpose XML tags identify the data and are used to store and organise data SAX Parser Used to parse XML documents Separate package is available in java Reads the document stream wise instead of complete dumping
High Level Design 2. Create wikipedia object Title InfoBox(summary) External Links Text Contents Split text in tokens Remove stop words Stem the words Maintain count where each word occured(maintain Hashmap)
High Level Design-contd. 3. 4. Inverted Index Key doc-list (word) (set of all documents) docids-setbit:tf(weight) Ex-abc Term Frequency:NumericaI stats that is intended to reflect how important a word is to the document in a collection(or corpus) 1 +log(total weight) Merge sub-index files Using external sorting
????? ???

Need a Tutor or Coaching Class?

Post an enquiry and get instant responses from qualified and experienced tutors.

Post Requirement

Related PPTs

Computer Networking

Computer Science

80,492 Views
PPT On Elementary Programming With C Session 1

C / C++, C# (C Sharp), Computer Science

3,921 Views
PPT On Variables And Data Types In Programming Lang

C / C++, C# (C Sharp), Computer Science

11,224 Views
Presentation On Working With Computer Peripherals

Computer Science

3,248 Views
Requirement Analysis and Design

Computer, Computer Science, Software Engineering

98 Views
Apple Marketing Strategies

Computer Science

11,002 Views
Rational Unified Process

Computer Science

3,918 Views
Computer Network Security Basics

Basic Computer, Computer Science, Information Technology and Strategic Management

12,132 Views
Typing and Basic Office Skills

Computer for official job, Computer Science, Soft Skills

98 Views
Software Engineering

Computer Science

2,937 Views

Upload PPTs

If you have your own PowerPoint Presentations which you think can benefit others, please upload on LearnPick. For each approved PPT you will get 100 Credit Points and 100 Activity Score which will increase your profile visibility.

Upload Now

We use cookies

Choose Country Code

Direction

Ask a Question