Looking for a Tutor Near You?

Post Learning Requirement »
x

Choose Country Code

x

Direction

x

Ask a Question

x

x
x
x
Hire a Tutor

Text Search Engine

Loading...

Published in: Computer Science
2,969 Views

Final Year CSE project PPT

Mishal A / Noida

33 years of teaching experience

Qualification: B.Tech/B.E. (BIT MESRA - 2014)

Teaches: All Subjects, Chemistry, Computer Science, Mathematics, Physics

Contact this Tutor
  1. TEXT SEARCH ENGINE
  2. AIM To build a proto-type of a search engine which works on millions of wikipedia pages(which are in XML format) and retrieve the top 1 0 relevant wikipedia documents that matches the input
  3. Introduction Searching Engine-A Misnomer Has to be a "Sorting Engine' How to rank the documents and their titles? Exact value search vs Full text search cannot be strict wrt search engine Relevance-based search Machine learning approach
  4. Theory Involved Inverted Index All unique words which appear in any document For each word,a list of documents in which it appears Various levels of indexing can be applied Text Pre-processing Tokenisation White space tokeniser Penn treebank tokeniser Standard defined by LDC(Linguistic Data Consortium) Separates out clitics Separates all punctuations Keeps hyphenated words together
  5. Theory-contd. Sentence Segmentation Ending of a sentence Case Folding Words with same root meaning Removing stop words Words not important wrt search Stemming Taking out the root words
  6. Theory-contd. XML(Extensible Markup Language) Used for communication purpose XML tags identify the data and are used to store and organise data SAX Parser Used to parse XML documents Separate package is available in java Reads the document stream wise instead of complete dumping
  7. High Level Design 2. Create wikipedia object Title InfoBox(summary) External Links Text Contents Split text in tokens Remove stop words Stem the words Maintain count where each word occured(maintain Hashmap)
  8. High Level Design-contd. 3. 4. Inverted Index Key doc-list (word) (set of all documents) docids-setbit:tf(weight) Ex-abc Term Frequency:NumericaI stats that is intended to reflect how important a word is to the document in a collection(or corpus) 1 +log(total weight) Merge sub-index files Using external sorting
  9. ????? ???