Search Engine

Search Engine

Timeline

June 2024

Skills

Python, REST API, JavaScript, HTML, CSS

MapReduce, Distributed Systems, AWS, Networking, Concurrency

Overview

This project involved building a scalable search engine similar to Google or Bing. I developed the search engine using a MapReduce framework for parallel data processing and a segmented inverted index for efficient text retrieval. The final product includes a REST API for search results and a user interface for querying and displaying results.

Process

MapReduce Framework Implementation

The project began with implementing a MapReduce framework in Python, inspired by Google's MapReduce paper. This framework executes MapReduce programs with distributed processing on multiple computers, and includes two main components: a Manager that handles job distribution and multiple Workers that perform map and reduce tasks. My goal was to understand MapReduce program execution, basic distributed systems, fault tolerance, and networking. An example of the MapReduce job involved a Word Count task with input files and Worker instances, demonstrating the framework's functionality.

Search Engine Development

Building on the MapReduce framework, this stage involved developing a scalable search engine. The project focused on information retrieval concepts such as text analysis (tf-idf) and link analysis (PageRank), and utilized MapReduce for parallel data processing. I created a segmented inverted index of web pages using a pipeline of MapReduce programs. Additionally, I developed an Index server with a REST API for returning search results in JSON format and a Search server providing a user interface that mimics Google or Bing.

Outcomes

Throughout the project, I gained expertise in implementing distributed systems and parallel processing with MapReduce, as well as working with Python for large-scale data processing. I also developed skills in building scalable web applications, including creating REST APIs and user interfaces. The project provided valuable experience in information retrieval concepts, networking, concurrency, and using cloud services for distributed computing. The final product is a scalable search engine with a MapReduce framework, a segmented inverted index, a REST API for search results, and a user interface for displaying queries.