Search Engine
Timeline
June 2024
Skills
Python, REST API, JavaScript, HTML, CSS
MapReduce, Distributed Systems, AWS, Networking, Concurrency
Overview
This project involved building a scalable search engine similar to Google or Bing. I developed the search engine using a MapReduce framework for parallel data processing and a segmented inverted index for efficient text retrieval. The final product includes a REST API for search results and a user interface for querying and displaying results.
Process
MapReduce Framework Implementation
The project began with implementing a MapReduce framework in Python, inspired by Google's MapReduce paper. This framework executes MapReduce programs with distributed processing on multiple computers, and includes two main components: a Manager that handles job distribution and multiple Workers that perform map and reduce tasks. My goal was to understand MapReduce program execution, basic distributed systems, fault tolerance, and networking. An example of the MapReduce job involved a Word Count task with input files and Worker instances, demonstrating the framework's functionality.
Search Engine Development
Building on the MapReduce framework, this stage involved developing a scalable search engine. The project focused on information retrieval concepts such as text analysis (tf-idf) and link analysis (PageRank), and utilized MapReduce for parallel data processing. I created a segmented inverted index of web pages using a pipeline of MapReduce programs. Additionally, I developed an Index server with a REST API for returning search results in JSON format and a Search server providing a user interface that mimics Google or Bing.
Outcomes
Throughout the project, I gained expertise in implementing distributed systems and parallel processing with MapReduce, as well as working with Python for large-scale data processing. I also developed skills in building scalable web applications, including creating REST APIs and user interfaces. The project provided valuable experience in information retrieval concepts, networking, concurrency, and using cloud services for distributed computing. The final product is a scalable search engine with a MapReduce framework, a segmented inverted index, a REST API for search results, and a user interface for displaying queries.