NewsHound: Distributed RSS Indexing Engine

NewsHound is a high-performance C application that aggregates distributed RSS news feeds, parses live HTML content, and builds an efficient in-memory "Inverted Index" to enable real-time keyword search across hundreds of articles.

This project demonstrates core systems programming concepts, including network data ingestion, markup parsing, and the implementation of custom high-performance data structures (HashSets and Vectors) for O(1) lookups.

Key Features

Automated Aggregation: Connects to remote servers to download and parse RSS 2.0 XML feeds, extracting article metadata and URLs.
Inverted Indexing: Constructs a mapping of {Keyword -> List of Articles} to allow for instant query resolution, similar to the architecture of large-scale search engines.
Relevance Ranking: Implements a frequency-based ranking algorithm (Term Frequency) to sort search results by relevance.
Noise Reduction: Utilizes a "Stop Word" filtering layer to strip out common grammatical articles (e.g., "the", "and") during indexing, optimizing memory usage and search quality.
Deduplication: Detects and merges duplicate articles syndicated across multiple feeds to ensure result uniqueness.

Technical Architecture

1. The Ingestion Pipeline

The system connects to a list of provided RSS feed URLs. It utilizes a stream tokenizer to parse the incoming XML, identifying <item>, <title>, and <link> tags to extract potential articles.

2. The Indexing Engine

Once an article is identified, the engine downloads the raw HTML.

Tokenization: HTML tags are stripped, and the text is broken into individual tokens.
Normalization: Tokens are case-normalized and checked against a StopWord HashSet.
hashing: Valid keywords are hashed and stored in a dynamic HashSet.
Vector Storage: Each keyword points to a Vector of Article structs, tracking the frequency of the word in that specific document.

3. The Query Interface

Users can perform boolean queries against the index. The system hashes the search term, retrieves the associated Vector of articles, and sorts them by frequency count before displaying the results.

Challenges & Solutions

Memory Management

Indexing the web is memory-intensive. The system implements rigorous dynamic memory management, ensuring that thousands of string allocations for article titles, URLs, and keywords are properly freed upon shutdown.

Data Deduplication

News wires often cross-post stories. The engine implements a comparator logic that flags articles as duplicates if they share the same URL or the same Title + Server Origin, preventing index pollution.

Installation & Usage

Prerequisites

GCC Compiler
Standard C Libraries

Build

make

Usage

Run the aggregator with the provided database of feeds:

./rss-search

> Welcome to NewsHound. Indexing feeds...
> Indexing complete. 1400 articles indexed.
> Enter search term: "Linux"

Project Structure

├── src/
│   ├── rss-search.c       # Main entry point and orchestration
│   ├── html-parser.c      # Tokenization logic
│   └── indexer.c          # HashSet and Vector implementation
├── data/
│   └── feeds.txt          # List of RSS streams
├── Makefile
└── README.md

Author: Luka Aladashvili

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Makefile		Makefile
README.md		README.md
assn-4-checker-32		assn-4-checker-32
assn-4-checker-64		assn-4-checker-64
bool.h		bool.h
checker.valgrind.txt		checker.valgrind.txt
hashset.h		hashset.h
html-utils.h		html-utils.h
index.c		index.c
index.h		index.h
index.o		index.o
rss-news-search		rss-news-search
rss-news-search-linux		rss-news-search-linux
rss-news-search-solaris		rss-news-search-solaris
rss-news-search.c		rss-news-search.c
rss-news-search.o		rss-news-search.o
streamtokenizer.h		streamtokenizer.h
url.h		url.h
urlconnection.h		urlconnection.h
valgrind.out		valgrind.out
vector.h		vector.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NewsHound: Distributed RSS Indexing Engine

Key Features

Technical Architecture

1. The Ingestion Pipeline

2. The Indexing Engine

3. The Query Interface

Challenges & Solutions

Memory Management

Data Deduplication

Installation & Usage

Prerequisites

Build

Usage

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NewsHound: Distributed RSS Indexing Engine

Key Features

Technical Architecture

1. The Ingestion Pipeline

2. The Indexing Engine

3. The Query Interface

Challenges & Solutions

Memory Management

Data Deduplication

Installation & Usage

Prerequisites

Build

Usage

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages