Extracting Knowledge from Hadith Using Named Entity Recognition and LLMs

Published: 23 October 2024
Author: Mohammad Galib Shams, Nabil Mosharraf Hossain, Riasat Islam
Introduction
At Greentech Apps Foundation (GTAF), we are always looking for new ways to enhance user engagement with Islamic content. In our latest research project, we focused on automating the extraction of structured information from hadith using Named Entity Recognition (NER) powered by Large Language Models (LLMs). The aim is to convert rich hadith texts into structured insights that can be searchable, categorized, and explored contextually.
Problem Statement
We wanted to extract key semantic elements from hadith—such as:
- Emotions conveyed by the hadith
- Sayings of the Prophet Muhammad (peace be upon him)
- Narrators in the chain
- Tags and key concepts
- Contextual topics
- Entities like people, time, tribes, and locations
- Prophet’s actions
- Possible questions a user might ask that would lead to this hadith
This structured metadata would allow us to build smarter search, recommendation, and educational tools for hadith-based learning.
Our Approach
We developed a Python pipeline leveraging a powerful LLM (Meta’s Llama-3 70B Instruct) to extract this information from raw hadith text. Here’s how it worked:
Prompt Engineering
We crafted a custom prompt template that:
- Categorized emotion (Instruction, Motivation, Comforting, Warning, Neutral)
- Extracted Prophet Muhammad’s sayings
- Identified all narrators in the chain
- Isolated time, location, people (excluding the Prophet), and tribe entities
- Listed the Prophet’s actions in the hadith
- Suggested appropriate tags (based only on the hadith text)
- Generated topics, concepts, and questions to understand and apply the hadith
Processing Pipeline
- Load & Clean Hadith Text: We preprocess hadith to remove inconsistent characters and format issues.
- Send Request to LLM: We construct a JSON prompt using our template and send it to our Azure-hosted Llama-3 70B endpoint.
- Parse Response: The LLM returns a structured JSON with all extracted elements.
- Save & Log: Results are saved into CSV and logs are prepended using a custom reverse log file handler for better traceability.
- Error Handling: If the LLM fails to return valid JSON or if there is a rate limit, the exception is logged and retried up to 3 times.
Evaluation
We evaluated the system on a CSV of hadiths (hadith_collection_new_db.csv
).
- Output was saved in
llama_70B_Al-hadith_collection_new_db.csv
- Accuracy in trials was consistently above 90%
Key Innovations
- Named Entity Extraction: This goes beyond simple keyword tagging by identifying specific relationships, people, tribes, and settings.
- Prophet’s Sayings & Actions: Our model isolates the direct words and deeds of the Prophet (PBUH), a valuable tool for both educators and learners.
- Emotion Classification: Adds a layer of affective tagging useful in UX and search design.
- Dynamic Questions: Suggested questions improve how we present hadith to users, especially for AI chatbots or quiz tools.
Challenges
- Parsing references-only hadiths
- Handling null/irrelevant translations
- Maintaining output JSON structure from LLM responses
- Entity disambiguation (e.g., differentiating a tribe vs. a location)
What’s Next?
- Validate across larger hadith datasets
- Train smaller, fine-tuned models for offline inference
- Integrate this into our Hadith app’s semantic search and recommendations
- Enable feedback loops from users to improve tagging accuracy
Conclusion
This project represents a step forward in structuring Islamic knowledge for modern interfaces. By combining LLMs, prompt engineering, and NER, we are unlocking new ways for Muslims around the world to learn from and engage with Hadith in a meaningful and personalised way.
Want to contribute to our Islamic AI efforts? Reach out at https://gtaf.org.