Akomo Ntoso format Generation
Understanding Akoma Ntoso: The "Why"
Before diving into the "how," it's important to understand why Akoma Ntoso is the chosen format. Legal documents are more than just text; they have an intricate, hierarchical structure. Akoma Ntoso ("linked hearts" in the Akan language of Ghana) is an international standard designed to represent legal and legislative documents in a structured, machine-readable way.
By converting the BBMP Act into this format, we unlock several key capabilities:
Granular Version Tracking: Instead of just knowing that a file has changed, we can pinpoint changes to a specific section, sub-section, or even a single clause. This is crucial for tracking amendments with precision.
Semantic Understanding: The Akoma Ntoso structure provides semantic meaning to the text. A machine can understand the difference between a
chapter, asection, and apreamble, which is impossible with plain text.Interoperability: As a global standard, it allows for the exchange and comparison of legal documents across different systems and jurisdictions.
The Conversion Logic: From Raw Text to Structured JSON
The conversion process is orchestrated by the scripts in your bbmp_data_extractor directory, primarily extractor.js and extractor_gemini.js. The core of this process is leveraging a large language model (LLM) to act as an expert legal parser.
Here’s a step-by-step breakdown of the logic:
Step 1: Consolidating the Raw Text
First, the raw text, which is fragmented across multiple JSON objects (representing pages), needs to be combined into a single, coherent string for each chapter.
Logic: The getChapterContent function is responsible for this. It reads the JSON file for a specific chapter, iterates through the array of pages, and concatenates the markdown content from each page.
Code Snippet (extractor.js):
JavaScript
async function getChapterContent(filePath) {
try {
const data = await fs.readFile(filePath, 'utf-8');
const pages = JSON.parse(data);
const combinedMarkdown = pages.map(page => page.markdown).join('\n');
return combinedMarkdown;
} catch (error) {
console.error(`Error reading or parsing file at ${filePath}:`, error);
throw error;
}
}Step 2: Instructing the AI with a System Prompt
This is the most critical part of the process. We provide the LLM with a detailed set of instructions, or a "system prompt," that tells it exactly how to parse the text and what the output JSON structure should look like.
Logic: The systemPrompt in convertToAkomaNtoso is carefully crafted to:
Define the Role: It tells the AI to act as an "expert legal document parser."
Specify the Goal: It clearly states the objective is to create a hierarchical JSON structure based on Akoma Ntoso principles.
Provide a Schema: It gives a clear example of the desired JSON output, including the nesting of
chapter,section,subsection, andclauses. This is crucial for ensuring the AI returns a consistent and predictable structure.Set Constraints: It includes critical instructions like "do not change, rephrase, or manipulate any of the original legal text" to ensure the integrity of the legal document is maintained.
Code Snippet (extractor_gemini.js):
JavaScript
Step 3: Making the API Call and Receiving the JSON
With the raw text consolidated and the instructions defined, the script then makes an API call to the LLM.
Logic: The convertToAkomaNtoso function sends the system prompt and the raw text to the AI model. It specifies that the response should be a JSON object, which simplifies the parsing on our end.
Code Snippet (extractor.js):
JavaScript
Step 4: Storing the Structured Output
Finally, the structured JSON returned by the AI is saved to a file.
Logic: The main function orchestrates the entire process: it calls getChapterContent to get the text, passes it to convertToAkomaNtoso, and then writes the resulting structured JSON to a new file in the bbmp_data_extractor/akomo-ntoso/ directory.
Code Snippet (extractor.js):
JavaScript
The Result: A Deeply Structured Legal Document
The output of this process, as seen in files like chapter1_akoma_ntoso.json, is a rich, hierarchical representation of the BBMP Act.
Here is an example from your data, showing how a section with nested clauses is structured:
JSON
This structured format is what enables the powerful versioning and diffing features of your application, turning a static, hard-to-track document into a dynamic, transparent, and accessible legal text.
Last updated