Akomo Ntoso format Generation

Understanding Akoma Ntoso: The "Why"

Before diving into the "how," it's important to understand why Akoma Ntoso is the chosen format. Legal documents are more than just text; they have an intricate, hierarchical structure. Akoma Ntoso ("linked hearts" in the Akan language of Ghana) is an international standard designed to represent legal and legislative documents in a structured, machine-readable way.

By converting the BBMP Act into this format, we unlock several key capabilities:

Granular Version Tracking: Instead of just knowing that a file has changed, we can pinpoint changes to a specific section, sub-section, or even a single clause. This is crucial for tracking amendments with precision.
Semantic Understanding: The Akoma Ntoso structure provides semantic meaning to the text. A machine can understand the difference between a chapter, a section, and a preamble, which is impossible with plain text.
Interoperability: As a global standard, it allows for the exchange and comparison of legal documents across different systems and jurisdictions.

The Conversion Logic: From Raw Text to Structured JSON

The conversion process is orchestrated by the scripts in your bbmp_data_extractor directory, primarily extractor.js and extractor_gemini.js. The core of this process is leveraging a large language model (LLM) to act as an expert legal parser.

Here’s a step-by-step breakdown of the logic:

Step 1: Consolidating the Raw Text

First, the raw text, which is fragmented across multiple JSON objects (representing pages), needs to be combined into a single, coherent string for each chapter.

Logic: The getChapterContent function is responsible for this. It reads the JSON file for a specific chapter, iterates through the array of pages, and concatenates the markdown content from each page.

Code Snippet (extractor.js):

JavaScript

async function getChapterContent(filePath) {
    try {
        const data = await fs.readFile(filePath, 'utf-8');
        const pages = JSON.parse(data);

        const combinedMarkdown = pages.map(page => page.markdown).join('\n');
        return combinedMarkdown;
    } catch (error) {
        console.error(`Error reading or parsing file at ${filePath}:`, error);
        throw error;
    }
}

Step 2: Instructing the AI with a System Prompt

This is the most critical part of the process. We provide the LLM with a detailed set of instructions, or a "system prompt," that tells it exactly how to parse the text and what the output JSON structure should look like.

Logic: The systemPrompt in convertToAkomaNtoso is carefully crafted to:

Define the Role: It tells the AI to act as an "expert legal document parser."
Specify the Goal: It clearly states the objective is to create a hierarchical JSON structure based on Akoma Ntoso principles.
Provide a Schema: It gives a clear example of the desired JSON output, including the nesting of chapter, section, subsection, and clauses. This is crucial for ensuring the AI returns a consistent and predictable structure.
Set Constraints: It includes critical instructions like "do not change, rephrase, or manipulate any of the original legal text" to ensure the integrity of the legal document is maintained.

Code Snippet (extractor_gemini.js):

JavaScript

const systemPrompt = `
You are an expert legal document parser. Your task is to convert the provided raw text from a legislative act into a structured JSON format that follows Akoma Ntoso principles...

**Desired JSON Structure Example (with nesting):**
{
  "akomaNtoso": {
    "act": {
      "meta": { "... meta info ..." },
      "preamble": { "... preamble info ..." },
      "body": {
        "chapter": {
          "@eId": "ch_I",
          "num": "CHAPTER I",
          "heading": "PRELIMINARY",
          "section": [
            {
              "@eId": "sec_3",
              "num": "3.",
              "heading": "Definitions.",
              "content": [
                 {
                   "p": "In this Act, unless the context otherwise requires,-"
                 },
                 {
                  "subsection": {
                    "num": "(7)",
                    "content": {
                      "p": "'building' includes,-",
                      "clauses": [
                        { "num": "(a)", "content": "a house, out-house, stable..." },
                        { "num": "(b)", "content": "a structure on wheels..." }
                      ]
                    }
                  }
                 }
              ]
            }
          ]
        }
      }
    }
  }
}
`;

Step 3: Making the API Call and Receiving the JSON

With the raw text consolidated and the instructions defined, the script then makes an API call to the LLM.

Logic: The convertToAkomaNtoso function sends the system prompt and the raw text to the AI model. It specifies that the response should be a JSON object, which simplifies the parsing on our end.

Code Snippet (extractor.js):

JavaScript

async function convertToAkomaNtoso(textContent) {
    // ... systemPrompt is defined here ...

    try {
        console.log("Sending content to OpenAI for conversion...");
        const response = await openai.chat.completions.create({
            model: "gpt-4-turbo",
            messages: [
                { role: "system", content: systemPrompt },
                { role: "user", content: `Here is the text to convert:\n\n${textContent}` }
            ],
            response_format: { type: "json_object" },
        });

        const jsonOutput = JSON.parse(response.choices[0].message.content);
        console.log("Successfully received and parsed structured JSON from OpenAI.");
        return jsonOutput;
    } catch (error) {
        console.error("Error during OpenAI API call:", error);
        throw error;
    }
}

Step 4: Storing the Structured Output

Finally, the structured JSON returned by the AI is saved to a file.

Logic: The main function orchestrates the entire process: it calls getChapterContent to get the text, passes it to convertToAkomaNtoso, and then writes the resulting structured JSON to a new file in the bbmp_data_extractor/akomo-ntoso/ directory.

Code Snippet (extractor.js):

JavaScript

async function main() {
    try {
        const chapterFilePath = path.resolve('bbmp_data_extractor/chapter1.json');
        const rawText = await getChapterContent(chapterFilePath);
        
        const structuredJson = await convertToAkomaNtoso(rawText);

        const outputFilePath = path.resolve('chapter1_akoma_ntoso.json');
        await fs.writeFile(outputFilePath, JSON.stringify(structuredJson, null, 2));
        
        console.log(`Conversion successful. Structured JSON saved to: ${outputFilePath}`);
    } catch (error) {
        console.error("The process failed:", error);
    }
}

The Result: A Deeply Structured Legal Document

The output of this process, as seen in files like chapter1_akoma_ntoso.json, is a rich, hierarchical representation of the BBMP Act.

Here is an example from your data, showing how a section with nested clauses is structured:

JSON

{
  "akomaNtoso": {
    "act": {
      "body": {
        "chapter": [
          {
            "@eId": "ch_I",
            "section": [
              {
                "@eId": "sec_3",
                "num": "3.",
                "heading": "Definitions.",
                "content": [
                  {
                    "p": "In this Act, unless the context otherwise requires,-"
                  },
                  {
                    "subsection": {
                      "num": "(7)",
                      "content": {
                        "p": "'building' includes,-",
                        "clauses": [
                          {
                            "num": "(a)",
                            "content": "a house, out-house, stable, privy, shed, hut, wall, verandah, fixed platform, plinth, door step and any other structure..."
                          },
                          {
                            "num": "(b)",
                            "content": "a structure on wheels simply resting in the ground without foundations;"
                          }
                        ]
                      }
                    }
                  }
                ]
              }
            ]
          }
        ]
      }
    }
  }
}

This structured format is what enables the powerful versioning and diffing features of your application, turning a static, hard-to-track document into a dynamic, transparent, and accessible legal text.

PreviousWorkflow NextFuture Prospect

Last updated 4 months ago