Akomo Ntoso format Generation
Understanding Akoma Ntoso: The "Why"
Before diving into the "how," it's important to understand why Akoma Ntoso is the chosen format. Legal documents are more than just text; they have an intricate, hierarchical structure. Akoma Ntoso ("linked hearts" in the Akan language of Ghana) is an international standard designed to represent legal and legislative documents in a structured, machine-readable way.
By converting the BBMP Act into this format, we unlock several key capabilities:
Granular Version Tracking: Instead of just knowing that a file has changed, we can pinpoint changes to a specific section, sub-section, or even a single clause. This is crucial for tracking amendments with precision.
Semantic Understanding: The Akoma Ntoso structure provides semantic meaning to the text. A machine can understand the difference between a
chapter, asection, and apreamble, which is impossible with plain text.Interoperability: As a global standard, it allows for the exchange and comparison of legal documents across different systems and jurisdictions.
The Conversion Logic: From Raw Text to Structured JSON
The conversion process is orchestrated by the scripts in your bbmp_data_extractor directory, primarily extractor.js and extractor_gemini.js. The core of this process is leveraging a large language model (LLM) to act as an expert legal parser.
Here’s a step-by-step breakdown of the logic:
Step 1: Consolidating the Raw Text
First, the raw text, which is fragmented across multiple JSON objects (representing pages), needs to be combined into a single, coherent string for each chapter.
Logic: The getChapterContent function is responsible for this. It reads the JSON file for a specific chapter, iterates through the array of pages, and concatenates the markdown content from each page.
Code Snippet (extractor.js):
JavaScript
async function getChapterContent(filePath) {
try {
const data = await fs.readFile(filePath, 'utf-8');
const pages = JSON.parse(data);
const combinedMarkdown = pages.map(page => page.markdown).join('\n');
return combinedMarkdown;
} catch (error) {
console.error(`Error reading or parsing file at ${filePath}:`, error);
throw error;
}
}Step 2: Instructing the AI with a System Prompt
This is the most critical part of the process. We provide the LLM with a detailed set of instructions, or a "system prompt," that tells it exactly how to parse the text and what the output JSON structure should look like.
Logic: The systemPrompt in convertToAkomaNtoso is carefully crafted to:
Define the Role: It tells the AI to act as an "expert legal document parser."
Specify the Goal: It clearly states the objective is to create a hierarchical JSON structure based on Akoma Ntoso principles.
Provide a Schema: It gives a clear example of the desired JSON output, including the nesting of
chapter,section,subsection, andclauses. This is crucial for ensuring the AI returns a consistent and predictable structure.Set Constraints: It includes critical instructions like "do not change, rephrase, or manipulate any of the original legal text" to ensure the integrity of the legal document is maintained.
Code Snippet (extractor_gemini.js):
JavaScript
const systemPrompt = `
You are an expert legal document parser. Your task is to convert the provided raw text from a legislative act into a structured JSON format that follows Akoma Ntoso principles...
**Desired JSON Structure Example (with nesting):**
{
"akomaNtoso": {
"act": {
"meta": { "... meta info ..." },
"preamble": { "... preamble info ..." },
"body": {
"chapter": {
"@eId": "ch_I",
"num": "CHAPTER I",
"heading": "PRELIMINARY",
"section": [
{
"@eId": "sec_3",
"num": "3.",
"heading": "Definitions.",
"content": [
{
"p": "In this Act, unless the context otherwise requires,-"
},
{
"subsection": {
"num": "(7)",
"content": {
"p": "'building' includes,-",
"clauses": [
{ "num": "(a)", "content": "a house, out-house, stable..." },
{ "num": "(b)", "content": "a structure on wheels..." }
]
}
}
}
]
}
]
}
}
}
}
}
`;Step 3: Making the API Call and Receiving the JSON
With the raw text consolidated and the instructions defined, the script then makes an API call to the LLM.
Logic: The convertToAkomaNtoso function sends the system prompt and the raw text to the AI model. It specifies that the response should be a JSON object, which simplifies the parsing on our end.
Code Snippet (extractor.js):
JavaScript
async function convertToAkomaNtoso(textContent) {
// ... systemPrompt is defined here ...
try {
console.log("Sending content to OpenAI for conversion...");
const response = await openai.chat.completions.create({
model: "gpt-4-turbo",
messages: [
{ role: "system", content: systemPrompt },
{ role: "user", content: `Here is the text to convert:\n\n${textContent}` }
],
response_format: { type: "json_object" },
});
const jsonOutput = JSON.parse(response.choices[0].message.content);
console.log("Successfully received and parsed structured JSON from OpenAI.");
return jsonOutput;
} catch (error) {
console.error("Error during OpenAI API call:", error);
throw error;
}
}Step 4: Storing the Structured Output
Finally, the structured JSON returned by the AI is saved to a file.
Logic: The main function orchestrates the entire process: it calls getChapterContent to get the text, passes it to convertToAkomaNtoso, and then writes the resulting structured JSON to a new file in the bbmp_data_extractor/akomo-ntoso/ directory.
Code Snippet (extractor.js):
JavaScript
async function main() {
try {
const chapterFilePath = path.resolve('bbmp_data_extractor/chapter1.json');
const rawText = await getChapterContent(chapterFilePath);
const structuredJson = await convertToAkomaNtoso(rawText);
const outputFilePath = path.resolve('chapter1_akoma_ntoso.json');
await fs.writeFile(outputFilePath, JSON.stringify(structuredJson, null, 2));
console.log(`Conversion successful. Structured JSON saved to: ${outputFilePath}`);
} catch (error) {
console.error("The process failed:", error);
}
}The Result: A Deeply Structured Legal Document
The output of this process, as seen in files like chapter1_akoma_ntoso.json, is a rich, hierarchical representation of the BBMP Act.
Here is an example from your data, showing how a section with nested clauses is structured:
JSON
{
"akomaNtoso": {
"act": {
"body": {
"chapter": [
{
"@eId": "ch_I",
"section": [
{
"@eId": "sec_3",
"num": "3.",
"heading": "Definitions.",
"content": [
{
"p": "In this Act, unless the context otherwise requires,-"
},
{
"subsection": {
"num": "(7)",
"content": {
"p": "'building' includes,-",
"clauses": [
{
"num": "(a)",
"content": "a house, out-house, stable, privy, shed, hut, wall, verandah, fixed platform, plinth, door step and any other structure..."
},
{
"num": "(b)",
"content": "a structure on wheels simply resting in the ground without foundations;"
}
]
}
}
}
]
}
]
}
]
}
}
}
}This structured format is what enables the powerful versioning and diffing features of your application, turning a static, hard-to-track document into a dynamic, transparent, and accessible legal text.
Last updated