PEFS Digitization and DEXPI Conversion for Process Engineering Archives

How do you make 35,000 engineering drawings speak the same language as your engineering systems?

A large Middle Eastern oil and gas operator held 35,000 PEFS files across AutoCAD, PDF, TIFF, and JPEG formats, with no way to search them by equipment or process element. Azati built an AI-powered pipeline that extracts and classifies data from these drawings, converts them to the DEXPI industry standard via Proteus XML, validates the output, and generates SVG visualizations, turning a static archive into an intelligent, queryable engineering data layer.

Digitize my engineering drawings
35,000

PEFS files in the archive, previously unsearchable

60-80%

estimated reduction in manual work based on pilot results

DEXPI

industry standard adopted for all converted output, enabling downstream system integration

Technologies used

Python
Python
PyTorch
PyTorch
OpenCV
OpenCV
FastAPI
FastAPI
MongoDB
MongoDB
Oracle Cloud Infrastructure
Oracle Cloud Infrastructure

Motivation

Process Engineering Flow Schemes define how an oil and gas facility works: equipment connections, process flows, safety systems. When 35,000 of these diagrams exist only as AutoCAD drawings, PDFs, TIFF scans, and JPEG images, that knowledge is locked. You cannot search a TIFF. Every question about plant topology becomes a manual review across a 35,000-file archive, and decisions on operations, safety, and facility upgrades get made without the engineering data that should inform them.

Azati built the pipeline to unlock it: extract and classify data from PEFS drawings regardless of format, reconstruct process topology as a graph model, convert to DEXPI, the industry standard for machine-readable P&ID and PEFS data, validate, and generate SVG visualizations. The engagement started as a pilot and moved to full scale after the approach proved out.

Business challenges

Challenge 01

An archive that engineering systems could not read

PEFS are the logical backbone of the facility, but stored as static files they are invisible to any software system. No tag search, no equipment lookup, no connection tracing, just 35,000 files that had to be opened and reviewed one by one:

  • Files distributed across AutoCAD, PDF, TIFF, and JPEG formats
  • No machine-readable structure or searchable metadata
  • Engineers manually reviewing files to locate specific equipment
  • Decisions on operations, safety, and upgrades slowed by documentation access
#1
Challenge 02

Drawing formats that break any generic extraction approach

PEFS drawings carry meaning in symbols, topology, and geometric relationships, not just text. Symbol detection, geometric relationship analysis, and topology reconstruction are required before any data can be meaningfully extracted:

  • AutoCAD DXF/DWG files requiring specialized vector parsing via ezdxf
  • Raster formats requiring computer vision for symbol detection and classification
  • Hundreds of symbol types across different drawing conventions
  • Process topology encoded in geometry, not in named fields
#2
Challenge 03

DEXPI conversion as a precision engineering problem

Producing valid DEXPI output is not a mapping exercise. DEXPI, built on the Proteus XML schema, is an information model that describes how process elements relate to each other within a defined hierarchy. Getting the conversion right required building expertise in the standard from scratch alongside the extraction work:

  • DEXPI information model and Proteus XML schema expertise required
  • Process topology must be represented as correctly typed graph structures
  • Pydantic validation enforcing schema compliance at every conversion step
  • NetworkX graph modeling for connection and hierarchy reconstruction
#3
Challenge 04

Requirements that evolved as the drawings revealed their complexity

A 35,000-file archive accumulated over decades contains drawing conventions, symbol variants, and edge cases that no specification document captures in advance. The real requirements emerged from the drawings themselves:

  • New symbol types and drawing conventions surfacing with each batch
  • Models retrained iteratively as annotated examples accumulated
  • Client's downstream integration requirements clarified as conversion output became concrete
  • Agile Scrum structure sustaining continuous delivery through a moving target
#4

Why oil and gas operators choose Azati for PEFS digitization

DEXPI and Proteus XML as a first-class capability, not an afterthought

Most AI document tools extract data into proprietary formats that require custom integration work for every downstream system. Azati built DEXPI conversion directly into the pipeline from the start, using pyDEXPI and NetworkX with Pydantic validation, so the output is not just structured data, it is data that engineering platforms, CMMS, and ERP systems can consume immediately without additional transformation.

Process topology reconstruction, not just symbol extraction

Extracting symbols from a PEFS drawing is the easy part. Understanding how those symbols connect, what the process relationships are between them, and representing that topology correctly in the DEXPI information model is the hard part. Azati built the geometric relationship analysis and graph modeling layer using Shapely and NetworkX to do exactly that, because a list of symbols without topology is not useful engineering data.

Iterative model development built into the engagement from day one

PEFS archives contain decades of drawing conventions. No initial model covers all of them. Azati's approach uses Gradio-based review interfaces to close the loop between extraction output and domain expert feedback quickly, with systematic retraining cycles that improve coverage with every processed batch rather than plateauing early.

A returning client, extended for a reason

This client worked with Azati on other projects before this engagement and chose Azati again specifically for the combination of AI and ML depth, experience handling large unstructured datasets, and an iterative approach that fit the genuinely exploratory nature of the problem.

Engineering drawings that your systems cannot read?

AutoCAD, PDF, TIFF, or JPEG, Azati can show you what an AI-powered PEFS digitization and DEXPI conversion pipeline looks like for your archive and your downstream systems.

Digitize my engineering drawings

How AI-powered PEFS digitization and DEXPI conversion works end to end

Azati built a multi-stage pipeline that ingests PEFS files in any format, extracts and classifies the engineering data inside them, reconstructs the process topology as a graph model, converts the result to DEXPI, validates it against the schema, and renders a visual output, all in a loop that improves as the volume of processed and annotated drawings grows.

01

Multi-format drawing ingestion

The pipeline handles PEFS files regardless of format. AutoCAD DXF and DWG files are parsed as vector geometry using ezdxf, which preserves the structural precision of the original drawing. PDF, TIFF, and JPEG files are rasterized or processed directly through the computer vision pipeline. A single entry point handles all formats.

Key capabilities:
  • AutoCAD DXF/DWG vector parsing via ezdxf
  • PDF ingestion and rasterization via pypdfium2
  • TIFF and JPEG raster image processing
  • Format-agnostic pipeline entry point
02

Symbol detection and classification

Object detection models trained on PEFS symbol libraries identify and classify equipment, instruments, valves, and process elements across each drawing. Datumaro manages the annotation and training data pipeline, PyTorch Lightning powers model training, and models are retrained iteratively as annotated examples from new drawing batches accumulate.

Key capabilities:
  • Object detection for PEFS equipment and instrument symbols
  • Symbol classification across drawing conventions
  • Datumaro-managed annotation and training data pipeline
  • Iterative model retraining as new drawing types appear
03

OCR and text-to-symbol binding

Alongside visual symbol detection, the pipeline extracts text from drawings: equipment tags, instrument codes, line designations, and annotations. Each extracted text element is bound to the graphical symbol it labels, creating the searchable identifiers engineers need to find specific equipment by tag rather than by visual scan.

Key capabilities:
  • OCR for equipment tags and instrument codes
  • Text-to-symbol binding for searchable metadata
  • Extraction across multiple text styles and drawing conventions
  • Foundation for intelligent archive search
04

Process topology reconstruction

After extracting individual elements, the pipeline reconstructs the network of connections between them. Shapely handles the geometric relationship analysis, identifying which elements are connected by which lines and in what configuration. NetworkX models the result as a directed graph, producing a machine-readable representation of the process topology that is the core of the DEXPI output.

Key capabilities:
  • Geometric relationship analysis via Shapely
  • Connection and topology graph modeling via NetworkX
  • Equipment-to-piping relationship reconstruction
  • Graph structure as the basis for DEXPI conversion
05

DEXPI conversion via Proteus XML

The process topology graph and extracted element data are converted to DEXPI using pyDEXPI and the Proteus XML schema. Pydantic validation enforces schema compliance at every conversion step, ensuring the output meets the standard's requirements rather than just resembling its structure. The result is a DEXPI file that downstream engineering systems can ingest directly.

Key capabilities:
  • pyDEXPI-based DEXPI format generation
  • Proteus XML schema compliance
  • Pydantic validation of converted output
  • Downstream-ready files for engineering platform integration
06

SVG visualization and human review layer

For each processed drawing, the pipeline generates an SVG visualization that preserves the original geometry in a web-viewable, scalable format. Engineers review extraction results against the original drawing, with extracted elements overlaid or linked for verification. This review layer feeds back into model retraining, closing the loop between human expertise and automated extraction.

Key capabilities:
  • SVG rendering of processed engineering drawings
  • Extraction result overlay for human review
  • Scalable vector output for web-based platforms
  • Review feedback loop into model retraining

What Azati did

AreaAzati contribution
Multi-format ingestionBuilt parsing pipelines for AutoCAD DXF, PDF, TIFF, and JPEG drawings
Symbol detectionTrained and deployed object detection models for PEFS symbol classification
OCR and taggingExtracted equipment tags, instrument codes, and bound them to graphical symbols
Topology reconstructionBuilt graph models of process element connections using Shapely and NetworkX
DEXPI conversionImplemented pyDEXPI-based conversion with Proteus XML schema and Pydantic validation
SVG generationGenerated scalable vector visualizations with extraction overlays for human review
Iterative deliveryRetrained models and refined extraction as new drawing types emerged
Requirements engineeringTranslated evolving client needs into technical specifications through close collaboration

Security

The engagement operates under the client's internal data security policies. Data transfer uses TLS encryption, access is role-based with full activity logging, and all team members work under NDA with authorized access to client systems. The pipeline runs on Oracle Cloud Infrastructure (OCI) in alignment with the client's infrastructure and data residency requirements.

Engagement & delivery

Pilot phase confirmed the approach, full-scale engagement followed

The engagement began as a focused pilot from October 2025 to April 2026, processing a representative subset of the archive and establishing the extraction, topology reconstruction, DEXPI conversion, and validation pipeline. Pilot results showed an estimated 60-80% reduction in manual work and led directly to a full-scale engagement now underway.

Agile Scrum with iterative model improvement built in

The team works in Agile Scrum, with each sprint balancing new drawing type coverage, model retraining, and requirements refinement as the client's downstream integration needs become clearer:

  • Agile Scrum delivery across pilot and full-scale phases
  • Regular sprint reviews with client to validate extraction accuracy and DEXPI output
  • Model retraining cycles tied to accumulating annotated drawing data
  • Requirements evolved as new PEFS variants and edge cases surfaced

Returning client, contract extended for at least another year

This client had worked with Azati on prior engagements before this project. After a successful pilot, the contract was extended for at least another year, reflecting sustained confidence in both the technical approach and the team.

Results & business impact

35,000 previously unsearchable drawings becoming queryable

Engineers and operators who previously had to manually review files to find specific equipment or trace a process connection can now search the archive by tag, element, or process relationship. The documentation that was locked in static files becomes part of how the facility is actually understood and managed.

60-80% estimated reduction in manual processing work

Pilot results demonstrated that automated extraction, classification, topology reconstruction, and DEXPI conversion can shift the majority of manual processing out of human hands. A specialist reviews and confirms AI output rather than performing the extraction from scratch on every drawing.

DEXPI-compliant output that plugs into existing engineering systems

Converted files follow the DEXPI standard built on Proteus XML. That means the output is not structured data waiting for an integration project, it is data that engineering platforms, CMMS, ERP systems, and digital twin initiatives can consume directly, compounding the value of every processed drawing.

The data foundation for digital twin and predictive maintenance readiness

Every digital twin initiative and predictive maintenance program eventually runs into the same problem: plant topology lives in engineering drawings that no system can read. Making the PEFS archive machine-readable in DEXPI format is the data layer that makes those downstream initiatives actually possible, not a nice-to-have preparation, but the prerequisite.

Strategic wins

What this engagement demonstrates beyond the feature list:

This is where digital twin initiatives actually start

Most digital twin programs stall not because the AI is not ready, but because the plant topology data is not ready. PEFS drawings contain the logical DNA of a facility: how every equipment item, valve, instrument, and process line relates to everything else. Converting that to machine-readable DEXPI is the prerequisite step that makes predictive maintenance, safety analysis, and operational AI actually work against real plant structure rather than incomplete manual records.

DEXPI compatibility is what separates useful data from another structured silo

Many digitization projects produce structured output that sits in a proprietary database and requires custom work to connect to anything downstream. By converting directly to DEXPI via Proteus XML, Azati ensures the extracted data speaks the same language as the engineering systems the client already runs, removing the integration burden from every connection and compounding the value of every processed drawing.

Iterative model improvement is the only viable approach at this scale and age

A 35,000-file archive built across decades contains drawing conventions, symbol variants, and edge cases that no specification document captures in advance, and no initial model covers completely. Building systematic retraining into the engagement from the start, tied to human review feedback, is what lets the pipeline keep improving its coverage rather than plateauing at a rate that still leaves too much residual manual work.

The described expertise is relevant for

  • AI-powered P&ID digitization and PEFS for oil and gas and process industry operators
  • DEXPI conversion and Proteus XML schema implementation
  • Computer vision and OCR for engineering drawing in AutoCAD, PDF, TIFF, and JPEG formats
  • Symbol detection and classification for process engineering schematics
  • Process topology reconstruction for intelligent search and downstream integration
  • Engineering drawing data preparation for digital twin and predictive maintenance initiatives
  • OCI-hosted AI pipelines for regulated industrial environments

Related case studies

Explore our successful projects and see how Azati delivers measurable results for our clients.

AI-Powered Technical Document Processing and Defect Detection for Oil & Gas
Energy, Oil & Gas

AI-Powered Technical Document Processing and Defect Detection for Oil & Gas

~100,000 technical documents processed by the AI pipeline
50–70% estimated reduction in manual review effort
40–60% estimated reduction in reporting data preparation time
  • Python
  • FastAPI
  • PyTorch
  • OpenCV
  • OCI

⚡ Pain Points We Tackled

A large Middle Eastern oil and gas operator was manually extracting data from technical documents and engineering drawings submitted by contractors, cross-checking it against internal systems, and compiling reports by hand. At scale, this process was slow, resource-intensive, and increasingly error-prone, with no systematic way to prioritize which discrepancies mattered most.

Our Approach

Azati built an AI-powered pipeline combining computer vision and OCR to extract and validate data from engineering documents and AutoCAD drawings, flagging discrepancies against internal system records and delivering verified results directly into the client's existing Knowledge Hub platform, extending it rather than replacing it.

Applied Methods and Practices

  • Document and drawing data extraction: Computer vision and geometry-aware OCR pipeline handling AutoCAD, PDF, and raster formats.
  • Cross-validation against internal systems: Automated comparison against internal records with traceable discrepancy flagging.
  • LLM-assisted interpretation: Large language models providing context-aware interpretation of technical terminology across contractor document formats.
  • Knowledge Hub integration: Results delivered directly into the client's existing platform without a separate tool.
  • DD Projects and Reporting Hub: Tag hierarchies, bulk updates via Excel, and exportable reports with kh-link references back to source documents.

Solution Features

  • ~100,000 documents processed in production: Pipeline moved from pilot to full scale, validating the approach at real engineering document volumes.
  • 50–70% estimated reduction in manual review effort: Automated extraction and discrepancy detection shifted most work from fully manual to semi-automated.
  • Faster reporting data preparation: Time spent preparing audit-ready data dropped substantially compared to the fully manual process.
  • Scalable, traceable process: Every extracted data point and flagged discrepancy traceable back to its source document.
Enterprise Data Platform Modernization and Analytical Pipeline Development
Energy, Oil & Gas

Enterprise Data Platform Modernization, Analytical Pipeline Development

Dozens of analytical data marts and pipelines built and enhanced
Hundreds of flows within the enterprise data ecosystem
24+ months of continuous embedded delivery
  • Python
  • SQL
  • dbt
  • Apache Airflow
  • ClickHouse

⚡ Pain Points We Tackled

A large chemical enterprise required scalable mechanisms to collect, transform, and deliver trusted analytical data across dozens of interconnected business systems. Growing analytical requirements, a complex data ecosystem with hundreds of flows and thousands of tables, and the need for reliable reporting data created ongoing engineering challenges.

Our Approach

Azati embedded specialists into the client's delivery organization, developing analytical pipelines, data marts, and orchestration workflows using SQL, Python, dbt, Apache Airflow, Apache NiFi, Greenplum, PostgreSQL, and ClickHouse. The engagement focused on continuous platform evolution over 24+ months rather than a one-time implementation.

Applied Methods and Practices

  • Data integration and ingestion workflows: Mechanisms for bringing data into the analytical environment from multiple enterprise systems.
  • Analytical data marts: Business-oriented datasets curated for reporting and self-service analytics.
  • Workflow orchestration: Apache Airflow and Apache NiFi-based pipelines for continuous data processing and delivery.
  • Performance optimization: SQL and transformation refinement as the analytical asset base expanded.
  • Scalable foundations: Greenplum, ClickHouse, dbt, and PostgreSQL supporting growing analytical requirements.

Solution Features

  • Improved access to trusted business data: Curated analytical datasets simplified information consumption for BI teams.
  • Better operational visibility: Reliable analytical structures supporting reporting and operational analysis across multiple domains.
  • Sustainable platform evolution: Continuous enhancements accommodating changing reporting requirements and new systems over 24+ months.
  • Additional engineering capacity: Azati integrated into the client's established delivery organization without disrupting existing processes.
AI-Powered Petrochemical Inventory Management Platform
Energy, Oil & Gas

AI-Powered Petrochemical Inventory Management Platform

7 years of continuous dedicated team development and operation
10–30% typical inventory reduction through AI nomenclature normalization
$10M+ typical savings per asset for this kind of AI-driven inventory optimization
  • Python
  • FastAPI
  • Flask
  • Kubernetes
  • PostgreSQL

⚡ Pain Points We Tackled

A large petrochemical operator faced high carrying, maintenance, and disposal costs from unclaimed and duplicate inventory across multiple ERP systems using different coding conventions. The same material was often tracked under different codes, making it impossible to see true inventory levels without manual reconciliation.

Our Approach

Azati ran a seven-year dedicated team engagement building and operating an on-premise inventory management platform with AI-assisted nomenclature normalization across disconnected ERP systems, covering backend development, DevOps, and test automation alongside the core product functionality.

Applied Methods and Practices

  • AI-assisted nomenclature normalization: ML components matching and normalizing materials coded differently across legacy ERP systems.
  • Inventory search and scenarios: Search across warehouse and MRO stock, custom scenario creation, and material scope definition.
  • Dashboard reporting: Stock level and valuation visibility without raw data access.
  • On-premise DevOps: Kubernetes, Helm, and Keycloak deployment on the client's infrastructure.
  • Test automation: Playwright, pytest, and httpx-based framework covering UI, API, and database validation.

Solution Features

  • Lower inventory carrying costs: AI-matched nomenclature gave the business a unified view of true inventory across previously fragmented ERP records.
  • Seven years without a vendor switch: This was the client's first and only outsourcing relationship for this system, reflecting sustained trust in the engagement.
  • Industrial inventory expertise inside Azati: Deep expertise in MRO inventory optimization, ERP normalization, and supply chain integration for the petrochemical sector.
Oilfield Reservoir Analytics Platform
Energy, Oil & Gas

Oilfield Reservoir Analytics Platform

1000x reduction in floating-point accumulation error
4 analytical modules delivered in a 4-month window
10 days to redesign and pass UAT on the computational core
  • Python
  • Petroleum engineering
  • Reservoir analytics
  • QA engineering
  • Custom Software

⚡ Pain Points We Tackled

Fragmented calculation workflows for production forecasting and reservoir engineering led to compounding numerical precision errors and slow, unreliable analytical outputs. The existing computational core could not support the accuracy requirements of petroleum engineering analysis at production scale.

Our Approach

Azati rebuilt the modular Python analytics platform replacing fragmented calculation workflows with a precision-engineered computational core, delivering four analytical modules within a four-month window and completing a full computational core redesign that passed UAT on first attempt after ten days.

Applied Methods and Practices

  • Modular Python analytics: Production forecasting and reservoir engineering calculation workflows rebuilt as independent, testable modules.
  • Computational core redesign: 10-day engineering sprint eliminating the source of floating-point accumulation error.
  • QA-led delivery: UAT passed on first attempt following QA-led redesign of the core numerical computation layer.
  • Production forecasting modules: Four analytical modules covering reservoir engineering workflows delivered to production timeline.

Solution Features

  • 1000x reduction in numerical precision loss: Floating-point accumulation error eliminated at the computational core level.
  • 4 modules in 4 months: Full analytical module set delivered within the production engineering timeline.
  • UAT passed on first attempt: QA-led redesign produced a computational core that met all acceptance criteria without iteration.
Enterprise Construction Cost Estimation Platform for Oil and Gas
Energy, Oil & Gas

Enterprise Construction Cost Estimation Platform for Oil and Gas

2 releases delivered early with future functionality demos
4 core calculation and pricing modules built and stabilized
Complex cross-module calculation logic validated for enterprise tender preparation
  • Custom Software
  • Oil & Gas
  • Cost estimation
  • Enterprise platform

⚡ Pain Points We Tackled

An oil and gas enterprise needed a construction cost estimation platform capable of handling complex cross-module calculation dependencies across large-scale capital projects. Existing tools could not reliably manage the interdependencies between estimation modules required for accurate project cost modeling.

Our Approach

Azati built an enterprise-grade construction cost estimation platform designed specifically for oil and gas project complexity, with architecture accounting for cross-module calculation dependencies and ensuring that changes in one estimation module propagate correctly across dependent calculations.

Applied Methods and Practices

  • Cross-module calculation dependencies: Architecture designed to handle complex interdependencies between cost estimation modules in large capital projects.
  • Enterprise-grade platform: Built for oil and gas project scale and complexity, not generic construction estimation.

Solution Features

  • Reliable cross-module cost estimation: Platform correctly propagates calculation changes across dependent modules, supporting accurate project cost modeling.
  • Oil and gas domain fit: Built for the specific complexity of capital project estimation in the energy sector.

Last updated

Got a job for Azati? Let’s talk business!

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

What's next?

  • 1. Tell Us Your Story
    Describe your project. We come back within 24 hours with team availability and a rough plan. NDA on request before the first call.
  • 2. Get Your Roadmap
    Receive a detailed proposal with scope, team composition, timeline, and costs tailored to your goals.
  • 3. Start Building
    Azati aligns on details, finalize terms, and launch your project with full transparency.