How to extract text from DOCX or ODT files using PHP

Shashwat Srivastava

Feb 8, 2011 — 2 min read

Are you searching for a method to extract text from DOCX or ODT files using PHP? Well in this article I will show you how to do so. This technique can be used to create a web crawler and index document files based upon their content i.e. this can be used to create a document repository. The technique here doesn't involve any third party plugins or softwares. It will work in PHP 5.2+ and the only requirement is php_zip.dll for Windows or --enable-zip parameter for Linux. Actually the DOCX and ODT files are archive files whose extension has been changed from .zip to .docx or .odt. Hence we need a ZIP library for PHP in order to extract the data from them.

You can verify this fact yourself. Just try to open any docx or odt file with a ZIP utility. Check out the screenshot below -

The text data is present in word/document.xml for DOCX and in Content.xml for ODT file. In order to extract the text all we need to do is that get the contents of word/document.xml (for docx file) or content.xml (for odt file) and then display its content after filtering out XML tags present in it.

Create a new PHP file and name it as extract.php and add the following code it -

<?php /*Name of the document file*/ $document = 'attractive_prices.docx'; /**Function to extract text*/ function extracttext($filename) { //Check for extension $ext = end(explode('.', $filename)); //if its docx file if($ext == 'docx') $dataFile = "word/document.xml"; //else it must be odt file else $dataFile = "content.xml"; //Create a new ZIP archive object $zip = new ZipArchive; // Open the archive file if (true === $zip->open($filename)) { // If successful, search for the data file in the archive if (($index = $zip->locateName($dataFile)) !== false) { // Index found! Now read it to a string $text = $zip->getFromIndex($index); // Load XML from a string // Ignore errors and warnings $xml = DOMDocument::loadXML($text, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING); // Remove XML formatting tags and return the text return strip_tags($xml->saveXML()); } //Close the archive file $zip->close(); } // In case of failure return a message return "File not found"; } echo extracttext($document); ?>

Comments in the code snippet should easily help you to understand it.

How to Fix SSL Certificate Renewal Issue on DigitalOcean Ghost Blog

I was checking out my blog after a couple of months and encountered an SSL error from Cloudflare. I logged into my DigitalOcean account after a long time to debug the issue. The first thing that I found was that my ghost blog process was up and running. You can

How to set up adb environment variable in Mac OSX

Struggling to run adb command on your terminal in Mac? You have come to the right place. Android Debug Bridge (adb) is a powerful tool which lets us communicate with our android device. It is great tool to install and debug your android app. Let's get started! Open

Why is Google Analytics showing index.php after every page URL?

I was checking my Google Analytics today after migrating my blog to Ghost and noticed a weird thing. All the page URLs were contained index.php towards the end! When I tried opening these pages from GA, all the URLs were throwing 404 error as they contained the same index.

Fix node-gyp issue on Mac OS

I have been facing this issue on my Mac OS Catalina recently where I am unable to install certain Node.js packages using npm. The installation usually fails whenever there is a dependency on node-gyp. node-gyp is a cross-platform command-line tool written in Node.js for compiling native addon modules

Read more

How to Fix SSL Certificate Renewal Issue on DigitalOcean Ghost Blog

How to set up adb environment variable in Mac OSX

Why is Google Analytics showing index.php after every page URL?

Fix node-gyp issue on Mac OS