How to extract images from DOCX files using PHP

In this post I will show you how to extract and display images from Microsoft Word Document files (docx extension) using PHP. This can be used for creating a document repository and indexing images for it. We won't use any third party software or module for this. This code will work in PHP 5.2+ and the only requirement is php_zip.dll for Windows or --enable-zip parameter for Linux. Basically we need a ZIP library for PHP. The reason behind this requirement is that Word document files are actually archived files with there extension changed from .zip to .docx.

Try to open a docx file with a ZIP file utility. See below the screenshot of a docx file named attractive_prices.docx extracted with WINRAR.

As you can see there are various folders and XML files inside this archive. We can even extract the text from the document file! Indeed my next post will be on this. All the media files are stored by Word in directory - word/media.

So if we want to extract and display images from a doc file, all we need to do is open the word/media directory present in the archive and then display all the images present inside it.

Create a new PHP file and name it as extract.php and add the following code in it.

<?php

/*Name of the document file*/
$document = 'attractive_prices.docx';

/*Function to extract images*/
function readZippedImages($filename) {

/*Create a new ZIP archive object*/
   $zip = new ZipArchive;

   /*Open the received archive file*/
   if (true === $zip->open($filename)) {
  for ($i=0; $i<$zip->numFiles;$i++) {

/*Loop via all the files to check for image files*/
   $zip_element = $zip->statIndex($i);

/*Check for images*/
   if(preg_match("([^\s]+(\.(?i)(jpg|jpeg|png|gif|bmp))$)",$zip_element['name'])) {

/*Display images if present by using display.php*/
    echo "<image src='display.php?filename=".$filename."&index=".$i."' /><hr />";
   }
  }
 }
}
readZippedImages($document);
?>

Now create another PHP file and name it as display.php and add the following code to it.

<?php

/*Tell the browser that we want to display an image*/
 header('Content-Type: image/jpeg');

/*Create a new ZIP archive object*/
   $zip = new ZipArchive;

   /*Open the received archive file*/
   if (true === $zip->open($_GET['filename'])) {

/*Get the content of the specified index of ZIP archive*/
  echo $zip->getFromIndex($_GET['index']);
 }

$zip->close();
?>

Comments have been included in both the scripts for easy understanding. Check out the working demo.

//Input the IDs of the IFRAMES you wish to dynamically resize to match its content height: //Separate each ID with a comma. Examples: ["myframe1", "myframe2"] or ["myframe"] or [] for none: var iframeids=["FRAME1"] //Should script hide iframe from browsers that don't support this script (non IE5+/NS6+ browsers. Recommended): var iframehide="no" var getFFVersion=navigator.userAgent.substring(navigator.userAgent.indexOf("Firefox")).split("/")[1] var FFextraHeight=parseFloat(getFFVersion)>=0.1? 3 : 0 //extra height in px to add to iframe in FireFox 1.0+ browsers function resizeCaller() { var dyniframe=new Array() for (i=0; i<iframeids.length; i++){ if (document.getElementById) resizeIframe(iframeids) //reveal iframe for lower end browsers? (see var above): if ((document.all || document.getElementById) && iframehide=="no"){ var tempobj=document.all? document.all[iframeids] : document.getElementById(iframeids) tempobj.style.display="block" } } } function resizeIframe(frameid){ var currentfr=document.getElementById(frameid) if (currentfr && !window.opera){ currentfr.style.display="block" if (currentfr.contentDocument && currentfr.contentDocument.body.offsetHeight) //ns6 syntax currentfr.height = currentfr.contentDocument.body.offsetHeight+FFextraHeight+30; else if (currentfr.Document && currentfr.Document.body.scrollHeight) //ie5+ syntax currentfr.height = currentfr.Document.body.scrollHeight; if (currentfr.addEventListener) currentfr.addEventListener("load", readjustIframe, false) else if (currentfr.attachEvent){ currentfr.detachEvent("onload", readjustIframe) // Bug fix line currentfr.attachEvent("onload", readjustIframe) } } } function readjustIframe(loadevt) { var crossevt=(window.event)? event : loadevt var iframeroot=(crossevt.currentTarget)? crossevt.currentTarget : crossevt.srcElement if (iframeroot) resizeIframe(iframeroot.id); } function loadintoIframe(iframeid, url){ if (document.getElementById) document.getElementById(iframeid).src=url } if (window.addEventListener) window.addEventListener("load", resizeCaller, false) else if (window.attachEvent) window.attachEvent("onload", resizeCaller) else window.onload=resizeCaller