Open-source News

A look inside an EPUB file

opensource.com - Tue, 08/16/2022 - 15:00
A look inside an EPUB file Jim Hall Tue, 08/16/2022 - 03:00 Register or Login to like Register or Login to like

eBooks provide a great way to read books, magazines, and other content on the go. Readers can enjoy eBooks to pass the time during long flights and train rides. The most popular eBook file format is the EPUB file, short for "electronic publication." EPUB files are supported across a variety of eReaders and are effectively the standard for eBook publication today.

The EPUB file format is an open standard based on XHTML for content and XML for metadata, contained in a zip file archive. And because everything is based on open standards, we can use common tools to create or examine EPUB files. Let's explore an EPUB file to learn more about it. A guide to tips and tricks for C programming, published earlier this year on Opensource.com, is available in PDF or EPUB format.

Because EPUB files are XHTML content and XML metadata in a zip file, you can start with the unzip command to examine the EPUB from the command line:

$ unzip -l osdc_Jim-Hall_C-Programming-Tips.epub
Archive: osdc_Jim-Hall_C-Programming-Tips.epub
Length Date Time Name
--------- ---------- ----- ----
20 06-23-2022 00:20 mimetype
8259 06-23-2022 00:20 OEBPS/styles/stylesheet.css
1659 06-23-2022 00:20 OEBPS/toc.xhtml
4460 06-23-2022 00:20 OEBPS/content.opf
44157 06-23-2022 00:20 OEBPS/sections/section0018.xhtml
1242 06-23-2022 00:20 OEBPS/sections/section0002.xhtml
22429 06-23-2022 00:20 OEBPS/sections/section0008.xhtml
[...]
9628 06-23-2022 00:20 OEBPS/sections/section0016.xhtml
748 06-23-2022 00:20 OEBPS/sections/section0001.xhtml
3370 06-23-2022 00:20 OEBPS/toc.ncx
8308 06-23-2022 00:21 OEBPS/images/image0011.png
6598 06-23-2022 00:21 OEBPS/images/image0009.png
[...]
14492 06-23-2022 00:21 OEBPS/images/image0005.png
239 06-23-2022 00:20 META-INF/container.xml
--------- -------
959201 41 files

This EPUB contains a lot of files, but much of this is content. To understand how an EPUB file is put together, follow the process flow of an eBook reader:

  1. eBook readers need to verify that the EPUB file is really an EPUB file. They verify the file by examining the mimetype file at the root of the EPUB archive. This file contains just one line that describes the MIME type of the EPUB file:

    application/epub+zip
  2. To locate the content, eBook readers start with the META-INF/container.xml file. This is a brief XML document that indicates where to find the content. For this EPUB file, the container.xml file looks like this:

    <?xml version="1.0" encoding="UTF-8"?>

    To make the container.xml file easier to read, I split the single line into multiple lines and added some spacing to indent each line. XML files don't really care about extra white space like new lines and spaces, so this extra spacing doesn't affect the XML file.

  3. The container.xml file says the root of the EPUB starts with the content.opf file in the OEBPS directory. The OPF extension is because EPUB is based on the Open Packaging Format, but the content.opf file is really just another XML file.

  4. The content.opf file contains a complete manifest of the EPUB contents, plus an ordered table of contents, with references to find each chapter or section. The content.opf file for this EPUB is quite long, so I'll show just a bit of it here as an example.

    The XML data is contained within a block, which itself has a block, the data, and a block that contains the eBook's table of contents:

    <?xml version="1.0" encoding="UTF-8"?>
    unique-identifier="unique-identifier" version="3.0" xmlns="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:opf="http://www.idpf.org/2007/opf">
    >
    id="unique-identifier">osdc002>
    >Tips and Tricks for C Programming>
    >Jim Hall>
    >English>
    property="dcterms:modified">2022-06-23T12:09:13Z>
    content="LibreOffice/7.3.0.3$Linux_X86_64 LibreOffice_project/0f246aa12d0eee4a0f7adcefbf7c878fc2238db3 (libepubgen/0.1.1)" name="generator"/>
    >
    >
    ...
    href="sections/section0001.xhtml" id="section0001" media-type="application/xhtml+xml"/>
    href="images/image0003.png" id="image0003" media-type="image/png"/>
    href="styles/stylesheet.css" id="stylesheet.css" media-type="text/css"/>
    href="toc.ncx" id="toc.ncx" media-type="application/x-dtbncx+xml"/>
    ...
    >
    toc="toc.ncx">
    idref="section0001"/>
    idref="section0002"/>
    idref="section0003"/>
    ...
    >
    >

    You can match up the data to see where to find each section. That’s how EPUB readers do it. For example, the first item in the table of contents references section0001 which is defined in the manifest as located in the sections/section0001.xhtml file. The file doesn’t need to be named the same as the idref entry, but that’s how LibreOffice Writer’s automated process created the file. (You can see in the metadata that this EPUB was created with LibreOffice version 7.3.0.3 on Linux, which can export content as EPUB files.)

The EPUB format

EPUB files are a great way to publish content using an open format. The EPUB file format is XML metadata with XHTML content, inside a zip container. While most technical writers use tools to create EPUB files, because EPUB is based on open standards means you can create your own EPUB files in some other way.

EPUB files are a great way to publish content using an open format.

Image by:

Lewis Cowles, CC BY-SA 4.0

Linux Documentation What to read next How I use the Linux fmt command to format text How I use the Linux sed command to automate file edits Old-school technical writing with groff Create beautiful PDFs in LaTeX A gentle introduction to HTML Writing project documentation in HTML Level up your HTML document with CSS How ODT files are structured This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License. Register or Login to post a comment.

LZ4 v1.9.4 Achieves 20~70% Speedups For Some CPUs & Configurations

Phoronix - Tue, 08/16/2022 - 07:22
LZ4 v1.9.4 is out today as the first point release in nearly two years for this BSD-licensed, speedy, lossless compression algorithm...

Qualcomm Posts "QAIC" DRM Accelerator Driver For Linux

Phoronix - Tue, 08/16/2022 - 03:00
After Qualcomm announced their Cloud AI 100 Accelerator back in 2019, in 2020 during the early days of the pandemic they posted a Linux driver for this accelerator. That driver didn't get picked up for the mainline Linux kernel and two years later there still is little fanfare around the Qualcomm AI Cloud Accelerator hardware. However, now they have posted a new Linux driver that goes the DRM driver route...

Android 13 Sources Released To AOSP

Phoronix - Tue, 08/16/2022 - 01:36
Google announced today that the Android 13 sources have been published to the Android Open-Source Project as part of officially releasing this newest version of Android...

GNOME 43 Beta Released With More GTK 4 Porting, Other Desktop Improvements

Phoronix - Tue, 08/16/2022 - 00:31
The beta of GNOME 43 is now available for testing ahead of the stable release next month...

Linux 6.0 Supporting New Intel/AMD Hardware, Performance Improvements & Much More

Phoronix - Mon, 08/15/2022 - 21:45
Yesterday marked the release of Linux 6.0-rc1 and as such the merge window is no over and no more feature work is set to land in this kernel version. Here is my write-up of all the interesting new features and changes/improvements coming for Linux 6.0.

Pages