Detecting all pages which contain color

Question

In an larger LaTeX document there are often only some pages with color content (mainly figures) and the remaining ones are only black and white.
Because printing costs for color pages are much higher than for black and white it would be good to be able to extract all pages with color and print them separately. The first step for this is to be able to detect if a page contains color or not. This could be in a form of an text list of page number suitable to be read by a PDF page extraction script (using e.g. pdftk).

A simple solution sufficient for many people would be to detect all pages which contain a figure and assume that only these have color. However, a general solution would be nice. Only color elements which are printed should be taken into account, while e.g. the color frames around link by hyperref should not. It is OK that the solution would disable these for the detection.

Martin Scharrer · Answer

For the general case it seem to be indeed better to use an external tool to test for all pages which contain colors. This is the topic of the mentioned SO question
How do I know if PDF pages are color or black-and-white?.
I now wrote an answer to it which includes small script for this.

However, it is much easier to get a list of all pages containing figures.
Here I use the zref-abspage package to get an absolute page counter.
The normal write command can be used which will expand its content when the surrounding content is really placed on a page. Therefore the page counters will have the correct value. 
Then the end-macro of figure can simply be patched to hold this code.

documentclass{book}
usepackage{mwe}

usepackage{zref-abspage}% absolute page counter
newwritefigpages
openoutfigpages=jobname.fpg
makeatletter
g@addto@macroendfigure{%
    % Write absolute page number and page label to file
    % Do not use immediate!
    writefigpages{numbervalue{abspage}: thepage}%
}
makeatother

newcountmycount% for example loop
begin{document}
frontmatter
Blindtext

begin{figure}
    centering
    includegraphics[width=.8textwidth,height=5cm]{example-image}
    caption{Some caption}
end{figure}

mainmatter
Blindtext

loop% keep MWE small by using a loop

begin{figure}
    centering
    includegraphics[width=.8textwidth,height=5cm]{example-image}
    caption{Some caption}
end{figure}

{Blindtext}

advancemycount by 1
    ifnummycount<20relax
repeat

backmatter
appendix
Blindtext

begin{figure}
    centering
    includegraphics[width=.8textwidth,height=5cm]{example-image}
    caption{Some caption}
end{figure}

end{document}

This generates a .fpg file (for figure pages) which looks like:

2: ii
4: 2
5: 3
7: 5
8: 6
10: 8
11: 9
13: 11
14: 12
16: 14
18: 16
19: 17
21: 19
22: 20
24: 22
25: 23
27: 25
28: 26
30: 28
31: 29
33: 31
38: 36

The format can be changed if required.

Kurt Pfeifle · Answer

Newer versions of Ghostscript (version 9.05 and later) include a "device" called inkcov. It calculates the ink coverage of each page (not for each image) in Cyan (C), Magenta (M), Yellow (Y) and Black (K) values, where 0.00000 means 0%, and 1.00000 means 100%.
Example commandline:
gs -o - -sDEVICE=inkcov /path/to/your.pdf

Example output:
Page 1
0.00000  0.00000  0.00000  0.02230 CMYK OK
Page 2
0.02360  0.02360  0.02360  0.02360 CMYK OK
Page 3
0.02525  0.02525  0.02525  0.00000 CMYK OK
Page 4
0.00000  0.00000  0.00000  0.01982 CMYK OK

You can see here that the pages 1+4 are using no color, while pages 2+3 do. This case is particularly 'nasty' for people who want to save on color ink: because all the respective C, M, Y (and K) values are exactly the same for each of the pages 2+3, they possibly could appear to the human eye not as color pages, but as ("rich") grayscale anyway (if each single pixel is mixed with these color values).
Ghostscript can also convert color into grayscale. Example commandline:
gs                                
  -o grayscale.pdf                
  -sDEVICE=pdfwrite               
  -sColorConversionStrategy=Gray  
  -sProcessColorModel=/DeviceGray 
   /path/to/your.pdf

Checking for the ink coverage distribution again (note how the addition of -q to the parameters slightly changes the output format):
gs -q  -o - -sDEVICE=inkcov grayscale.pdf
 0.00000  0.00000  0.00000  0.02230 CMYK OK
 0.00000  0.00000  0.00000  0.02360 CMYK OK
 0.00000  0.00000  0.00000  0.02525 CMYK OK
 0.00000  0.00000  0.00000  0.01982 CMYK OK

Chris H · Answer

There's a rather useful python script at http://homepages.inf.ed.ac.uk/imurray2/code/hacks/pdfcolorsplit which uses pdftk to split into colour and b&w files, though it doesn't deal with the boxes around hyperrefs. If you have access to the LaTeX source, why not turn off the colour in hyperref anyway - I do it like this:

usepackage[colorlinks=true,
            linkcolor=black,
            citecolor=black,
            filecolor=black,
            urlcolor=black]{hyperref}

IIRC if you just set [colorlinks=false] they're not clickable.

usepackage[colorlinks=true,
            linkcolor=black,
            citecolor=black,
            filecolor=black,
            urlcolor=black]{hyperref}

IIRC if you just set [colorlinks=false] they're not clickable.

nikos · Answer

Here is a MATLAB script which uses Kurt Pfeifle's answer to split a PDF into two files, one colour and one grayscale. The original file is preserved. It does not handle double-sided printing.

It is not bullet-proof and might need some debugging, but hopefully it will work out of the box.

You will need:

Ghostscript (version 9.05 and later)
MATLAB function ghostscript() and user_input() from here
pdftk

Here is the script (you will need to change lines 4,5 and possibly 6):

clear all; close all; clc;

%Change these:
pathToFile = '/Users/nikos/Desktop/';
fName = 'thesis.pdf';
%you might need to change the path to pdftk (if in windows for example)
pdftkPath = '/usr/local/bin/pdftk';

disp('Reminder: you might want to set hypersetup{colorlinks=false} in latex');
disp('Do you want to manually set as grayscale any pages that have (C == M == Y)?');
a = input('Otherwise they will be treated as colour! (y/n) ','s');

if (a~= 'y' && a~='Y')
    manualMode = false;
else
    manualMode = true;
end

[status, ret] = ghostscript(['-o - -sDEVICE=inkcov ',pathToFile,fName]);

inds = strfind(ret,'0.');
pages = length(inds)/4;

if (round(pages) ~= pages)
    disp('Something went wrong');
    disp('Check the variable ret');
    disp('I am looking the the string ''0.'' which should only occur when listing CMYK values');
end

a = input(['Is your pdf ', num2str(pages), ' pages long (y/n) ?'],'s');

if (a ~= 'y' && a ~= 'Y')
    break;
end

disp([num2str(pages), ' pages processed.']);
c = 1:4:length(inds);
m = 2:4:length(inds);
y = 3:4:length(inds);
k = 4:4:length(inds);

colorPages = '';
bwPages = '';
cpCounter = 0;
bwCounter = 0;
for i = 1:pages
    C = str2num(ret(inds(c(i)):inds(c(i))+6));
    M = str2num(ret(inds(m(i)):inds(m(i))+6));
    Y = str2num(ret(inds(y(i)):inds(y(i))+6));
    K = str2num(ret(inds(k(i)):inds(k(i))+6));

if (C == 0 && M == 0 && Y == 0)
        bwPages = [bwPages, ' ',num2str(i)];
        bwCounter = bwCounter+1;
    elseif (C == M && C == Y && manualMode)
        a = input(['Is page ', num2str(i), ' colour (C == M == Y) (y/n) ?'],'s');
        if (a ~= 'y' && a ~= 'Y')
            bwPages = [bwPages, ' ',num2str(i)];
            bwCounter = bwCounter+1;            
        else
            colorPages = [colorPages, ' ', num2str(i)];
            cpCounter = cpCounter+1;
        end
    else
        colorPages = [colorPages, ' ', num2str(i)];
        cpCounter = cpCounter+1;
    end
end

cName = [pathToFile, 'color_',fName];
bName = [pathToFile, 'bw_',fName];
disp([cName, ' (',num2str(cpCounter), ' pages)']);
disp([bName, ' (',num2str(bwCounter), ' pages)']);

system([pdftkPath, ' ', pathToFile, fName, ' cat ', colorPages,' output ', cName]);
system([pdftkPath, ' ', pathToFile, fName, ' cat ', bwPages,' output ', bName]);

Gabriel · Answer

I extend Chris H's answer:

I extented the pdfcolorsplit.py script with an option -r to reassemble all split parts again into a final pdf, by converting all b/w parts to grayscale before reassembling:

use like (-p option worked the best) :

./pdfcolorsplit.py -p -v -s -r Report.pdf

The code is here:

#!/usr/bin/env python
# Python 2 and 3 compatible.

# Python program to take a pdf file, and split it into color and black
# and white part(s). Requires pdftk and one of gs and pdftoppm.
#
# Iain Murray, February 2010.
#
# Inspired by dvicoloursplit.py, Jeremy Sanders 2001, although written
# from scratch.
#
# 2011-09-19 fixed bug with odd numbers of pages reported by Richard Shaw
# 2012-06-11 tweaked to run in Python 3 as well as 2.

##  This program is free software; you can redistribute it and/or modify
##  it under the terms of the GNU General Public License as published by
##  the Free Software Foundation; either version 2 of the License, or
##  (at your option) any later version.

##  This program is distributed in the hope that it will be useful,
##  but WITHOUT ANY WARRANTY; without even the implied warranty of
##  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
##  GNU General Public License for more details.

import os, os.path, sys, string, re, tempfile, shutil, getopt
import heapq

def a2b(x):
    """Turn ascii into bytes for Python 3, in way that works with Python 2"""
    try:
        return bytes(x)
    except:
        return bytes(x, 'ascii')

def iscolorppm(filename):
    """Does the PPM file contain any non-grayscale colors?"""
    file = open(filename, 'rb')
    # Ugly: I read the whole file into RAM, and copy it needlessly a lot
    data = file.read()
    file.close()

    # PPM is a *very* liberal file format. It allows comments anywhere in the
    # header, even in the middle of tokens.
    comments_re = re.compile(a2b('^([^ tn]*)#[^n]*n'))
    split_re = re.compile(a2b('^([ tn]|#[^n]*n)+([^ tn#])'))
    tok_re = re.compile(a2b('^([^ tn]*)([ tn].*)'), re.DOTALL)
    toks = []
    while len(toks) < 4:
        while split_re.match(data):
            data = split_re.sub(r'2', data)
        while comments_re.match(data):
            data = comments_re.sub(r'1', data)
        (tok, data) = tok_re.match(data).groups()
        toks.append(tok)
    magic = toks[0]
    (width, height, max_color) = map(int, toks[1:])
    data = data[1:]

    if magic == b'P3':
        binary = False
    elif magic == b'P6':
        binary = True
    else:
        print("%s is not a valid PPM file" % filename)
        sys.exit(1)

    # Massage data so adjacent triples should have the same value in b/w images
    data_len = width*height*3
    if binary:
        if int(max_color) > 255:
            # Untested. Each intensity is in two bytes.
            data_len *= 2
            data = data[1:data_len:2] + data[:data_len:2]
    else:
        data = [int(x) for x in data.split()]

    if len(data) < data_len:
        print('PPM file is truncated?')
        sys.exit(1)

    triples = zip(data[0:data_len:3], data[1:data_len:3], data[2:data_len:3])
    black_and_white = all((a==b and a==c for (a,b,c) in triples))
    return not black_and_white


def pdfcolorsplit(file, doublesided, merge, use_pdftoppm, reassemble, verbose):
    # Work out which pages are color
    if verbose:
        print('Analyzing %s...' % file)
    tmpdir = tempfile.mkdtemp(prefix = 'pdfcs_')
    if use_pdftoppm:
        root = os.path.join(tmpdir, 'page')
        os.system('pdftoppm -r 20 "%s" "%s"' % (file, root))
    else:
        gs_opts = '-sDEVICE=ppmraw -dBATCH -dNOPAUSE -dSAFE -r20'
        if not verbose:
            gs_opts += ' -q'
        os.system('gs ' + gs_opts + ' -sOutputFile="%s" "%s"' 
                % (os.path.join(tmpdir, 'tmp%06d.ppm'), file))
    PPMs = os.listdir(tmpdir)
    PPMs.sort()
    iscolor = [iscolorppm(os.path.join(tmpdir, x)) for x in PPMs]
    num_pages = len(iscolor)
    shutil.rmtree(tmpdir)
    if doublesided:
        # Treat as color those b/w pages that share a sheet with a color page
        iscolorpair = [x or y for (x,y) in zip(iscolor[::2], iscolor[1::2])]
        iscolor[:2*len(iscolorpair):2] = iscolorpair
        iscolor[1::2] = iscolorpair

    # Construct page range strings
    flips = [x for x in range(2,num_pages+1) if iscolor[x-1] != iscolor[x-2]]
    if not flips:
        if verbose:
            print('No splitting needs to be done, skipping %s' % file)
        return
    edges = [1] + flips + [num_pages+1]
    ranges = ['%d-%d' % (x,y-1) for (x,y) in zip(edges[:-1], edges[1:])]

    print(iscolor, ranges)

    # Finally output split files
    if verbose:
        print('Outputing splits as new pdf files...')
    base_name = file
    if base_name.lower().endswith('.pdf'):
        base_name = base_name[:-4]
    suffixes = ['_bwsplit', '_colorsplit']
    # jobs is a seq of (range, filename) pairs, e.g. ('1-3', 'colorbits.pdf')
    # convert jobs
    if merge:
        jobs = ((' '.join(ranges[0::2]), base_name + suffixes[iscolor[0]]),
                (' '.join(ranges[1::2]), base_name + suffixes[not iscolor[0]]))
    else:
        jobs = [(r, '%s_%03d%s' % (base_name,n+1,suffixes[(n+iscolor[0])%2])) 
                for (n,r) in enumerate(ranges)]



    for (pages, name) in jobs:
        if verbose:
            print('pdftk "%s" cat %s output "%s.pdf"' % (file, pages, name))
        os.system('pdftk "%s" cat %s output "%s.pdf"' % (file, pages, name))

    # reassemble all continuous files into final output by converting b/w parts to grayscale
    if reassemble:
      graySuffix = "_gray"
      jobsconvert = [ j for j in jobs[ int(iscolor[0])::2] ]
      #print(jobsconvert)
      # convert all b/w to gray
      for (pages,name) in jobsconvert:
        cmd="gs  -sOutputFile=%s%s.pdf  -sDEVICE=pdfwrite  -dAutoRotatePages=/None -sColorConversionStrategy=Gray  -dProcessColorModel=/DeviceGray  -dCompatibilityLevel=1.4  -dNOPAUSE  -dBATCH %s.pdf" % (name,graySuffix,name)
        if verbose:
           print(cmd) 
        os.system(cmd)

      ## interleave converted b/w and colors and make pdftk cat command
      cJobs = jobs[0::2] if iscolor[0] else jobs[1::2]
      #print(cJobs)
      bwJobs = [ (pages,name+graySuffix) for pages,name in jobsconvert]

      def interleave(l1, l2):
        iter1 = iter(l1)
        iter2 = iter(l2)
        while True:
            try:
                if iter1 != None:
                    yield next(iter1)
            except StopIteration:
                iter1 = None
            try:
                if iter2 != None:
                    yield next(iter2)
            except StopIteration:
                iter2 = None
            if iter1 == None and iter2 == None:
                raise StopIteration()


      jobsCatAll = interleave(cJobs,bwJobs) if iscolor[0] else interleave(bwJobs,cJobs)
      #print(list(jobsCatAll))

      cmd = "pdftk " + " ".join([j[1]+".pdf" for j in jobsCatAll]) + " cat output %s%s.pdf " % (base_name,"_all")
      if verbose:
        print(cmd)  
      os.system(cmd)

def usage():
    progname = os.path.basename(sys.argv[0])
    print('Usage: %s [OPTIONS] <PDF-file(s)>' % progname)
    print('')
    print('Splits PDF files into color and black and white sections.')
    print('')
    print('Options:')
    print('   -m option merges color and b/w parts to give two files.')
    print('      The default is to output numbered contiguous pieces')
    print('      that could easily be reassembled.')
    print('   -s option chooses simplex rather than duplex output')
    print('   -v verbose.')
    print('   -p Use pdftoppm rather than gs to detect color. Faster,')
    print('      but can get confused by hyperlinks that do not print.')
    print('   -r Reassemble all continuous files by converting all b/w ')
    print('      parts to grayscale (requires gs).')

def main():
    try:
        opt_pairs, filenames = getopt.gnu_getopt(sys.argv[1:], "hvpmsr", ["help"])
    except getopt.GetoptError as err:
        print(str(err))
        usage()
        sys.exit(1)
    if opt_pairs:
        opts = list(zip(*opt_pairs))[0]
    else:
        opts = []
    if ('-h' in opts) or ('--help' in opts) or (not filenames):
        usage()
        sys.exit()
    verbose = '-v' in opts
    use_pdftoppm = '-p' in opts
    merge = '-m' in opts
    doublesided = not ('-s' in opts)

    reassemble = '-r' in opts

    if merge and reassemble:
      raise ValueError("Merge and reassemble options not compatible!")

    for file in filenames:
        pdfcolorsplit(file, doublesided, merge, use_pdftoppm, reassemble, verbose)

if __name__ == "__main__":
    main()

Detecting all pages which contain color

5 Answers

Add your own answers!

Ask a Question