TeX - LaTeX Asked by Martin Scharrer on January 29, 2021
In an larger LaTeX document there are often only some pages with color content (mainly figures) and the remaining ones are only black and white.
Because printing costs for color pages are much higher than for black and white it would be good to be able to extract all pages with color and print them separately. The first step for this is to be able to detect if a page contains color or not. This could be in a form of an text list of page number suitable to be read by a PDF page extraction script (using e.g. pdftk
).
A simple solution sufficient for many people would be to detect all pages which contain a figure
and assume that only these have color. However, a general solution would be nice. Only color elements which are printed should be taken into account, while e.g. the color frames around link by hyperref
should not. It is OK that the solution would disable these for the detection.
For the general case it seem to be indeed better to use an external tool to test for all pages which contain colors. This is the topic of the mentioned SO question How do I know if PDF pages are color or black-and-white?. I now wrote an answer to it which includes small script for this.
However, it is much easier to get a list of all pages containing figures.
Here I use the zref-abspage
package to get an absolute page counter.
The normal write
command can be used which will expand its content when the surrounding content is really placed on a page. Therefore the page counters will have the correct value.
Then the end-macro of figure
can simply be patched to hold this code.
documentclass{book}
usepackage{mwe}
usepackage{zref-abspage}% absolute page counter
newwritefigpages
openoutfigpages=jobname.fpg
makeatletter
g@addto@macroendfigure{%
% Write absolute page number and page label to file
% Do not use immediate!
writefigpages{numbervalue{abspage}: thepage}%
}
makeatother
newcountmycount% for example loop
begin{document}
frontmatter
Blindtext
begin{figure}
centering
includegraphics[width=.8textwidth,height=5cm]{example-image}
caption{Some caption}
end{figure}
mainmatter
Blindtext
loop% keep MWE small by using a loop
begin{figure}
centering
includegraphics[width=.8textwidth,height=5cm]{example-image}
caption{Some caption}
end{figure}
{Blindtext}
advancemycount by 1
ifnummycount<20relax
repeat
backmatter
appendix
Blindtext
begin{figure}
centering
includegraphics[width=.8textwidth,height=5cm]{example-image}
caption{Some caption}
end{figure}
end{document}
This generates a .fpg
file (for figure pages) which looks like:
2: ii
4: 2
5: 3
7: 5
8: 6
10: 8
11: 9
13: 11
14: 12
16: 14
18: 16
19: 17
21: 19
22: 20
24: 22
25: 23
27: 25
28: 26
30: 28
31: 29
33: 31
38: 36
The format can be changed if required.
Answered by Martin Scharrer on January 29, 2021
Newer versions of Ghostscript (version 9.05 and later) include a "device" called inkcov
. It calculates the ink coverage of each page (not for each image) in Cyan (C), Magenta (M), Yellow (Y) and Black (K) values, where 0.00000 means 0%, and 1.00000 means 100%.
Example commandline:
gs -o - -sDEVICE=inkcov /path/to/your.pdf
Example output:
Page 1
0.00000 0.00000 0.00000 0.02230 CMYK OK
Page 2
0.02360 0.02360 0.02360 0.02360 CMYK OK
Page 3
0.02525 0.02525 0.02525 0.00000 CMYK OK
Page 4
0.00000 0.00000 0.00000 0.01982 CMYK OK
You can see here that the pages 1+4 are using no color, while pages 2+3 do. This case is particularly 'nasty' for people who want to save on color ink: because all the respective C, M, Y (and K) values are exactly the same for each of the pages 2+3, they possibly could appear to the human eye not as color pages, but as ("rich") grayscale anyway (if each single pixel is mixed with these color values).
Ghostscript can also convert color into grayscale. Example commandline:
gs
-o grayscale.pdf
-sDEVICE=pdfwrite
-sColorConversionStrategy=Gray
-sProcessColorModel=/DeviceGray
/path/to/your.pdf
Checking for the ink coverage distribution again (note how the addition of -q
to the parameters slightly changes the output format):
gs -q -o - -sDEVICE=inkcov grayscale.pdf
0.00000 0.00000 0.00000 0.02230 CMYK OK
0.00000 0.00000 0.00000 0.02360 CMYK OK
0.00000 0.00000 0.00000 0.02525 CMYK OK
0.00000 0.00000 0.00000 0.01982 CMYK OK
Answered by Kurt Pfeifle on January 29, 2021
There's a rather useful python script at http://homepages.inf.ed.ac.uk/imurray2/code/hacks/pdfcolorsplit which uses pdftk
to split into colour and b&w files, though it doesn't deal with the boxes around hyperrefs. If you have access to the LaTeX source, why not turn off the colour in hyperref anyway - I do it like this:
usepackage[colorlinks=true,
linkcolor=black,
citecolor=black,
filecolor=black,
urlcolor=black]{hyperref}
IIRC if you just set [colorlinks=false]
they're not clickable.
Answered by Chris H on January 29, 2021
Here is a MATLAB script which uses Kurt Pfeifle's answer to split a PDF into two files, one colour and one grayscale. The original file is preserved. It does not handle double-sided printing.
It is not bullet-proof and might need some debugging, but hopefully it will work out of the box.
You will need:
Here is the script (you will need to change lines 4,5 and possibly 6):
clear all; close all; clc;
%Change these:
pathToFile = '/Users/nikos/Desktop/';
fName = 'thesis.pdf';
%you might need to change the path to pdftk (if in windows for example)
pdftkPath = '/usr/local/bin/pdftk';
disp('Reminder: you might want to set hypersetup{colorlinks=false} in latex');
disp('Do you want to manually set as grayscale any pages that have (C == M == Y)?');
a = input('Otherwise they will be treated as colour! (y/n) ','s');
if (a~= 'y' && a~='Y')
manualMode = false;
else
manualMode = true;
end
[status, ret] = ghostscript(['-o - -sDEVICE=inkcov ',pathToFile,fName]);
inds = strfind(ret,'0.');
pages = length(inds)/4;
if (round(pages) ~= pages)
disp('Something went wrong');
disp('Check the variable ret');
disp('I am looking the the string ''0.'' which should only occur when listing CMYK values');
end
a = input(['Is your pdf ', num2str(pages), ' pages long (y/n) ?'],'s');
if (a ~= 'y' && a ~= 'Y')
break;
end
disp([num2str(pages), ' pages processed.']);
c = 1:4:length(inds);
m = 2:4:length(inds);
y = 3:4:length(inds);
k = 4:4:length(inds);
colorPages = '';
bwPages = '';
cpCounter = 0;
bwCounter = 0;
for i = 1:pages
C = str2num(ret(inds(c(i)):inds(c(i))+6));
M = str2num(ret(inds(m(i)):inds(m(i))+6));
Y = str2num(ret(inds(y(i)):inds(y(i))+6));
K = str2num(ret(inds(k(i)):inds(k(i))+6));
if (C == 0 && M == 0 && Y == 0)
bwPages = [bwPages, ' ',num2str(i)];
bwCounter = bwCounter+1;
elseif (C == M && C == Y && manualMode)
a = input(['Is page ', num2str(i), ' colour (C == M == Y) (y/n) ?'],'s');
if (a ~= 'y' && a ~= 'Y')
bwPages = [bwPages, ' ',num2str(i)];
bwCounter = bwCounter+1;
else
colorPages = [colorPages, ' ', num2str(i)];
cpCounter = cpCounter+1;
end
else
colorPages = [colorPages, ' ', num2str(i)];
cpCounter = cpCounter+1;
end
end
cName = [pathToFile, 'color_',fName];
bName = [pathToFile, 'bw_',fName];
disp([cName, ' (',num2str(cpCounter), ' pages)']);
disp([bName, ' (',num2str(bwCounter), ' pages)']);
system([pdftkPath, ' ', pathToFile, fName, ' cat ', colorPages,' output ', cName]);
system([pdftkPath, ' ', pathToFile, fName, ' cat ', bwPages,' output ', bName]);
Answered by nikos on January 29, 2021
I extend Chris H's answer:
I extented the pdfcolorsplit.py
script with an option -r
to reassemble all split parts again into a final pdf, by converting all b/w parts to grayscale before reassembling:
use like (-p
option worked the best) :
./pdfcolorsplit.py -p -v -s -r Report.pdf
The code is here:
#!/usr/bin/env python
# Python 2 and 3 compatible.
# Python program to take a pdf file, and split it into color and black
# and white part(s). Requires pdftk and one of gs and pdftoppm.
#
# Iain Murray, February 2010.
#
# Inspired by dvicoloursplit.py, Jeremy Sanders 2001, although written
# from scratch.
#
# 2011-09-19 fixed bug with odd numbers of pages reported by Richard Shaw
# 2012-06-11 tweaked to run in Python 3 as well as 2.
## This program is free software; you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 2 of the License, or
## (at your option) any later version.
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
## GNU General Public License for more details.
import os, os.path, sys, string, re, tempfile, shutil, getopt
import heapq
def a2b(x):
"""Turn ascii into bytes for Python 3, in way that works with Python 2"""
try:
return bytes(x)
except:
return bytes(x, 'ascii')
def iscolorppm(filename):
"""Does the PPM file contain any non-grayscale colors?"""
file = open(filename, 'rb')
# Ugly: I read the whole file into RAM, and copy it needlessly a lot
data = file.read()
file.close()
# PPM is a *very* liberal file format. It allows comments anywhere in the
# header, even in the middle of tokens.
comments_re = re.compile(a2b('^([^ tn]*)#[^n]*n'))
split_re = re.compile(a2b('^([ tn]|#[^n]*n)+([^ tn#])'))
tok_re = re.compile(a2b('^([^ tn]*)([ tn].*)'), re.DOTALL)
toks = []
while len(toks) < 4:
while split_re.match(data):
data = split_re.sub(r'2', data)
while comments_re.match(data):
data = comments_re.sub(r'1', data)
(tok, data) = tok_re.match(data).groups()
toks.append(tok)
magic = toks[0]
(width, height, max_color) = map(int, toks[1:])
data = data[1:]
if magic == b'P3':
binary = False
elif magic == b'P6':
binary = True
else:
print("%s is not a valid PPM file" % filename)
sys.exit(1)
# Massage data so adjacent triples should have the same value in b/w images
data_len = width*height*3
if binary:
if int(max_color) > 255:
# Untested. Each intensity is in two bytes.
data_len *= 2
data = data[1:data_len:2] + data[:data_len:2]
else:
data = [int(x) for x in data.split()]
if len(data) < data_len:
print('PPM file is truncated?')
sys.exit(1)
triples = zip(data[0:data_len:3], data[1:data_len:3], data[2:data_len:3])
black_and_white = all((a==b and a==c for (a,b,c) in triples))
return not black_and_white
def pdfcolorsplit(file, doublesided, merge, use_pdftoppm, reassemble, verbose):
# Work out which pages are color
if verbose:
print('Analyzing %s...' % file)
tmpdir = tempfile.mkdtemp(prefix = 'pdfcs_')
if use_pdftoppm:
root = os.path.join(tmpdir, 'page')
os.system('pdftoppm -r 20 "%s" "%s"' % (file, root))
else:
gs_opts = '-sDEVICE=ppmraw -dBATCH -dNOPAUSE -dSAFE -r20'
if not verbose:
gs_opts += ' -q'
os.system('gs ' + gs_opts + ' -sOutputFile="%s" "%s"'
% (os.path.join(tmpdir, 'tmp%06d.ppm'), file))
PPMs = os.listdir(tmpdir)
PPMs.sort()
iscolor = [iscolorppm(os.path.join(tmpdir, x)) for x in PPMs]
num_pages = len(iscolor)
shutil.rmtree(tmpdir)
if doublesided:
# Treat as color those b/w pages that share a sheet with a color page
iscolorpair = [x or y for (x,y) in zip(iscolor[::2], iscolor[1::2])]
iscolor[:2*len(iscolorpair):2] = iscolorpair
iscolor[1::2] = iscolorpair
# Construct page range strings
flips = [x for x in range(2,num_pages+1) if iscolor[x-1] != iscolor[x-2]]
if not flips:
if verbose:
print('No splitting needs to be done, skipping %s' % file)
return
edges = [1] + flips + [num_pages+1]
ranges = ['%d-%d' % (x,y-1) for (x,y) in zip(edges[:-1], edges[1:])]
print(iscolor, ranges)
# Finally output split files
if verbose:
print('Outputing splits as new pdf files...')
base_name = file
if base_name.lower().endswith('.pdf'):
base_name = base_name[:-4]
suffixes = ['_bwsplit', '_colorsplit']
# jobs is a seq of (range, filename) pairs, e.g. ('1-3', 'colorbits.pdf')
# convert jobs
if merge:
jobs = ((' '.join(ranges[0::2]), base_name + suffixes[iscolor[0]]),
(' '.join(ranges[1::2]), base_name + suffixes[not iscolor[0]]))
else:
jobs = [(r, '%s_%03d%s' % (base_name,n+1,suffixes[(n+iscolor[0])%2]))
for (n,r) in enumerate(ranges)]
for (pages, name) in jobs:
if verbose:
print('pdftk "%s" cat %s output "%s.pdf"' % (file, pages, name))
os.system('pdftk "%s" cat %s output "%s.pdf"' % (file, pages, name))
# reassemble all continuous files into final output by converting b/w parts to grayscale
if reassemble:
graySuffix = "_gray"
jobsconvert = [ j for j in jobs[ int(iscolor[0])::2] ]
#print(jobsconvert)
# convert all b/w to gray
for (pages,name) in jobsconvert:
cmd="gs -sOutputFile=%s%s.pdf -sDEVICE=pdfwrite -dAutoRotatePages=/None -sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray -dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH %s.pdf" % (name,graySuffix,name)
if verbose:
print(cmd)
os.system(cmd)
## interleave converted b/w and colors and make pdftk cat command
cJobs = jobs[0::2] if iscolor[0] else jobs[1::2]
#print(cJobs)
bwJobs = [ (pages,name+graySuffix) for pages,name in jobsconvert]
def interleave(l1, l2):
iter1 = iter(l1)
iter2 = iter(l2)
while True:
try:
if iter1 != None:
yield next(iter1)
except StopIteration:
iter1 = None
try:
if iter2 != None:
yield next(iter2)
except StopIteration:
iter2 = None
if iter1 == None and iter2 == None:
raise StopIteration()
jobsCatAll = interleave(cJobs,bwJobs) if iscolor[0] else interleave(bwJobs,cJobs)
#print(list(jobsCatAll))
cmd = "pdftk " + " ".join([j[1]+".pdf" for j in jobsCatAll]) + " cat output %s%s.pdf " % (base_name,"_all")
if verbose:
print(cmd)
os.system(cmd)
def usage():
progname = os.path.basename(sys.argv[0])
print('Usage: %s [OPTIONS] <PDF-file(s)>' % progname)
print('')
print('Splits PDF files into color and black and white sections.')
print('')
print('Options:')
print(' -m option merges color and b/w parts to give two files.')
print(' The default is to output numbered contiguous pieces')
print(' that could easily be reassembled.')
print(' -s option chooses simplex rather than duplex output')
print(' -v verbose.')
print(' -p Use pdftoppm rather than gs to detect color. Faster,')
print(' but can get confused by hyperlinks that do not print.')
print(' -r Reassemble all continuous files by converting all b/w ')
print(' parts to grayscale (requires gs).')
def main():
try:
opt_pairs, filenames = getopt.gnu_getopt(sys.argv[1:], "hvpmsr", ["help"])
except getopt.GetoptError as err:
print(str(err))
usage()
sys.exit(1)
if opt_pairs:
opts = list(zip(*opt_pairs))[0]
else:
opts = []
if ('-h' in opts) or ('--help' in opts) or (not filenames):
usage()
sys.exit()
verbose = '-v' in opts
use_pdftoppm = '-p' in opts
merge = '-m' in opts
doublesided = not ('-s' in opts)
reassemble = '-r' in opts
if merge and reassemble:
raise ValueError("Merge and reassemble options not compatible!")
for file in filenames:
pdfcolorsplit(file, doublesided, merge, use_pdftoppm, reassemble, verbose)
if __name__ == "__main__":
main()
Answered by Gabriel on January 29, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP