tesseract  4.00.00dev
pdfrenderer.cpp
Go to the documentation of this file.
1 // File: pdfrenderer.cpp
3 // Description: PDF rendering interface to inject into TessBaseAPI
4 //
5 // (C) Copyright 2011, Google Inc.
6 // Licensed under the Apache License, Version 2.0 (the "License");
7 // you may not use this file except in compliance with the License.
8 // You may obtain a copy of the License at
9 // http://www.apache.org/licenses/LICENSE-2.0
10 // Unless required by applicable law or agreed to in writing, software
11 // distributed under the License is distributed on an "AS IS" BASIS,
12 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 // See the License for the specific language governing permissions and
14 // limitations under the License.
15 //
17 
18 // Include automatically generated configuration file if running autoconf.
19 #ifdef HAVE_CONFIG_H
20 #include "config_auto.h"
21 #endif
22 
23 #include <memory> // std::unique_ptr
24 #include "allheaders.h"
25 #include "baseapi.h"
26 #include "math.h"
27 #include "renderer.h"
28 #include "strngs.h"
29 #include "tprintf.h"
30 
31 #ifdef _MSC_VER
32 #include "mathfix.h"
33 #endif
34 
35 /*
36 
37 Design notes from Ken Sharp, with light editing.
38 
39 We think one solution is a font with a single glyph (.notdef) and a
40 CIDToGIDMap which maps all the CIDs to 0. That map would then be
41 stored as a stream in the PDF file, and when flate compressed should
42 be pretty small. The font, of course, will be approximately the same
43 size as the one you currently use.
44 
45 I'm working on such a font now, the CIDToGIDMap is trivial, you just
46 create a stream object which contains 128k bytes (2 bytes per possible
47 CID and your CIDs range from 0 to 65535) and where you currently have
48 "/CIDToGIDMap /Identity" you would have "/CIDToGIDMap <object> 0 R".
49 
50 Note that if, in future, you were to use a different (ie not 2 byte)
51 CMap for character codes you could trivially extend the CIDToGIDMap.
52 
53 The following is an explanation of how some of the font stuff works,
54 this may be too simple for you in which case please accept my
55 apologies, its hard to know how much knowledge someone has. You can
56 skip all this anyway, its just for information.
57 
58 The font embedded in a PDF file is usually intended just to be
59 rendered, but extensions allow for at least some ability to locate (or
60 copy) text from a document. This isn't something which was an original
61 goal of the PDF format, but its been retro-fitted, presumably due to
62 popular demand.
63 
64 To do this reliably the PDF file must contain a ToUnicode CMap, a
65 device for mapping character codes to Unicode code points. If one of
66 these is present, then this will be used to convert the character
67 codes into Unicode values. If its not present then the reader will
68 fall back through a series of heuristics to try and guess the
69 result. This is, as you would expect, prone to failure.
70 
71 This doesn't concern you of course, since you always write a ToUnicode
72 CMap, so because you are writing the text in text rendering mode 3 it
73 would seem that you don't really need to worry about this, but in the
74 PDF spec you cannot have an isolated ToUnicode CMap, it has to be
75 attached to a font, so in order to get even copy/paste to work you
76 need to define a font.
77 
78 This is what leads to problems, tools like pdfwrite assume that they
79 are going to be able to (or even have to) modify the font entries, so
80 they require that the font being embedded be valid, and to be honest
81 the font Tesseract embeds isn't valid (for this purpose).
82 
83 
84 To see why lets look at how text is specified in a PDF file:
85 
86 (Test) Tj
87 
88 Now that looks like text but actually it isn't. Each of those bytes is
89 a 'character code'. When it comes to rendering the text a complex
90 sequence of events takes place, which converts the character code into
91 'something' which the font understands. Its entirely possible via
92 character mappings to have that text render as 'Sftu'
93 
94 For simple fonts (PostScript type 1), we use the character code as the
95 index into an Encoding array (256 elements), each element of which is
96 a glyph name, so this gives us a glyph name. We then consult the
97 CharStrings dictionary in the font, that's a complex object which
98 contains pairs of keys and values, you can use the key to retrieve a
99 given value. So we have a glyph name, we then use that as the key to
100 the dictionary and retrieve the associated value. For a type 1 font,
101 the value is a glyph program that describes how to draw the glyph.
102 
103 For CIDFonts, its a little more complicated. Because CIDFonts can be
104 large, using a glyph name as the key is unreasonable (it would also
105 lead to unfeasibly large Encoding arrays), so instead we use a 'CID'
106 as the key. CIDs are just numbers.
107 
108 But.... We don't use the character code as the CID. What we do is use
109 a CMap to convert the character code into a CID. We then use the CID
110 to key the CharStrings dictionary and proceed as before. So the 'CMap'
111 is the equivalent of the Encoding array, but its a more compact and
112 flexible representation.
113 
114 Note that you have to use the CMap just to find out how many bytes
115 constitute a character code, and it can be variable. For example you
116 can say if the first byte is 0x00->0x7f then its just one byte, if its
117 0x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I
118 have seen CMaps defining character codes up to 5 bytes wide.
119 
120 Now that's fine for 'PostScript' CIDFonts, but its not sufficient for
121 TrueType CIDFonts. The thing is that TrueType fonts are accessed using
122 a Glyph ID (GID) (and the LOCA table) which may well not be anything
123 like the CID. So for this case PDF includes a CIDToGIDMap. That maps
124 the CIDs to GIDs, and we can then use the GID to get the glyph
125 description from the GLYF table of the font.
126 
127 So for a TrueType CIDFont, character-code->CID->GID->glyf-program.
128 
129 Looking at the PDF file I was supplied with we see that it contains
130 text like :
131 
132 <0x0075> Tj
133 
134 So we start by taking the character code (117) and look it up in the
135 CMap. Well you don't supply a CMap, you just use the Identity-H one
136 which is predefined. So character code 117 maps to CID 117. Then we
137 use the CIDToGIDMap, again you don't supply one, you just use the
138 predefined 'Identity' map. So CID 117 maps to GID 117. But the font we
139 were supplied with only contains 116 glyphs.
140 
141 Now for Latin that's not a huge problem, you can just supply a bigger
142 font. But for more complex languages that *is* going to be more of a
143 problem. Either you need to supply a font which contains glyphs for
144 all the possible CID->GID mappings, or we need to think laterally.
145 
146 Our solution using a TrueType CIDFont is to intervene at the
147 CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a
148 font with just one glyph, the .notdef glyph at GID 0. This is what I'm
149 looking into now.
150 
151 It would also be possible to have a 'PostScript' (ie type 1 outlines)
152 CIDFont which contained 1 glyph, and a CMap which mapped all character
153 codes to CID 0. The effect would be the same.
154 
155 Its possible (I haven't checked) that the PostScript CIDFont and
156 associated CMap would be smaller than the TrueType font and associated
157 CIDToGIDMap.
158 
159 --- in a followup ---
160 
161 OK there is a small problem there, if I use GID 0 then Acrobat gets
162 upset about it and complains it cannot extract the font. If I set the
163 CIDToGIDMap so that all the entries are 1 instead, it's happy. Totally
164 mad......
165 
166 */
167 
168 namespace tesseract {
169 
170 // Use for PDF object fragments. Must be large enough
171 // to hold a colormap with 256 colors in the verbose
172 // PDF representation.
173 static const int kBasicBufSize = 2048;
174 
175 // If the font is 10 pts, nominal character width is 5 pts
176 static const int kCharWidth = 2;
177 
178 // Used for memory allocation. A codepoint must take no more than this
179 // many bytes, when written in the PDF way. e.g. "<0063>" for the
180 // letter 'c'
181 static const int kMaxBytesPerCodepoint = 20;
182 
183 /**********************************************************************
184  * PDF Renderer interface implementation
185  **********************************************************************/
186 
187 TessPDFRenderer::TessPDFRenderer(const char *outputbase, const char *datadir,
188  bool textonly)
189  : TessResultRenderer(outputbase, "pdf") {
190  obj_ = 0;
191  datadir_ = datadir;
192  textonly_ = textonly;
193  offsets_.push_back(0);
194 }
195 
196 void TessPDFRenderer::AppendPDFObjectDIY(size_t objectsize) {
197  offsets_.push_back(objectsize + offsets_.back());
198  obj_++;
199 }
200 
201 void TessPDFRenderer::AppendPDFObject(const char *data) {
202  AppendPDFObjectDIY(strlen(data));
203  AppendString((const char *)data);
204 }
205 
206 // Helper function to prevent us from accidentally writing
207 // scientific notation to an HOCR or PDF file. Besides, three
208 // decimal points are all you really need.
209 double prec(double x) {
210  double kPrecision = 1000.0;
211  double a = round(x * kPrecision) / kPrecision;
212  if (a == -0)
213  return 0;
214  return a;
215 }
216 
217 long dist2(int x1, int y1, int x2, int y2) {
218  return (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1);
219 }
220 
221 // Viewers like evince can get really confused during copy-paste when
222 // the baseline wanders around. So I've decided to project every word
223 // onto the (straight) line baseline. All numbers are in the native
224 // PDF coordinate system, which has the origin in the bottom left and
225 // the unit is points, which is 1/72 inch. Tesseract reports baselines
226 // left-to-right no matter what the reading order is. We need the
227 // word baseline in reading order, so we do that conversion here. Returns
228 // the word's baseline origin and length.
229 void GetWordBaseline(int writing_direction, int ppi, int height,
230  int word_x1, int word_y1, int word_x2, int word_y2,
231  int line_x1, int line_y1, int line_x2, int line_y2,
232  double *x0, double *y0, double *length) {
233  if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
234  Swap(&word_x1, &word_x2);
235  Swap(&word_y1, &word_y2);
236  }
237  double word_length;
238  double x, y;
239  {
240  int px = word_x1;
241  int py = word_y1;
242  double l2 = dist2(line_x1, line_y1, line_x2, line_y2);
243  if (l2 == 0) {
244  x = line_x1;
245  y = line_y1;
246  } else {
247  double t = ((px - line_x2) * (line_x2 - line_x1) +
248  (py - line_y2) * (line_y2 - line_y1)) / l2;
249  x = line_x2 + t * (line_x2 - line_x1);
250  y = line_y2 + t * (line_y2 - line_y1);
251  }
252  word_length = sqrt(static_cast<double>(dist2(word_x1, word_y1,
253  word_x2, word_y2)));
254  word_length = word_length * 72.0 / ppi;
255  x = x * 72 / ppi;
256  y = height - (y * 72.0 / ppi);
257  }
258  *x0 = x;
259  *y0 = y;
260  *length = word_length;
261 }
262 
263 // Compute coefficients for an affine matrix describing the rotation
264 // of the text. If the text is right-to-left such as Arabic or Hebrew,
265 // we reflect over the Y-axis. This matrix will set the coordinate
266 // system for placing text in the PDF file.
267 //
268 // RTL
269 // [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
270 // [ y' ] [ c d ][ y ] [ 0 1 ] [-sin cos ][ y ]
271 void AffineMatrix(int writing_direction,
272  int line_x1, int line_y1, int line_x2, int line_y2,
273  double *a, double *b, double *c, double *d) {
274  double theta = atan2(static_cast<double>(line_y1 - line_y2),
275  static_cast<double>(line_x2 - line_x1));
276  *a = cos(theta);
277  *b = sin(theta);
278  *c = -sin(theta);
279  *d = cos(theta);
280  switch(writing_direction) {
282  *a = -*a;
283  *b = -*b;
284  break;
286  // TODO(jbreiden) Consider using the vertical PDF writing mode.
287  break;
288  default:
289  break;
290  }
291 }
292 
293 // There are some really awkward PDF viewers in the wild, such as
294 // 'Preview' which ships with the Mac. They do a better job with text
295 // selection and highlighting when given perfectly flat baseline
296 // instead of very slightly tilted. We clip small tilts to appease
297 // these viewers. I chose this threshold large enough to absorb noise,
298 // but small enough that lines probably won't cross each other if the
299 // whole page is tilted at almost exactly the clipping threshold.
300 void ClipBaseline(int ppi, int x1, int y1, int x2, int y2,
301  int *line_x1, int *line_y1,
302  int *line_x2, int *line_y2) {
303  *line_x1 = x1;
304  *line_y1 = y1;
305  *line_x2 = x2;
306  *line_y2 = y2;
307  double rise = abs(y2 - y1) * 72 / ppi;
308  double run = abs(x2 - x1) * 72 / ppi;
309  if (rise < 2.0 && 2.0 < run)
310  *line_y1 = *line_y2 = (y1 + y2) / 2;
311 }
312 
313 bool CodepointToUtf16be(int code, char utf16[kMaxBytesPerCodepoint]) {
314  if ((code > 0xD7FF && code < 0xE000) || code > 0x10FFFF) {
315  tprintf("Dropping invalid codepoint %d\n", code);
316  return false;
317  }
318  if (code < 0x10000) {
319  snprintf(utf16, kMaxBytesPerCodepoint, "%04X", code);
320  } else {
321  int a = code - 0x010000;
322  int high_surrogate = (0x03FF & (a >> 10)) + 0xD800;
323  int low_surrogate = (0x03FF & a) + 0xDC00;
324  snprintf(utf16, kMaxBytesPerCodepoint,
325  "%04X%04X", high_surrogate, low_surrogate);
326  }
327  return true;
328 }
329 
330 char* TessPDFRenderer::GetPDFTextObjects(TessBaseAPI* api,
331  double width, double height) {
332  STRING pdf_str("");
333  double ppi = api->GetSourceYResolution();
334 
335  // These initial conditions are all arbitrary and will be overwritten
336  double old_x = 0.0, old_y = 0.0;
337  int old_fontsize = 0;
338  tesseract::WritingDirection old_writing_direction =
340  bool new_block = true;
341  int fontsize = 0;
342  double a = 1;
343  double b = 0;
344  double c = 0;
345  double d = 1;
346 
347  // TODO(jbreiden) This marries the text and image together.
348  // Slightly cleaner from an abstraction standpoint if this were to
349  // live inside a separate text object.
350  pdf_str += "q ";
351  pdf_str.add_str_double("", prec(width));
352  pdf_str += " 0 0 ";
353  pdf_str.add_str_double("", prec(height));
354  pdf_str += " 0 0 cm";
355  if (!textonly_) {
356  pdf_str += " /Im1 Do";
357  }
358  pdf_str += " Q\n";
359 
360  int line_x1 = 0;
361  int line_y1 = 0;
362  int line_x2 = 0;
363  int line_y2 = 0;
364 
365  ResultIterator *res_it = api->GetIterator();
366  while (!res_it->Empty(RIL_BLOCK)) {
367  if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
368  pdf_str += "BT\n3 Tr"; // Begin text object, use invisible ink
369  old_fontsize = 0; // Every block will declare its fontsize
370  new_block = true; // Every block will declare its affine matrix
371  }
372 
373  if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
374  int x1, y1, x2, y2;
375  res_it->Baseline(RIL_TEXTLINE, &x1, &y1, &x2, &y2);
376  ClipBaseline(ppi, x1, y1, x2, y2, &line_x1, &line_y1, &line_x2, &line_y2);
377  }
378 
379  if (res_it->Empty(RIL_WORD)) {
380  res_it->Next(RIL_WORD);
381  continue;
382  }
383 
384  // Writing direction changes at a per-word granularity
385  tesseract::WritingDirection writing_direction;
386  {
387  tesseract::Orientation orientation;
388  tesseract::TextlineOrder textline_order;
389  float deskew_angle;
390  res_it->Orientation(&orientation, &writing_direction,
391  &textline_order, &deskew_angle);
392  if (writing_direction != WRITING_DIRECTION_TOP_TO_BOTTOM) {
393  switch (res_it->WordDirection()) {
394  case DIR_LEFT_TO_RIGHT:
395  writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
396  break;
397  case DIR_RIGHT_TO_LEFT:
398  writing_direction = WRITING_DIRECTION_RIGHT_TO_LEFT;
399  break;
400  default:
401  writing_direction = old_writing_direction;
402  }
403  }
404  }
405 
406  // Where is word origin and how long is it?
407  double x, y, word_length;
408  {
409  int word_x1, word_y1, word_x2, word_y2;
410  res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
411  GetWordBaseline(writing_direction, ppi, height,
412  word_x1, word_y1, word_x2, word_y2,
413  line_x1, line_y1, line_x2, line_y2,
414  &x, &y, &word_length);
415  }
416 
417  if (writing_direction != old_writing_direction || new_block) {
418  AffineMatrix(writing_direction,
419  line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
420  pdf_str.add_str_double(" ", prec(a)); // . This affine matrix
421  pdf_str.add_str_double(" ", prec(b)); // . sets the coordinate
422  pdf_str.add_str_double(" ", prec(c)); // . system for all
423  pdf_str.add_str_double(" ", prec(d)); // . text that follows.
424  pdf_str.add_str_double(" ", prec(x)); // .
425  pdf_str.add_str_double(" ", prec(y)); // .
426  pdf_str += (" Tm "); // Place cursor absolutely
427  new_block = false;
428  } else {
429  double dx = x - old_x;
430  double dy = y - old_y;
431  pdf_str.add_str_double(" ", prec(dx * a + dy * b));
432  pdf_str.add_str_double(" ", prec(dx * c + dy * d));
433  pdf_str += (" Td "); // Relative moveto
434  }
435  old_x = x;
436  old_y = y;
437  old_writing_direction = writing_direction;
438 
439  // Adjust font size on a per word granularity. Pay attention to
440  // fontsize, old_fontsize, and pdf_str. We've found that for
441  // in Arabic, Tesseract will happily return a fontsize of zero,
442  // so we make up a default number to protect ourselves.
443  {
444  bool bold, italic, underlined, monospace, serif, smallcaps;
445  int font_id;
446  res_it->WordFontAttributes(&bold, &italic, &underlined, &monospace,
447  &serif, &smallcaps, &fontsize, &font_id);
448  const int kDefaultFontsize = 8;
449  if (fontsize <= 0)
450  fontsize = kDefaultFontsize;
451  if (fontsize != old_fontsize) {
452  char textfont[20];
453  snprintf(textfont, sizeof(textfont), "/f-0-0 %d Tf ", fontsize);
454  pdf_str += textfont;
455  old_fontsize = fontsize;
456  }
457  }
458 
459  bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD);
460  bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
461  STRING pdf_word("");
462  int pdf_word_len = 0;
463  do {
464  const std::unique_ptr<const char[]> grapheme(res_it->GetUTF8Text(RIL_SYMBOL));
465  if (grapheme && grapheme[0] != '\0') {
466  GenericVector<int> unicodes;
467  UNICHAR::UTF8ToUnicode(grapheme.get(), &unicodes);
468  char utf16[kMaxBytesPerCodepoint];
469  for (int i = 0; i < unicodes.length(); i++) {
470  int code = unicodes[i];
471  if (CodepointToUtf16be(code, utf16)) {
472  pdf_word += utf16;
473  pdf_word_len++;
474  }
475  }
476  }
477  res_it->Next(RIL_SYMBOL);
478  } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
479  if (word_length > 0 && pdf_word_len > 0 && fontsize > 0) {
480  double h_stretch =
481  kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));
482  pdf_str.add_str_double("", h_stretch);
483  pdf_str += " Tz"; // horizontal stretch
484  pdf_str += " [ <";
485  pdf_str += pdf_word; // UTF-16BE representation
486  pdf_str += "> ] TJ"; // show the text
487  }
488  if (last_word_in_line) {
489  pdf_str += " \n";
490  }
491  if (last_word_in_block) {
492  pdf_str += "ET\n"; // end the text object
493  }
494  }
495  char *ret = new char[pdf_str.length() + 1];
496  strcpy(ret, pdf_str.string());
497  delete res_it;
498  return ret;
499 }
500 
502  char buf[kBasicBufSize];
503  size_t n;
504 
505  n = snprintf(buf, sizeof(buf),
506  "%%PDF-1.5\n"
507  "%%%c%c%c%c\n",
508  0xDE, 0xAD, 0xBE, 0xEB);
509  if (n >= sizeof(buf)) return false;
510  AppendPDFObject(buf);
511 
512  // CATALOG
513  n = snprintf(buf, sizeof(buf),
514  "1 0 obj\n"
515  "<<\n"
516  " /Type /Catalog\n"
517  " /Pages %ld 0 R\n"
518  ">>\n"
519  "endobj\n",
520  2L);
521  if (n >= sizeof(buf)) return false;
522  AppendPDFObject(buf);
523 
524  // We are reserving object #2 for the /Pages
525  // object, which I am going to create and write
526  // at the end of the PDF file.
527  AppendPDFObject("");
528 
529  // TYPE0 FONT
530  n = snprintf(buf, sizeof(buf),
531  "3 0 obj\n"
532  "<<\n"
533  " /BaseFont /GlyphLessFont\n"
534  " /DescendantFonts [ %ld 0 R ]\n"
535  " /Encoding /Identity-H\n"
536  " /Subtype /Type0\n"
537  " /ToUnicode %ld 0 R\n"
538  " /Type /Font\n"
539  ">>\n"
540  "endobj\n",
541  4L, // CIDFontType2 font
542  6L // ToUnicode
543  );
544  if (n >= sizeof(buf)) return false;
545  AppendPDFObject(buf);
546 
547  // CIDFONTTYPE2
548  n = snprintf(buf, sizeof(buf),
549  "4 0 obj\n"
550  "<<\n"
551  " /BaseFont /GlyphLessFont\n"
552  " /CIDToGIDMap %ld 0 R\n"
553  " /CIDSystemInfo\n"
554  " <<\n"
555  " /Ordering (Identity)\n"
556  " /Registry (Adobe)\n"
557  " /Supplement 0\n"
558  " >>\n"
559  " /FontDescriptor %ld 0 R\n"
560  " /Subtype /CIDFontType2\n"
561  " /Type /Font\n"
562  " /DW %d\n"
563  ">>\n"
564  "endobj\n",
565  5L, // CIDToGIDMap
566  7L, // Font descriptor
567  1000 / kCharWidth);
568  if (n >= sizeof(buf)) return false;
569  AppendPDFObject(buf);
570 
571  // CIDTOGIDMAP
572  const int kCIDToGIDMapSize = 2 * (1 << 16);
573  const std::unique_ptr</*non-const*/ unsigned char[]> cidtogidmap(new unsigned char[kCIDToGIDMapSize]);
574  for (int i = 0; i < kCIDToGIDMapSize; i++) {
575  cidtogidmap[i] = (i % 2) ? 1 : 0;
576  }
577  size_t len;
578  unsigned char *comp =
579  zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
580  n = snprintf(buf, sizeof(buf),
581  "5 0 obj\n"
582  "<<\n"
583  " /Length %lu /Filter /FlateDecode\n"
584  ">>\n"
585  "stream\n",
586  (unsigned long)len);
587  if (n >= sizeof(buf)) {
588  lept_free(comp);
589  return false;
590  }
591  AppendString(buf);
592  long objsize = strlen(buf);
593  AppendData(reinterpret_cast<char *>(comp), len);
594  objsize += len;
595  lept_free(comp);
596  const char *endstream_endobj =
597  "endstream\n"
598  "endobj\n";
599  AppendString(endstream_endobj);
600  objsize += strlen(endstream_endobj);
601  AppendPDFObjectDIY(objsize);
602 
603  const char *stream =
604  "/CIDInit /ProcSet findresource begin\n"
605  "12 dict begin\n"
606  "begincmap\n"
607  "/CIDSystemInfo\n"
608  "<<\n"
609  " /Registry (Adobe)\n"
610  " /Ordering (UCS)\n"
611  " /Supplement 0\n"
612  ">> def\n"
613  "/CMapName /Adobe-Identify-UCS def\n"
614  "/CMapType 2 def\n"
615  "1 begincodespacerange\n"
616  "<0000> <FFFF>\n"
617  "endcodespacerange\n"
618  "1 beginbfrange\n"
619  "<0000> <FFFF> <0000>\n"
620  "endbfrange\n"
621  "endcmap\n"
622  "CMapName currentdict /CMap defineresource pop\n"
623  "end\n"
624  "end\n";
625 
626  // TOUNICODE
627  n = snprintf(buf, sizeof(buf),
628  "6 0 obj\n"
629  "<< /Length %lu >>\n"
630  "stream\n"
631  "%s"
632  "endstream\n"
633  "endobj\n", (unsigned long) strlen(stream), stream);
634  if (n >= sizeof(buf)) return false;
635  AppendPDFObject(buf);
636 
637  // FONT DESCRIPTOR
638  n = snprintf(buf, sizeof(buf),
639  "7 0 obj\n"
640  "<<\n"
641  " /Ascent %d\n"
642  " /CapHeight %d\n"
643  " /Descent -1\n" // Spec says must be negative
644  " /Flags 5\n" // FixedPitch + Symbolic
645  " /FontBBox [ 0 0 %d %d ]\n"
646  " /FontFile2 %ld 0 R\n"
647  " /FontName /GlyphLessFont\n"
648  " /ItalicAngle 0\n"
649  " /StemV 80\n"
650  " /Type /FontDescriptor\n"
651  ">>\n"
652  "endobj\n",
653  1000,
654  1000,
655  1000 / kCharWidth,
656  1000,
657  8L // Font data
658  );
659  if (n >= sizeof(buf)) return false;
660  AppendPDFObject(buf);
661 
662  n = snprintf(buf, sizeof(buf), "%s/pdf.ttf", datadir_);
663  if (n >= sizeof(buf)) return false;
664  FILE *fp = fopen(buf, "rb");
665  if (!fp) {
666  tprintf("Can not open file \"%s\"!\n", buf);
667  return false;
668  }
669  fseek(fp, 0, SEEK_END);
670  long int size = ftell(fp);
671  fseek(fp, 0, SEEK_SET);
672  const std::unique_ptr</*non-const*/ char[]> buffer(new char[size]);
673  if (fread(buffer.get(), 1, size, fp) != static_cast<unsigned long>(size)) {
674  fclose(fp);
675  return false;
676  }
677  fclose(fp);
678  // FONTFILE2
679  n = snprintf(buf, sizeof(buf),
680  "8 0 obj\n"
681  "<<\n"
682  " /Length %ld\n"
683  " /Length1 %ld\n"
684  ">>\n"
685  "stream\n", size, size);
686  if (n >= sizeof(buf)) {
687  return false;
688  }
689  AppendString(buf);
690  objsize = strlen(buf);
691  AppendData(buffer.get(), size);
692  objsize += size;
693  AppendString(endstream_endobj);
694  objsize += strlen(endstream_endobj);
695  AppendPDFObjectDIY(objsize);
696  return true;
697 }
698 
699 bool TessPDFRenderer::imageToPDFObj(Pix *pix,
700  char *filename,
701  long int objnum,
702  char **pdf_object,
703  long int *pdf_object_size) {
704  size_t n;
705  char b0[kBasicBufSize];
706  char b1[kBasicBufSize];
707  char b2[kBasicBufSize];
708  if (!pdf_object_size || !pdf_object)
709  return false;
710  *pdf_object = NULL;
711  *pdf_object_size = 0;
712  if (!filename)
713  return false;
714 
715  L_Compressed_Data *cid = NULL;
716  const int kJpegQuality = 85;
717 
718  int format, sad;
719  findFileFormat(filename, &format);
720  if (pixGetSpp(pix) == 4 && format == IFF_PNG) {
721  Pix *p1 = pixAlphaBlendUniform(pix, 0xffffff00);
722  sad = pixGenerateCIData(p1, L_FLATE_ENCODE, 0, 0, &cid);
723  pixDestroy(&p1);
724  } else {
725  sad = l_generateCIDataForPdf(filename, pix, kJpegQuality, &cid);
726  }
727 
728  if (sad || !cid) {
729  l_CIDataDestroy(&cid);
730  return false;
731  }
732 
733  const char *group4 = "";
734  const char *filter;
735  switch(cid->type) {
736  case L_FLATE_ENCODE:
737  filter = "/FlateDecode";
738  break;
739  case L_JPEG_ENCODE:
740  filter = "/DCTDecode";
741  break;
742  case L_G4_ENCODE:
743  filter = "/CCITTFaxDecode";
744  group4 = " /K -1\n";
745  break;
746  case L_JP2K_ENCODE:
747  filter = "/JPXDecode";
748  break;
749  default:
750  l_CIDataDestroy(&cid);
751  return false;
752  }
753 
754  // Maybe someday we will accept RGBA but today is not that day.
755  // It requires creating an /SMask for the alpha channel.
756  // http://stackoverflow.com/questions/14220221
757  const char *colorspace;
758  if (cid->ncolors > 0) {
759  n = snprintf(b0, sizeof(b0),
760  " /ColorSpace [ /Indexed /DeviceRGB %d %s ]\n",
761  cid->ncolors - 1, cid->cmapdatahex);
762  if (n >= sizeof(b0)) {
763  l_CIDataDestroy(&cid);
764  return false;
765  }
766  colorspace = b0;
767  } else {
768  switch (cid->spp) {
769  case 1:
770  colorspace = " /ColorSpace /DeviceGray\n";
771  break;
772  case 3:
773  colorspace = " /ColorSpace /DeviceRGB\n";
774  break;
775  default:
776  l_CIDataDestroy(&cid);
777  return false;
778  }
779  }
780 
781  int predictor = (cid->predictor) ? 14 : 1;
782 
783  // IMAGE
784  n = snprintf(b1, sizeof(b1),
785  "%ld 0 obj\n"
786  "<<\n"
787  " /Length %ld\n"
788  " /Subtype /Image\n",
789  objnum, (unsigned long) cid->nbytescomp);
790  if (n >= sizeof(b1)) {
791  l_CIDataDestroy(&cid);
792  return false;
793  }
794 
795  n = snprintf(b2, sizeof(b2),
796  " /Width %d\n"
797  " /Height %d\n"
798  " /BitsPerComponent %d\n"
799  " /Filter %s\n"
800  " /DecodeParms\n"
801  " <<\n"
802  " /Predictor %d\n"
803  " /Colors %d\n"
804  "%s"
805  " /Columns %d\n"
806  " /BitsPerComponent %d\n"
807  " >>\n"
808  ">>\n"
809  "stream\n",
810  cid->w, cid->h, cid->bps, filter, predictor, cid->spp,
811  group4, cid->w, cid->bps);
812  if (n >= sizeof(b2)) {
813  l_CIDataDestroy(&cid);
814  return false;
815  }
816 
817  const char *b3 =
818  "endstream\n"
819  "endobj\n";
820 
821  size_t b1_len = strlen(b1);
822  size_t b2_len = strlen(b2);
823  size_t b3_len = strlen(b3);
824  size_t colorspace_len = strlen(colorspace);
825 
826  *pdf_object_size =
827  b1_len + colorspace_len + b2_len + cid->nbytescomp + b3_len;
828  *pdf_object = new char[*pdf_object_size];
829 
830  char *p = *pdf_object;
831  memcpy(p, b1, b1_len);
832  p += b1_len;
833  memcpy(p, colorspace, colorspace_len);
834  p += colorspace_len;
835  memcpy(p, b2, b2_len);
836  p += b2_len;
837  memcpy(p, cid->datacomp, cid->nbytescomp);
838  p += cid->nbytescomp;
839  memcpy(p, b3, b3_len);
840  l_CIDataDestroy(&cid);
841  return true;
842 }
843 
845  size_t n;
846  char buf[kBasicBufSize];
847  char buf2[kBasicBufSize];
848  Pix *pix = api->GetInputImage();
849  char *filename = (char *)api->GetInputName();
850  int ppi = api->GetSourceYResolution();
851  if (!pix || ppi <= 0)
852  return false;
853  double width = pixGetWidth(pix) * 72.0 / ppi;
854  double height = pixGetHeight(pix) * 72.0 / ppi;
855 
856  snprintf(buf2, sizeof(buf2), "/XObject << /Im1 %ld 0 R >>\n", obj_ + 2);
857  const char *xobject = (textonly_) ? "" : buf2;
858 
859  // PAGE
860  n = snprintf(buf, sizeof(buf),
861  "%ld 0 obj\n"
862  "<<\n"
863  " /Type /Page\n"
864  " /Parent %ld 0 R\n"
865  " /MediaBox [0 0 %.2f %.2f]\n"
866  " /Contents %ld 0 R\n"
867  " /Resources\n"
868  " <<\n"
869  " %s"
870  " /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
871  " /Font << /f-0-0 %ld 0 R >>\n"
872  " >>\n"
873  ">>\n"
874  "endobj\n",
875  obj_,
876  2L, // Pages object
877  width, height,
878  obj_ + 1, // Contents object
879  xobject, // Image object
880  3L); // Type0 Font
881  if (n >= sizeof(buf)) return false;
882  pages_.push_back(obj_);
883  AppendPDFObject(buf);
884 
885  // CONTENTS
886  const std::unique_ptr</*non-const*/ char[]> pdftext(GetPDFTextObjects(api, width, height));
887  const long pdftext_len = strlen(pdftext.get());
888  size_t len;
889  unsigned char *comp_pdftext =
890  zlibCompress(reinterpret_cast<unsigned char *>(pdftext.get()), pdftext_len, &len);
891  long comp_pdftext_len = len;
892  n = snprintf(buf, sizeof(buf),
893  "%ld 0 obj\n"
894  "<<\n"
895  " /Length %ld /Filter /FlateDecode\n"
896  ">>\n"
897  "stream\n", obj_, comp_pdftext_len);
898  if (n >= sizeof(buf)) {
899  lept_free(comp_pdftext);
900  return false;
901  }
902  AppendString(buf);
903  long objsize = strlen(buf);
904  AppendData(reinterpret_cast<char *>(comp_pdftext), comp_pdftext_len);
905  objsize += comp_pdftext_len;
906  lept_free(comp_pdftext);
907  const char *b2 =
908  "endstream\n"
909  "endobj\n";
910  AppendString(b2);
911  objsize += strlen(b2);
912  AppendPDFObjectDIY(objsize);
913 
914  if (!textonly_) {
915  char *pdf_object = nullptr;
916  if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
917  return false;
918  }
919  AppendData(pdf_object, objsize);
920  AppendPDFObjectDIY(objsize);
921  delete[] pdf_object;
922  }
923  return true;
924 }
925 
926 
928  size_t n;
929  char buf[kBasicBufSize];
930 
931  // We reserved the /Pages object number early, so that the /Page
932  // objects could refer to their parent. We finally have enough
933  // information to go fill it in. Using lower level calls to manipulate
934  // the offset record in two spots, because we are placing objects
935  // out of order in the file.
936 
937  // PAGES
938  const long int kPagesObjectNumber = 2;
939  offsets_[kPagesObjectNumber] = offsets_.back(); // manipulation #1
940  n = snprintf(buf, sizeof(buf),
941  "%ld 0 obj\n"
942  "<<\n"
943  " /Type /Pages\n"
944  " /Kids [ ", kPagesObjectNumber);
945  if (n >= sizeof(buf)) return false;
946  AppendString(buf);
947  size_t pages_objsize = strlen(buf);
948  for (size_t i = 0; i < pages_.unsigned_size(); i++) {
949  n = snprintf(buf, sizeof(buf),
950  "%ld 0 R ", pages_[i]);
951  if (n >= sizeof(buf)) return false;
952  AppendString(buf);
953  pages_objsize += strlen(buf);
954  }
955  n = snprintf(buf, sizeof(buf),
956  "]\n"
957  " /Count %d\n"
958  ">>\n"
959  "endobj\n", pages_.size());
960  if (n >= sizeof(buf)) return false;
961  AppendString(buf);
962  pages_objsize += strlen(buf);
963  offsets_.back() += pages_objsize; // manipulation #2
964 
965  // INFO
966  STRING utf16_title = "FEFF"; // byte_order_marker
967  GenericVector<int> unicodes;
968  UNICHAR::UTF8ToUnicode(title(), &unicodes);
969  char utf16[kMaxBytesPerCodepoint];
970  for (int i = 0; i < unicodes.length(); i++) {
971  int code = unicodes[i];
972  if (CodepointToUtf16be(code, utf16)) {
973  utf16_title += utf16;
974  }
975  }
976 
977  char* datestr = l_getFormattedDate();
978  n = snprintf(buf, sizeof(buf),
979  "%ld 0 obj\n"
980  "<<\n"
981  " /Producer (Tesseract %s)\n"
982  " /CreationDate (D:%s)\n"
983  " /Title <%s>\n"
984  ">>\n"
985  "endobj\n",
986  obj_, TESSERACT_VERSION_STR, datestr, utf16_title.c_str());
987  lept_free(datestr);
988  if (n >= sizeof(buf)) return false;
989  AppendPDFObject(buf);
990  n = snprintf(buf, sizeof(buf),
991  "xref\n"
992  "0 %ld\n"
993  "0000000000 65535 f \n", obj_);
994  if (n >= sizeof(buf)) return false;
995  AppendString(buf);
996  for (int i = 1; i < obj_; i++) {
997  n = snprintf(buf, sizeof(buf), "%010ld 00000 n \n", offsets_[i]);
998  if (n >= sizeof(buf)) return false;
999  AppendString(buf);
1000  }
1001  n = snprintf(buf, sizeof(buf),
1002  "trailer\n"
1003  "<<\n"
1004  " /Size %ld\n"
1005  " /Root %ld 0 R\n"
1006  " /Info %ld 0 R\n"
1007  ">>\n"
1008  "startxref\n"
1009  "%ld\n"
1010  "%%%%EOF\n",
1011  obj_,
1012  1L, // catalog
1013  obj_ - 1, // info
1014  offsets_.back());
1015  if (n >= sizeof(buf)) return false;
1016  AppendString(buf);
1017  return true;
1018 }
1019 } // namespace tesseract
const char * WordFontAttributes(bool *is_bold, bool *is_italic, bool *is_underlined, bool *is_monospace, bool *is_serif, bool *is_smallcaps, int *pointsize, int *font_id) const
void Swap(T *p1, T *p2)
Definition: helpers.h:97
bool Empty(PageIteratorLevel level) const
unsigned int unsigned_size() const
Definition: genericvector.h:76
virtual bool Next(PageIteratorLevel level)
virtual bool IsAtBeginningOf(PageIteratorLevel level) const
voidpf stream
Definition: ioapi.h:39
#define round(x)
Definition: mathfix.h:34
virtual char * GetUTF8Text(PageIteratorLevel level) const
voidpf void uLong size
Definition: ioapi.h:39
void GetWordBaseline(int writing_direction, int ppi, int height, int word_x1, int word_y1, int word_x2, int word_y2, int line_x1, int line_y1, int line_x2, int line_y2, double *x0, double *y0, double *length)
int push_back(T object)
#define tprintf(...)
Definition: tprintf.h:31
const char * string() const
Definition: strngs.cpp:198
inT32 length() const
Definition: strngs.cpp:193
#define SEEK_SET
Definition: ioapi.c:29
int size() const
Definition: genericvector.h:72
void AffineMatrix(int writing_direction, int line_x1, int line_y1, int line_x2, int line_y2, double *a, double *b, double *c, double *d)
virtual bool IsAtFinalElement(PageIteratorLevel level, PageIteratorLevel element) const
virtual bool EndDocumentHandler()
bool CodepointToUtf16be(int code, char utf16[kMaxBytesPerCodepoint])
virtual bool AddImageHandler(TessBaseAPI *api)
Definition: strngs.h:45
void add_str_double(const char *str, double number)
Definition: strngs.cpp:391
TessPDFRenderer(const char *outputbase, const char *datadir, bool textonly)
void AppendData(const char *s, int len)
Definition: renderer.cpp:106
ResultIterator * GetIterator()
Definition: baseapi.cpp:1236
virtual bool BeginDocumentHandler()
StrongScriptDirection WordDirection() const
int length() const
Definition: genericvector.h:85
bool Baseline(PageIteratorLevel level, int *x1, int *y1, int *x2, int *y2) const
double prec(double x)
T & back() const
const char * GetInputName()
Definition: baseapi.cpp:924
const char * filename
Definition: ioapi.h:38
const char * c_str() const
Definition: strngs.cpp:209
static bool UTF8ToUnicode(const char *utf8_str, GenericVector< int > *unicodes)
Definition: unichar.cpp:211
voidpf void * buf
Definition: ioapi.h:39
void ClipBaseline(int ppi, int x1, int y1, int x2, int y2, int *line_x1, int *line_y1, int *line_x2, int *line_y2)
#define SEEK_END
Definition: ioapi.c:25
#define TESSERACT_VERSION_STR
Definition: baseapi.h:23
long dist2(int x1, int y1, int x2, int y2)
void AppendString(const char *s)
Definition: renderer.cpp:102
void Orientation(tesseract::Orientation *orientation, tesseract::WritingDirection *writing_direction, tesseract::TextlineOrder *textline_order, float *deskew_angle) const
const char * title() const
Definition: renderer.h:81