hocr fix for tesseract (anti)

If you like to use tesseract to make “searchable” PDFs and find that some parts of the page content are missing in the PDF, here is the necessary fix for ExactCODE’s hocr2pdf

Index: lib/hocr.cc
===============================================================
--- lib/hocr.cc
+++ lib/hocr.cc
@@ -327,6 +327,12 @@
   //std::cerr << "elementStart: '" << name << "', attr: '" << attr << "'" << std::endl;
   
   BBox b = parseBBox(attr);
+
+  // explicitly flush line of text on manual preak or end of paragraph
+  if (attr.find("class='ocr_line'") != std::string::npos ||
+      attr.find("class='ocr_par'") != std::string::npos)
+    textline.flush();
+
   if (b.x2 > 0 && b.y2 > 0)
     lastBBox = b;