Web Dump hocr fix for tesseract
If you like to use tesseract to make “searchable” PDFs and find that some parts of the page content are missing in the PDF, here is the necessary fix for ExactCODE’s hocr2pdf
Index: lib/hocr.cc
===============================================================
--- lib/hocr.cc
+++ lib/hocr.cc
@@ -327,6 +327,12 @@
//std::cerr << "elementStart: '" << name << "', attr: '" << attr << "'" << std::endl;
BBox b = parseBBox(attr);
+
+ // explicitly flush line of text on manual preak or end of paragraph
+ if (attr.find("class='ocr_line'") != std::string::npos ||
+ attr.find("class='ocr_par'") != std::string::npos)
+ textline.flush();
+
if (b.x2 > 0 && b.y2 > 0)
lastBBox = b;