Submitted by WayneK on 2024/05/07 09:52

When I paste text from an OCR'd pdf, I end up with a lot of "­"s (soft hyphens in original code).

eg "Supreme Court" becomes "Su­reme Court".

I'm trying to find a way to remove these without having to re-type each one.

InfoQube Find and Replace doesn't recognize them.

The only way I've found so far is to paste them into Word Pad to remove the formatting, but this replaces them with a dash, which still has to be removed. [update: doesn't work - the symbols re-appear later]

I've looked at this before but have never been able to find a solution. 

Wayne

Comments

Hi Wayne,

I wasn't aware of ­. I'm now removing this from copy paste operations. Also, it will be removed for all fields (except the ItemHTML field of course when doing a Repair). This will be available in the next version

HTH!

Pierre_Admin

I can't think of another tag that causes consistent problems like this one does.

I do have a general problem with getting correct OCR'd text into InfoQube but I don't know of any other specific instances. It's just general problems interpreting imperfect text from old books and magazines, which causes a lot of manual correcting in InfoQube.

Wayne

 

How do I ?