OCRExtractText (Function)

ONLINE HELP
WINDEV, WEBDEV AND WINDEV MOBILE

Version:

Home | Sign in | English

Help / WLanguage / WLanguage functions / Standard functions / OCR functions

WINDEV

WEBDEV

WINDEV Mobile

Others

See also

OCRExtractText (Function)

In french: OCRExtraitTexte

Reads the text contained in an image.

Example

MyImage is Image
let MyString = OCRExtractText(MyImage)

MyImage is Image
r is Rectangle
r.X=346
r.Y=2258
r.Width = (2158-346)
r.Height = (2323-2258)
let sString = OCRExtractText(MyImage, r)
Trace(sString)

p is Polygon
p.Point[1].X = 346
p.Point[1].Y = 2258
p.Point[2].X = 2158
p.Point[2].Y = 2258
p.Point[3].X = 2158
p.Point[3].Y = 2323
p.Point[4].X = 346
p.Point[4].Y = 2323
let sString2 = OCRExtractText(MyImage, p)
Trace(sString2)

Syntax

<Result> = OCRExtractText(<Image to use> [, <Area to read>])

<Result>: Character string

Text extracted from the image.

<Image to use>: Control name, Image variable, character string

Image in which the text areas must be detected. The image can correspond to:
an Image control,
an Image variable,
an Image Memo item,
the path of an image file
the path of PDF file.

<Area to read>: Optional Rectangle or Polygon variable

Name of the Rectangle variable that represents the area containing the text to be extracted.
Name of the Polygon variable that represents the area containing the text to be extracted. In this case, the area read corresponds to the rectangle that contains the polygon.
By default, if this parameter is not specified, all the text in the image is extracted.

Remarks

The Legacy engine is used. Custom models (.traineddata files) must be compatible with this engine.
Legacy and LSTM engines can be used in WINDEV applications (Windows and Linux). LSTM models are provided by default.
For PDF files:
- if the <Area to read> parameter is not specified, OCRExtractText will extract the text from all pages of the specified PDF file.
- if the <Area to read> parameter is specified, the desired page must be extracted as an image using PDFExtractPage (even if the PDF file has only one page). This image can then be used with OCRExtractText.

To get the best results possible, it is recommended to:
- Use a high-resolution image.
- Crop the image around the text if possible (avoid unnecessary areas).
- Limit text skew. If the image is slightly skewed, OCR may be able to detect the text, but the quality will be affected.
  Skewed images can be read.
- Limit the number of models/languages used.
Note that, if the image used corresponds to an Image control, the source image will be directly manipulated. Therefore, the changes made in the Image control (image size for example) will not be taken into account. To apply these changes, it is necessary to save the image.
Note that, if the image used (via an Image control or not) is a PDF file, its quality will be set to 300 DPI.
OCR can only detect printed text. It cannot recognize handwritten text.
"White" text is not recognized.

If the image used corresponds to an Image control and the source image is smaller than the control, the <Area to read> parameter must be specified with the coordinates of the source image and not with the coordinates of the Image control. CoordinateImageControlToImage can be used to convert these coordinates.

Related Examples:

Unit examples (WINDEV): OCR functions

[ + ] This example shows how to use OCR functions in WINDEV.

Business / UI classification: Business Logic

Component: wd300ocr.dll