PC SOFT

WINDEVWEBDEV AND WINDEV MOBILE
ONLINE HELP

Home | Sign in | English UK
  • Conversion rules
  • Supported tags
  • Managing the character set
WINDEV
WindowsLinuxUniversal Windows 10 AppJavaReports and QueriesUser code (UMC)
WEBDEV
WindowsLinuxPHPWEBDEV - Browser code
WINDEV Mobile
AndroidAndroid Widget iPhone/iPadApple WatchUniversal Windows 10 AppWindows Mobile
Others
Stored procedures
Converts an HTML string or an HTML buffer into text string. The following operations are performed during the conversion:
  • The HTML tags are deleted,
  • The special HTML characters are converted,
  • The CR characters (Carriage Return) are converted into space characters,
  • The multiple spaces are converted into unique spaces.
Example
MyHTMLText is string = "<!--test-->&lt;b&gt;&lt;i&gt;&amp;quot;Hello!&amp;quot;&lt;/i&gt;&lt;/b&gt;"
Text is string = HTMLToText(MyHTMLText)
// Text is set to: "Hello"!
Reports and Queries
// If the HTML document is set to:
//<HTML>
// <HEAD>
//  <TITLE>This is a test for a Web page</TITLE>
//  <META http-equiv="content-type" content="text/html; charset=UTF-8">
// </HEAD>
//<BODY>
// <P>This is &nbsp;&nbsp;&nbsp;&nbsp; an HTML page in English</P>
// It contains 1 paragraph<BR /><DD>a tabulation<BR />and 3 line skips
//  <BR /><A href="http://www.pcsoft.fr">This is a link</A>
// </BODY>
//</HTML>

Text = HTMLToText(MyHTMLText)
// Text will contain:
// This is        an HTML page   in English.
//
// It contains 1 paragraph
//   a tabulation
// and 3 line skips
// This is a link
Syntax
<Result> = HTMLToText(<Text in HTML Format> [, <Charset Used>])
<Result>: Character string
Text corresponding to the result of the HTML conversion. The encoding used is the one of the current character set of WINDEV or WEBDEV.
<Text in HTML Format>: Character string or buffer (with quotes)
Text to convert.
<Charset Used>: Optional Integer constant
Constant identifying the character set used to write the <Text in HTML Format>.
The current character set of WINDEV or WEBDEV is used by default (charsetCurrent constant).
If information about the character set used is found in the <Text in HTML Format>, this information has priority over this parameter.
See Correspondence between languages, sub-languages, character sets and nations for more details.
Remarks

Conversion rules

  • The HTML tags are analyzed in order to keep the best possible formatting in the output text (CR characters, space characters, tabulations). The formatting is not kept: bold, italic, colors, ...
  • Do not appear in the text output:
    • the HTML tags
    • the content of the "header" (information found in the <HEAD> tag)
    • the comments
    • the control texts
    • the scripts
    • the SSL definitions
    • the CSS styles (except color)
    • the form elements
  • Management of CR characters
    • 2 CR characters are inserted to replace the following tags: <P>, <H1> to <H6>, <TABLE>, <UL> or <OL>
    • 1 CR character is inserted to replace the following tags: <BR>, <TR>, <LI>, <DD> or <DIV>
    • 1 single CR character is inserted if several identical tags (<TR>, <LI>, <DD> or <DIV>) are found one after another (except for <BR> tags)
  • Management of arrays
    • A CR character is inserted for each array row (<TR> tag).
    • A tabulation is inserted for each array column (<TD> tag).
  • Management of special characters
    A special character is a character defined in the HTML standard. For example, a space character can be written as " ". This standard is automatically used.

Supported tags

The unsupported tags are ignored: their content is taken into account as text.
The supported tags are as follows:
  • <PRE>
  • <UL>: Line break + tabulation
  • <OL>: Line break + tabulation
  • <LI>: Tabulation
  • <H1>: Line break before and line break after
  • <H2>: Line break before and line break after
  • <H3>: Line break before and line break after
  • <H4>: Line break before and line break after
  • <H5>: Line break before and line break after
  • <H6>: Line break before and line break after
  • <P>: Line break before and line break after
  • <BR>: Line break
  • <DL>: Line break
  • <DT>: Line break
  • <DD>: Tabulation and line break
  • <TABLE>: Line break
  • <TR>: Line break
  • <TD>: Elements separated by a tabulation
  • <HEAD>: Content ignored, except for the parameters of the character set
  • <STYLE>: Content ignored
  • &lt;SCRIPT&gt: Content ignored
  • <!-- -->: Comments ignored

Managing the character set

To find out the character set used in the HTML text, HTMLToText uses the information found in the CONTENT attribute of a <META> tag.
If this tag is not found, the character set used to write the HTML text must be specified in <Charset Used>.
Indeed, if the HTML content is using an Arabic character set while WINDEV/WEBDEV use a French character set by default, invalid characters will be found in the output text.
Notes:
  • If the output text contains several "?" characters, it means that the character of the character set used in the HTML document cannot be expressed with a character of the current language.
  • The UTF8 character set is commonly used to encode the Web pages.
Related Examples:
The HTMLTo functions Unit examples (WEBDEV): The HTMLTo functions
[ + ] This example explains how to use the HTMLToRTF and HTMLToText functions of WLanguage.
Switching from the RTF format to the HTML format Unit examples (WINDEV): Switching from the RTF format to the HTML format
[ + ] Using RTFToHTML and RTFToText.
Component : wd240rtf.dll
Minimum version required
  • Version 12
This page is also available for…
Comments
Click [Add] to post a comment