HTMLToText (Function)

ONLINE HELP
WINDEV, WEBDEV AND WINDEV MOBILE

Version:

Home | Sign in | English

This content has been translated automatically. Click here to view the French version.

Help / WLanguage / WLanguage functions / Standard functions / String functions

Conversion rules
Supported tags
Managing the character set

WINDEV

WEBDEV

WINDEV Mobile

Others

See also

HTMLToText (Function)

In french: HTMLVersTexte

Converts an HTML string or buffer to text. The following operations are performed during the conversion:

Deletion of HTML tags,
Conversion of HTML special characters,
Conversion of CR characters (Carriage Return) to spaces,
Conversion of multiple spaces to single spaces.

Example

MonTexteHtml is string 
MonTexteHtml = "<!--test-->&lt;b&gt;&lt;i&gt;&quot;Bonjour !&quot;&lt;/i&gt;&lt;/b&gt;"
Texte is string = HTMLToText(MonTexteHtml)
// Texte vaut : "Bonjour" !

// Si le document HTML vaut:
//<HTML>
// <HEAD>
//  <TITLE>Ceci est un essai de page Web</TITLE>
//  <META http-equiv="content-type" content="text/html; charset=UTF-8">
// </HEAD>
//<BODY>
// <P>Ceci est &nbsp;&nbsp;&nbsp;&nbsp; une page HTML en Français</P>
// Elle contient 1 paragraphe<BR /><DD>une tabulation<BR />et 3 sauts de lignes
//  <BR /><A href="http://www.pcsoft.fr">Ceci est un lien</A>
// </BODY>
//</HTML>

Texte = HTMLToText(MonTexteHtml)
// Texte contiendra : 
// Ceci est        une page HTML   en Français.
//
// Elle contient 1 paragraphe 
//   une tabulation
// et 3 sauts de lignes
// Ceci est un lien

Syntax

<Result> = HTMLToText(<Text in HTML format> [, <Charset used>])

<Result>: Character string

Text corresponding to the result of the HTML conversion. The encoding used is the one of the current character set of WINDEV or WEBDEV.

<Text in HTML format>: String or buffer

Text to convert.

<Charset used>: Optional Integer constant

Constant identifying the character set used to write the <Text in HTML format>. For more details on these constants, see Correspondence between languages, sub-languages, character sets and nations.
If the <Text in HTML format> contains any information about the character set, this information takes precedence over the specified constant.
If this parameter is not specified, or if <Text in HTML format> doesn't contain any information about the character set, the WINDEV/WEBDEV current character set is used (charsetCurrent constant).

Remarks

Conversion rules

The HTML tags are analyzed to keep the best possible formatting in the output text (CR characters, spaces, tabs, etc.). Formatting is not preserved: bold, italics, colors, etc.
The following elements do not appear in the text output:
- HTML tags
- content of the "header" (information in the <HEAD> tag)
- comments
- control texts
- scripts
- SSL definitions
- CSS styles (except "color" attributes)
- form elements
Management of CR characters
- 2 Carriage Returns are inserted to replace the following tags: <P>, <H1> to <H6>, <TABLE>, <UL> or <OL>
- 1 Carriage Return is inserted to replace the following tags: <BR>, <TR>, <LI>, <DD> or <DIV>
- 1 single Carriage Return is inserted if several identical tags (<TR>, <LI>, <DD> or <DIV>) follow one another (except for <BR> tags)
Management of arrays
- A CR character is inserted for each array row (<TR> tag).
- A tab is inserted for each array column (<TD> tag).
Management of special characters
A special character is a character defined in the HTML standard. For example, a space can be written as "&nbsp;" and the "é" character as "&eacute,". This standard is automatically used.

Supported tags

Unmanaged tags are ignored: their content is treated as text.

The supported tags are as follows:

<PRE>
<UL>: Line feed + Tab
<OL>: Line feed + Tab
<LI>: Tabulation
<H1>: Line feed before and line feed after
<H2>: Line feed before and line feed after
<H3>: Line feed before and line feed after
<H4>: Line feed before and line feed after
<H5>: Line feed before and line feed after
<H6>: Line feed before and line feed after
<P>: Line feed before and line feed after
<BR>: Line jump
<DL>: Line jump
<DT>: Line jump
<DD>: Tabulation and line feeds
<TABLE>: Line jump
<TR>: Line jump
<TD>: Tab-separated elements
<HEAD>: Content ignored, except for character set parameters
<STYLE> Content ignored
<SCRIPT>: Content ignored
Comments ignored

Managing the character set

To identify the character set used in the HTML text, HTMLToText uses the information in the CONTENT attribute of a <META> tag.

If this tag is not found, the character set used to write the HTML text must be specified in <Charset used>.

If the HTML content uses an Arabic character set and WINDEV/WEBDEV uses a French character set by default, the output text will have invalid characters.

Remarks:

If the output text contains several question marks ("?"), it means that the characters of the character set used in the HTML document cannot be expressed with the characters of the current language.
The UTF-8 character set is commonly used to encode Web pages.

Related Examples:

Unit examples (WEBDEV): The HTMLTo functions

[ + ] This example explains how to use the HTMLToRTF and HTMLToText functions of WLanguage.

Unit examples (WINDEV): Switching from the RTF format to the HTML format

[ + ] Using RTFToHTML and RTFToText.

Complete examples (WINDEV): WD Mail

[ + ] This application is an email client developed in WINDEV. It is based on the Email objects.
This email client is used to retrieve and send emails by using the POP, IMAP and SMTP protocols.
You have the ability to apply filters to the incoming emails.

The application can also be used to manage several email accounts. The writing of an email is based on the HTML edit control.

Unit examples (WINDEV): HTML types (HTMLDocument, HTMLNode, HTMLAttribute)

[ + ] This example shows how to use the HTMLXxx WLanguage types (HTMLDocument, HTMLNode, HTMLAttribute)

Component: wd300rtf.dll