Remove HTML tags using Regular Expression in C#

Raja
Posted by Raja under C# category on | Points: 40 | Views : 9220
Here is a C# function that can be used to remove HTML tags from the content. It will ensure that returned content is pure text.

public static string RemoveHtml(string source)
{
return Regex.Replace(source, "<.*?>|&.*?;", string.Empty);
}


This also removes &nbsp; (blank space) from the content.

Thanks

Comments or Responses

Posted by: Ishan7 on: 12/15/2020 Level:Starter | Status: [Member] | Points: 10
As often stated before, you should not use regular expressions to process XML or HTML documents. They do not perform very well with HTML and XML documents, because there is no way to express nested structures in a general way.

You could use the following.

String result = Regex.Replace(htmlDocument, @"<[^>]*>", String.Empty);

This will work for most cases, but there will be cases (for example CDATA containing angle brackets) where this will not work as expected.

Reference: https://stackoverflow.com/a/787951/11954917

Login to post response