In this article we will learn the use of PATINDEX and PARSENAME to extract relevant information from unstructured data.
Introduction
Let's say we have some records as under
Data
------
988455XXXX Niladri <html><body>This is Niladri's mobile number</body></html>
988785XXXX Arina <html><body>This is Arina's mobile number</body></html>
45/78,KRRoad,Kanoji-34**** Babai <html><body>This is Babai's house</body></html>
The objective is to bring the record into a relational format such as
MobileNumber OR Address Owner Name Details
----------------------- ---------- -------
988455XXXX Niladri <html><body>This is Niladri's mobile number</body></html>
988785XXXX Arina <html><body>This is Arina's mobile number</body></html>
45/78,KRRoad,Kanoji-34**** Babai <html><body>This is Babai's house</body></html>
Straight to program
As a first activity, let us separate the HTML portion from the rest. For doing so, we will use PATINDEX to get the starting index of the character(<) as under
SELECT
Data,[StartIndex of <]=PATINDEX('%<%',Data)
FROM @T
/*
Data StartIndex of <
---- ------------------
988455XXXX Niladri <html><body>This is Niladri's mobile number</body></html> 20
988785XXXX Arina <html><body>This is Arina's mobile number</body></html> 18
45/78,KRRoad,Kanoji-34**** Babai <html><body>This is Babai's house</body></html> 34
*/
We can now use substring to extract the relevant portions as under
SELECT
BeforeHtmlContents = SUBSTRING(Data,1,PATINDEX('%<%',Data)-1)
,HtmlContents = SUBSTRING(Data,PATINDEX('%<%',Data),LEN(Data))
FROM @T
/*
BeforeHtmlContents HtmlContents
------------------ ------------
988455XXXX Niladri <html><body>This is Niladri's mobile number</body></html>
988785XXXX Arina <html><body>This is Arina's mobile number</body></html>
45/78,KRRoad,Kanoji-34**** Babai <html><body>This is Babai's house</body></html>
*/
Now let us replace the white spaces of the BeforeHtmlContents column's data with dot(.) as under
SELECT
BeforeHtmlContents = REPLACE(RTRIM(SUBSTRING(Data,1,PATINDEX('%<%',Data)-1)),' ','.')
--,HtmlContents = SUBSTRING(Data,PATINDEX('%<%',Data),LEN(Data))
FROM @T
/*
BeforeHtmlContents
--------------------
988455XXXX.Niladri
988785XXXX.Arina
45/78,KRRoad,Kanoji-34****.Babai
*/
This will be needed at the time of using PARSENAME function. Finally we need to separate the MobileNumber OR Address field from the Owner Name which we are doing as under
SELECT
[MobileNumber OR Address]=PARSENAME(BeforeHtmlContents, 2)
,[Owner Name]=PARSENAME(BeforeHtmlContents, 1)
,Details = HtmlContents
FROM
(SELECT
BeforeHtmlContents = REPLACE(RTRIM(SUBSTRING(Data,1,PATINDEX('%<%',Data)-1)),' ','.')
,HtmlContents = SUBSTRING(Data,PATINDEX('%<%',Data),LEN(Data))
FROM @T)X
/*
MobileNumber OR Address Owner Name Details
----------------------- ---------- -------
988455XXXX Niladri <html><body>This is Niladri's mobile number</body></html>
988785XXXX Arina <html><body>This is Arina's mobile number</body></html>
45/78,KRRoad,Kanoji-34**** Babai <html><body>This is Babai's house</body></html>
*/
The complete TSQL Script is given below
DECLARE @T TABLE(Data VARCHAR(100))
INSERT INTO @T VALUES('988455XXXX Niladri <html><body>This is Niladri''s mobile number</body></html>')
INSERT INTO @T VALUES('988785XXXX Arina <html><body>This is Arina''s mobile number</body></html>')
INSERT INTO @T VALUES('45/78,KRRoad,Kanoji-34**** Babai <html><body>This is Babai''s house</body></html>')
SELECT
[MobileNumber OR Address]=PARSENAME(BeforeHtmlContents, 2)
,[Owner Name]=PARSENAME(BeforeHtmlContents, 1)
,Details = HtmlContents
FROM
(SELECT
BeforeHtmlContents = REPLACE(RTRIM(SUBSTRING(Data,1,PATINDEX('%<%',Data)-1)),' ','.')
,HtmlContents = SUBSTRING(Data,PATINDEX('%<%',Data),LEN(Data))
FROM @T)X
Conclusion
This article taught us how to use PARSENAME and PATINDEX together to bring an unstructured record to a structured one. Hope this will be helpful. Thanks for reading.