Database Administrators Asked by user1664043 on November 14, 2021
I’m getting these xml files from a vendor, and it’s a wrapper of NITF (news) schema and the http://www.xmlnews.org/namespaces/meta# news metadata schema (from Space 1999!)
Unfortunately, they don’t declare any namespaces at all on the outer document. This is what they give us:
<?xml version="1.0"?>
<document>
<nitf>
<head>...</head>
<body>...</body>
etc
</nitf>
<xn:Resource xmlns:xn="http://www.xmlnews.org/namespaces/meta#">...</xn:Resource>
</document>
I was trying to see if I could improve throughput by creating an xml schema collection and parsing it typed, but the lack of any namespace declaration in the xml text is tripping me up.
I’ve tried putting
;WITH XMLNAMESPACES (default 'http://iptc.org/std/NITF/2006-10-18/')
SELECT CAST(rawXml as XML(NitfSchemaCollection))
but it doesn’t like it (XML Validation: Declaration not found for element ‘document’ exception).
I even tried using ;WITH XMLNAMESPACES to get the raw xml parsed into an XML type and then casting it to XML(NitfSchemaCollection), but same problem.
So my questions are:
and
We’re currently on Sql Server 2008 sp4 but I could try it on a newer instance if that might change something.
EDIT: Here’s a sample document. Both the nitf and xn:Resource nodes conform to two very old newswire service serialization standards. For my schema collection I added both, and tweaked the nitf one to add the document node, which is non-standard. The schema are lengthy for a post but I can add them if anyone is interested.
<?xml version="1.0"?>
<document>
<nitf>
<head>
<title>First World Problems: 'Should I cancel my Easter holiday and charter a superyacht to escape coronavirus?'</title>
</head>
<body>
<body.head>
<hedline>
<hl1>First World Problems: 'Should I cancel my Easter holiday and charter a superyacht to escape coronavirus?'</hl1>
</hedline>
<byline>
<bytag>By Caroline White</bytag>
</byline>
<distributor>Telegraph Group</distributor>
</body.head>
<body.content>
<p><em>'I am thinking of cancelling my Easter holiday and chartering a yacht to whisk my immediate family off to sea. The idea is that we can still enjoy the trip of a lifetime without risking contracting the coronavirus. How would you recommend proceeding?'</em></p>
<p>If you’ve got the wallet for it, a superyacht charter offers the most luxurious seclusion on the planet – and like the hand sanitiser aisle in Boots, you’re not the first to think of it. Some brokers anticipate an uptick in superyacht sales, as UHNWI look to create safe havens, and wealthy holidaymakers are likely to follow suit. So get moving.</p>
<p>The first step is to recruit a charter broker – try Fraser, Burgess, YPI or <org value="ACORN:3601037911" idsrc="xmltag.org" >Camper & Nicholsons</org>. They will gauge your budget, preferences and read your personality (are you too formal for that laid-back Aussie captain; are you too wild for that silver-service English crew) then come back to you with a bespoke selection of options. The next step is a rather blissful journey through yacht brochures. Then there are the itineraries to flick through: beach barbeques, diving days and suppers under the stars…</p>
...blah blah blah...
<p><em><em>If you have a question for any of our Telegraph Luxury experts, on any topic, please email <a href="http://mailto:[email protected]/">[email protected]</a></em></em></p>
<p><em>Last week on First World Problems</em></p>
<p><a href="https://www.telegraph.co.uk/luxury/womens-style/first-world-problems-expensive-blonde-highlights-mayfair-salon/">First World Problems: 'Are expensive highlights at a Mayfair salon worth the price-and the journey?'</a></p>
<p><em><em>Sign up for the <a href="https://www.telegraph.co.uk/newsletters/Luxury/">Telegraph Luxury newsletter</a> for your weekly dose of exquisite taste and expert opinion.</em></em></p>
</body.content>
</body>
</nitf>
<xn:Resource xmlns:xn="http://www.xmlnews.org/namespaces/meta#">
<xn:providerName>Telegraph Group</xn:providerName>
<xn:providerCode>127</xn:providerCode>
<xn:serviceName>Telegraph Online</xn:serviceName>
<xn:serviceCode>2</xn:serviceCode>
<xn:resourceID>202003100715TELEGR__ONLINE___60979152</xn:resourceID>
<xn:publicationTime>2020-03-10T07:15:00-04:00</xn:publicationTime>
<xn:receivedTime>2020-03-10T07:50:43-04:00</xn:receivedTime>
<xn:title>First World Problems: 'Should I cancel my Easter holiday and charter a superyacht to escape coronavirus?'</xn:title>
<xn:rendition>202003100715TELEGR__ONLINE___60979152.xml</xn:rendition>
<xn:vendorData>WAVO:Publish Reason=CORRECTED</xn:vendorData>
<xn:vendorData>WAVO:alert=FALSE</xn:vendorData>
<xn:vendorData>WAVO:headline_only=FALSE</xn:vendorData>
<xn:vendorData>WAVO:temporary=FALSE</xn:vendorData>
<xn:vendorData>AMX:Publish Reason=CORRECTED</xn:vendorData>
<xn:vendorData>AMX:Alert=FALSE</xn:vendorData>
<xn:vendorData>AMX:Headline Only=FALSE</xn:vendorData>
<xn:vendorData>AMX:Temporary=FALSE</xn:vendorData>
<xn:vendorData>AMX:Special Code=PS/p.TELEGR__</xn:vendorData>
<xn:vendorData>AMX:Special Code=PS/s.ONLINE__</xn:vendorData>
<xn:copyright>Copyright © 2020 Telegraph.co.ukk. All rights reserved</xn:copyright>
<!-- Entity Extractor -->
<xn:companyCode>ACORN:A.3601037911#6#60#60</xn:companyCode>
<xn:companyCode>ACORN:A.2295203068#6#60#60</xn:companyCode>
<xn:industryCode>IC/fini#6#50#60</xn:industryCode>
<xn:industryCode>IC/fini.bank#6#60#60</xn:industryCode>
<xn:industryCode>IC/fini.invs#6#60#60</xn:industryCode>
<xn:industryCode>IC/fini.secr#6#60#60</xn:industryCode>
<xn:industryCode>IC/svcs#6#50#60</xn:industryCode>
<xn:industryCode>IC/svcs.prof#6#60#60</xn:industryCode>
<xn:locationCode>LB/car#7#70#49</xn:locationCode>
<xn:locationCode>LR/car#9#70#90</xn:locationCode>
<xn:locationCode>LU/car#9#70#90</xn:locationCode>
<xn:locationCode>LU/car.any#7#49#70</xn:locationCode>
<xn:subjectCode>NZ/COID#6#50#60</xn:subjectCode>
<xn:subjectCode>NZ/COID.1475554280#6#60#60</xn:subjectCode>
<xn:subjectCode>NZ/COID.27088#6#60#60</xn:subjectCode>
<xn:subjectCode>NZ/COID.5838940#6#60#60</xn:subjectCode>
<!-- Classifier -->
<xn:subjectCode>IS/lifesoc.privair#5#50#50</xn:subjectCode>
<xn:subjectCode>MC/HOT#6</xn:subjectCode>
<xn:subjectCode>NC/67115358#9#98#50</xn:subjectCode>
<xn:subjectCode>NC/67115586#5#55#50</xn:subjectCode>
<xn:subjectCode>NC/67119129#5#58#50</xn:subjectCode>
<xn:subjectCode>NC/67119169#5#50#50</xn:subjectCode>
<xn:vendorData>AMX:Special Code=PT/updated</xn:vendorData>
<xn:subjectCode>XC/any#6#50#60</xn:subjectCode>
<xn:subjectCode>XC/any.company#6#60#50</xn:subjectCode>
<xn:subjectCode>XC/Private#6#60#50</xn:subjectCode>
<!-- Rules -->
<xn:subjectCode>MC/BIZREL#1</xn:subjectCode>
<xn:subjectCode>NE/BAYERINS#5#58#50</xn:subjectCode>
<xn:subjectCode>NE/GEOAMER#9#70#90</xn:subjectCode>
<xn:subjectCode>NE/GEOCARIB#9#70#90</xn:subjectCode>
<xn:industryCode>NI/Banks#6#60#60</xn:industryCode>
<xn:industryCode>NI/Finance#6#60#60</xn:industryCode>
<xn:industryCode>NI/Securities#6#60#60</xn:industryCode>
<xn:industryCode>NI/Services#6#60#60</xn:industryCode>
<xn:vendorData>AMX:Special Code=TL/americas#7#70#50</xn:vendorData>
<xn:vendorData>AMX:Special Code=TL/LOC#7#50#70</xn:vendorData>
<xn:vendorData>AMX:Special Code=TT/TOPIC#5#50#50</xn:vendorData>
<xn:vendorData>AMX:Special Code=TT/transport#5#50#50</xn:vendorData>
<xn:language>en</xn:language>
</xn:Resource>
</document>
Our processing has to parse these documents, then we’re trying to normalize out a number of the meta data attributes to various tables and columns.
Just parsing unknown xml, I presume Sql Server has to start with a blank name table for every document parsed; I figured a typed xml column starts with a known vocabulary and should be faster. Plus the hope was the xquery would be faster as well.
Here’s an example of the queries we do in processing:
;WITH XMLNAMESPACES ('http://www.xmlnews.org/namespaces/meta#' AS xn)
Insert Into dbo.NewsStory
Select NewsID,provider,service,
CASE When provider='AMSPIDER' and Service='ACBJ' and PublicationAbbrev='web.site' Then dbo.fnGetSpiderPubAbbrev(PublicationAbbrev_Spider) Else PublicationAbbrev End As PublicationAbbrev,
Title, PublishDate, AMXReceivedTime, AllowedReleaseTime,ParsedDate,DateLine, Description, [Language], PublishReason, IsAlert, IsHeadLine, IsTemporary, Copyright
From (
Select X.NewsID,
replace(RIGHT(RS.c.value('(./xn:vendorData[substring((./text())[1],1,22)="AMX:Special Code=PS/p."]/text())[1]', 'VARCHAR(50)'),8) , '_', '') as provider,
replace(RIGHT(RS.c.value('(./xn:vendorData[substring((./text())[1],1,22)="AMX:Special Code=PS/s."]/text())[1]', 'VARCHAR(50)'),8) , '_', '') as service,
CONVERT(NVARCHAR(max),RS.c.query('xn:vendorData'))) as PublicationAbbrev,
replace(RS.c.value('(./xn:vendorData[substring((./text())[1],1,11)="AMX:Credit="]/text())[1]', 'VARCHAR(200)'),'AMX:Credit=', '') as PublicationAbbrev_Spider,
RS.c.value('(./xn:title/text())[1]', 'VARCHAR(200)') AS Title,
CONVERT(DATETIME,REPLACE(LEFT(RS.c.value('(./xn:publicationTime/text())[1]', 'VARCHAR(50)'),19),'T',' ')) AS PublishDate,
CONVERT(DATETIME,REPLACE(LEFT(RS.c.value('(./xn:receivedTime/text())[1]', 'VARCHAR(50)'),19),'T',' ')) AS AMXReceivedTime,
CONVERT(DATETIME,REPLACE(LEFT(RS.c.value('(./xn:releaseTime/text())[1]', 'VARCHAR(50)'),19),'T',' ')) AS AllowedReleaseTime, getdate() as ParsedDate,
RS.c.value('(./xn:dateline/text())[1]', 'VARCHAR(200)') AS DateLine,
RS.c.value('(./xn:description/text())[1]', 'VARCHAR(2000)') AS Description,
RS.c.value('(./xn:language/text())[1]', 'VARCHAR(10)') AS [Language],
LTRIM(SUBSTRING(RS.c.value('(./xn:vendorData[substring((.)[1],1,19)="AMX:Publish Reason="])[1]','VARCHAR(45)'),20,25)) AS PublishReason,
CASE LTRIM(SUBSTRING(RS.c.value('(./xn:vendorData[substring((./text())[1],1,10)="AMX:Alert="]/text())[1]','VARCHAR(45)'),11,10)) WHEN 'FALSE' THEN 0 ELSE 1 END AS IsAlert,
CASE LTRIM(SUBSTRING(RS.c.value('(./xn:vendorData[substring((./text())[1],1,18)="AMX:Headline Only="]/text())[1]','VARCHAR(45)'),19,10)) WHEN 'FALSE' THEN 0 ELSE 1 END AS IsHeadLine,
CASE LTRIM(SUBSTRING(RS.c.value('(./xn:vendorData[substring((./text())[1],1,14)="AMX:Temporary="]/text())[1]','VARCHAR(45)'),15,10)) WHEN 'FALSE' THEN 0 ELSE 1 END AS IsTemporary,
RS.c.value('(./xn:copyright/text())[1]', 'VARCHAR(1000)')AS Copyright
From @XmlFileTable X CROSS APPLY AMXFile.nodes('/document/xn:Resource') RS(c)
) A
The schema collection comes from the NITF source (https://www.iptc.org/std/NITF/3.6/specification/nitf-3-6.xsd) and the xmlnews dtd (http://www.xmlnews.org/dtds/xmlnews-meta-dtd.zip).
I used Visual Studio to convert the xmlnews dtd to a schema and used that to seed NitfSchemaCollection.
Then I tweaked the NITF schema to
remove the include (apparently a small subset for Ruby that I didn’t need)
added to the header
... xmlns:xn="http://www.xmlnews.org/namespaces/meta#">
<import namespace="http://www.xmlnews.org/namespaces/meta#" />
added a document element just above the nitf element declaration, to match what the vendor is shipping to us. e.g.
<element name="document">
<complexType>
<sequence>
<element ref="nitf:nitf" minOccurs="1" maxOccurs="1" />
<element ref="xn:Resource" minOccurs="1" maxOccurs="1" />
</sequence>
</complexType>
</element>
Each document has only 1 nitf node and 1 xn:Resource node, but there can be many instances of the child nodes under xn:Resource.
The part of the XML you are parsing is not constricted by a schema but by a DTD so you can't use a schema collation to make the parsing by SQL Server any different. That said, I have not seen a case yet where a schema is helpful in the scenario where you are shredding XML documents to tables, and there is adding the overhead of validating the XML against the schema to that.
There are some things you can do in the query to make it more efficient.
In the query below I changed the handling of dates, moved the text()
in predicates before the predicate and use .
in the predicate and using exist()
where you are checking for boolean values.
Note that what happened to me in my tests was that the rewrite did not go parallell so when comparing performance keep that in mind. You might like that it only uses one thread in a busy server or you might want to use everything you have. If you want the query to go parallell you can use a trace flag OPTION(QUERYTRACEON 8649)
or if you prefer a serial plan use option (maxdop 1)
.
So in my tests on SQL Server 2008 the performance of the rewrite is about twice as fast.
Look at what I did here, use it if you like it and test on your data.
with xmlnamespaces ('http://www.xmlnews.org/namespaces/meta#' AS xn)
select replace(right(RS.c.value('(xn:vendorData/text()[substring((.)[1],1,22)="AMX:Special Code=PS/p."])[1]', 'varchar(50)'), 8), '_', '') as provider,
replace(right(RS.c.value('(xn:vendorData/text()[substring((.)[1],1,22)="AMX:Special Code=PS/s."])[1]', 'varchar(50)'), 8), '_', '') as service,
convert(nvarchar(max), RS.c.query('xn:vendorData')) as PublicationAbbrev,
replace(RS.c.value('(xn:vendorData/text()[substring((.)[1],1,11)="AMX:Credit="])[1]', 'VARCHAR(200)'), 'AMX:Credit=', '') as PublicationAbbrev_Spider,
RS.c.value('(xn:title/text())[1]', 'varchar(200)') as Title,
convert(datetime, RS.c.value('(xn:publicationTime/text())[1]', 'varchar(19)')) as PublishDate,
convert(datetime, RS.c.value('(xn:receivedTime/text())[1]', 'varchar(19)')) as AMXReceivedTime,
convert(datetime, RS.c.value('(xn:releaseTime/text())[1]', 'varchar(19)')) as AllowedReleaseTime,
getdate() as ParsedDate,
RS.c.value('(xn:dateline/text())[1]', 'varchar(200)') as DateLine,
RS.c.value('(xn:description/text())[1]', 'varchar(2000)') as Description,
RS.c.value('(xn:language/text())[1]', 'varchar(10)') as [Language],
ltrim(substring(RS.c.value('(./xn:vendorData/text()[substring((.)[1],1,19)="AMX:Publish Reason="])[1]', 'VARCHAR(45)'), 20, 25)) as PublishReason,
RS.c.exist('xn:vendorData/text()[. = "AMX:Alert=TRUE"]') as IsAlert,
RS.c.exist('xn:vendorData/text()[. = "AMX:Headline Only=TRUE"]') as IsHeadLine,
RS.c.exist('xn:vendorData/text()[. = "AMX:Temporary=TRUE"]') as IsTemporary,
RS.c.value('(xn:copyright/text())[1]', 'varchar(1000)') as Copyright
from @XmlFileTable X
cross apply AMXFile.nodes('/document/xn:Resource') RS(c);
Answered by Mikael Eriksson on November 14, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP