TransWikia.com

Correcting Import Errors presumably caused by unwanted whitespace characters

Mathematica Asked on August 18, 2021

In scrapping data from a website that has a list of references searchable by the first
letter of the author. I have attempted to Import this as a single html file for each
letter. Oddly, this works fine except for all but the letter citations with authors whose
names begin with the letter "P".

Data from all letters, except "P" for subsequent processing were Imported as follows:

 sldbA = Import["FBReferencesO.htm", "Data", CharacterEncoding -> "UTF8"];

 sldbZ = Import["FBReferencesZ.htm", "Data", CharacterEncoding -> "UTF8"];
 

These all work, but

 sldbP = Import["FBReferencesP.htm", "Data", CharacterEncoding -> "UTF8"];
 

fails.

Although the formated input (html) seems identical for all such files (save for the PHP
search strings and the data for the author’s names and date), mysteriously the Import
doesn’t work for citations for authors with names starting with "P".

In attempting to figure out why the records for the "P" citations fail to import, I note using an html
validator, that various anchor elements fail because the href attribute contains whitespace characters,
(though oddly they also do in all the non-"P" *.htm files).

Since the "FBReferencesP.htm" file is 27,000+ lines long and the offending lines are all of the same
kind, I figured I could convert the file to a text file and then import the "P" files using "Plaintext"
and then edit out the offending white space in the appropriate lines using a Cases statement and
Export it back out as valid html. Presumably, fixing this would permit Mathematica to successfully
import this file for further processing (ie putting data either in a Table, Grid, or Dataset).

The offending lines all look more or less like this:

 <a href="Author=Paalvast, P.&Year=2014&FishBase=Yes">Paalvast, P.</a>
 

except that the author names and dates vary and there maybe of course multiple white
space characters in the attribute.

How do I construct the Cases and StringReplace statements to remove all the whitespace between
"<a href="Author" and the ">’ leaving the rest of the line(s) alone?

NB: The white space betweem the author’s names after the > character and the anchor
termination should not be changed and whitespace deleted only, for lines containing <a href="Author=.

Unfortunately, despite effort on my part more complex Regular Expressions and their Mathematica
equivalents continue to remain a mystery to me. Help would be appreciated. Hoping this is a learning experience for me and others.

The original file may be found at https://www.sealifebase.ca/ListByLetter/FBReferencesP.htm

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP