TransWikia.com

How to retrieve multiple words that are between and using the match function or other means?

Stack Overflow Asked by webdesignnoob on February 2, 2021

Say my string is this:

var testexample = <p nameIt="Title">Title_Test</p><figure class="t15"><table><thead><tr>
<th><span>Column1</span></th><th><span>Column2</span></th></tr></thead><tbody><tr><td><span>Entry1</span></td><td><span>Entry2</span></td><td><span>ready</span></td></tr></tbody></table></figure><p ex="ready">!aaa; 

It’s quite a long string, but it’s a table written out in string form. How would I get the words from in between <span> and </span>? For example, I would like it to return Column1, Column2, Entry1, Entry2 (maybe in an array?)

Here is what I tried so far:

storing = testexample.match(/<span>(.*)</span>/);

But it only returned "Column1" I also tried doing matchAll, exec, and doing /<span>(.*)</span>/g. These results gave me the whole string, nothing, things like <th><span>Column1</span></th>, or the just "Column1" again.
I’m quite new at javascript so I’m unsure what I’m doing wrong as I have read the documentation for this. Any help would be appreciated. Thank you.

3 Answers

As pointed out, you can't reliably parse random HTML with Regex. HOWEVER, assuming you only want to parse an HTML table of the kind you have in the question, this is your regex:

<span>(.*?)</span>

I changed a couple things:

  1. You hadn't escaped the / in </span> so your regex was actually ended earlier
  2. I added a ? in the match anything section. This way the regex will match the shortest possible sequence so you get to match all spans.
  3. Calling match will match all occurences of this regex. This will also include the <span> / </span> parts
  4. Trim the start and ending <span> parts

Here's the complete example:

var testexample = `<p nameIt="Title">Title_Test</p><figure class="t15"><table><thead><tr>
<th><span>Column1</span></th><th><span>Column2</span></th></tr></thead><tbody><tr><td><span>Entry1</span></td><td><span>Entry2</span></td><td><span>ready</span></td></tr></tbody></table></figure><p ex="ready">!aaa`;

var regex = /<span>(.*?)</span>/g;

var match = testexample.match(regex);
var columnContent = match.map(m => m.replace("<span>", "").replace("</span>", ""));
console.log(columnContent[0]); // Column1
console.log(columnContent[1]); // Column2

Answered by Nikola Dimitroff on February 2, 2021

Your Regex should be using the global and multi flag -- But other than that you need to be checking for more than one instance .. Something like this:

<s*span[^>]*>(.*?)<s*/s*spans*>

You can see it at work here:

Rexex 101

ALSO because as stated you can't reliably parse HTML with regex -- I did my best to make sure you could still use styles or attributes INSIDE the <span> tag .. IE <span style="color:#FF0000;"> will still work with the example I provided.

With another example here:

Regex 101

Answered by Zak on February 2, 2021

There is a very good answer of @bobince about why you should not even try to use regular expressions for parsing HTML

To help you with an answer you should provide info what environment you would like to use for such job.

Is it browser or node.js and do you have HTML as text or in a page?

I would propose another solution to your problem that creates dom elements that you will query to extract desired data.

/**
 * Helper function to transform HTML string to dom element
 * @param {string} html
 * @param {string} elementType
 * @returns {HTMLDivElement}
 */
function htmlToElement(html, elementType = 'div') {
  const template = document.createElement(elementType);

  template.innerHTML = html.trim(); // Never return a text node of whitespace as the result

  return template;
}

const htmlString = `<p nameIt="Title">Title_Test</p><figure class="t15"><table><thead><tr>
<th><span>Column1</span></th><th><span>Column2</span></th></tr></thead><tbody><tr><td><span>Entry1</span></td><td><span>Entry2</span></td><td><span>ready</span></td></tr></tbody></table></figure><p ex="ready">`; 
const element = htmlToElement(htmlString);

// extract inner text from spans as array of strings
const arrayOfWords = [...element.querySelectorAll('span')].map(span => span.innerText);
// convert array of strings to space separated string
const wordsJoinedWithSpace = arrayOfWords.join(' ');
// log a result in a console
console.log({arrayOfWords, wordsJoinedWithSpace});

Answered by Hakier on February 2, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP