Data Formats

HTML

HTML stands for Hypertext Markup Language. It is the standard language of the web intended to be displayed using web browsers like Firefox, Chrome, and Edge.

The following is a good example of an HTML document with examples of many basic HTML elements:

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>HTML5 Example</title>
  </head>
  <body>
    <div id="top" class="page" role="document">
      <header role="banner">
        <h1>HTML5 Example</h1>
        <p>This is an example page full of HTML.</p>
      </header>
      <nav role="nav">
        <ul>
          <li>
            <a href="#tech" class="big title" custom-attr="ok">Tech</a>
            <ul>
              <li><a href="https://google.com">Google</a></li>
              <li><a href="https://cloudflare.com">Cloudflare</a></li>
              <li><a href="https://microsoft.com">Microsoft</a></li>
            </ul>
          </li>
          <li>
            <a href="#news" data-news="true">News</a>
            <ul>
              <li><a href="https://nytimes.com">NYTimes</a></li>
              <li><a href="https://washingtonpost.com">WAPO</a></li>
            </ul>
          </li>
        </ul>
      </nav>
      <main>
        <section id="headings">
            <div>
              <h1>Heading 1</h1>
              <h2>Heading 2</h2>
              <h3>Heading 3</h3>
              <h4>Heading 4</h4>
              <h5>Heading 5</h5>
              <h6>Heading 6</h6>
            </div>
        </section>
        <section id="other">
            <p>This is a paragraph.</p>
            <p>This is another paragraph.</p>
            <ol>
                <li>Ordered list 1</li>
                <li>Ordered list 2</li>
                <li>Ordered list 3</li>
            </ol>
            <ul>
                <li>Unordered list 1</li>
                <li>Unordered list 2</li>
                <li>Unordered list 3</li>
            </ul>
        </section>

        <h3>Horizontal rule</h3>
        <hr>

        <h3>Line break</h3>
        <br>
          
        <section id="table">
            <table>
                <caption>Caption</caption>
                <thead>
                    <tr>
                        <th>Heading 1</th>
                        <th>Heading 2</th>
                    </tr>
                </thead>
                <tfoot>
                    <tr>
                        <th>Footer 1</th>
                        <th>Footer 2</th>
                    </tr>
                </tfoot>
                <tbody>
                    <tr>
                        <td>Cell 1</td>
                        <td>Cell 2</td>
                    </tr>
                    <tr>
                        <td>Cell 1</td>
                        <td>Cell 2</td>
                    </tr>
                </tbody>
            </table>
        </section>
        <section>
            <div>
                <p>This is a paragraph with text that has been modified to appear 
                    <strong>strong</strong>, <em>emphasized</em>, <b>bold</b>, <i>italicized</i>, 
                    <u>underlined</u>, <s>struckthrough</s>, etc.
                </p>
            </div>
        </section>
      </main>
    </div>
  </body>
</html>

An HTML document is comprised of elements. An element is the combination of a start tag, content, and end tag. For example, the following is an element:

<title>HTML5 Example</title>

In that example, "<title>" is the start tag, "HTML5 Example" is the content, and "</title>" is the end tag.

HTML elements can be nested. For example:

<section>
  <div>
      <p>This is a paragraph with text that has been modified to appear 
          <strong>strong</strong>, <em>emphasized</em>, <b>bold</b>, <i>italicized</i>, 
          <u>underlined</u>, <s>struckthrough</s>, etc.
      </p>
  </div>
</section>

This is an element where "<section>" is the start tag, "</section>" is the end tag, and the rest:

<div>
    <p>This is a paragraph with text that has been modified to appear 
        <strong>strong</strong>, <em>emphasized</em>, <b>bold</b>, <i>italicized</i>, 
        <u>underlined</u>, <s>struckthrough</s>, etc.
    </p>
</div>

is the content. The content in this example is a nested element, where the start tag is "<div>", the end tag is "</div>" and the content is:

<p>This is a paragraph with text that has been modified to appear 
    <strong>strong</strong>, <em>emphasized</em>, <b>bold</b>, <i>italicized</i>, 
    <u>underlined</u>, <s>struckthrough</s>, etc.
</p>

... yet another nested element.

HTML has a variety of different tags, each with a distinct purpose. Whether or not a website uses a specific tag for it's intended purpose is another story. Certain HTML tags are oft "misused" in modern web development.

HTML tags can have attributes. Attributes are always shown in the start tag, and can come in both name="value" pairs, like class="h1-strong":

<div class="h1-strong">

, or alone, like async:

<script src="my_script.js" async>

There are a variety of valid html attributes. It is important to note that if an attribute in an HTML tag is not one of the official attributes, the browser will ignore it. With that being said, the "unofficial" attributes will still be a part of the Document Object Model (DOM), and therefore accessible to javascript. For this reason, it is not uncommon to come across technically "invalid" HTML attributes in website's source code.

In HTML5, there now exists an HTML compliant version of these custom attributes called data-*. The * in data-* is a wildcard that can represent anything as long as it is at least 1 character long and contains no uppercase letters. You can see an example of a data-* attribute in our example, where we use data-news to store whether or not the link is news data:

<a href="#news" data-news="true">News</a>

Unless you have a very good reason to use a custom attribute that is not a data-* attribute, you would be well-advised to just use data-* when possible.

XML

XML stands for Extensible Markup Language. XML and HTML appear and sound very similar but have different goals. Whereas HTML was designed to display data, XML was designed to store data.

Like HTML, XML still has start tags, end tags, and content, however, rather than using a finite set of legal tags, XML tags are defined and created by the user, so XML like the following is perfectly legal.

<exam>
  <question>
    <text>What is red?</text>
    <solution>A color.</solution>
    <points>2</points>
  </question>
  <question>
    <text>What is a square?</text>
    <solution>A shape.</solution>
    <points>1</points>
  </question>
</exam>

Rather than opening that content in a web browser and expecting it to do something like we would for HTML, in order for that, valid, XML to do anything, someone must write software to send, receive, store, display, or do something with the data.

XPath Expressions

Xpath Expressions are expressions created to select nodes or node-sets from an XML document.

The following table (from w3schools) shows some of the most useful expressions that can be used to select nodes programmatically.

Expression Description
nodename Selects all nodes with the name "nodename"
/ Selects from the root node
// Selects nodes in the document from the current node that match the selection no matter where they are
.// Selects nodes in the document from the current node that match the selection only if they are in the current element
. Selects the current node
.. Selects the parent of the current node
@ Selects attributes

Predicates

A predicate is a way to filter a node-set by evaluating the expression contained within a set of square brackets []. For example, let's say you wanted to get all of the questions that have points > 1. Doing this using predicates is easy:

<exam>
  <question>
    <text>What is red?</text>
    <solution>A color.</solution>
    <points>2</points>
  </question>
  <question>
    <text>What is a square?</text>
    <solution>A shape.</solution>
    <points>1</points>
  </question>
</exam>
//question[points > 1]

Operators

In order to make comparisons, we need operators. The following is a list of xpath operators from here.

Operator Description Example
| Computes two node-sets. //book | //cd
+ Addition 6 + 4
- Subtraction 6 - 4
* Multiplication 6 * 4
div Division 8 div 4
= Equal to (relation, not assignment) price = 9.80
!= Not equal price != 9.80
< Less than price < 9.80
<= Less than or equal to price <= 9.80
> Greater than price > 9.80
>= Greater than or equal to price >= 9.80
or Logical or price = 9.80 or price = 9.70
and Logical and price > 9.00 and price < 10.00
mod Modulus (division remainder) 5 mod 2

Function

xpath functions are an easy way to filter data with more control. You can find a list of functions here. Of particular use are the contains and translate functions. contains allows you to see if the first argument string contains the second argument string. It is particularly useful to see if a class or attribute contains some substring. translate allows you to, among other things, change the case of some text prior to using the contains function, for example.

Examples

Given the following XML, answer the questions.

<html>
    <head>
        <title>My Title</title>
    </head>
    <body>
        <div>
            <div class="abc123 sktoe-sldjkt dkjfg3-dlgsk">
                <div class="glkjr-slkd dkgj-0 dklfgj-00">
                    <a class="slkdg43lk dlks" href="https://example.com/123456">
                    </a>
                </div>
            </div>
            <div>
                <div class="ldskfg4">
                    <span class="slktjoe" aria-label="123 comments, 43 Retweets, 4000 likes">Love it.</span>
                </div>
            </div>
            <div data-amount="12">13</div>
        </div>
        <div>
            <div class="abc123 sktoe-sls dkjfg-dlgsk">
                <div class="glkj-slkd dkgj-0 dklfj-00">
                    <a class="slkd3lk dls" href="https://example.com/123456">
                    </a>
                </div>
            </div>
            <div>
                <div class="ldg4">
                    <span class="sktjoe" aria-label="1000 comments, 455 Retweets, 40000 likes">Love it.</span>
                </div>
            </div>
            <div data-amount="122">133</div>
        </div>
    </body>
</html>
Write an xpath expression to get the "title" element.

Click here for solution //title

Write an xpath expression to get the content of the "title" element.

Click here for solution //title/text

Write an xpath expression to get every "div" element.

Click here for solution //div

Write an xpath expression to get every "div" element with class="ldskfg4".

Click here for solution //div[@class='ldskfg4']

Write an xpath expression to get every "div" element where "abc123" is in the class attribute's value.

Click here for solution //div[contains(@class, 'abc123')]

Write an xpath expression to get every "div" element with an aria-label attribute.

Click here for solution //div[@aria-label]

Resources

XPath Tutorial

A w3schools tutorial on XPath expressions. A nice, quick 1 page summary.

XPath Expression Cheatsheet

A simple, but very thorough cheatsheet to recall xpath expressions.