56tvmao: How-to instructions you can trust. Linux How to Manipulate HTML and XML Files from the Command Line

How to Manipulate HTML and XML Files from the Command Line

The World Wide Web Consortium (W3C) has a number of free tools available to help with the correct generation and processing of HTML and XML files. The HTML-XML package is a set of simple utilities for manipulating HTML and XML files from the command line. It is available for many of the different Linux distributions and can be useful for those who have to process HTML or XML files on a regular basis.

To install the package on Ubuntu, use:

sudo apt-get install html-xml-utils

There are 31 tools in this package, here is a summary of what they can do:

  • cexport – create headerfile of exported declarations from a C file
  • hxaddid – add ID’s to selected elements
  • hxcite- replace bibliographic references by hyperlinks
  • hxcite-mkbib- expand references and create bibliography
  • hxcopy- copy an HTML file while preserving relative links
  • hxcount – count elements and attributes in HTML or XML files
  • hxextract – extract selected elements
  • hxclean – apply heuristics to correct an HTML file
  • hxprune – remove marked elements from an HTML file
  • hxincl- expand included HTML or XML files
  • hxindex – create an alphabetically sorted index
  • hxmkbib – create bibliography from a template
  • hxmultitoc- create a table of contents for a set of HTML files
  • hxname2id- move some ID= or NAME= from A elements to their parents
  • hxnormalize – pretty-print an HTML file
  • hxnum – number section headings in an HTML file
  • hxpipe- convert XML to a format easier to parse with Perl or AWK
  • hxprintlinks- number links & add table of URLs at end of an HTML file
  • hxremove- remove selected elements from an XML file
  • hxtabletrans- transpose an HTML or XHTML table
  • hxtoc – insert a table of contents in an HTML file
  • hxuncdata – replace CDATA sections by character entities
  • hxunent – replace HTML predefined character entities to UTF-8
  • hxunpipe- convert output of pipe back to XML format
  • hxunxmlns – replace “global names” by XML Namespace prefixes
  • hxwls – list links in an HTML file
  • hxxmlns – replace XML Namespace prefixes by “global names”
  • asc2xml, xml2asc- convert between UTF8 and entities
  • hxref – generate cross-references
  • hxselect- extract elements that match a (CSS) selector

To introduce you to the power of this tool set, here are some examples on how you would use a few of the commands.

The “hxnormalize” command will reformat an HTML file so that it is easy to read and nicely formatted. To test this command, we will create an ugly HTML. Select and copy the following lines and paste them directly into a terminal window.

cat > test.html <p>hello
__EOF__
</p>

This will create a file called test.html. The HTML is missing some of the closing tags and is all written in one line. The hxnormalize command will reformat the file and write the pretty version to the standard output (stdout). Here is how you run the command:

hxnormalize -e test.html

The “-e” flag tells hxnormalize to insert any missing closing tags.

You can also run the command against a web page by replacing “test.html” with a URL, for example:

hxnormalize http://www.example.com

The hxwls command will parse a local HTML file or a website, and list the links within the HTML. For example:

hxwls http://www.example.com

Here is the first few lines of output for the Make Tech Easier website:

The hxtabletrans command changes a table so that rows become columns and columns become rows.

Let’s create an HTML file with a simple table. Select and copy the following lines, and then paste them directly into a terminal window.

cat > table.html 
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
 
__EOF__

The result is a file called table.html. In a web browser the table would look something like this:

Jill Smith 50
Eve Jackson 94

If you run the hxtabletrans command, then it will write the transposed table to the standard output. The results can be redirected to another file like this:

hxtabletrans table.html > table2.html

The new file, table2.html, will show Jill Smith and Eve Jackson in columns, rather than in rows as in the original. The resulting table will be something like this:

Jill Eve
Smith Jackson
50 94

Most of the commands are used in a similar way to the examples above, i.e. you need to specify a file or URL to process and the output is written to the stdout. Try experimenting with the different commands as you might find them useful.

If you have any questions about the HTML-XML utilities then please feel free to ask them in the comments below and we will see if we can help.


Gary Sims

Gary has been a technical writer, author and blogger since 2003. He is an expert in open source systems (including Linux), system administration, system security and networking protocols. He also knows several programming languages, as he was previously a software engineer for 10 years. He has a Bachelor of Science in business information systems from a UK University.

Subscribe to our newsletter!

Our latest tutorials delivered straight to your inbox

Sign up for all newsletters.
By signing up, you agree to our Privacy Policy and European users agree to the data transfer policy. We will not share your data and you can unsubscribe at any time. Subscribe

Related Post