# latest Blogs

Cleve Moler is the author of the first MATLAB, one of the founders of MathWorks, and is currently Chief Mathematician at the company. He writes here about MATLAB, scientific computing and interesting mathematics.
Updated: 22 hours 30 min ago

### Levenshtein Edit Distance Between Strings

Mon, 08/14/2017 - 22:30

How can you measure the distance between two words? How can you find the closest match to a word in a list of words? The Levenshtein distance between two strings is the number of single character deletions, insertions, or substitutions required to transform one string into the other. This is also known as the edit distance.

Vladimir Levenshtein is a Russian mathematician who published this notion in 1966. I am using his distance measure in a project that I will describe in a future post. Other applications include optical character recognition (OCR), spell checking, and DNA sequencing.

Contents"Hi ho, hi ho"

For my examples, I will use these seven famous names.

T = {"Doc","Grumpy","Happy","Sleepy","Bashful","Sneezy","Dopey"} T = 1×7 cell array Columns 1 through 5 ["Doc"] ["Grumpy"] ["Happy"] ["Sleepy"] ["Bashful"] Columns 6 through 7 ["Sneezy"] ["Dopey"]

The strings "Sleepy" and "Sneezy" are close to each other because they are the same length, and we can transform "Sleepy" to "Sneezy" by two edit operations, change 'l' to 'n' and 'p' to 'z'. The Levenshtein distance between these two words is 2.

On the other hand, "Bashful" is not close to his friends. His name is longer and the only letter he shares with another is an 'a' with "Happy". So his distance to "Happy" is 6, while the distance to any of the others is 7, the length of his name.

Here is the matrix of pair-wise distances

for i = 1:7 for j = 1:7 D(i,j) = lev(T{i},T{j}); end end D D = 0 6 5 6 7 6 3 6 0 4 4 7 5 5 5 4 0 4 6 5 3 6 4 4 0 7 2 4 7 7 6 7 0 7 7 6 5 5 2 7 0 4 3 5 3 4 7 4 0

Bashful's column is all 7's, except for a 6 in Happy's row and a 0 on the diagonal.

A bar graph of the total distances shows Bashful's name is the furthest from all the others, and that Dopey's is the closest, but just barely.

bar(sum(D)) title('Total Levenshtein Distance') set(gca,'xticklabels',T) Recursive

Here is a recursive program that provides one way to compute lev(s,t). It compares each character in a string with all of the characters in the other string. For each comparison, a cost c is added to a total that is accumulated by the recursion. The cost of one comparison is 0 if the pair is a match and 1 if it isn't.

type levr function d = levr(s,t,m,n) % Levenshtein distance between strings, recursive implementation. % levr(s,t) is the number of deletions, insertions, % or substitutions required to transform s to t. % m = length(s), n = length(t), levr(s,t,m,n) initiates recursion. % https://en.wikipedia.org/wiki/Levenshtein_distance if nargin == 2 s = char(s); t = char(t); d = levr(s,t,length(s),length(t)); elseif m == 0 d = n; elseif n == 0 d = m; else c = s(m) ~= t(n); % c = 0 if chars match, 1 if not. d = min([levr(s,t,m-1,n) + 1 levr(s,t,m,n-1) + 1 levr(s,t,m-1,n-1) + c]); end end

Like all recursive programs, this code is impractical to use in practice with long strings or long lists of words because it repeatedly compares the same pairs of characters. The complexity is exponential in the lengths of the strings.

Matrix Memory

A time-memory tradeoff can be made by allocating storage to save the results of individual comparisons. The memory involved is a matrix of size (m+1)-by-(n+1) where m and n are the lengths of the two strings, so it's O(m*n). The time complexity is also O(m*n).

type levm function d = levm(s,t) % Levenshtein distance between strings, matrix implementation. % levr(s,t) is the number of deletions, insertions, % or substitutions required to transform s to t. % https://en.wikipedia.org/wiki/Levenshtein_distance s = char(s); t = char(t); m = length(s); n = length(t); D = zeros(m+1,n+1); D(:,1) = (0:m)'; D(1,:) = (0:n); for i = 1:m for j = 1:n c = s(i) ~= t(j); % c = 0 if chars match, 1 if not. D(i+1,j+1) = min([D(i,j+1) + 1 D(i+1,j) + 1 D(i,j) + c]); end end levm_print(s,t,D) d = D(m+1,n+1); end Details

I've included a print routine so we can see some detail. Let's begin by finding the distance between a single letter and a word that doesn't contain that letter. The distance is the length n of the word because one substitution and n-1 deletions are required.

d = levm("S","Dopey") D o p e y S 1 2 3 4 5 d = 5

Now let's have one character match. In this case the character is 'e'.

d = levm("Sle","Dopey") D o p e y S 1 2 3 4 5 l 2 2 3 4 5 e 3 3 3 3 4 d = 4

Finally, two full words. The distance is the last entry in the last row or column of the matrix.

d = levm("Sleepy","Dopey") D o p e y S 1 2 3 4 5 l 2 2 3 4 5 e 3 3 3 3 4 e 4 4 4 3 4 p 5 5 4 4 4 y 6 6 5 5 4 d = 4 Best Program

We don't need storage for the whole matrix, just two rows. The storage cost is now linear in the lengths of the strings. This is the most efficient of my three functions.

type lev function d = lev(s,t) % Levenshtein distance between strings or char arrays. % lev(s,t) is the number of deletions, insertions, % or substitutions required to transform s to t. % https://en.wikipedia.org/wiki/Levenshtein_distance s = char(s); t = char(t); m = length(s); n = length(t); x = 0:n; y = zeros(1,n+1); for i = 1:m y(1) = i; for j = 1:n c = (s(i) ~= t(j)); % c = 0 if chars match, 1 if not. y(j+1) = min([y(j) + 1 x(j+1) + 1 x(j) + c]); end % swap [x,y] = deal(y,x); end d = x(n+1); end Metric Space

The function lev makes the set of strings a metric space. That's because of four properties. The distance from an element to itself is zero.

d(x,x) = 0

Otherwise the distance is positive.

d(x,y) > 0 if x ~= y

Distance is symmetric.

d(x,y) = d(y,x)

The triangle inequality, for any three elements,

d(x,y) <= d(x,w) + d(w,y) 44 Programming Languages

I usually avoid programming language debates, but here are implementations in 44 different programming languages, including one in MATLAB. I prefer my own code.

\n'); d.write(code_string); // Add copyright line at the bottom if specified. if (copyright.length > 0) { d.writeln(''); d.writeln('%%'); if (copyright.length > 0) { d.writeln('% _' + copyright + '_'); } } d.write('\n'); d.title = title + ' (MATLAB code)'; d.close(); } -->

Get the MATLAB code (requires JavaScript)

Published with MATLAB® R2017a

. %% "Hi ho, hi ho" % For my examples, I will use these seven famous names. T = {"Doc","Grumpy","Happy","Sleepy","Bashful","Sneezy","Dopey"} %% % The strings "Sleepy" and "Sneezy" are close to each other because they % are the same length, and we can transform "Sleepy" to "Sneezy" by two % edit operations, change 'l' to 'n' and 'p' to 'z'. The Levenshtein % distance between these two words is 2. %% % On the other hand, "Bashful" is not close to his friends. His name % is longer and the only letter he shares with another is an 'a' with % "Happy". So his distance to "Happy" is 6, while the distance to % any of the others is 7, the length of his name. %% % Here is the matrix of pair-wise distances for i = 1:7 for j = 1:7 D(i,j) = lev(T{i},T{j}); end end D %% % Bashful's column is all 7's, except for a 6 in Happy's row and a 0 % on the diagonal. %% % A bar graph of the total distances shows Bashful's name is the % furthest from all the others, and that Dopey's is the closest, % but just barely. bar(sum(D)) title('Total Levenshtein Distance') set(gca,'xticklabels',T) %% Recursive % Here is a recursive program that provides one way to compute |lev(s,t)|. % It compares each character in a string with all of the characters in % the other string. For each comparison, a cost |c| is added to a % total that is accumulated by the recursion. The cost of one comparison % is 0 if the pair is a match and 1 if it isn't. type levr %% % Like all recursive programs, this code is impractical to use in % practice with long strings or long lists of words because it % repeatedly compares the same pairs of characters. The complexity % is exponential in the lengths of the strings. %% Matrix Memory % A time-memory tradeoff can be made by allocating storage to save the % results of individual comparisons. The memory involved is a matrix of % size (m+1)-by-(n+1) where m and n are the lengths of the two strings, % so it's O(m*n). The time complexity is also O(m*n). type levm %% Details % I've included a print routine so we can see some detail. % Let's begin by finding the distance between a single letter and a word % that doesn't contain that letter. The distance is the length n of the % word because one substitution and n-1 deletions are required. d = levm("S","Dopey") %% % Now let's have one character match. In this case the character is 'e'. d = levm("Sle","Dopey") %% % Finally, two full words. The distance is the last entry in the last % row or column of the matrix. d = levm("Sleepy","Dopey") %% Best Program % We don't need storage for the whole matrix, just two rows. % The storage cost is now linear in the lengths of the strings. % This is the most efficient of my three functions. type lev %% Metric Space % The function |lev| makes the set of strings a _metric space_. % That's because of four properties. The distance from an element to % itself is zero. % % d(x,x) = 0 % % Otherwise the distance is positive. % % d(x,y) > 0 if x ~= y % % Distance is _symmetric_. % % d(x,y) = d(y,x) % % The triangle inequality, for any three elements, % % d(x,y) <= d(x,w) + d(w,y) %% 44 Programming Languages % I usually avoid programming language debates, but % are implementations in 44 different programming languages, % including one in MATLAB. I prefer my own code. ##### SOURCE END ##### 6edce509ea604c6fbef7aee4ee314bbd -->

### Latent Semantic Indexing, SVD, and Zipf’s Law

Mon, 07/31/2017 - 22:30

Latent Semantic Indexing, LSI, uses the Singular Value Decomposition of a term-by-document matrix to represent the information in the documents in a manner that facilitates responding to queries and other information retrieval tasks. I set out to learn for myself how LSI is implemented. I am finding that it is harder than I thought.

ContentsBackground

Latent Semantic Indexing was first described in 1990 by Susan Dumais and several colleagues at Bellcore, the descendant of the famous A.T.&T. Bell Labs. (Dumais is now at Microsoft Research.) I first heard about LSI a couple of years later from Mike Berry at the University of Tennessee. Berry, Dumais and Gavin O'Brien have a paper, "Using Linear Algebra for Intelligent Information Retrieval" in SIAM Review in 1995. Here is a link to a preprint of that paper.

An important problem in information retrieval involves synonyms. Different words can refer to the same, or very similar, topics. How can you retrieve a relevant document that does not actually use a key word in a query? For example, consider the terms

• FFT (Finite Fast Fourier Transform)
• SVD (Singular Value Decomposition)
• ODE (Ordinary Differential Equation

Someone looking for information about PCA (Principal Component Analysis) would be more interested in documents about SVD than those about the other two topics. It is possible to discover this connection because documents about PCA and SVD are more likely to share terms like "rank" and "subspace" that are not present in the others. For information retrieval purposes, PCA and SVD are synonyms. Latent Semantic Indexing can reveal such connections.

Strings

I will make use of the new string object, introduced in recent versions of MATLAB. The double quote has been an illegal character in MATLAB. But now it delineate strings.

s = "% Let $A$ be the 4-by-4 magic square from Durer's _Melancholia_." s = "% Let $A$ be the 4-by-4 magic square from Durer's _Melancholia_."

To erase the leading percent sign and the LaTeX between the dollar signs.

s = erase(s,'%') s = eraseBetween(s,'$','$','Boundaries','inclusive') s = " Let $A$ be the 4-by-4 magic square from Durer's _Melancholia_." s = " Let be the 4-by-4 magic square from Durer's _Melancholia_." Cleve's Corner

The documents for this project are the MATLAB files that constitute the source text for this blog. I started writing the blog edition of Cleve's Corner a little over five years ago, in June 2012.

Each post comes from a MATLAB file processed by the publish command. Most of the lines in the file are comments beginning with a single %. These become the prose in the post. Comments beginning with a double %% generate the title and section headings. Text within '|' characters is rendered with a fixed width font to represent variables and code. Text within '$' characters is LaTeX that is eventually typeset by MathJax. I have collected five years' worth of source files in one directory, Blogs, on my computer. The statement D = doc_list('Blogs'); produces a string array of the file names in Blogs. The first few lines of D are D(1:5) ans = 5×1 string array "apologies_blog.m" "arrowhead2.m" "backslash.m" "balancing.m" "biorhythms.m" The statements n = length(D) n/5 n = 135 ans = 27 tell me how many posts I've made, and how many posts per year. That's an average of about one every two weeks. The file that generates today's post is included in the library. for j = 1:n if contains(D{j},"lsi") j D{j} end end j = 75 ans = 'lsi_blog.m' Read posts The following code prepends each file name with the directory name, reads each post into a string s, counts the total number of lines that I've written, and computes the average number of lines per post. lines = 0; for j = 1:n s = read_blog("Blogs/"+D{j}); lines = lines + length(s); end lines avg = lines/n lines = 15578 avg = 115.3926 Stop words The most common word in Cleve's Corner blog is "the". By itself, "the" accounts for over 8 percent of the total number of words. This is roughly the same percentage seen in larger samples of English prose. Stop words are short, frequently used words, like "the", that can be excluded from frequency counts because they are of little value when comparing documents. Here is a list of 25 commonly used stop words. I am going to use a brute force, easy to implement, stategy for stop words. I specify an integer parameter, stop, and simply ignore all words with stop or fewer characters. A value of zero for stop means do not ignore any values. After some experiments which I am about to show, I will decide to take stop = 5. Find terms Here is the core code of this project. The input is the list of documents D and the stop parameter stop. The code scans all the documents, finding and counting the individual words. In the process it • skips all noncomment lines • skips all embedded LaTeX • skips all code fragments • skips all URLs • ignores all words with stop or fewer characters The code prints some intermediate results and then returns three • T, string array of unique words. These are the terms. • C, the frequency counts of T. • A, the (sparse) term/document matrix. The term/document matrix is m -by- n, where • m = length of T is the number of terms, • n = length of D is the number of documents. type find_terms function [T,C,A] = find_terms(D,stop) % [T,C,A] = find_terms(D,stop) % Input % D = document list % stop = length of stop words % Output % T = terms, sorted alphabetically % C = counts of terms % A = term/document matrix [W,Cw,wt] = words_and_counts(D,stop); T = terms(W); m = length(T); fprintf('\n\n stop = %d\n words = %d\n m = %d\n\n',stop,wt,m) A = term_doc_matrix(W,Cw,T); C = full(sum(A,2)); [Cp,p] = sort(C,'descend'); Tp = T(p); % Terms sorted by frequency freq = Cp/wt; ten = 10; fprintf(' index term count fraction\n') for k = 1:ten fprintf('%8d %12s %8d %9.4f\n',k,Tp(k),Cp(k),freq(k)) end total = sum(freq(1:ten)); fprintf('%s total = %7.4f\n',blanks(24),total) end No stop words Let's run that code with stop set to zero so there are no stop words. The ten most frequently used words account for over one-quarter of all the words and are useless for characterizing documents. Note that, by itself, "the" accounts for over 8 percent of all the words. stop = 0; find_terms(D,stop); stop = 0 words = 114512 m = 8896 index term count fraction 1 the 9404 0.0821 2 of 3941 0.0344 3 a 2885 0.0252 4 and 2685 0.0234 5 is 2561 0.0224 6 to 2358 0.0206 7 in 2272 0.0198 8 that 1308 0.0114 9 for 1128 0.0099 10 this 969 0.0085 total = 0.2577 Three characters or less Skipping all words with three or less characters cuts the total number of words almost in half and eliminates "the", but still leaves mostly uninteresting words in the top ten. stop = 3; find_terms(D,stop); stop = 3 words = 67310 m = 8333 index term count fraction 1 that 1308 0.0194 2 this 969 0.0144 3 with 907 0.0135 4 matrix 509 0.0076 5 from 481 0.0071 6 matlab 472 0.0070 7 have 438 0.0065 8 about 363 0.0054 9 first 350 0.0052 10 point 303 0.0045 total = 0.0906 Five characters or less Setting stop to 4 is a reasonable choice, but let's be even more aggressive. With stop equal to 5, the ten most frequent words are now much more characteristic of this blog. All further calculations use these results. stop = 5; [T,C,A] = find_terms(D,stop); m = length(T); stop = 5 words = 38856 m = 6459 index term count fraction 1 matrix 509 0.0131 2 matlab 472 0.0121 3 function 302 0.0078 4 number 224 0.0058 5 floating 199 0.0051 6 precision 184 0.0047 7 computer 167 0.0043 8 algorithm 164 0.0042 9 values 160 0.0041 10 numerical 158 0.0041 total = 0.0653 Zipf's Law This is just an aside. Zipf's Law is named after George Kingsley Zipf, although he did not claim originality. The law states that in a sample of natural language, the frequency of any word is inversely proportional to its index in the frequency table. In other words, if the frequency counts C are sorted by frequency, then a plot of log(C(1:k)) versus log(1:k) should be a straight line. Our frequency counts only vaguely follow this law, but a log-log plot makes many quantities look proportional. (We get a little better conformance to the Law with smaller values of stop.) Zipf(C) SVD Now we're ready to compute the SVD. We might as well make A full; it has only n = 135 columns. But A is very tall and skinny, so it's important to use the 'econ' flag so that U also has only 135 columns. Without this flag, we would asking for a square m -by- m matrix U with m over 6000. [U,S,V] = svd(full(A),'econ'); Query Here is a preliminary stab at a function to make queries. type query function posts = query(queries,k,cutoff,T,U,S,V,D) % posts = query(queries,k,cutoff,T,U,S,V,D) % Rank k approximation to term/document matrix. Uk = U(:,1:k); Sk = S(1:k,1:k); Vk = V(:,1:k); % Construct the query vector from the relevant terms. m = size(U,1); q = zeros(m,1); for i = 1:length(queries) % Find the index of the query key word in the term list. wi = word_index(queries{i},T); q(wi) = 1; end % Project the query onto the document space. qhat = Sk\Uk'*q; v = Vk*qhat; v = v/norm(qhat); % Pick off the documents that are close to the query. posts = D(v > cutoff); query_plot(v,cutoff,queries) end Try it Set the rank k and the cutoff level. rank = 40; cutoff = .2; Try a few one-word queries. queries = {"singular","roundoff","wilkinson"}; for j = 1:length(queries) queryj = queries{j} posts = query(queryj,rank,cutoff,T,U,S,V,D) snapnow end queryj = "singular" posts = 4×1 string array "eigshowp_w3.m" "four_spaces_blog.m" "parter_blog.m" "svdshow_blog.m" queryj = "roundoff" posts = 5×1 string array "condition_blog.m" "floats.m" "partial_pivot_blog.m" "refine_blog.m" "sanderson_blog.m" queryj = "wilkinson" posts = 2×1 string array "dubrulle.m" "jhw_1.m" Note that the query about Wilkinson found the post about him and also the post about Dubrulle, who improved a Wilkinson algorithm. For the finale, merge all the queries into one. queries posts = query(queries,rank,cutoff,T,U,S,V,D) queries = 1×3 cell array ["singular"] ["roundoff"] ["wilkinson"] posts = 5×1 string array "four_spaces_blog.m" "jhw_1.m" "parter_blog.m" "refine_blog.m" "svdshow_blog.m" ------------------------------------------------------------------ Code doc_list type doc_list function D = doc_list(dir) % D = doc_list(dir) is a string array of the file names in dirtory dir. D = ""; [status,L] = system(['ls ' dir]); if status ~= 0 error(['Cannot find ' dir]) end k1 = 1; i = 1; for k2 = find(L == newline) D(i,:) = L(k1:k2-1); i = i+1; k1 = k2+1; end end erase_stuff type erase_stuff function sout = erase_stuff(s) % s = erase_stuff(s) % erases noncomments, !sysltem, LaTeX, URLs, Courier, and punctuation. j = 1; k = 1; while k <= length(s) sk = lower(char(s(k))); if length(sk) >= 4 if sk(1) ~= '%' % Skip non comment break elseif any(sk=='!') && any(sk=='|') % Skip !system | ... break elseif all(sk(3:4) == '$') sk(3:4) = ' '; % Skip display latex while all(sk~='$') k = k+1; sk = char(s(k)); end else % Embedded latex sk = eraseBetween(sk,'$','$','Boundaries','inclusive'); % URL if any(sk == '<') if ~any(sk == '>') k = k+1; sk = [sk lower(char(s(k)))]; end sk = eraseBetween(sk,'<','>','Boundaries','inclusive'); sk = erase(sk,'>'); end if contains(string(sk),"http") break end % Courier f = find(sk == '|'); assert(length(f)==1 | mod(length(f),2)==0); for i = 1:2:length(f)-1 w = sk(f(i)+1:f(i+1)-1); if length(w)>2 && all(w>='a') && all(w<='z') sk(f(i)) = ' '; sk(f(i+1)) = ' '; else sk(f(i):f(i+1)) = ' '; end end % Puncuation sk((sk<'a' | sk>'z') & (sk~=' ')) = []; sout(j,:) = string(sk); j = j+1; end end k = k+1; end skip = k-j; end find_words type find_words function w = find_words(s,stop) % words(s,stop) Find the words with length > stop in the text string. w = ""; i = 1; for k = 1:length(s) sk = [' ' char(s{k}) ' ']; f = strfind(sk,' '); for j = 1:length(f)-1 t = strtrim(sk(f(j):f(j+1))); % Skip stop words if length(t) <= stop continue end if ~isempty(t) w{i,1} = t; i = i+1; end end end query_plot type query_plot function query_plot(v,cutoff,queries) clf shg plot(v,'.','markersize',12) ax = axis; yax = 1.1*max(abs(ax(3:4))); axis([ax(1:2) -yax yax]) line(ax(1:2),[cutoff cutoff],'color','k') title(sprintf('%s, ',queries{:})) end read_blog type read_blog function s = read_blog(filename) % read_blog(filename) % skip over lines that do not begin with '%'. fid = fopen(filename); line = fgetl(fid); k = 1; while ~isequal(line,-1) % -1 signals eof if length(line) > 2 && line(1) == '%' s(k,:) = string(line); k = k+1; end line = fgetl(fid); end fclose(fid); end term_doc_matrix type term_doc_matrix function A = term_doc_matrix(W,C,T) % A = term_doc_matrix(W,C,T) m = length(T); n = length(W); A = sparse(m,n); for j = 1:n nzj = length(W{j}); i = zeros(nzj,1); s = zeros(nzj,1); for k = 1:nzj i(k) = word_index(W{j}(k),T); s(k) = C{j}(k); end A(i,j) = s; end end word_count type word_count function [wout,count] = word_count(w) % [w,count] = word_count(w) Unique words with counts. w = sort(w); wout = ""; count = []; i = 1; k = 1; while k <= length(w) c = 1; while k < length(w) && isequal(w{k},w{k+1}) c = c+1; k = k+1; end wout{i,1} = w{k}; count(i,1) = c; i = i+1; k = k+1; end [count,p] = sort(count,'descend'); wout = wout(p); end word_index type word_index function p = word_index(w,list) % Index of w in list. % Returns empty if w is not in list. m = length(list); p = fix(m/2); q = ceil(p/2); t = 0; tmax = ceil(log2(m)); % Binary search while list(p) ~= w if list(p) > w p = max(p-q,1); else p = min(p+q,m); end q = ceil(q/2); t = t+1; if t == tmax p = []; break end end end words_and_counts type words_and_counts function [W,Cw,wt] = words_and_counts(D,stop) % [W,Cw,wt] = words_and_counts(D,stop) W = []; % Words Cw = []; % Counts wt = 0; n = length(D); for j = 1:n s = read_blog("Blogs/"+D{j}); s = erase_stuff(s); w = find_words(s,stop); [W{j,1},Cw{j,1}] = word_count(w); wt = wt + length(w); end end Zipf type Zipf function Zipf(C) % Zipf(C) plots log2(C(1:k)) vs. log2(1:k) C = sort(C,'descend'); % Sort counts by frequency figure(1) clf shg k = 256; plot(log2(1:k),log2(C(1:k)),'.','markersize',12) axis([-1 8 4 10]) xlabel('log2(index)') ylabel('log2(count)') title('Zipf''s Law') drawnow end \n'); d.write(code_string); // Add copyright line at the bottom if specified. if (copyright.length > 0) { d.writeln(''); d.writeln('%%'); if (copyright.length > 0) { d.writeln('% _' + copyright + '_'); } } d.write('\n'); d.title = title + ' (MATLAB code)'; d.close(); } --> Get the MATLAB code (requires JavaScript) Published with MATLAB® R2017a in 1995. % % to a preprint of that paper. %% % An important problem in information retrieval involves synonyms. % Different words can refer to the same, or very similar, topics. % How can you retrieve a relevant document that does not actually % use a key word in a query? For example, consider the terms % % * FFT (Finite Fast Fourier Transform) % * SVD (Singular Value Decomposition) % * ODE (Ordinary Differential Equation % % Someone looking for information about PCA (Principal Component % Analysis) would be more interested in documents about SVD than those % about the other two topics. It is possible to discover this connection % because documents about PCA and SVD are more likely to share terms % like "rank" and "subspace" that are not present in the others. % For information retrieval purposes, PCA and SVD are synonyms. % Latent Semantic Indexing can reveal such connections. %% Strings % I will make use of the new |string| object, introduced in recent versions % of MATLAB. The double quote has been an illegal character in MATLAB. % But now it delineate strings. s = "% Let$A$be the 4-by-4 magic square from Durer's _Melancholia_." %% % To erase the leading percent sign and the LaTeX between the dollar signs. s = erase(s,'%') s = eraseBetween(s,'$','$','Boundaries','inclusive') %% Cleve's Corner % The documents for this project are the MATLAB files that constitute the % source text for this blog. I started writing the blog edition of % Cleve's Corner a little over five years ago, in June 2012. %% % Each post comes from a MATLAB file processed by the |publish| command. % Most of the lines in the file are comments beginning with a single |%|. % These become the prose in the post. Comments beginning with a double % |%%| generate the title and section headings. % Text within '|' characters % is rendered with a fixed width font to represent variables and code. % Text within '$' characters is LaTeX that is eventually typeset by % MathJax. %% % I have collected five years' worth of source files in one directory, % |Blogs|, on my computer. The statement D = doc_list('Blogs'); %% % produces a string array of the file names in |Blogs|. % The first few lines of |D| are D(1:5) %% % The statements n = length(D) n/5 %% % tell me how many posts I've made, and how many posts per year. % That's an average of about one every two weeks. %% % The file that generates today's post is included in the library. for j = 1:n if contains(D{j},"lsi") j D{j} end end %% Read posts % The following code prepends each file name with the directory name, % reads each post into a string |s|, counts the total number of lines % that I've written, and computes the average number of lines per post. lines = 0; for j = 1:n s = read_blog("Blogs/"+D{j}); lines = lines + length(s); end lines avg = lines/n %% Stop words % The most common word in Cleve's Corner blog is "the". % By itself, "the" accounts for over 8 percent of the total number of % words. This is roughly the same percentage seen in larger samples % of English prose. %% % _Stop words_ are short, frequently used words, like "the", that can be % excluded from frequency counts because they are of little value when % comparing documents. % of 25 commonly used stop words. %% % I am going to use a brute force, easy to implement, stategy for stop % words. I specify an integer parameter, |stop|, and simply ignore % all words with |stop| or fewer characters. A value of zero for |stop| % means do not ignore any values. After some experiments which I am % about to show, I will decide to take |stop = 5|. %% Find terms % Here is the core code of this project. The input is the list of % documents |D| and the stop parameter |stop|. The code scans all % the documents, finding and counting the individual words. % In the process it % % * skips all noncomment lines % * skips all embedded LaTeX % * skips all code fragments % * skips all URLs % * ignores all words with |stop| or fewer characters %% % The code prints some intermediate results and then returns three % % * T, string array of unique words. These are the terms. % * C, the frequency counts of T. % * A, the (sparse) term/document matrix. %% % The term/document matrix is |m| -by- |n|, where % % * |m| = length of T is the number of terms, % * |n| = length of D is the number of documents. type find_terms %% No stop words % Let's run that code with |stop| set to zero so there are no stop words. % The ten most frequently used words account for over one-quarter of % all the words and are useless for characterizing documents. % Note that, by itself, "the" accounts for over 8 percent of all the words. stop = 0; find_terms(D,stop); %% Three characters or less % Skipping all words with three or less characters cuts the total number % of words almost in half and eliminates "the", % but still leaves mostly uninteresting words in the top ten. stop = 3; find_terms(D,stop); %% Five characters or less % Setting |stop| to 4 is a reasonable choice, but let's be even more % aggressive. With |stop| equal to 5, the ten most frequent words are % now much more characteristic of this blog. All further calculations % use these results. stop = 5; [T,C,A] = find_terms(D,stop); m = length(T); %% Zipf's Law % This is just an aside. is named after George Kingsley Zipf, although he did % not claim originality. The law states that in a sample of natural % language, the frequency of any word is inversely proportional to % its index in the frequency table. %% % In other words, if the frequency counts |C| are sorted by % frequency, then a plot of |log(C(1:k))| versus |log(1:k)| % should be a straight line. Our frequency counts % only vaguely follow this law, but a log-log plot makes many % quantities look proportional. (We get a little better conformance % to the Law with smaller values of |stop|.) Zipf(C) %% SVD % Now we're ready to compute the SVD. We might as well make |A| full; % it has only |n| = 135 columns. But |A| is very tall and skinny, % so it's important to use the |'econ'| flag so that |U| also has only % 135 columns. Without this flag, we would asking for a square % |m| -by- |m| matrix |U| with |m| over 6000. [U,S,V] = svd(full(A),'econ'); %% Query % Here is a preliminary stab at a function to make queries. type query %% Try it %% % Set the rank k and the cutoff level. rank = 40; cutoff = .2; %% % Try a few one-word queries. queries = {"singular","roundoff","wilkinson"}; for j = 1:length(queries) queryj = queries{j} posts = query(queryj,rank,cutoff,T,U,S,V,D) snapnow end %% % Note that the query about Wilkinson found the post about him % and also the post about Dubrulle, who improved a Wilkinson algorithm. %% % For the finale, merge all the queries into one. queries posts = query(queries,rank,cutoff,T,U,S,V,D) %% % REPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASH %% Code % *doc_list* type doc_list %% % *erase_stuff* type erase_stuff %% % *find_words* type find_words %% % *query_plot* type query_plot %% % *read_blog* type read_blog %% % *term_doc_matrix* type term_doc_matrix %% % *word_count* type word_count %% % *word_index* type word_index %% % *words_and_counts* type words_and_counts %% % *Zipf* type Zipf ##### SOURCE END ##### 24969ac7161f427885d977875c2db9e3 -->

### What is the Condition Number of a Matrix?

Tue, 07/18/2017 - 07:28

A couple of questions in comments on recent blog posts have prompted me to discuss matrix condition numbers.

• Can you comment on when the condition number gives a tight estimate of the error in a computed inverse and whether there is a better estimator?

• Do you have any idea of the slowdown factor ... for large linear equation solves with extremely ill-conditioned matrices?

My short answers are that the error estimate is rarely tight, but that it is not possible to find a better one, and that it takes the same amount of time to solve ill-conditioned linear equations as it does to solve well-conditioned systems.

ContentsCondition Number for Inversion

I should first point out that there are many different condition numbers and that, although the questioners may not have realized it, they were asking about just one of them -- the condition number for inversion. In general, a condition number applies not only to a particular matrix, but also to the problem being solved. Are we inverting the matrix, finding its eigenvalues, or computing the exponential? The list goes on. A matrix can be poorly conditioned for inversion while the eigenvalue problem is well conditioned. Or, vice versa.

A condition number for a matrix and computational task measures how sensitive the answer is to perturbations in the input data and to roundoff errors made during the solution process.

When we simply say a matrix is "ill-conditioned", we are usually just thinking of the sensitivity of its inverse and not of all the other condition numbers.

Norms

In order to make these notions more precise, let's start with a vector norm. Specifically, the Euclidean norm or 2- norm.

$$\|x\| \ = \ (\sum_i x_i^2)^{1/2}$$

This is the "as the crow flies" distance in n-dimensional space.

The corresponding norm of a matrix $A$ measures how much the mapping induced by that matrix can stretch vectors.

$$M \ = \ \|A\| \ = \ {\max {{\|Ax\|} \over {\|x\|}}}$$

It is sometimes also important to consider how much a matrix can shrink vectors.

$$m \ = \ {\min {{\|Ax\|} \over {\|x\|}}}$$

The reciprocal of the minimum stretching is the norm of the inverse, because

$$m \ = \ {\min {{\|Ax\|} \over {\|x\|}}} \ = \ {\min {{\|y\|} \over {\|A^{-1} y\|}}} \ = \ 1/{\max {{\|A^{-1} y\|} \over {\|y\|}}} \ = \ 1/\|A^{-1}\|$$

A singular matrix is one that can map nonzero vectors into the zero vector. For a singular matrix

$$m \ = \ 0$$

and the inverse does not exist.

The ratio of the maximum to minimum stretching is the condition number for inversion.

$$\kappa(A) \ = \ {{M} \over {m}}$$

An equivalent definition is

$$\kappa(A) \ = \ \|A\| \|A^{-1}\|$$

If a matrix is singular, then its condition number is infinite.

Linear equations

The condition number $\kappa(A)$ is involved in the answer to the question: How much can a change in the right hand side of a system of simultaneous linear equations affect the solution? Consider a system of equations

$$A x \ = \ b$$

and a second system obtained by altering the right-hand side

$$A(x + \delta x) = b + \delta b$$

Think of $\delta b$ as being the error in $b$ and $\delta x$ as being the resulting error in $x$, although we need not make any assumptions that the errors are small. Because $A (\delta x) = \delta b$, the definitions of $M$ and $m$ immediately lead to

$$\|b\| \leq M \|x\|$$

and

$$\|\delta b\| \geq m \|\delta x\|$$

Consequently, if $m \neq 0$,

$${\|\delta x\| \over \|x\|} \leq \kappa(A) {\|\delta b\| \over \|b\|}$$

The quantity $\|\delta b\|/\|b\|$ is the relative change in the right-hand side, and the quantity $\|\delta x\|/\|x\|$ is the resulting relative change in the solution. The advantage of using relative changes is that they are dimensionless --- they are not affected by overall scale factors.

This inequality shows that the condition number is a relative error magnification factor. Changes in the right-hand side can cause changes $\kappa(A)$ times as large in the solution.

Example

Let's investigate a system of linear equations involving

$$A=\left(\begin{array}{cc} 4.1 & 2.8 \\ 9.7 & 6.6 \end{array}\right)$$

Take $b$ to be the first column of $A$, so the solution to $Ax = b$ is simply

$$x = \left(\begin{array}{c} 1 \\ 0 \end{array}\right)$$

Switch to MATLAB

A = [4.1 2.8; 9.7 6.6] b = A(:,1) x = A\b A = 4.1000 2.8000 9.7000 6.6000 b = 4.1000 9.7000 x = 1.0000 -0.0000

Now add 0.01 to the first component of b.

b2 = [4.11; 9.7] b2 = 4.1100 9.7000

The solution changes dramatically.

x2 = A\b2 x2 = 0.3400 0.9700

This sensitivity of the solution x to changes in the right hand side b is a reflection of the condition number.

kappa = cond(A) kappa = 1.6230e+03

The upper bound on the possible change in x indicates changes in all of the significant digits.

kappa*norm(b-b2)/norm(b) ans = 1.5412

The actual change in x resulting from this perturbation is

norm(x-x2)/norm(x) ans = 1.1732

So this particular change in the right hand side generated almost the largest possible change in the solution.

Close to singular

A large condition number means that the matrix is close to being singular. Let's make a small change in the second row of A.

A A2 = [4.1 2.8; 9.676 6.608] A = 4.1000 2.8000 9.7000 6.6000 A2 = 4.1000 2.8000 9.6760 6.6080

The resulting matrix is effectively singular. If we try to compute its inverse, we get a warning message.

A2inv = inv(A2) Warning: Matrix is close to singular or badly scaled. Results may be inaccurate. RCOND = 1.988677e-17. A2inv = 1.0e+15 * -1.4812 0.6276 2.1689 -0.9190

The quantity RCOND in the warning is an estimate of the reciprocal of the condition number. Using the reciprocal is left over from the days before we had IEEE floating point arithmetic with Inf to represent overflow and infinity. For this example RCOND is on the order of eps(1) and the scale factor for A2inv implies that its elements are useless. It's impossible to compute something that doesn't exist.

A fairly recent addition to MATLAB is the function condest that estimates $\kappa(A)$ . Now condest(A) is preferable to 1/rcond(A).

Inverse

The condition number $\kappa(A)$ also appears in the bound for how much a change $E$ in a matrix $A$ can affect its inverse.

$${{\|(A+E)^{-1} - A^{-1}\|} \over {\|A^{-1}\|}} \ \le \ \kappa(A) {{\|E\|} \over {\|A\|}}$$

Jim Wilkinson's work about roundoff error in Gaussian elimination showed that each column of the computed inverse is a column of the exact inverse of a matrix within roundoff error of the given matrix. Let's fudge this a bit and say that inv(A) computes the exact inverse of $A+E$ where $\|E\|$ is on the order of roundoff error compared to $\|A\|$ .

We don't know $E$ exactly, but for an n -by- n matrix we have the estimate

norm(E) $\approx \space$ n*eps(norm(A))

So we have a simple estimate for the error in the computed inverse, relative to the unknown exact inverse.

X = $\space$ exact inverse of A

Z = inv(A)

norm(Z - X)/norm(X) $\approx$ n*eps*cond(A)

For our 2-by-2 example the estimate of the relative error in the computed inverse is

2*eps*condest(A) ans = 9.9893e-13

This says we can expect 12 or 13 (out of 16) significant digits.

Wilkinson had to assume that every individual floating point arithmetic operation incurs the maximum roundoff error. But only a fraction of the operations have any roundoff error at all and even for those the errors are smaller than the maximum possible. So this estimate can be expected to an overestimate. But no tighter estimate is possible.

For our example, the computed inverse is

format long Z = inv(A) Z = -66.000000000000057 28.000000000000025 97.000000000000085 -41.000000000000036

It turns out that the exact inverse has the integer entries produced by

X = round(Z) X = -66 28 97 -41

We can compare the actual relative error with the estimate.

format short norm(Z - X)/norm(X) ans = 8.7342e-16

So we actually have more than 15 significant digits of accuracy.

Speed

In simplified outline, the algorithm for computing the inverse of an $n$ -by- $n$ matrix, or for solving a system of $n$ linear equations, involves loops of length $n$ nested three deep. Each of the $n^2$ elements is accessed roughly $n$ times. So the computational complexity is proportional to $n^3$. Lots of things can be done that affect the coefficient of proportionality, but it's still order $n^3$.

The actual numbers in the matrix (generally) don't affect the execution time. A nearly singular matrix can be inverted just as fast as a well-conditioned one. The answer might not be very accurate if the condition number is large, but $\kappa(A)$ does not play a role in the speed.

\n'); d.write(code_string); // Add copyright line at the bottom if specified. if (copyright.length > 0) { d.writeln(''); d.writeln('%%'); if (copyright.length > 0) { d.writeln('% _' + copyright + '_'); } } d.write('\n'); d.title = title + ' (MATLAB code)'; d.close(); } -->

Get the MATLAB code (requires JavaScript)

Published with MATLAB® R2017a

### Householder Seminar HHXX on Numerical Linear Algebra

Sat, 07/01/2017 - 14:46

The Householder meetings on Numerical Linear Algebra have been held roughly every three years since 1961. The twentieth, with the logo HHXX, was held June 18th through 23rd at Virginia Tech in Blacksburg, Virginia. I've been to 17 of the meetings. They have been an important part of my professional life and I've written Cleve's Corner articles and blog posts about them, MathWorks News & Notes, and Householder XIX.

ContentsWeb Site

Here's a link to the website, where you can see the daily program, <http://www.math.vt.edu/HHXX>.

Conference Center

The meeting was held in a conference center on the edge of the Virginia Tech campus. There were about 150 attendees. We ate and stayed in the center, which was perfect for this kind of meeting.

Group Photo

Here's the group photo, with the distinctive Virginia Tech architecture in the background.

Photo credit: Alex Grim, Virginia Tech.

Old Timers

Eric de Sturler took this photo of me with some of my long time friends, Paul van Dooren, Bo Kågström, Pete Stewart, Chris Page, Michael Saunders, and Bob Plemmons.

Photo credit: Eric de Sturler, Virginia Tech.

My Talk

I gave the after dinner talk on Wednesday evening. I talked about some of my memories of the past meetings. Here's a video where I've recreated the talk.

@media (min-width: 992px) { .containing-block { width: 100%; } } 2017 Householder Prize

The Householder Prize is awarded for the best PhD thesis written in the previous three years. This year there were two winners.

There was also an honorable mention.

• Leonardo Robol, Istituto de Sienza e Tecnologie dell'Informazione "A. Faedo".
2017 Householder Prize Committee
• Michele Benzi, Emory University, USA
• Inderjit Dhillon, UT Austin, USA
• Howard Elman (chair), University of Maryland, USA
• Andreas Frommer, University of Wuppertal, Germany
• Francoise Tisseur, University of Manchester, UK
• Stephen Vavasis, University of Waterloo, Canada
2020 Householder Prize Committee
• Inderjit Dhillon, UT Austin, USA
• Alan Edelman, MIT, USA
• Mark Embree, VA Tech, USA
• Melina Freitag, University of Bath, UK
• Andreas Frommer, University of Wuppertal, Germany
• Francoise Tisseur (chair), University of Manchester, UK
• Stephen Vavasis, University of Waterloo, Canada
2017 Householder Seminar Committee
• David Bindel, Cornell University, USA
• Jim Demmel, University of California at Berkeley, USA
• Zlatko Drmac, University of Zagreb, Croatia
• Alan Edelman, Massachusetts Institute of Technology, USA
• Heike Fassbender , TU Braunschweig, Germany
• Ilse Ipsen, North Carolina State University, USA
• Volker Mehrmann, Technical University of Berlin, Germany
• Jim Nagy (chair), Emory University, USA
• Valeria Simoncini, University of Bologna, Italy
• Andy Wathen, Oxford University, UK
2020 Householder Seminar Committee
• Zhaojun Bai, University of California, Davis, USA
• David Bindel, Cornell University, USA
• Jim Demmel, University of California at Berkeley, USA
• Zlatko Drmac, University of Zagreb, Croatia
• Heike Fassbender (chair), TU Braunschweig, Germany
• Sherry Li, Lawrence Berkeley National Laboratory, USA
• Volker Mehrmann, Technical University of Berlin, Germany
• Jim Nagy, Emory University, USA
• Valeria Simoncini, University of Bologna, Italy
• Andy Wathen, Oxford University, UK
2017 Householder Symposium Local Organizing Committee

The local organizing committee did a terrific job. All of them are from Virginia Tech.

• Christopher Beattie
• Julianne Chung
• Matthias Chung
• Eric de Sturler
• Mark Embree (chair)
• Serkan Gugercin
2020 Householder Symposium Location

The next Householder Symposium will be held on the eastern coast of Italy at the Hotel Sierra Silvana, Selva di Fasano, halfway between Bari and Brindisi. The exact dates are not yet set, but it will probably be in June again.

2020 Householder Symposium Local Organizing Committee
• Nicola Mastronardi (chair), Consiglio Nazionale delle Ricerche, Bari, Italy
• Dario A. Bini, Università di Pisa, Italy
• Fasma Diele, Consiglio Nazionale delle Ricerche, Bari, Italy
• Carmela Marangi, Consiglio Nazionale delle Ricerche, Bari, Italy
• Beatrice Meini, Università di Pisa, Italy
• Stefano Serra Capizzano, University of Insubria, Como, Italy
• Valeria Simoncini, Dipartimento di Matematica, Università di Bologna, Italy
Thanks

Thanks to Stuart McGarrity for help with the video, to Jim Nagy who reminded me of all the committee members, and to Eric de Struler for the photos.

\n'); d.write(code_string); // Add copyright line at the bottom if specified. if (copyright.length > 0) { d.writeln(''); d.writeln('%%'); if (copyright.length > 0) { d.writeln('% _' + copyright + '_'); } } d.write('\n'); d.title = title + ' (MATLAB code)'; d.close(); } -->

Get the MATLAB code (requires JavaScript)

Published with MATLAB® R2017a

, and % . %% Web Site % Here's a link to the website, where you can see the daily program, % . %% Conference Center % The meeting was held in a conference center on the edge of the % Virginia Tech campus. There were about 150 attendees. We ate and % stayed in the center, which was perfect for this kind of meeting. %% Group Photo % Here's the group photo, with the distinctive Virginia Tech architecture % in the background. % % <> % % Photo credit: Alex Grim, Virginia Tech. %% Old Timers % Eric de Sturler took this photo of me with some of my long time friends, % Paul van Dooren, Bo KÃ¥gstrÃ¶m, Pete Stewart, Chris Page, Michael Saunders, % and Bob Plemmons. % % <> % % Photo credit: Eric de Sturler, Virginia Tech. %% My Talk % I gave the after dinner talk on Wednesday evening. I talked about some % of my memories of the past meetings. % Here's a video where I've recreated the talk. % % [bcvid id="5483625177001"] %% 2017 Householder Prize % The Householder Prize is awarded for the best PhD thesis written in % the previous three years. This year there were two winners. % % * Marcel Schweitzer, Ecole Polytechnique Federale de Lausanne, % Switzerland. % . % * Edgar Solomonik, University of Illinois, UIUC, USA, % . %% % There was also an honorable mention. % % * Leonardo Robol, Istituto de Sienza e Tecnologie dell'Informazione % "A. Faedo". %% 2017 Householder Prize Committee % * Michele Benzi, Emory University, USA % * Inderjit Dhillon, UT Austin, USA % * Howard Elman (chair), University of Maryland, USA % * Andreas Frommer, University of Wuppertal, Germany % * Francoise Tisseur, University of Manchester, UK % * Stephen Vavasis, University of Waterloo, Canada %% 2020 Householder Prize Committee % * Inderjit Dhillon, UT Austin, USA % * Alan Edelman, MIT, USA % * Mark Embree, VA Tech, USA % * Melina Freitag, University of Bath, UK % * Andreas Frommer, University of Wuppertal, Germany % * Francoise Tisseur (chair), University of Manchester, UK % * Stephen Vavasis, University of Waterloo, Canada %% 2017 Householder Seminar Committee % * David Bindel, Cornell University, USA % * Jim Demmel, University of California at Berkeley, USA % * Zlatko Drmac, University of Zagreb, Croatia % * Alan Edelman, Massachusetts Institute of Technology, USA % * Heike Fassbender , TU Braunschweig, Germany % * Ilse Ipsen, North Carolina State University, USA % * Volker Mehrmann, Technical University of Berlin, Germany % * Jim Nagy (chair), Emory University, USA % * Valeria Simoncini, University of Bologna, Italy % * Andy Wathen, Oxford University, UK %% 2020 Householder Seminar Committee % * Zhaojun Bai, University of California, Davis, USA % * David Bindel, Cornell University, USA % * Jim Demmel, University of California at Berkeley, USA % * Zlatko Drmac, University of Zagreb, Croatia % * Heike Fassbender (chair), TU Braunschweig, Germany % * Sherry Li, Lawrence Berkeley National Laboratory, USA % * Volker Mehrmann, Technical University of Berlin, Germany % * Jim Nagy, Emory University, USA % * Valeria Simoncini, University of Bologna, Italy % * Andy Wathen, Oxford University, UK %% 2017 Householder Symposium Local Organizing Committee % The local organizing committee did a terrific job. % All of them are from Virginia Tech. % % * Christopher Beattie % * Julianne Chung % * Matthias Chung % * Eric de Sturler % * Mark Embree (chair) % * Serkan Gugercin %% 2020 Householder Symposium Location % The next Householder Symposium will be held on the eastern coast of % Italy at the Hotel Sierra Silvana, Selva di Fasano, halfway between % Bari and Brindisi. The exact dates are not yet set, but it will % probably be in June again. %% 2020 Householder Symposium Local Organizing Committee % * Nicola Mastronardi (chair), Consiglio Nazionale delle Ricerche, Bari, % Italy % * Dario A. Bini, UniversitÃ  di Pisa, Italy % * Fasma Diele, Consiglio Nazionale delle Ricerche, Bari, Italy % * Carmela Marangi, Consiglio Nazionale delle Ricerche, Bari, Italy % * Beatrice Meini, UniversitÃ  di Pisa, Italy % * Stefano Serra Capizzano, University of Insubria, Como, Italy % * Valeria Simoncini, Dipartimento di Matematica, UniversitÃ  di Bologna, % Italy %% Thanks % Thanks to Stuart McGarrity for help with the video, % to Jim Nagy who reminded me of all the committee members, % and to Eric de Struler for the photos. ##### SOURCE END ##### 431eb354223f4728878ad84b1f84c0e4 -->

### Hilbert Matrices

Thu, 06/08/2017 - 02:25

I first encountered the Hilbert matrix when I was doing individual studies under Professor John Todd at Caltech in 1960. It has been part of my professional life ever since.

ContentsDavid Hilbert

Around the turn of the 20th century, David Hilbert was the world's most famous mathematician. He introduced the matrix that now bears his name in a paper in 1895. The elements of the matrix, which are reciprocals of consecutive positive integers, are constant along the antidiagonals.

$$h_{i,j} = \frac{1}{i+j-1}, \ \ i,j = 1:n$$

format rat H5 = hilb(5) H5 = 1 1/2 1/3 1/4 1/5 1/2 1/3 1/4 1/5 1/6 1/3 1/4 1/5 1/6 1/7 1/4 1/5 1/6 1/7 1/8 1/5 1/6 1/7 1/8 1/9 latex(sym(H5));

$$H_5 = \left(\begin{array}{ccccc} 1 & \frac{1}{2} & \frac{1}{3} & \frac{1}{4} & \frac{1}{5}\\ \frac{1}{2} & \frac{1}{3} & \frac{1}{4} & \frac{1}{5} & \frac{1}{6}\\ \frac{1}{3} & \frac{1}{4} & \frac{1}{5} & \frac{1}{6} & \frac{1}{7}\\ \frac{1}{4} & \frac{1}{5} & \frac{1}{6} & \frac{1}{7} & \frac{1}{8}\\ \frac{1}{5} & \frac{1}{6} & \frac{1}{7} & \frac{1}{8} & \frac{1}{9} \end{array}\right)$$

Here's a picture. The continuous surface generated is very smooth.

H12 = hilb(12); surf(log(H12)) view(60,60) Least squares

Hilbert was interested in this matrix because it comes up in the least squares approximation of a continuous function on the unit interval by polynomials, expressed in the conventional basis as linear combinations of monomials.

$$p(x) = \sum_{j=1}^n c_j x^{j-1}$$

The coefficient matrix for the normal equations has elements

$$\int_0^1 x^{i+j-2} dx \ = \ \frac{1}{i+j-1}$$

Properties

A Hilbert matrix has many useful properties.

• Symmetric.
• Positive definite.
• Hankel, $a_{i,j}$ is a function of $i+j$.
• Cauchy, $a_{i,j} = 1/(x_i - y_j)$.
• Nearly singular.

Each column of a Hilbert matrix is nearly a multiple of the other columns. So the columns are nearly linearly dependent and the matrix is close to, but not exactly, singular.

hilb

MATLAB has always had functions hilb and invhilb that compute the Hilbert matrix and its inverse. The body of hilb is now only two lines.

J = 1:n; H = 1./(J'+J-1);

We often cite this as a good example of singleton expansion. A column vector is added to a row vector to produce a matrix, then a scalar is subtracted from that matrix, and finally the reciprocals of the elements produce the result.

Inverse

It is possible to express the elements of the inverse of the Hilbert matrix in terms of binomial coefficients. For reasons that I've now forgotten, I always use $T$ for $H^{-1}$.

$$t_{i,j} = (-1)^{i+j} (i+j-1) {{n+i-1} \choose {n-j}} {{n+j-1} \choose {n-i}} {{i+j-2} \choose {i-1}}^2$$

The elements of the inverse Hilbert matrix are integers, but they are large integers. The largest ones are on the diagonal. For order 13, the largest element is

$$(T_{13})_{9,9} \ = \ 100863567447142500$$

This is over $10^{17}$ and is larger than double precision flintmax.

format longe flintmax = flintmax flintmax = 9.007199254740992e+15

So, it is possible to represent the largest elements exactly only if $n \le 12$.

The HELP entry for invhilb includes a sentence inherited from my original Fortran MATLAB.

The result is exact for N less than about 15.

Now that's misleading. It should say

The result is exact for N <= 12.

(I'm filing a bug report.)

invhilv

The body of invhilb begins by setting p to the order n. The doubly nested for loops then evaluate the binomial coefficient formula recursively, avoiding unnecessary integer overflow.

dbtype invhilb 18:28 18 p = n; 19 for i = 1:n 20 r = p*p; 21 H(i,i) = r/(2*i-1); 22 for j = i+1:n 23 r = -((n-j+1)*r*(n+j-1))/(j-1)^2; 24 H(i,j) = r/(i+j-1); 25 H(j,i) = r/(i+j-1); 26 end 27 p = ((n-i)*p*(n+i))/(i^2); 28 end

I first programmed this algorithm in machine language for the Burroughs 205 Datatron at Caltech almost 60 years ago. I was barely out of my teens.

Here's the result for n = 6.

format short T6 = invhilb(6) T6 = 36 -630 3360 -7560 7560 -2772 -630 14700 -88200 211680 -220500 83160 3360 -88200 564480 -1411200 1512000 -582120 -7560 211680 -1411200 3628800 -3969000 1552320 7560 -220500 1512000 -3969000 4410000 -1746360 -2772 83160 -582120 1552320 -1746360 698544

A checkerboard sign pattern with large integers in the inverse cancels the smooth surface of the Hilbert matrix itself.

T12 = invhilb(12); spy(T12 > 0)

A log scale mitigates the jaggedness.

surf(sign(T12).*log(abs(T12))) view(60,60) Rookie experiment

Now using MATLAB, I am going to repeat the experiment that I did on the Burroughs 205 when I was still a rookie. I had just written my first program that used Gaussian elimination to invert matrices. I proceeded to test it by inverting Hilbert matrices and comparing the results with the exact inverses. (Today's code uses this utility function that picks out the largest element in a matrix.)

maxabs = @(X) max(double(abs(X(:))));

Here is n = 10.

n = 10 H = hilb(n); X = inv(H); % Computed inverse T = invhilb(n); % Theoretical inverse E = X - T; % The error err = maxabs(E) n = 10 err = 5.0259e+08

At first I might have said:

Wow! The error is $10^8$. That's a pretty big error. Can I trust my matrix inversion code? What went wrong?

But I knew the elements of the inverse are huge. We should be looking at relative error.

relerr = maxabs(E)/maxabs(T) relerr = 1.4439e-04

OK. The relative error is $10^{-4}$. That still seems like a lot.

I knew that the Hilbert matrix is nearly singular. That's why I was using it. John Todd was one of the first people to write about condition numbers. An error estimate that involves nearness to singularity and the floating point accuracy would be expressed today by

esterr = cond(H)*eps esterr = 0.0036

That was about all I understood at the time. The roundoff error in the inversion process is magnified by the condition number of the matrix. And, the error I observe is less than the estimate that this simple analysis provides. So my inversion code passes this test.

Deeper explanation

I met Jim Wilkinson a few years later and came to realize that there is more to the story. I'm not actually inverting the Hilbert matrix. There are roundoff errors involved in computing H even before it is passed to the inversion routine.

Today, the Symbolic Math Toolbox helps provide a deeper explanation. The 'f' flag on the sym constructor says to convert double precision floating point arguments exactly to their rational representation. Here's how it works in this situation. To save space, I'll print just the first column.

H = hilb(n); F = sym(H,'f'); F(:,1) ans = 1 1/2 6004799503160661/18014398509481984 1/4 3602879701896397/18014398509481984 6004799503160661/36028797018963968 2573485501354569/18014398509481984 1/8 2001599834386887/18014398509481984 3602879701896397/36028797018963968

The elements of H that are not exact binary fractions become ratios of large integers. The denominators are powers of two; in this case $2^{54}$ and $2^{55}$. The numerators are these denominators divided by $3$, $5$, etc. and then rounded to the nearest integer. The elements of F are as close to the exact elements of a Hilbert matrix as we can get in binary floating point.

Let's invert F, using the exact rational arithmetic provided by the Symbolic Toolbox. (I couldn't do this in 1960.)

S = inv(F);

We now have three inverse Hilbert matrices, X, S, and T.

• X is the approximate inverse computed with floating point arithmetic by the routine I was testing years ago, or by MATLAB inv function today.
• S is the exact inverse of the floating point matrix that was actually passed to the inversion routine.
• T is the exact Hilbert inverse, obtained from the binomial coefficient formula.

Let's print the first columns alongside each other.

fprintf('%12s %16s %15s\n','X','S','T') fprintf('%16.4f %16.4f %12.0f\n',[X(:,1) S(:,1) T(:,1)]') X S T 99.9961 99.9976 100 -4949.6667 -4949.7926 -4950 79192.8929 79195.5727 79200 -600535.3362 -600559.6914 -600600 2522211.3665 2522327.5182 2522520 -6305451.1288 -6305770.4041 -6306300 9608206.6797 9608730.4926 9609600 -8750253.0592 -8750759.2546 -8751600 4375092.6697 4375358.4162 4375800 -923624.4113 -923682.8529 -923780

It looks like X is closer to S than S is to T. Let's confirm by computing two relative errors, the difference between X and S, and the difference between S and T.

format shorte relerr(1) = maxabs(X - S)/maxabs(T); relerr(2) = maxabs(S - T)/maxabs(T) relerrtotal = sum(relerr) relerr = 5.4143e-05 9.0252e-05 relerrtotal = 1.4439e-04

The error in the computed inverse comes from two sources -- generating the matrix in the first place and then computing the inverse. The first of these is actually larger than the second, although the two are comparable.

\n'); d.write(code_string); // Add copyright line at the bottom if specified. if (copyright.length > 0) { d.writeln(''); d.writeln('%%'); if (copyright.length > 0) { d.writeln('% _' + copyright + '_'); } } d.write('\n'); d.title = title + ' (MATLAB code)'; d.close(); } -->

Get the MATLAB code (requires JavaScript)

Published with MATLAB® R2017a

at Caltech in 1960. % It has been part of my professional life ever since. %% David Hilbert % Around the turn of the 20th century, David Hilbert was the world's % most famous mathematician. He introduced the matrix that now bears % his name in a paper in 1895. The elements of the matrix, which are % reciprocals of consecutive positive integers, are constant along the % antidiagonals. % % $$h_{i,j} = \frac{1}{i+j-1}, \ \ i,j = 1:n$$ %% format rat H5 = hilb(5) %% latex(sym(H5)); %% % $$H_5 = \left(\begin{array}{ccccc} % 1 & \frac{1}{2} & \frac{1}{3} & \frac{1}{4} & \frac{1}{5}\\ % \frac{1}{2} & \frac{1}{3} & \frac{1}{4} & \frac{1}{5} & \frac{1}{6}\\ % \frac{1}{3} & \frac{1}{4} & \frac{1}{5} & \frac{1}{6} & \frac{1}{7}\\ % \frac{1}{4} & \frac{1}{5} & \frac{1}{6} & \frac{1}{7} & \frac{1}{8}\\ % \frac{1}{5} & \frac{1}{6} & \frac{1}{7} & \frac{1}{8} & \frac{1}{9} % \end{array}\right)$$ %% % Here's a picture. The continuous surface generated is very smooth. H12 = hilb(12); surf(log(H12)) view(60,60) %% Least squares % Hilbert was interested in this matrix because it comes up in the % least squares approximation of a continuous function on the unit % interval by polynomials, expressed in the conventional basis % as linear combinations of monomials. % % $$p(x) = \sum_{j=1}^n c_j x^{j-1}$$ % % The coefficient matrix for the normal equations has elements % % $$\int_0^1 x^{i+j-2} dx \ = \ \frac{1}{i+j-1}$$ %% Properties % A Hilbert matrix has many useful properties. % % * Symmetric. % * Positive definite. % * Hankel, $a_{i,j}$ is a function of $i+j$. % * Cauchy, $a_{i,j} = 1/(x_i - y_j)$. % * Nearly singular. %% % Each column of a Hilbert matrix is nearly a multiple of % the other columns. So the columns are nearly linearly % dependent and the matrix is close to, but not exactly, singular. %% hilb % MATLAB has always had functions |hilb| and |invhilb| that compute % the Hilbert matrix and its inverse. The body of |hilb| is now only % two lines. % % J = 1:n; % H = 1./(J'+J-1); % % We often cite this as a good example of _singleton expansion_. % A column vector is added to a row vector to produce a matrix, % then a scalar is subtracted from that matrix, and finally the % reciprocals of the elements produce the result. %% Inverse % It is possible to express the elements of the inverse of the Hilbert % matrix in terms of binomial coefficients. For reasons that I've now % forgotten, I always use $T$ for $H^{-1}$. % % $$t_{i,j} = (-1)^{i+j} (i+j-1) {{n+i-1} \choose {n-j}} % {{n+j-1} \choose {n-i}} {{i+j-2} \choose {i-1}}^2$$ %% % The elements of the inverse Hilbert matrix are integers, but they % are _large_ integers. The largest ones are on the diagonal. For % order 13, the largest element is % % $$(T_{13})_{9,9} \ = \ 100863567447142500$$ % % This is over $10^{17}$ and is larger than double precision |flintmax|. format longe flintmax = flintmax %% % So, it is possible to represent the largest % elements exactly only if $n \le 12$. %% % The HELP entry for |invhilb| includes a sentence inherited from % my original Fortran MATLAB. % % The result is exact for N less than about 15. % % Now that's misleading. It should say % % The result is exact for N <= 12. % % (I'm filing a bug report.) %% invhilv % The body of |invhilb| begins by setting |p| to the order |n|. % The doubly nested |for| loops then evaluate the binomial coefficient % formula recursively, avoiding unnecessary integer overflow. dbtype invhilb 18:28 %% % I first programmed this algorithm in machine language for the Burroughs % 205 Datatron at Caltech almost 60 years ago. I was barely out of my % teens. %% % Here's the result for |n = 6|. format short T6 = invhilb(6) %% % A checkerboard sign pattern with large integers in the inverse cancels % the smooth surface of the Hilbert matrix itself. T12 = invhilb(12); spy(T12 > 0) %% % A log scale mitigates the jaggedness. surf(sign(T12).*log(abs(T12))) view(60,60) %% Rookie experiment % Now using MATLAB, I am going to repeat the experiment that I did on % the Burroughs 205 when I was still a rookie. I had just written my % first program that used Gaussian elimination to invert matrices. % I proceeded to test it by inverting Hilbert matrices and comparing % the results with the exact inverses. (Today's code uses % this utility function that picks out the largest element in % a matrix.) maxabs = @(X) max(double(abs(X(:)))); %% % Here is |n = 10|. n = 10 H = hilb(n); X = inv(H); % Computed inverse T = invhilb(n); % Theoretical inverse E = X - T; % The error err = maxabs(E) %% % At first I might have said: % % _Wow! The error is $10^8$. That's a pretty big error. % Can I trust my matrix inversion code? What went wrong?_ %% % But I knew the elements of the inverse are huge. % We should be looking at _relative_ error. relerr = maxabs(E)/maxabs(T) %% % _OK. The relative error is $10^{-4}$. That still seems like a lot._ %% % I knew that the Hilbert matrix is nearly singular. That's why I % was using it. John Todd was one of the first people to write about % condition numbers. An error estimate that involves nearness to % singularity and the floating point accuracy would be expressed today % by esterr = cond(H)*eps %% % That was about all I understood at the time. The roundoff error % in the inversion process is magnified by the condition number of % the matrix. And, the error I observe is less than the estimate that % this simple analysis provides. So my inversion code passes this test. %% Deeper explanation % I met % a few years later and came to realize that there is more to the story. % I'm not actually inverting the Hilbert matrix. There are roundoff % errors involved in computing |H| even before it is passed to the % inversion routine. %% % Today, the Symbolic Math Toolbox helps provide a deeper explanation. % The |'f'| flag on the |sym| constructor says to convert double precision % floating point arguments exactly to their rational representation. % Here's how it works in this situation. To save space, I'll print just % the first column. H = hilb(n); F = sym(H,'f'); F(:,1) %% % The elements of |H| that are not exact binary fractions become ratios % of large integers. The denominators are powers of two; in this case % $2^{54}$ and $2^{55}$. The numerators are these denominators divided by % $3$, $5$, etc. and then rounded to the nearest integer. % The elements of |F| are as close to the exact elements of a Hilbert % matrix as we can get in binary floating point. %% % Let's invert |F|, using the exact rational arithmetic provided % by the Symbolic Toolbox. (I couldn't do this in 1960.) S = inv(F); %% % We now have three inverse Hilbert matrices, |X|, |S|, and |T|. % % * |X| is the approximate inverse computed with floating point arithmetic % by the routine I was testing years ago, or by MATLAB |inv| function % today. % * |S| is the exact inverse of the floating point matrix % that was actually passed to the inversion routine. % % * |T| is the exact Hilbert inverse, obtained from the binomial % coefficient formula. %% % Let's print the first columns alongside each other. fprintf('%12s %16s %15s\n','X','S','T') fprintf('%16.4f %16.4f %12.0f\n',[X(:,1) S(:,1) T(:,1)]') %% % It looks like |X| is closer to |S| than |S| is to |T|. Let's confirm % by computing two relative errors, the difference between % |X| and |S|, and the difference between |S| and |T|. format shorte relerr(1) = maxabs(X - S)/maxabs(T); relerr(2) = maxabs(S - T)/maxabs(T) relerrtotal = sum(relerr) %% % The error in the computed inverse comes from two sources REPLACE_WITH_DASH_DASH generating % the matrix in the first place and then computing the inverse. % The first of these is actually larger than the second, although % the two are comparable. ##### SOURCE END ##### b5980438b46c4cb5acd20123a1a034fd -->

### Quadruple Precision, 128-bit Floating Point Arithmetic

Tue, 05/23/2017 - 02:18

The floating point arithmetic format that occupies 128 bits of storage is known as binary128 or quadruple precision. This blog post describes an implementation of quadruple precision programmed entirely in the MATLAB language.

ContentsBackground

The IEEE 754 standard, published in 1985, defines formats for floating point numbers that occupy 32 or 64 bits of storage. These formats are known as binary32 and binary64, or more frequently as single and double precision. For many years MATLAB used only double precision and it remains our default format. Single precision has been added gradually over the last several years and is now also fully supported.

A revision of IEEE 754, published in 2008, defines two more floating point formats. One, binary16 or half precision, occupies only 16 bits and was the subject of my previous blog post. It is primarily intended to reduce storage and memory bandwidth requirements. Since it provides only "half" precision, its use for actual computation is problematic.

The other new format introduced in IEEE 754-2008 is binary128 or quadruple precision. It is intended for situations where the accuracy or range of double precision is inadequate.

I see two descriptions of quadruple precision software implementations on the Web.

I have not used either package, but judging by their Web pages, they both appear to be complete and well supported.

The MATLAB Symbolic Math Toolbox provides vpa, arbitrary precision decimal floating point arithmetic, and sym, exact rational arithmetic. Both provide accuracy and range well beyond quadruple precision, but do not specifically support the 128-bit IEEE format.

My goal here is to describe a prototype of a MATLAB object, fp128, that implements quadruple precision with code written entirely in the MATLAB language. It is not very efficient, but is does allow experimentation with the 128-bit format.

Beyond double

There are other floating point formats beyond double precision. Long double usually refers to the 80-bit extended precision floating point registers available with the Intel x86 architecture and described as double extended in IEEE 754. This provides the same exponent range as quadruple precision, but much less accuracy.

Double double refers to the use of a pair of double precision values. The exponent field and sign bit of the second double are ignored, so this is effectively a 116-bit format. Both the exponent range and the precision are more than double but less than quadruple.

Floating point anatomy

The format of a floating point number is characterized by two parameters, p, the number of bits in the fraction and q, the number of bits in the exponent. I will compare four precisions, half, single, double, and quadruple. The four pairs of characterizing parameters are

p = [10, 23, 52 112]; q = [5, 8, 11, 15];

With these values of p and q, and with one more bit for the sign, the total number of bits in the word, w, is a power of two.

format shortg w = p + q + 1 w = 16 32 64 128

Normalized numbers

Most floating point numbers are normalized, and are expressed as

$$x = \pm (1+f)2^e$$

The fraction $f$ is in the half open interval

$$0 \leq f < 1$$

The binary representation of $f$ requires at most p bits. In other words $2^p f$ is an integer in the range

$$0 \leq 2^p f < 2^p$$

The exponent $e$ is an integer in the range

$$-b+1 \leq e \leq b$$

The quantity $b$ is both the largest exponent and the bias.

$$b = 2^{q-1} - 1$$

b = 2.^(q-1)-1 b = 15 127 1023 16383

The fractional part of a normalized number is $1+f$, but only $f$ needs to be stored. That leading $1$ is known as the hidden bit.

Subnormal

There are two values of the exponent $e$ for which the biased exponent, $e+b$, reaches the smallest and largest values possible to represent in q bits. The smallest is

$$e + b = 0$$

The corresponding floating point numbers do not have a hidden leading bit. These are the subnormal or denormal numbers.

$$x = \pm f 2^{-b}$$

Infinity and Not-A-Number

The largest possible biased exponent is

$$e + b = 2^q-1$$.

Quantities with this exponent field represent infinities and NaN, or Not-A-Number.

The percentage of floating point numbers that are exceptional because they are subnormal, infinity or NaN increases as the precision decreases. Exceptional exponents are only $2$ values out of $2^q$. For quadruple precision this is $2/2^{15}$, which is less than a one one-thousandth of one percent.

Encode the sign bit with s = 0 for nonnegative and s = 1 for negative. And encode the exponent with an offsetting bias, b. Then a floating point number can be packed in w bits with

x = [s e+b 2^p*f]Precision and range

epsilon

If a real number cannot be expressed with a binary expansion requiring at most p bits, it must be approximated by a floating point number that does have such a binary representation. This is roundoff error. The important quantity characterizing precision is machine epsilon, or eps. In MATLAB, eps(x) is the distance from x to the next larger (in absolute value) floating point number (of that class). With no argument, eps is simply the difference between 1 and the next larger floating point number.

format shortg eps = 2.^(-p) eps = 0.00097656 1.1921e-07 2.2204e-16 1.9259e-34

This tells us that quadruple precision is good for about 34 decimal digits of accuracy, double for about 16 decimal digits, single for about 7, and half for about 3.

realmax

If a real number, or the result of an arithmetic operation, is too large to be represented, it overflows and is replaced by infinity. The largest floating point number that does not overflow is realmax. When I try to compute quadruple realmax with double precision, it overflows. I will fix this up in the table to follow.

realmax = 2.^b.*(2-eps) realmax = 65504 3.4028e+38 1.7977e+308 Inf

realmin

Underflow and representation of very small numbers is more complicated. The smallest normalized floating point number is realmin. When I try to compute quadruple realmin it underflows to zero. Again, I will fix this up in the table.

realmin = 2.^(-b+1) realmin = 6.1035e-05 1.1755e-38 2.2251e-308 0

tiny

But there are numbers smaller than realmin. IEEE 754 introduced the notion of gradual underflow and denormal numbers. In the 2008 revised standard their name was changed to subnormal.

Think of roundoff in numbers near underflow. Before 754, floating point numbers had the disconcerting property that x and y could be unequal, but their difference could underflow, so x-y becomes 0. With 754 the gap between 0 and realmin is filled with numbers whose spacing is the same as the spacing between realmin and 2*realmin. I like to call this spacing, and the smallest subnormal number, tiny.

tiny = realmin.*eps tiny = 5.9605e-08 1.4013e-45 4.9407e-324 0 Floating point integers

flintmax

It is possible to do integer arithmetic with floating point numbers. I like to call such numbers flints. When we write the numbers $3$ and $3.0$, they are different descriptions of the same integer, but we think of one as fixed point and the other as floating point. The largest flint is flintmax.

flintmax = 2./eps flintmax = 2048 1.6777e+07 9.0072e+15 1.0385e+34

Technically all the floating point numbers larger than flintmax are integers, but the spacing between them is larger than one, so it is not safe to use them for integer arithmetic. Only integer-valued floating point numbers between 0 and flintmax are allowed to be called flints.

Table

Let's collect all these anatomical characteristics together in a new MATLAB table. I have now edited the output and inserted the correct quadruple precision values.

T = [w; p; q; b; eps; realmax; realmin; tiny; flintmax]; T = table(T(:,1), T(:,2), T(:,3), T(:,4), ... 'variablenames',{'half','single','double','quadruple'}, ... 'rownames',{'w','p','q','b','eps','realmax','realmin', ... 'tiny','flintmax'}); type Table.txt half single double quadruple __________ __________ ___________ __________ w 16 32 64 128 p 10 23 52 112 q 5 8 11 15 b 15 127 1023 16383 eps 0.00097656 1.1921e-07 2.2204e-16 1.9259e-34 realmax 65504 3.4028e+38 1.7977e+308 1.190e+4932 realmin 6.1035e-05 1.1755e-38 2.2251e-308 3.362e-4932 tiny 5.9605e-08 1.4013e-45 4.9407e-324 6.475e-4966 flintmax 2048 1.6777e+07 9.0072e+15 1.0385e+34 fp128

I am currently working on code for an object, @fp128, that could provide a full implementation of quadruple-precision arithmetic. The methods available so far are

methods(fp128) Methods for class fp128: abs eq le mtimes realmax subsref cond fp128 lt ne realmin svd diag frac lu norm shifter sym disp ge max normalize sig times display gt minus normalize2 sign tril double hex mldivide plus sqrt triu ebias hypot mpower power subsasgn uminus eps ldivide mrdivide rdivide subsindex

The code that I have for quadrule precision is much more complex than the code that I have for half precision. There I am able to "cheat" by converting half precision numbers to doubles and relying on traditional MATLAB arithmetic. I can't do that for quads.

The storage scheme for quads is described in the help entry for the constructor.

help @fp128/fp128 fp128 Quad precision constructor. z = fp128(x) has three fields. x = s*(1+f)*2^e, where z.s, one uint8, s = (-1)^sg = 1-2*sg, sg = (1-s)/2. z.e, 15 bits, biased exponent, one uint16. b = 2^14-1 = 16383, eb = e + b, 1 <= eb <= 2*b for normalized quads, eb = 0 for subnormal quads, eb = 2*b+1 = 32767 for infinity and NaN. z.f, 112 bits, nonnegative fraction, 4-vector of uint64s, each with 1/4-th of the bits, 0 <= f(k) < 2^28, 4*28 = 112. z.f represents sum(f .* pow2), pow2 = 2.^(-28*(1:4)) Reference page in Doc Center doc fp128

Breaking the 112-bit fraction into four 28-bit pieces makes it possible to do arithmetic operations on the pieces without worrying about integer overflow. The core of the times code, which implements x.*y, is the convolution of the two fractional parts.

dbtype 45:53 @fp128/times 45 % The multiplication. 46 % z.f = conv(x.f,y.f); 47 % Restore hidden 1's. 48 xf = [1 x.f]; 49 yf = [1 y.f]; 50 zf = zeros(1,9,'uint64'); 51 for k = 1:5 52 zf(k:k+4) = zf(k:k+4) + yf(k)*xf; 53 end

The result of the convolution, zf, is a uint64 vector of length nine with 52-bit elements. It must be renormalized to the fit the fp128 storage scheme.

Addition and subtraction involve addition and subtraction of the fractional parts after they have been shifted so that the corresponding exponents are equal. Again, this produces temporary vectors that must be renormalized.

Scalar division, y/x, is done by first computing the reciprocal of the denominator, 1/x, and then doing one final multiplication, 1/x * y. The reciprocal is computed by a few steps of Newton iteration, starting with a scaled reciprocal, 1/double(x).

Examples

The output for each example shows the three fields in hexadecimal -- one sign field, one biased exponent field, and one fraction field that is a vector with four entries displayed with seven hex digits. This is followed by a 36 significant digit decimal representation.

One

clear format long one = fp128(1) one = 0 3fff 0000000 0000000 0000000 0000000 1.0

eps

eps = eps(one) eps = 0 3f8f 0000000 0000000 0000000 0000000 0.000000000000000000000000000000000192592994438723585305597794258492732

1 + eps

one_plus_eps = one + eps one_plus_eps = 0 3fff 0000000 0000000 0000000 0000001 1.00000000000000000000000000000000019

2 - eps

two_minus_eps = 2 - eps two_minus_eps = 0 3fff fffffff fffffff fffffff fffffff 1.99999999999999999999999999999999981

realmin

rmin = realmin(one) rmin = 0 0001 0000000 0000000 0000000 0000000 3.3621031431120935062626778173217526e-4932

realmax

rmax = realmax(one) rmax = 0 7ffe fffffff fffffff fffffff fffffff 1.18973149535723176508575932662800702e4932

Compute 1/10 with double, then convert to quadruple.

dble_tenth = fp128(1/10) dble_tenth = 0 3ffb 9999999 99999a0 0000000 0000000 0.100000000000000005551115123125782702

quad_tenth = 1/fp128(10) quad_tenth = 0 3ffb 9999999 9999999 9999999 9999999 0.0999999999999999999999999999999999928

Double precision pi converted to quadruple.

dble_pi = fp128(pi) dble_pi = 0 4000 921fb54 442d180 0000000 0000000 3.14159265358979311599796346854418516

quad_pi = fp128(sym('pi')) quad_pi = 0 4000 921fb54 442d184 69898cc 51701b8 3.1415926535897932384626433832795028 Matrix operations

The 4-by-4 magic square from Durer's Melancholia II provides my first matrix example.

clear M = fp128(magic(4));

Let's see how the 128-bit elements look in hex.

format hex M M = 0 4003 0000000 0000000 0000000 0000000 0 4001 4000000 0000000 0000000 0000000 0 4002 2000000 0000000 0000000 0000000 0 4001 0000000 0000000 0000000 0000000 0 4000 0000000 0000000 0000000 0000000 0 4002 6000000 0000000 0000000 0000000 0 4001 c000000 0000000 0000000 0000000 0 4002 c000000 0000000 0000000 0000000 0 4000 8000000 0000000 0000000 0000000 0 4002 4000000 0000000 0000000 0000000 0 4001 8000000 0000000 0000000 0000000 0 4002 e000000 0000000 0000000 0000000 0 4002 a000000 0000000 0000000 0000000 0 4002 0000000 0000000 0000000 0000000 0 4002 8000000 0000000 0000000 0000000 0 3fff 0000000 0000000 0000000 0000000

Check that the row sums are all equal. This matrix-vector multiply can be done exactly with the flints in the magic square.

e = fp128(ones(4,1)) Me = M*e e = 0 3fff 0000000 0000000 0000000 0000000 0 3fff 0000000 0000000 0000000 0000000 0 3fff 0000000 0000000 0000000 0000000 0 3fff 0000000 0000000 0000000 0000000 Me = 0 4004 1000000 0000000 0000000 0000000 0 4004 1000000 0000000 0000000 0000000 0 4004 1000000 0000000 0000000 0000000 0 4004 1000000 0000000 0000000 0000000 Quadruple precision backslash

I've overloaded mldivide, so I can solve linear systems and compute inverses. The actual computation is done by lutx, a "textbook" function that I wrote years ago, long before this quadruple-precision project, followed by the requisite solution of triangular systems. But now the MATLAB object system insures that every individual arithmetic operation is done with IEEE 754 quadruple precision.

Let's generate a 3-by-3 matrix with random two-digit integer entries.

A = fp128(randi(100,3,3)) A = 0 4002 0000000 0000000 0000000 0000000 0 4001 8000000 0000000 0000000 0000000 0 4004 b000000 0000000 0000000 0000000 0 4005 3800000 0000000 0000000 0000000 0 4005 7800000 0000000 0000000 0000000 0 4002 a000000 0000000 0000000 0000000 0 4004 c800000 0000000 0000000 0000000 0 4004 7800000 0000000 0000000 0000000 0 4000 0000000 0000000 0000000 0000000

I am going to use fp128 backslash to invert A. So I need the identity matrix in quadruple precision.

I = fp128(eye(size(A)));

Now the overloaded backslash calls lutx, and then solves two triangular systems to produce the inverse.

X = A\I X = 0 3ff7 2fd38ea bcfb815 69cdccc a36d8a5 1 3ff9 c595b53 8c842ee f26189c a0770d4 0 3ffa c0bc8b7 4adcc40 4ea66ca 61f1380 1 3ff7 a42f790 e4ad874 c358882 7ff988e 0 3ffa 12ea8c2 3ef8c17 01c7616 5e03a5a 1 3ffa 70d4565 958740b 78452d8 f32d866 0 3ff9 2fd38ea bcfb815 69cdccc a36d8a7 0 3ff3 86bc8e5 42ed82a 103d526 a56452f 1 3ff6 97f9949 ba961b3 72d69d9 4ace666

Compute the residual.

AX = A*X R = I - AX; format short RD = double(R) AX = 0 3fff 0000000 0000000 0000000 0000000 0 3f90 0000000 0000000 0000000 0000000 1 3f8d 0000000 0000000 0000000 0000000 0 0000 0000000 0000000 0000000 0000000 0 3fff 0000000 0000000 0000000 0000000 0 3f8d 8000000 0000000 0000000 0000000 1 3f8c 0000000 0000000 0000000 0000000 1 3f8d 0000000 0000000 0000000 0000000 0 3ffe fffffff fffffff fffffff ffffffb RD = 1.0e-33 * 0 0 0.0241 -0.3852 0 0.0481 0.0481 -0.0722 0.4815

Both AX and R are what I expect from arithmetic that is accurate to about 34 decimal digits.

Although I get a different random A every time I publish this blog post, I expect that it has a modest condition number.

kappa = cond(A) kappa = 0 4002 7e97c18 91278cd 8375371 7915346 11.9560249020758193065358323606886569

Since A is not badly conditioned, I can invert the computed inverse and expect to get close to the original integer matrix. The elements of the resulting Z are integers, possibly bruised with quadruple precision fuzz.

format hex Z = X\I Z = 0 4002 0000000 0000000 0000000 0000000 0 4001 8000000 0000000 0000000 0000000 0 4004 b000000 0000000 0000000 0000004 0 4005 37fffff fffffff fffffff ffffffc 0 4005 77fffff fffffff fffffff ffffffe 0 4002 a000000 0000000 0000000 0000001 0 4004 c7fffff fffffff fffffff ffffffc 0 4004 77fffff fffffff fffffff ffffffc 0 3fff fffffff fffffff fffffff ffffffc Quadruple precision SVD

I have just nonchalantly computed cond(A). Here is the code for the overloaded cond.

type @fp128/cond.m function kappa = cond(A) sigma = svd(A); kappa = sigma(1)/sigma(end); end

So it is correctly using the singular value decomposition. I also have svd overloaded. The SVD computation is handled by a 433 line M-file, svdtx, that, like lutx, was written before fp128 existed. It was necessary to modify five lines in svdtx. The line

u = zeros(n,ncu);

u = fp128(zeros(n,ncu));

Similarly for v, s, e and work. I should point out that the preallocation of the arrays is inherited from the LINPACK Fortran subroutine DSVDC. Without it, svdtx would not have required any modification to work correctly in quadruple precision.

Let's compute the full SVD.

[U,S,V] = svd(A) U = 1 3ffe 57d9492 76f3ea4 dc14bb3 15d42c1 1 3ffe 75a77c4 8c7b469 2cac695 59be7fe 1 3ffc 0621737 9b04c78 1c2109d 8736b46 1 3ffb 38214c0 d75c84c 4bcf5ff f3cffd7 1 3ffb a9281e3 e12dd3a d632d61 c8f6e60 0 3ffe fbbccdc a571fa1 f5a588b fb0d806 1 3ffe 79587db 4889548 f09ae4b cd0150c 0 3ffe 59fae16 17bcabb 6408ba4 7b2a573 0 3ff8 cde38fc e952ad5 8b526c2 780c2e5 S = 0 4006 1f3ad79 d0b9b08 18b1444 030e4ef 0 0000 0000000 0000000 0000000 0000000 0 0000 0000000 0000000 0000000 0000000 0 0000 0000000 0000000 0000000 0000000 0 4004 a720ef6 28c6ec0 87f4c54 82dda2a 0 0000 0000000 0000000 0000000 0000000 0 0000 0000000 0000000 0000000 0000000 0 0000 0000000 0000000 0000000 0000000 0 4002 8061b9a 0e96d8c c2ef745 9ea4c9a V = 1 3ffb db3df03 9b5e1b3 5bf4478 0e42b0d 1 3ffe b540007 4d4bc9e dc9461a 0de0481 1 3ffe 03aaff4 d9cea2c e8ee2bc 2eba908 0 3ffe fa73e09 9ef8810 a03d2eb 46ade00 1 3ffa b316e2f fe9d3ae dfa9988 fbca927 1 3ffc 184af51 f25fece 97bc0da 5ff13a2 1 3ffb 706955f a877cbb b63f6dd 4e2150e 0 3ffe 08fc1eb 7b86ef7 4af3c6c 732aae9 1 3ffe b3aaead ef356e2 7cd2937 94b22a7

Reconstruct A from its quadruple precision SVD. It's not too shabby.

USVT = U*S*V' USVT = 0 4001 fffffff fffffff fffffff fffffce 0 4001 7ffffff fffffff fffffff fffffc7 0 4004 b000000 0000000 0000000 000000a 0 4005 37fffff fffffff fffffff ffffff1 0 4005 77fffff fffffff fffffff ffffff6 0 4002 9ffffff fffffff fffffff fffffd2 0 4004 c7fffff fffffff fffffff ffffff1 0 4004 77fffff fffffff fffffff ffffff4 0 3fff fffffff fffffff fffffff ffffe83 Rosser matrix

An interesting example is provided by a classic test matrix, the 8-by-8 Rosser matrix. Let's compare quadruple precision computation with the exact rational computation provided by the Symbolic Math Toolbox.

First, generate quad and sym versions of rosser.

R = fp128(rosser); S = sym(rosser) S = [ 611, 196, -192, 407, -8, -52, -49, 29] [ 196, 899, 113, -192, -71, -43, -8, -44] [ -192, 113, 899, 196, 61, 49, 8, 52] [ 407, -192, 196, 611, 8, 44, 59, -23] [ -8, -71, 61, 8, 411, -599, 208, 208] [ -52, -43, 49, 44, -599, 411, 208, 208] [ -49, -8, 8, 59, 208, 208, 99, -911] [ 29, -44, 52, -23, 208, 208, -911, 99]

R is symmetric, but not positive definite, so its LU factorization requires pivoting.

[L,U,p] = lutx(R); format short p p = 1 2 3 7 6 8 4 5

R is singular, so with exact computation U(n,n) would be zero. With quadruple precision, the diagonal of U is

format long e diag(U) ans = 611.0 836.126022913256955810147299509001582 802.209942588471107300640276546225738 99.0115741407236314604636000423592687 -710.481057851148425133280246646085002 579.272484693223512196223933017062413 -1.2455924519190846395771824210276321 0.000000000000000000000000000000215716190833522835766351129431653015

The relative size of the last diagonal element is zero to almost 34 digits.

double(U(8,8)/U(1,1)) ans = 3.530543221497919e-34

Compare this with symbolic computation, which, in this case, can compute an LU decomposition with exact rational arithmetic and no pivoting.

[L,U] = lu(S); diag(U) ans = 611 510873/611 409827400/510873 50479800/2049137 3120997/10302 -1702299620/3120997 255000/40901 0

As expected, with symbolic computation U(8,8) is exactly zero.

r = svd(R) r = 1020.04901842999682384631379130551006 1020.04901842999682384631379130550858 1019.99999999999999999999999999999941 1019.90195135927848300282241090227735 999.999999999999999999999999999999014 999.999999999999999999999999999998817 0.0980486407215169971775890977220345302 0.0000000000000000000000000000000832757192990287779822645036097560521

The Rosser matrix is atypical because its characteristic polynomial factors over the rationals. So, even though it is of degree 8, the singular values are the roots of quadratic factors.

s = svd(S) s = 10*10405^(1/2) 10*10405^(1/2) 1020 10*(1020*26^(1/2) + 5201)^(1/2) 1000 1000 10*(5201 - 1020*26^(1/2))^(1/2) 0

The relative error of the quadruple precision calculation.

double(norm(r - s)/norm(s)) ans = 9.293610246879066e-34

Postscript

Finally, verify that we've been working all this time with fp128 and sym objects.

whos Name Size Bytes Class Attributes A 3x3 3531 fp128 AX 3x3 3531 fp128 I 3x3 3531 fp128 L 8x8 8 sym M 4x4 6128 fp128 Me 4x1 1676 fp128 R 8x8 23936 fp128 RD 3x3 72 double S 8x8 8 sym U 8x8 8 sym USVT 3x3 3531 fp128 V 3x3 3531 fp128 X 3x3 3531 fp128 Z 3x3 3531 fp128 ans 1x1 8 double e 4x1 1676 fp128 kappa 1x1 563 fp128 p 8x1 64 double r 8x1 3160 fp128 s 8x1 8 sym \n'); d.write(code_string); // Add copyright line at the bottom if specified. if (copyright.length > 0) { d.writeln(''); d.writeln('%%'); if (copyright.length > 0) { d.writeln('% _' + copyright + '_'); } } d.write('\n'); d.title = title + ' (MATLAB code)'; d.close(); } -->

Get the MATLAB code (requires JavaScript)

Published with MATLAB® R2017a

> %% Background % The IEEE 754 standard, published in 1985, defines formats for % floating point numbers that occupy 32 or 64 bits of storage. % These formats are known as _binary32_ and _binary64_, or more % frequently as _single_ and _double precision_. For many years % MATLAB used only double precision and it remains our default % format. Single precision has been added gradually % over the last several years and is now also fully supported. %% % A revision of IEEE 754, published in 2008, defines two more floating % point formats. One, _binary16_ or _half precision_, occupies only % 16 bits and was the subject of my % . It is primarily intended to reduce storage and % memory bandwidth requirements. Since it provides only "half" precision, % its use for actual computation is problematic. %% % The other new format introduced in IEEE 754-2008 is _binary128_ or % _quadruple precision_. It is intended for situations where the accuracy % or range of double precision is inadequate. %% % I see two descriptions of quadruple precision % software implementations on the Web. % % * % % * %% % I have not used either package, but judging by their Web pages, they % both appear to be complete and well supported. %% % The MATLAB Symbolic Math Toolbox provides |vpa|, arbitrary precision % decimal floating point arithmetic, and |sym|, exact rational arithmetic. % Both provide accuracy and range well beyond quadruple precision, but do % not specifically support the 128-bit IEEE format. %% % My goal here is to describe a prototype of a MATLAB object, |fp128|, % that implements quadruple precision with code written entirely in the % MATLAB language. It is not very efficient, but is does allow % experimentation with the 128-bit format. %% Beyond double % There are other floating point formats beyond double precision. % _Long double_ usually refers to the 80-bit extended precision % floating point registers available with the Intel x86 architecture % and described as _double extended_ in IEEE 754. This provides % the same exponent range as quadruple precision, but much less % accuracy. %% % _Double double_ refers to the use of a pair of double precision % values. The exponent field and sign bit of the second double are % ignored, so this is effectively a 116-bit format. Both the exponent % range and the precision are more than double but less than quadruple. %% Floating point anatomy % The format of a floating point number is characterized by two % parameters, |p|, the number of bits in the fraction and |q|, the number % of bits in the exponent. I will compare four precisions, % _half_, _single_, _double_, and _quadruple_. % The four pairs of characterizing parameters are p = [10, 23, 52 112]; %% q = [5, 8, 11, 15]; %% % With these values of |p| and |q|, and with one more bit for the sign, % the total number of bits in the word, |w|, is a power of two. format shortg w = p + q + 1 %% % *Normalized numbers* % % Most floating point numbers are _normalized_, and are expressed as % % $$x = \pm (1+f)2^e$$ %% % The _fraction_ $f$ is in the half open interval % % $$0 \leq f < 1$$ % %% % The binary representation of $f$ requires at most |p| bits. % In other words $2^p f$ is an integer in the range % % $$0 \leq 2^p f < 2^p$$ %% % % The _exponent_ $e$ is an integer in the range % % $$-b+1 \leq e \leq b$$ % %% % The quantity $b$ is both the largest exponent and the |bias|. % % $$b = 2^{q-1} - 1$$ b = 2.^(q-1)-1 %% % The fractional part of a normalized number is $1+f$, but only $f$ % needs to be stored. That leading $1$ is known as the _hidden bit_. %% % *Subnormal* % % There are two values of the exponent $e$ for which the biased exponent, % $e+b$, reaches the smallest and largest values possible to represent % in |q| bits. The smallest is % % $$e + b = 0$$ % % The corresponding floating point numbers do not have a hidden leading % bit. These are the _subnormal_ or _denormal_ numbers. % % $$x = \pm f 2^{-b}$$ % %% % *Infinity and Not-A-Number* % % The largest possible biased exponent is % % $$e + b = 2^q-1$$. % % Quantities with this exponent field represent _infinities_ and % _NaN_, or _Not-A-Number_. %% % The percentage of floating point numbers that are exceptional % because they are subnormal, infinity or NaN increases as the % precision decreases. Exceptional exponents are only $2$ values % out of $2^q$. For quadruple precision this is $2/2^{15}$, which is % less than a one one-thousandth of one percent. %% % Encode the sign bit with |s = 0| for nonnegative and |s = 1| for % negative. And encode the exponent with an offsetting bias, |b|. % Then a floating point number can be packed in |w| bits with % % x = [s e+b 2^p*f] %% Precision and range % *epsilon* % % If a real number cannot be expressed with a binary expansion requiring % at most |p| bits, it must be approximated by a floating point number % that does have such a binary representation. This is _roundoff error_. % The important quantity characterizing precision is _machine epsilon_, % or |eps|. In MATLAB, |eps(x)| is the distance from |x| to the % next larger (in absolute value) floating point number (of that class). % With no argument, |eps| is simply the difference between |1| and the % next larger floating point number. format shortg eps = 2.^(-p) %% % This tells us that quadruple precision is good for about 34 decimal % digits of accuracy, double for about 16 decimal digits, single for % about 7, and half for about 3. %% % *realmax* % % If a real number, or the result of an arithmetic operation, is too % large to be represented, it _overflows_ and is replaced by _infinity_. % The largest floating point number that does not overflow is |realmax|. % When I try to compute quadruple |realmax| with double precision, % it overflows. I will fix this up in the table to follow. realmax = 2.^b.*(2-eps) %% % *realmin* % % _Underflow_ and representation of very small numbers is more complicated. % The smallest normalized floating point number is |realmin|. When I % try to compute quadruple |realmin| it underflows to zero. Again, % I will fix this up in the table. realmin = 2.^(-b+1) %% % *tiny* % % But there are numbers smaller than |realmin|. IEEE 754 introduced the % notion of _gradual underflow_ and _denormal_ numbers. In the 2008 % revised standard their name was changed to _subnormal_. %% % Think of roundoff in numbers near underflow. Before 754, floating % point numbers had the disconcerting property that |x| and |y| could % be unequal, but their difference could underflow, so |x-y| becomes |0|. % With 754 the gap between |0| and |realmin| is filled with numbers whose % spacing is the same as the spacing between |realmin| and |2*realmin|. % I like to call this spacing, and the smallest subnormal number, |tiny|. tiny = realmin.*eps %% Floating point integers % *flintmax* % % It is possible to do integer arithmetic with floating point numbers. % I like to call such numbers _flints_. When we write the numbers $3$ % and $3.0$, they are different descriptions of the same integer, but % we think of one as fixed point and the other as floating point. % The largest flint is |flintmax|. flintmax = 2./eps %% % Technically all the floating point numbers larger than |flintmax| % are integers, but the spacing between them is larger than one, so it % is not safe to use them for integer arithmetic. Only integer-valued % floating point numbers between |0| and |flintmax| are allowed to be % called flints. %% Table % Let's collect all these anatomical characteristics together in a % new MATLAB |table|. I have now edited the output and inserted % the correct quadruple precision values. T = [w; p; q; b; eps; realmax; realmin; tiny; flintmax]; T = table(T(:,1), T(:,2), T(:,3), T(:,4), ... 'variablenames',{'half','single','double','quadruple'}, ... 'rownames',{'w','p','q','b','eps','realmax','realmin', ... 'tiny','flintmax'}); type Table.txt %% fp128 % I am currently working on code for an object, [email protected]|, % that could provide a full implementation of quadruple-precision % arithmetic. The methods available so far are methods(fp128) %% % The code that I have for quadrule precision is much more complex than % the code that I have for % . There I am able to "cheat" by converting half % precision numbers to doubles and relying on traditional MATLAB % arithmetic. I can't do that for quads. %% % The storage scheme for quads is described in the help entry for % the constructor. help @fp128/fp128 %% % Breaking the 112-bit fraction into four 28-bit pieces makes it possible % to do arithmetic operations on the pieces without worrying about integer % overflow. The core of the |times| code, which implements |x.*y|, % is the convolution of the two fractional parts. dbtype 45:53 @fp128/times %% % The result of the convolution, |zf|, is a |uint64| vector of length % nine with 52-bit elements. It must be renormalized to the fit the % |fp128| storage scheme. %% % Addition and subtraction involve addition and subtraction of the % fractional parts after they have been shifted so that the % corresponding exponents are equal. Again, this produces temporary % vectors that must be renormalized. %% % Scalar division, |y/x|, is done by first computing the reciprocal % of the denominator, |1/x|, and then doing one final multiplication, % |1/x * y|. The reciprocal is computed by a few steps of Newton % iteration, starting with a scaled reciprocal, |1/double(x)|. %% Examples % The output for each example shows the three fields in hexadecimal REPLACE_WITH_DASH_DASH % one sign field, one biased exponent field, and one fraction field % that is a vector with four entries displayed with seven hex digits. % This is followed by a 36 significant digit decimal representation. %% % *One* clear format long one = fp128(1) %% % *eps* eps = eps(one) %% % *1 + eps* one_plus_eps = one + eps %% % *2 - eps* two_minus_eps = 2 - eps %% % *realmin* rmin = realmin(one) %% % *realmax* rmax = realmax(one) %% % *Compute 1/10 with double, then convert to quadruple.* dble_tenth = fp128(1/10) %% % *Compute 1/10 with quadruple.* quad_tenth = 1/fp128(10) %% % *Double precision |pi| converted to quadruple.* dble_pi = fp128(pi) %% % *|pi| accurate to quadruple precision.* quad_pi = fp128(sym('pi')) %% Matrix operations % The 4-by-4 magic square from Durer's Melancholia II provides my first % matrix example. clear M = fp128(magic(4)); %% % Let's see how the 128-bit elements look in hex. format hex M %% % Check that the row sums are all equal. This matrix-vector multiply % can be done exactly with the flints in the magic square. e = fp128(ones(4,1)) Me = M*e %% Quadruple precision backslash % I've overloaded |mldivide|, so I can solve linear systems and % compute inverses. The actual computation is done by |lutx|, % a "textbook" function that I wrote years ago, long before this % quadruple-precision project, followed by the requisite solution % of triangular systems. But now the MATLAB object system insures % that every individual arithmetic operation is done with IEEE 754 % quadruple precision. %% % Let's generate a 3-by-3 matrix with random two-digit integer entries. A = fp128(randi(100,3,3)) %% % I am going to use |fp128| backslash to invert |A|. % So I need the identity matrix in quadruple precision. I = fp128(eye(size(A))); %% % Now the overloaded backslash calls |lutx|, and then solves two % triangular systems to produce the inverse. X = A\I %% % Compute the residual. AX = A*X R = I - AX; format short RD = double(R) %% % Both |AX| and |R| are what I expect from arithmetic that is % accurate to about 34 decimal digits. %% % Although I get a different random |A| every time I publish this blog % post, I expect that it has a modest condition number. kappa = cond(A) %% % Since |A| is not badly conditioned, I can invert the computed inverse % and expect to get close to the original integer matrix. % The elements of the resulting |Z| are integers, possibly bruised with % quadruple precision fuzz. format hex Z = X\I %% Quadruple precision SVD % I have just nonchalantly computed |cond(A)|. Here is the code for % the overloaded |cond|. type @fp128/cond.m %% % So it is correctly using the singular value decomposition. I also % have |svd| overloaded. The SVD computation is handled by a 433 line % M-file, |svdtx|, that, like |lutx|, was written before |fp128| existed. % It was necessary to modify five lines in |svdtx|. The line % % u = zeros(n,ncu); % % had to be changed to % % u = fp128(zeros(n,ncu)); % % Similarly for |v|, |s|, |e| and |work|. I should point out that the % preallocation of the arrays is inherited from the LINPACK Fortran % subroutine DSVDC. Without it, |svdtx| would not have required any % modification to work correctly in quadruple precision. %% % Let's compute the full SVD. [U,S,V] = svd(A) %% % Reconstruct |A| from its quadruple precision SVD. It's not too shabby. USVT = U*S*V' %% Rosser matrix % An interesting example is provided by a classic test matrix, the 8-by-8 % . Let's compare quadruple precision computation with % the exact rational computation provided by the Symbolic Math Toolbox. %% % First, generate quad and sym versions of |rosser|. R = fp128(rosser); S = sym(rosser) %% % |R| is symmetric, but not positive definite, so its LU factorization % requires pivoting. [L,U,p] = lutx(R); format short p %% % |R| is singular, so with exact computation |U(n,n)| would be zero. % With quadruple precision, the diagonal of |U| is format long e diag(U) %% % The relative size of the last diagonal element is zero to almost % 34 digits. double(U(8,8)/U(1,1)) %% % Compare this with symbolic computation, which, in this case, can % compute an LU decomposition with exact rational arithmetic and no % pivoting. [L,U] = lu(S); diag(U) %% % As expected, with symbolic computation |U(8,8)| is exactly zero. %% % How about SVD? r = svd(R) %% % The Rosser matrix is atypical because its % characteristic polynomial factors over the rationals. So, even though % it is of degree 8, the singular values are the roots of % quadratic factors. s = svd(S) %% % The relative error of the quadruple precision calculation. double(norm(r - s)/norm(s)) %% % About 33 digits. %% Postscript % Finally, verify that we've been working all this time % with |fp128| and |sym| objects. whos ##### SOURCE END ##### d47b6ba7b4b7413ea76f76763559fea7 -->

### “Half Precision” 16-bit Floating Point Arithmetic

Tue, 05/09/2017 - 02:30

The floating point arithmetic format that requires only 16 bits of storage is becoming increasingly popular. Also known as half precision or binary16, the format is useful when memory is a scarce resource.

ContentsBackground

The IEEE 754 standard, published in 1985, defines formats for floating point numbers that occupy 32 or 64 bits of storage. These formats are known as binary32 and binary64, or more frequently as single and double precision. For many years MATLAB used only double precision and it remains our default format. Single precision has been added gradually over the last several years and is now also fully supported.

A revision of IEEE 754, published in 2008, defines a floating point format that occupies only 16 bits. Known as binary16, it is primarily intended to reduce storage and memory bandwidth requirements. Since it provides only "half" precision, its use for actual computation is problematic. An interesting discussion of its utility as an image processing format with increased dynamic range is provided by Industrial Light and Magic. Hardware support for half precision is now available on many processors, including the GPU in the Apple iPhone 7. Here is a link to an extensive article about half precision on the NVIDIA GeForce GPU.

Floating point anatomy

The format of a floating point number is characterized by two parameters, p, the number of bits in the fraction and q, the number of bits in the exponent. I will consider four precisions, quarter, half, single, and double. The quarter-precision format is something that I just invented for this blog post; it is not standard and actually not very useful.

The four pairs of characterizing parameters are

p = [4, 10, 23, 52]; q = [3, 5, 8, 11];

With these values of p and q, and with one more bit for the sign, the total number of bits in the word, w, is a power of two.

w = p + q + 1 w = 8 16 32 64

Normalized numbers

Most floating point numbers are normalized, and are expressed as

$$x = \pm (1+f)2^e$$

The fraction $f$ is in the half open interval

$$0 \leq f < 1$$

The binary representation of $f$ requires at most p bits. In other words $2^p f$ is an integer in the range

$$0 \leq 2^p f < 2^p$$

The exponent $e$ is an integer in the range

$$-b+1 \leq e \leq b$$

The quantity $b$ is both the largest exponent and the bias.

$$b = 2^{q-1} - 1$$

b = 2.^(q-1)-1 b = 3 15 127 1023

The fractional part of a normalized number is $1+f$, but only $f$ needs to be stored. That leading $1$ is known as the hidden bit.

Subnormal

There are two values of the exponent $e$ for which the biased exponent, $e+b$, reaches the smallest and largest values possible to represent in q bits. The smallest is

$$e + b = 0$$

The corresponding floating point numbers do not have a hidden leading bit. These are the subnormal or denormal numbers.

$$x = \pm f 2^{-b}$$

Infinity and Not-A-Number

The largest possible biased exponent is

$$e + b = 2^q-1$$.

Quantities with this exponent field represent infinities and NaN, or Not-A-Number.

The percentage of floating point numbers that are exceptional because they are subnormal, infinity or NaN increases as the precision decreases. Exceptional exponents are only $2$ values out of $2^q$. For double precision this is $2/2^{11}$, which is less than a tenth of a percent, but for half precision it is $2/2^5$, which is more than 6 percent. And fully one-fourth of all my toy quarter precision floating point numbers are exceptional.

Encode the sign bit with s = 0 for nonnegative and s = 1 for negative. And encode the exponent with an offsetting bias, b. Then a floating point number can be packed in w bits with

x = [s e+b 2^p*f]Precision and range

epsilon

If a real number cannot be expressed with a binary expansion requiring at most p bits, it must be approximated by a floating point number that does have such a binary representation. This is roundoff error. The important quantity characterizing precision is machine epsilon, or eps. In MATLAB, eps(x) is the distance from x to the next larger (in absolute value) floating point number. With no argument, eps is simply the difference between 1 and the next larger floating point number.

format shortg eps = 2.^(-p) eps = 0.0625 0.00097656 1.1921e-07 2.2204e-16

This tells us that double precision is good for about 16 decimal digits of accuracy, single for about 7 decimal digits, half for about 3, and quarter for barely more than one.

realmax

If a real number, or the result of an arithmetic operation, is too large to be represented, it overflows and is replaced by the largest floating point number. This is

realmax = 2.^b.*(2-eps) realmax = 15.5 65504 3.4028e+38 1.7977e+308

realmin

Underflow and representation of very small numbers is more complicated. The smallest normalized floating point number is

realmin = 2.^(-b+1) realmin = 0.25 6.1035e-05 1.1755e-38 2.2251e-308

tiny

But there are numbers smaller than realmin. IEEE 754 introduced the notion of gradual underflow and denormal numbers. In the 2008 revised standard their name was changed to subnormal.

Think of roundoff in numbers near underflow. Before 754 floating point numbers had the disconcerting property that x and y could be unequal, but their difference could underflow so x-y becomes 0. With 754 the gap between 0 and realmin is filled with numbers whose spacing is the same as the spacing between realmin and 2*realmin. I like to call this spacing, and the smallest subnormal number, tiny.

tiny = realmin.*eps tiny = 0.015625 5.9605e-08 1.4013e-45 4.9407e-324 Floating point integers

flintmax

It is possible to do integer arithmetic with floating point numbers. I like to call such numbers flints. When we write the numbers $3$ and $3.0$, they are different descriptions of the same integer, but we think of one as fixed point and the other as floating point. The largest flint is flintmax.

flintmax = 2./eps flintmax = 32 2048 1.6777e+07 9.0072e+15

Technically all the floating point numbers larger than flintmax are integers, but the spacing between them is larger than one, so it is not safe to use them for integer arithmetic. Only integer-valued floating point numbers between 0 and flintmax are allowed to be called flints.

Table

Let's collect all these anatomical characteristics together in a new MATLAB table.

T = [w; p; q; b; eps; realmax; realmin; tiny; flintmax]; T = table(T(:,1), T(:,2), T(:,3), T(:,4), ... 'variablenames',{'quarter','half','single','double'}, ... 'rownames',{'w','p','q','b','eps','realmax','realmin', ... 'tiny','flintmax'}); disp(T) quarter half single double ________ __________ __________ ___________ w 8 16 32 64 p 4 10 23 52 q 3 5 8 11 b 3 15 127 1023 eps 0.0625 0.00097656 1.1921e-07 2.2204e-16 realmax 15.5 65504 3.4028e+38 1.7977e+308 realmin 0.25 6.1035e-05 1.1755e-38 2.2251e-308 tiny 0.015625 5.9605e-08 1.4013e-45 4.9407e-324 flintmax 32 2048 1.6777e+07 9.0072e+15 fp8 and fp16

Version 3.1 of Cleve's Laboratory includes code for objects @fp8 and @fp16 that begin to provide full implementations of quarter-precision and half-precision arithmetic.

The methods currently provided are

methods(fp16) Methods for class fp16: abs eps isfinite mrdivide rem subsref binary eq le mtimes round svd ctranspose fix lt ne sign tril diag fp16 lu norm single triu disp ge max plus size uminus display gt minus realmax subsasgn double hex mldivide realmin subsindex

These provide only partial implementations because the arithmetic is not done on the compact forms. We cheat. For each individual scalar operation, the operands are unpacked from their short storage into old fashioned doubles. The operation is then carried out by existing double precision code and the results returned to the shorter formats. This simulates the reduced precision and restricted range, but requires relatively little new code.

All of the work is done in the constructors @fp8/fp8.m and @fp16/fp16.m and what we might call the "deconstructors" @fp8/double.m and @fp16/double.m. The constructors convert ordinary floating point numbers to reduced precision representations by packing as many of the 32 or 64 bits as will fit into 8 or 16 bit words. The deconstructors do the reverse by unpacking things.

Once these methods are available, almost everything else is trivial. The code for most operations is like this one for the overloaded addition.

type @fp16/plus.m function z = plus(x,y) z = fp16(double(x) + double(y)); end Wikipedia test suite

0 01111 0000000000 = 1 0 00101 0000000000 = 2^-10 = eps 0 01111 0000000001 = 1+eps = 1.0009765625 (next smallest float after 1) 1 10000 0000000000 = -2 0 11110 1111111111 = 65504 (max half precision) = 2^15*(2-eps) 0 00001 0000000000 = 2^-14 = r ~ 6.10352e-5 (minimum positive normal) 0 00000 1111111111 = r*(1-eps) ~ 6.09756e-5 (maximum subnormal) 0 00000 0000000001 = r*eps ~ 5.96046e-8 (minimum positive subnormal) 0 00000 0000000000 ~ r*eps/2 (underflow to zero) 0 00000 0000000000 = 0 1 00000 0000000000 = -0 0 11111 0000000000 = infinity 1 11111 0000000000 = -infinity 0 11111 1111111111 = NaN 0 01101 0101010101 = 0.333251953125 ~ 1/3

This provides my test suite for checking fp16 operations on scalars.

clear zero = fp16(0); one = fp16(1); eps = eps(one); r = realmin(one); tests = {'1','eps','1+eps','-2','2/r*(2-eps)', ... 'r','r*(1-eps)','r*eps','r*eps/2', ... 'zero','-zero','1/zero','-1/zero','zero/zero','1/3'};

Let's run the tests.

for t = tests(:)' x = eval(t{:}); y = fp16(x); z = binary(y); w = double(y); fprintf(' %18s %04s %19.10g %19.10g %s\n', ... z,hex(y),double(x),w,t{:}) end 0 01111 0000000000 3C00 1 1 1 0 00101 0000000000 1400 0.0009765625 0.0009765625 eps 0 01111 0000000001 3C01 1.000976563 1.000976563 1+eps 1 10000 0000000000 C000 -2 -2 -2 0 11110 1111111111 7BFF 65504 65504 2/r*(2-eps) 0 00001 0000000000 0400 6.103515625e-05 6.103515625e-05 r 0 00000 1111111111 03FF 6.097555161e-05 6.097555161e-05 r*(1-eps) 0 00000 0000000001 0001 5.960464478e-08 5.960464478e-08 r*eps 0 00000 0000000001 0001 5.960464478e-08 5.960464478e-08 r*eps/2 0 00000 0000000000 0000 0 0 zero 0 00000 0000000000 0000 0 0 -zero 0 11111 0000000000 7C00 Inf Inf 1/zero 1 11111 0000000000 FC00 -Inf -Inf -1/zero 1 11111 1111111111 FFFF NaN NaN zero/zero 0 01101 0101010101 3555 0.3333333333 0.3332519531 1/3 Matrix operations

Most of the methods in @fp8 and @fp16 handle matrices. The 4-by-4 magic square from Durer's Melancholia II provides my first example.

clear format short M = fp16(magic(4)) M = 16 2 3 13 5 11 10 8 9 7 6 12 4 14 15 1

Let's see how the packed 16-bit elements look in binary.

B = binary(M) B = 4×4 string array Columns 1 through 3 "0 10011 0000000000" "0 10000 0000000000" "0 10000 1000000000" "0 10001 0100000000" "0 10010 0110000000" "0 10010 0100000000" "0 10010 0010000000" "0 10001 1100000000" "0 10001 1000000000" "0 10001 0000000000" "0 10010 1100000000" "0 10010 1110000000" Column 4 "0 10010 1010000000" "0 10010 0000000000" "0 10010 1000000000" "0 01111 0000000000"

Check that the row sums are all equal. This matrix-vector multiply can be done exactly with the flints in the magic square.

e = fp16(ones(4,1)) Me = M*e e = 1 1 1 1 Me = 34 34 34 34 fp16 backslash

I've overloaded mldivide, so I can solve linear systems and compute inverses. The actual computation is done by lutx, a "textbook" function that I wrote years ago, long before this half-precision project. But now the MATLAB object system insures that every individual arithmetic operation is done on unpacked fp16 numbers.

Let's generate a 5-by-5 matrix with random two-digit integer entries.

A = fp16(randi(100,5,5)) A = 82 10 16 15 66 91 28 98 43 4 13 55 96 92 85 92 96 49 80 94 64 97 81 96 68

I am going to use fp16 backslash to invert A. So I need the identity matrix in half precision.

I = fp16(eye(5)) I = 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1

Now the overloaded backslash calls lutx to compute the inverse.

X = A\I X = 0.0374 -0.0113 -0.0213 -0.0521 0.0631 -0.1044 0.0450 0.0364 0.1674 -0.1782 -0.0791 0.0458 0.0428 0.1270 -0.1550 0.1725 -0.0880 -0.0789 -0.2952 0.3445 -0.0355 0.0161 0.0285 0.0757 -0.0921

Compute the residual.

AX = A*X R = I - AX AX = 1.0000 -0.0001 0.0004 -0.0017 0.0009 0.0008 0.9995 0.0005 -0.0052 0.0043 0.0004 -0.0017 1.0010 -0.0076 -0.0009 0.0005 -0.0023 0.0007 0.9927 0.0013 0.0007 -0.0024 0.0009 -0.0084 1.0020 R = 0 0.0001 -0.0004 0.0017 -0.0009 -0.0008 0.0005 -0.0005 0.0052 -0.0043 -0.0004 0.0017 -0.0010 0.0076 0.0009 -0.0005 0.0023 -0.0007 0.0073 -0.0013 -0.0007 0.0024 -0.0009 0.0084 -0.0020

Both AX and R are what I expect from arithmetic that is accurate to only about three decimal digits.

Although I get a different random A every time I publish this blog post, I expect that it has a modest condition number.

kappa = cond(A) kappa = 211.1246

Since A is not badly conditioned, I can invert the computed inverse and expect to get close to the original integer matrix.

Z = X\I Z = 82.0625 10.0625 15.9844 15.0234 66.0625 91.1875 28.0625 97.8750 42.9688 4.1094 13.8750 55.8125 96.5000 92.6875 85.6875 92.7500 96.5625 49.4063 80.5000 94.5000 64.8125 97.6875 81.4375 96.5625 68.5625 fp16 SVD

I have just nonchalantly computed cond(A). But cond isn't on the list of overload methods for fp16. I was pleasantly surprised to find that matlab\matfun\cond.m quietly worked on this new datatype. Here is the core of that code.

dbtype cond 34:43, dbtype cond 47 34 if p == 2 35 s = svd(A); 36 if any(s == 0) % Handle singular matrix 37 c = Inf(class(A)); 38 else 39 c = max(s)./min(s); 40 if isempty(c) 41 c = zeros(class(A)); 42 end 43 end 47 end

So it is correctly using the singular value decomposition, and I have svd overloaded. The SVD computation is handled by a 433 line M-file, svdtx, that, like lutx, was written before fp16 existed.

Let's compute the SVD again.

[U,S,V] = svd(A) U = -0.2491 -0.5625 0.4087 0.5747 0.3521 -0.3560 -0.5132 -0.7603 -0.0077 -0.1742 -0.4626 0.6050 -0.1619 0.6060 -0.1621 -0.5474 -0.1196 0.4761 -0.3303 -0.5918 -0.5449 0.1981 -0.0316 -0.4402 0.6846 S = 333.5000 0 0 0 0 0 94.1875 0 0 0 0 0 84.0000 0 0 0 0 0 48.2500 0 0 0 0 0 1.5801 V = -0.4319 -0.8838 0.0479 -0.0894 0.1471 -0.4299 0.2233 0.1981 -0.7349 -0.4314 -0.4626 0.0955 -0.7461 0.3069 -0.3550 -0.4729 0.3684 -0.0754 -0.0961 0.7910 -0.4370 0.1544 0.6294 0.5903 -0.2014

Reconstruct A from its half precision SVD. It's not too shabby.

USVT = U*S*V' USVT = 81.9375 10.0703 16.0625 14.9453 66.0000 90.9375 27.9688 97.9375 42.9688 4.0391 12.9688 54.9688 96.0000 91.9375 84.9375 92.0000 96.0000 48.9688 79.9375 94.0000 63.9375 96.9375 80.9375 95.9375 67.8750

Finally, verify that we've been working all this time with fp16 objects.

whos Name Size Bytes Class Attributes A 5x5 226 fp16 AX 5x5 226 fp16 B 4x4 1576 string I 5x5 226 fp16 M 4x4 208 fp16 Me 4x1 184 fp16 R 5x5 226 fp16 S 5x5 226 fp16 U 5x5 226 fp16 USVT 5x5 226 fp16 V 5x5 226 fp16 X 5x5 226 fp16 Z 5x5 226 fp16 e 4x1 184 fp16 kappa 1x1 8 double Calculator

I introduced a calculator in my blog post about Roman numerals. Version 3.1 of Cleve's Laboratory also includes a fancier version of the calculator that computes in four different precisions -- quarter, half, single, and double -- and displays the results in four different formats -- decimal, hexadecimal, binary, and Roman.

I like to demonstrate the calculator by clicking on the keys

1 0 0 0 / 8 1 =

because the decimal expansion is a repeating .123456790.

Thanks

Thanks to MathWorkers Ben Tordoff, Steve Eddins, and Kiran Kintali who provided background and pointers to work on half precision.

\n'); d.write(code_string); // Add copyright line at the bottom if specified. if (copyright.length > 0) { d.writeln(''); d.writeln('%%'); if (copyright.length > 0) { d.writeln('% _' + copyright + '_'); } } d.write('\n'); d.title = title + ' (MATLAB code)'; d.close(); } -->

Get the MATLAB code (requires JavaScript)

Published with MATLAB® R2017a

. % Hardware support for half precision is now available on many processors, % including the GPU in the % . % Here is a link to an extensive article about half precision on the % . %% Floating point anatomy % The format of a floating point number is characterized by two % parameters, |p|, the number of bits in the fraction and |q|, the number % of bits in the exponent. I will consider four precisions, % _quarter_, _half_, _single_, and _double_. The quarter-precision % format is something that I just invented for this blog post; % it is not standard and actually not very useful. %% % The four pairs of characterizing parameters are p = [4, 10, 23, 52]; %% q = [3, 5, 8, 11]; %% % With these values of |p| and |q|, and with one more bit for the sign, % the total number of bits in the word, |w|, is a power of two. w = p + q + 1 %% % *Normalized numbers* % % Most floating point numbers are _normalized_, and are expressed as % % $$x = \pm (1+f)2^e$$ %% % The _fraction_ $f$ is in the half open interval % % $$0 \leq f < 1$$ % %% % The binary representation of $f$ requires at most |p| bits. % In other words $2^p f$ is an integer in the range % % $$0 \leq 2^p f < 2^p$$ %% % % The _exponent_ $e$ is an integer in the range % % $$-b+1 \leq e \leq b$$ % %% % The quantity $b$ is both the largest exponent and the |bias|. % % $$b = 2^{q-1} - 1$$ b = 2.^(q-1)-1 %% % The fractional part of a normalized number is $1+f$, but only $f$ % needs to be stored. That leading $1$ is known as the _hidden bit_. %% % *Subnormal* % % There are two values of the exponent $e$ for which the biased exponent, % $e+b$, reaches the smallest and largest values possible to represent % in |q| bits. The smallest is % % $$e + b = 0$$ % % The corresponding floating point numbers do not have a hidden leading % bit. These are the _subnormal_ or _denormal_ numbers. % % $$x = \pm f 2^{-b}$$ % %% % *Infinity and Not-A-Number* % % The largest possible biased exponent is % % $$e + b = 2^q-1$$. % % Quantities with this exponent field represent _infinities_ and % _NaN_, or _Not-A-Number_. %% % The percentage of floating point numbers that are exceptional % because they are subnormal, infinity or NaN increases as the % precision decreases. Exceptional exponents are only $2$ values % out of $2^q$. For double precision this is $2/2^{11}$, which is % less than a tenth of a percent, but for half precision it is $2/2^5$, % which is more than 6 percent. And fully one-fourth of all my toy % quarter precision floating point numbers are exceptional. %% % Encode the sign bit with |s = 0| for nonnegative and |s = 1| for % negative. And encode the exponent with an offsetting bias, |b|. % Then a floating point number can be packed in |w| bits with % % x = [s e+b 2^p*f] %% Precision and range % *epsilon* % % If a real number cannot be expressed with a binary expansion requiring % at most |p| bits, it must be approximated by a floating point number % that does have such a binary representation. This is _roundoff error_. % The important quantity characterizing precision is _machine epsilon_, % or |eps|. In MATLAB, |eps(x)| is the distance from |x| to the % next larger (in absolute value) floating point number. With no % argument, |eps| is simply the difference between |1| and the next larger % floating point number. format shortg eps = 2.^(-p) %% % This tells us that double precision is good for about 16 decimal digits % of accuracy, single for about 7 decimal digits, half for about 3, and % quarter for barely more than one. %% % *realmax* % % If a real number, or the result of an arithmetic operation, is too % large to be represented, it _overflows_ and is replaced by the % largest floating point number. This is realmax = 2.^b.*(2-eps) %% % *realmin* % % _Underflow_ and representation of very small numbers is more complicated. % The smallest normalized floating point number is realmin = 2.^(-b+1) %% % *tiny* % % But there are numbers smaller than |realmin|. IEEE 754 introduced the % notion of _gradual underflow_ and _denormal_ numbers. In the 2008 % revised standard their name was changed to _subnormal_. %% % Think of roundoff in numbers near underflow. Before 754 floating % point numbers had the disconcerting property that |x| and |y| could % be unequal, but their difference could underflow so |x-y| becomes |0|. % With 754 the gap between |0| and |realmin| is filled with numbers whose % spacing is the same as the spacing between |realmin| and |2*realmin|. % I like to call this spacing, and the smallest subnormal number, |tiny|. tiny = realmin.*eps %% Floating point integers % *flintmax* % % It is possible to do integer arithmetic with floating point numbers. % I like to call such numbers _flints_. When we write the numbers $3$ % and $3.0$, they are different descriptions of the same integer, but % we think of one as fixed point and the other as floating point. % The largest flint is |flintmax|. flintmax = 2./eps %% % Technically all the floating point numbers larger than |flintmax| % are integers, but the spacing between them is larger than one, so it % is not safe to use them for integer arithmetic. Only integer-valued % floating point numbers between |0| and |flintmax| are allowed to be % called flints. %% Table % Let's collect all these anatomical characteristics together in a % new MATLAB |table|. T = [w; p; q; b; eps; realmax; realmin; tiny; flintmax]; T = table(T(:,1), T(:,2), T(:,3), T(:,4), ... 'variablenames',{'quarter','half','single','double'}, ... 'rownames',{'w','p','q','b','eps','realmax','realmin', ... 'tiny','flintmax'}); disp(T) %% fp8 and fp16 % Version 3.1 of % includes code for objects [email protected]| and [email protected]| % that begin to provide full implementations of quarter-precision and % half-precision arithmetic. %% % The methods currently provided are methods(fp16) %% % These provide only partial implementations because the arithmetic is not % done on the compact forms. We cheat. For each individual scalar % operation, the operands are unpacked from their short storage into % old fashioned doubles. The operation is then carried out by existing % double precision code and the results returned to the shorter formats. % This simulates the reduced precision and restricted range, but requires % relatively little new code. %% % All of the work is done in the constructors [email protected]/fp8.m| and % [email protected]/fp16.m| and what we might call the "deconstructors" % [email protected]/double.m| and [email protected]/double.m|. % The constructors convert ordinary floating point % numbers to reduced precision representations by packing as many of % the 32 or 64 bits as will fit into 8 or 16 bit words. The deconstructors % do the reverse by unpacking things. %% % Once these methods are available, almost everything else is trivial. % The code for most operations is like this one for the overloaded % addition. type @fp16/plus.m %% Wikipedia test suite % The about half-precision includes several 16-bit examples % with the sign, exponent, and fraction fields separated. % I've added a couple more. % % 0 01111 0000000000 = 1 % 0 00101 0000000000 = 2^-10 = eps % 0 01111 0000000001 = 1+eps = 1.0009765625 (next smallest float after 1) % 1 10000 0000000000 = -2 % 0 11110 1111111111 = 65504 (max half precision) = 2^15*(2-eps) % 0 00001 0000000000 = 2^-14 = r ~ 6.10352e-5 (minimum positive normal) % 0 00000 1111111111 = r*(1-eps) ~ 6.09756e-5 (maximum subnormal) % 0 00000 0000000001 = r*eps ~ 5.96046e-8 (minimum positive subnormal) % 0 00000 0000000000 ~ r*eps/2 (underflow to zero) % 0 00000 0000000000 = 0 % 1 00000 0000000000 = -0 % 0 11111 0000000000 = infinity % 1 11111 0000000000 = -infinity % 0 11111 1111111111 = NaN % 0 01101 0101010101 = 0.333251953125 ~ 1/3 %% % This provides my test suite for checking |fp16| operations on scalars. clear zero = fp16(0); one = fp16(1); eps = eps(one); r = realmin(one); tests = {'1','eps','1+eps','-2','2/r*(2-eps)', ... 'r','r*(1-eps)','r*eps','r*eps/2', ... 'zero','-zero','1/zero','-1/zero','zero/zero','1/3'}; %% % Let's run the tests. for t = tests(:)' x = eval(t{:}); y = fp16(x); z = binary(y); w = double(y); fprintf(' %18s %04s %19.10g %19.10g %s\n', ... z,hex(y),double(x),w,t{:}) end %% Matrix operations % Most of the methods in [email protected]| and [email protected]| handle matrices. % The 4-by-4 magic square from Durer's Melancholia II provides my first % example. clear format short M = fp16(magic(4)) %% % Let's see how the packed 16-bit elements look in binary. B = binary(M) %% % Check that the row sums are all equal. This matrix-vector multiply % can be done exactly with the flints in the magic square. e = fp16(ones(4,1)) Me = M*e %% fp16 backslash % I've overloaded |mldivide|, so I can solve linear systems and % compute inverses. The actual computation is done by |lutx|, % a "textbook" function that I wrote years ago, long before this % half-precision project. But now the MATLAB object system insures % that every individual arithmetic operation is done on unpacked % |fp16| numbers. %% % Let's generate a 5-by-5 matrix with random two-digit integer entries. A = fp16(randi(100,5,5)) %% % I am going to use |fp16| backslash to invert |A|. % So I need the identity matrix in half precision. I = fp16(eye(5)) %% % Now the overloaded backslash calls |lutx| to compute the inverse. X = A\I %% % Compute the residual. AX = A*X R = I - AX %% % Both |AX| and |R| are what I expect from arithmetic that is % accurate to only about three decimal digits. %% % Although I get a different random |A| every time I publish this blog % post, I expect that it has a modest condition number. kappa = cond(A) %% % Since |A| is not badly conditioned, I can invert the computed inverse % and expect to get close to the original integer matrix. Z = X\I %% fp16 SVD % I have just nonchalantly computed |cond(A)|. But |cond| isn't on the % list of overload methods for |fp16|. I was pleasantly surprised to % find that |matlab\matfun\cond.m| quietly worked on this new datatype. % Here is the core of that code. dbtype cond 34:43, dbtype cond 47 %% % So it is correctly using the singular value decomposition, and I have % |svd| overloaded. The SVD computation is handled by a 433 line M-file, % |svdtx|, that, like |lutx|, was written before |fp16| existed. %% % Let's compute the SVD again. [U,S,V] = svd(A) %% % Reconstruct |A| from its half precision SVD. It's not too shabby. USVT = U*S*V' %% % Finally, verify that we've been working all this time % with |fp16| objects. whos %% Calculator % I introduced a |calculator| in my blog post about % . % Version 3.1 of % also includes a fancier version of the calculator % that computes in four different precisions REPLACE_WITH_DASH_DASH quarter, half, single, and % double REPLACE_WITH_DASH_DASH and displays the results in four different formats REPLACE_WITH_DASH_DASH decimal, % hexadecimal, binary, and Roman. %% % I like to demonstrate the calculator by clicking on the keys % % 1 0 0 0 / 8 1 = % % because the decimal expansion is a repeating |.123456790|. % % <> %% Thanks % Thanks to MathWorkers Ben Tordoff, Steve Eddins, and Kiran Kintali % who provided background and pointers to work on half precision. ##### SOURCE END ##### ff9f165871db4e949268c53e16278a8d -->