Data Driving: April 2016

Thursday, April 28, 2016

New options for thresholding images in MATLAB

I've written several posts recently about thresholding images. I was thus pretty excited to see Steve Eddins announce a new series of posts on his blog.

http://blogs.mathworks.com/steve/2016/04/28/image-binarization-new-functional-designs/

My impression from a very quick scan of the documentation is that MATLAB 2016a includes a new function imbinarize which can be called with a locally adaptive threshold option.

I don't think this will lead to a breakthrough in image processing - adaptive thresholding is not a new technique - but it will simplify coding. The new function will allow users to replace multiple lines of 'old' MATLAB code with a single command.

And in my opinion, that's why MATLAB is so useful for scientific computing. It's not that the software gives you completely new options; you can code anything in C if you want to. It's just that MATLAB makes it easy to try things without investing huge amounts of time. That's why it accelerates science.

Sunday, April 24, 2016

Standard errors

This post is going to be about the standard error of the mean, which is often just referred to as the standard error.

As a reminder, we're basing our discussion on a distribution of 5 km race times. The mean of the race times is 25.001 minutes. The standard deviation s is 2.973 minutes. The mean tells you where the distribution is centered. The standard deviation tells you about its width. Look back at the last post if you need a reminder.

The standard error is subtly different. It doesn't really describe your distribution. Rather, it helps you to understand how accurately you know the mean value of your data.

But the mean is the mean, you're thinking. How can it change?

Well, you're right. You calculated the mean of 500 race times. That's a fixed number. But the assumption in this type of statistics is that you're pulling these race times from an infinite series of potential times. When you picked this particular set of 500 values, the mean happened to be 25.001. That's now your estimate of the true mean of the infinite series of potential times.

But you could have picked a different set of 500 race times from the infinite list of potential numbers, in which case, you'd probably have calculated a different estimate of the mean. A third set of 500 times could have given you a still different estimate. And so on ...

How much would these estimates of the true mean vary? That's what the standard error helps you to decide.

Take a look at the histograms below. They show what happens when you split the 500 original race times into groups of different sizes and then calculate the mean of each group.

So the middle panel (group size = 5) shows the distribution of means calculated using 100 groups which each contain 5 race times. The first value represented in the histogram is the mean of the first 5 race times, the second value is the mean of the second 5 race times, and so on.

Do you see how the distributions get narrower as the group size increases? That is because as you include more and more points in the group, it becomes increasingly representative of the original population. By the time you're averaging 20 points, there's not much difference between the individual subsets.

You can calculate the width of the histograms above just like we did when we calculated the standard deviation in the last post. The width of these distributions is the standard error of the mean. It goes down as you increase the group size. You can also calculate it as

$\text{SE}_\bar{x}\ = \frac{s}{\sqrt{n}}$

from Wikipedia where s is the standard deviation calculated for your samples and n is the number of data points you have. As you increase n, the standard error goes down, because your points are becoming more representative of the original population. However, there are diminishing returns. Once you already have quite a few data points, you have to add a lot more to get a better estimate of the mean. That bit comes from the square root of n on the bottom.

So in summary,

the mean tells you where your data are centered
the standard deviation quantifies the spread in your original data
the standard error describes the uncertainty associated with the mean value

Next up, one-sample t-tests.

Thursday, April 21, 2016

Distributions and standard deviations

This post is going to be about distributions and statistics that summarize them.

I run 5 km races fairly often and so do many of my friends. Here are the latest times (in minutes) for a few of us.

23.053, 28.543, 22.725, 21.671, 22.463, 23.282

[Okay, I'm cheating a bit here. You probably realized that we don't time our runs to 3 decimal places. I made these numbers up in MATLAB because I want a concrete example, but bear with me.]

The fastest time in this small group was ~21.7 minutes. The slowest was just over 28.5 minutes. It looks like our average time was ~25 minutes.

The next image shows a histogram for 500 5 km times. (Download the data here if you want to follow along.) The x axis shows the time in minutes. The y axis shows the number of people who ran that time.

If you look at this distribution, you can see that our fastest runner completed his race in 17 minutes (there's a bar of height 1 at 17). Similarly, two of our group took 35 minutes.

This histogram is centered around 25 minutes. Some of us took less time, some of us took more, but our 'middle time' was ~25 minutes. You can calculate that by adding up all the 5 km times and dividing by the number of races. Here's the equation copied from Wikipedia.

$\bar{x} = \frac{x_1+x_2+\cdots +x_n}{n}$

For our data, the mean happens to be 25.0 minutes.

Knowing the mean value of a distribution is often helpful but it can be useful to know how wide the distribution is too. For example, you might want to know whether (a) we are all run very similar times, or (b) whether our times span a big range. These would correspond to narrow and wide distributions respectively.

The key parameter here is the standard deviation, written as s. The standard deviation for the narrow distribution above is 1. The standard deviation for the wide distribution is 4.

You can calculate the standard deviation using another formula, again copied from the Wikipedia page

$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2}, {\rm \ \ where\ \ } \mu = \frac{1}{N} \sum_{i=1}^N x_i.$

However, I think it might be more useful to think about what the standard deviation actually represents.

Take a look at this formula

You can think of this as a constant (the bit before the e) multiplied by e^something.

The e^somethingbit is the interesting part. (x-µ)² is greater than or equal to zero because squared numbers cannot be negative. Similarly, (2*s²) also has to be greater than or equal to zero.

So the term

is e to the power of 0 (when x is µ) or e to the power of a negative number.

The maximum value of the term is 1 when x is µ. The term heads to zero as x becomes bigger or smaller than µ. If s is big, the term heads to zero slowly. Conversely if s is small, the term drops to zero very quickly as x moves away from µ.

f(x) is the equation for the normal distribution. It's a very important example of a bell-shaped or Gaussian curve. (All Gaussian curves have the functional form e^{-something x²}). The bit at the beginning just makes sure that the area under the curve always equals 1.

You can fit f(x) to our histogram by adjusting the values of µ and s until you get the best possible match between the curve and the data. (We'll talk more in later posts about how to do this fitting.) When you do this

µ is the mean of the data
s is the standard deviation

Here's what the fit looks like for our data.

In summary, the mean of a distribution tells you where it is centered. The standard deviation s tells you how wide it is.

Next up, standard errors.

Monday, April 18, 2016

Statistics - introduction

This is the introduction to a series of posts about statistics.

I intend to start slowly and build up to cover one and two-sample t-tests. If things go well, I hope to extend the discussion to linear and non-linear regression. I hope that after reading this series you will be able to explain to your friends what the p value of a regression line means.

My approach is heavily influenced by Harvey Motulsky's book "Intuitive Biostatistics". I highly recommend buying this book, or at least borrowing it from your library and devouring it voraciously. It completely changed the way I think about statistics.

The first post in this series will cover means, standard deviations, and standard errors.

Saturday, April 16, 2016

Image segmentation - part 3

This is the third post in our discussion of image segmentation. The first post talked about splitting an image by RGB color using ImageJ / Fiji. Then I talked about using k-means segmentation and the L*a*b color space. You might remember that I was pretty excited about the technique. Is it a general solution to all segmentation problems?

Let's try it on this image.

When I look at this picture, I see 3 colors: white, black/blue and lots of red. Let's convert from RGB to L*a*b and then split into 3 regions using k-means segmentation. The MATLAB code to do this is

function k_means_clustering

image_file_string = '3mokotri.tif';

% Load image

im_rgb = imread(image_file_string);

% Transform to lab color space

cform = makecform('srgb2lab');

im_lab = applycform(im_rgb,cform);

% Create a n x 2 matrix of [a b] values

a = im_lab(:,:,2);

a = double(a(:));

b = im_lab(:,:,3);

b = double(b(:));

ab = [a b];

% Construct the 2D histogram with appropriate limits

c{1}=[min(ab(:)) : max(ab(:))];

c{2}=c{1};

n = hist3(ab,c);

n1 = n';

n1(size(n,1) + 1, size(n,2) + 1) = 0;

% Display

figure(1);

clf;

imagesc(log10(n1));

hold on;

title('2D histogram of pixel values, colored based on log10 of pixel count')

xlabel('a value');

ylabel('b value');

[id,cl]=kmeans(ab,3);

figure(2);

clf;

hold on;

cm = paruly(3);

for i=1:3

vi = find(id==i);

plot(ab(vi,1),ab(vi,2),'+','Color',cm(i,:));

plot(cl(i,1),cl(i,2),'md','MarkerFaceColor','m');

end

xlabel('a value');

ylabel('b value');

set(gca,'YDir','reverse');

[r,c] = size(im_rgb(:,:,1));

im_cluster = reshape(id,r,c);

figure(3);

imagesc(im_cluster);

The end result is

Hmm, that was not what I was hoping for.

Compare with the original again.

It looks like the algorithm did a good job of finding the white lines but it didn't find the black dots I was looking for. Instead, it separated the red areas in the original picture into dark red and light red (maybe pink?) areas.

I have some ideas about how to improve the algorithm for this particular image (which I intend to address in future posts) but I think I've just demonstrated a general result of image processing. It's hard to find a single approach that works for every type of task. It's also hard to develop techniques that don't have a few fiddle factors (parameters that you adjust until 'something works').

Image processing is never easy but as you gain experience and learn more about the techniques that are available, you begin to make faster progress.

Next up, some thoughts on t-tests.

Sunday, April 10, 2016

Image segmentation - part 2

This continues the previous post where we talked about segmenting a test image into 3 parts. As a reminder, here's the test image again.

k-means clustering takes your data points and tries to group them into k clusters. It's a general technique that can be used with many sorts of data but it can be useful for segmenting images.

If you want to segment by color, it's often easier to switch to the L*a*b color-space. When you do this, the second and third planes contain all of the color data. The first L plane is to do with brightness.

Here's some MATLAB code that

loads our test image
transforms it into the lab color space
calculates the 2D histogram of pixel intensities
displays the heat map of the histogram with a log color scale

function k_means_clustering

image_file_string = 'hubert_8_3_small.png';

% Load image

im_rgb = imread(image_file_string);

% Transform to lab color space

cform = makecform('srgb2lab');

im_lab = applycform(im_rgb,cform);

% Create a n x 2 matrix of [a b] values

a = im_lab(:,:,2);

a = double(a(:));

b = im_lab(:,:,3);

b = double(b(:));

ab = [a b];

% Construct the 2D histogram with appropriate limits

c{1}=[min(ab(:)) : max(ab(:))];

c{2}=c{1};

n = hist3(ab,c);

n1 = n';

n1(size(n,1) + 1, size(n,2) + 1) = 0;

% Display

figure(1);

clf;

imagesc(log10(n1));

hold on;

title('2D histogram of pixel values, colored based on log10 of pixel count')

xlabel('a value');

ylabel('b value');

The output is

When I look at the test image, I see three colors: a gray/white that forms the background, an orangey-yellow that is muscle tissue, and a darker red that indicates collagen. So let's try to split the pixel values into 3 separate groups. In MATLAB, the command is

[id,cl]=kmeans(ab,3);

id is a vector with values of 1,2, or 3. These link each pixel to a given cluster.
cl is a 3x2 array which shows the centroids of the clusters.

This snippet of code shows which [a,b] values fall into each cluster. The magenta markers show the centroids (cl) of each cluster.

figure(2);
clf;
hold on;
cm = paruly(3);
for i=1:3
vi = find(id==i);
plot(ab(vi,1),ab(vi,2),'+','Color',cm(i,:));
plot(cl(i,1),cl(i,2),'md','MarkerFaceColor','m');
end

set(gca,'YDir','reverse');

giving

It's now easy to set each pixel in the original image based on it's cluster identity. You just have to reshape the id vector.

[r,c] = size(im_rgb(:,:,1));
im_cluster = reshape(id,r,c);

figure(3);
imagesc(im_cluster);

The output is

Compare this result to the original.

The segmentation is not perfect but we didn't set any parameters other than 'we want 3 colors'. It's also super-fast and objective. I think it's pretty cool.

Is k means clustering a perfect solution for image segmentation? We will look at that in the next post.

Saturday, April 9, 2016

Image segmentation, part 1

I often want to calculate numbers that quantify features in images. For example, the picture below is a cross-section of a piece of heart muscle. The edges are mostly white/gray and are the background of the slide. The light orange parts are muscle. The red parts are collagen.

I want to calculate the relative content of collagen, that is the number of pixels that are collagen divided by the number of pixels that aren't background.

When my lab has done this in the past, we've segmented the image by thresholding in the RGB plane. The screenshot below shows an initial attempt using the Color Thresholding tool in Fiji.

I think this approach works, eventually, but it requires a lot of fiddling to get the thresholds right, and it's definitely subjective. When I came across this page in the MATLAB documentation, I wondered if k-means clustering might be a better option.

We'll describe that approach in the next post.

Saturday, April 2, 2016

Loading XML data with C++

I've been writing code that reads/writes data to/from text files since ~1993. When I am coding in C++, I still use the techniques I learned by trial and error 20+ years ago. As a result, I use a lot of fprintf and fscanf statements, and my text files end up with custom formats that are hard for others to use.

Recently, I've been thinking of moving towards XML. I think this will be fairly easy for code that I write in MATLAB but I wasn't sure how difficult it would be for C++. Today, I've learned that I have a chance!

Some quick googling pushed me towards TinyXML-2. It's well-documented and it seems to be actively supported. Another big advantage is that I only need to add 2 files (tinyxml2.h, tinyxml2.cpp) to my existing code.

As usual, I had to play around for a few minutes to get anything to work, but it was easier than I thought given my limited experience. For example, this C++

// Test
tinyxml2::XMLDocument doc;
XMLElement* a;
const char* b;

doc.LoadFile("C:\\temp\\test_xml.xml");
a = doc.FirstChildElement("MotilSim_model")->
FirstChildElement("thin_data")->FirstChildElement("eta_o");
b = a->GetText();
printf("eta_o: %s\n",b);

return(0);

produces

from the XML file that looks like this

<?xml version="1.0" encoding="windows-1252"?>
<MotilSim_model>
<code_data>
<ver>1.1</ver>
</code_data>
<field_data>
<random_seed>3</random_seed>
<no_of_time_steps>20</no_of_time_steps>
</field_data>

<thin_data>
<rest_length>5.5e-9</rest_length>
<k_a>0.02</k_a>
<eta_o>6e-11</eta_o>
<k_torque>1e-14</k_torque>
<m>0.7e-19</m>
<alignment_factor>20</alignment_factor>
</thin_data>

<cb_data>
<k_cb>0.001</k_cb>
<x_ps>10e-9</x_ps>
<f>0</f>
<g>10 1</g>
</cb_data>
</MotilSim_model>