Data Driving: 2016

Friday, October 14, 2016

How to read things without opening them

Brent Seales is a professor of computer science at the University of Kentucky. He drives several very cool research programs in areas ranging from robotic surgery to advanced image progressing.

Recently, the Economist (one of my favorite weekly reads) featured some of his work. Here's a link.

How to read an old scroll without opening it

Tuesday, October 11, 2016

iCite and W2P ratios - new ways of quantifying scientific productivity

What's the best way of ranking scientists based on their productivity?

There are lots of options including:

number of publications
number of citations
impact factor of the journals the publications are in

All have strengths and weaknesses.

Recently, leaders at my academic institution have started to talk publicly about h-indices. These are calculated for each author as the number (x) of publications he/she has that have each been cited at least x times. That's an interesting idea because it rewards impact; people who publish lots of papers that nobody cites have lower h-indices than people who publish a few manuscripts that are very influential. However the h-index has the drawback that it grows with time (because people publish more papers and they have more time to be cited). This means that it favors seasoned scientists who have been productive for a long time. It's not a great way of identifying a rising star.

I don't think that there will ever be a single perfect metric that scientists can use to quantify productivity but I was excited to see that NIH is supporting the Relative Citation Ratio with their new iCite tool.

The Relative Citation Ratio (RCR) is an article level metric that quantifies scientific influence. To quote a help box from iCite, "It is calculated as the cites/year of each paper, normalized to the citations per year received by NIH-funded papers in the same field."

The Weighted RCR is the sum of the individual RCRs for a group of papers.

This creates an interesting opportunity. If you calculate the ratio of the Weighted RCR to Total Publications (I'll call it W2P) you get a single value that defines the influence of a collection of papers.

If your W2P is greater than 1, your papers are more influential than those of your NIH-funded colleagues. If it's less than 1, your papers are being cited less often than average.

WtoP doesn't scale with the number of papers you publish so it shouldn't depend on the length of your career.

Only time will tell how scientists use these new metrics but I am going to make a bold (and probably rash) prediction. Since NIH is supporting RCRs, I think that they will take over from h-indices as the mostly commonly used measure of productivity in US biomedical science.

For the record, here are my current stats

46 publications
h-index = 22
Weighted RCR = 59.98
W2P = 1.34

and for the truly nerdy

Erdos number = 8

Monday, September 5, 2016

The Pentium Bug

I remember the Pentium Bug in 1994 (I had just started my PhD) but didn't know enough about scientific computing to understand the significance.

Here's a fascinating blog post from Cleve Moler describing what happened.

Saturday, September 3, 2016

POVRay sarcomeres - final touches

This post tidies up some loose ends about making movies of muscle sarcomeres in POV-Ray. I blogged earlier about

Putting all these techniques together allowed me to render an image of chain of sarcomeres.

The most difficult part here was defining the geometry and making sure that I was drawing each object in the correct x,y,z location. I automated this process using MATLAB. The whole scene is made up of spheres, cylinders, and boxes. A few objects are merged together as blobs.

Once I had that working for a single image, I just moved the camera around the scene until I had a sequence that I thought looked interesting. Then I rendered the images (it took a few hours on my PC) and stitched the sequence together to make a movie.

Here's the final result.

I am happy to share the code but it may take me some time to get it posted online. If you want to get it sooner, just send me a note via a comment below.

Thursday, August 25, 2016

How to learn to code?

Coding (perhaps like most things) is all about perspective.

My academic appointment is the Department of Physiology in the Medical School at the University of Kentucky. Only a few people in the department know how to program and, as a result, most of my colleagues think I'm 'pretty good with computers'.

My collaborators in Engineering and Chemistry across the street are always polite but think I code like a Geico Caveman.

Why do so few physiologists code? I used to think it's because biomedical scientists didn't think it was useful but I've changed my mind. I now think that our students don't code because they've never been taught. They've typically taken lots of courses in biology, chemistry, and genetics, but nobody has shown them how to write code, or how to break a task down into little steps so that they can create an algorithm.

What's the best way of helping these students get started?

I've blogged about how I learned to program before but my advice won't work for everybody and I'm always looking for new resources that I can tell students about.

Today, I came across the MATLAB OnRamp. This resource has actually been around for several years but I hadn't seen it before and it looks interesting. If you have any experience with it, or can suggest better ways of getting started in scientific computing, please share your thoughts in the comments below.

Monday, August 22, 2016

Filtering binary images by object size in MATLAB

As usual, I learned something new when I read Steve Eddin's latest blog post.

This time around, I discovered that MATLAB has some new features that make it easy to filter binary objects based on their size. There were, of course, old ways to do this but the new bwareafilt command (at least, new in MATLAB 2014B) simplifies some workflows.

Sunday, August 21, 2016

POV-Ray animation

This post continues my introduction to generating images and movies using POV-Ray.

In the last post, I showed how to render a simple scene. Now I am going to show how to animate it.

POV-Ray includes a lot of options for animation and this page provides some great tutorials and examples. However, I have always found it easier to generate animations by automatically generating different *.pov files and then rendering them as part of my workflow.

The last part is easy because you can call POV-Ray directly from the command line. (See here for many options.)

For example, once you are in the directory that contains the main POV-Ray executable, you can produce a PNG file that is 340 by 280 pixels from the scene described in c:\temp\out.pov using this command.

pvengine +W340 +H280 c:\temp\out.pov /exit

Here's some MATLAB code that will spin the camera around the previous image.

function povray_animation

r = -25;

no_of_points = 50;

x = r*cosd(linspace(0,360,no_of_points))

for i=1:no_of_points

pov_file = fopen(fullfile(cd,sprintf('out%.0f.pov',i)),'w');

fprintf(pov_file,'#include "colors.inc"\n');

fprintf(pov_file,'background {color White}\n');

fprintf(pov_file,'camera {location <%f,25,25>\n',x(i));

fprintf(pov_file,'\tsky <0,0,1>\n');

fprintf(pov_file,'\tlook_at <10,0,0>}\n');

fprintf(pov_file,'light_source {<%f,25,25> color White}\n',x(i));

fprintf(pov_file, ...

'cylinder {<0,0,0>,<10,0,0>,1 texture {pigment {color Red}}}\n');

fprintf(pov_file, ...

'cylinder {<0,0,0>,<0,10,0>,1 texture {pigment {color Green}}}\n');

fprintf(pov_file, ...

'cylinder {<0,0,0>,<0,0,10>,1 texture {pigment {color Blue}}}\n');

fprintf(pov_file, ...

'sphere {<10,10,10>,3 texture {pigment {color OrangeRed}}}\n');

fclose(pov_file);

cd_string = 'cd c:\program files\pov-ray\v3.7\bin';

command_string = sprintf('%s\npvengine +W340 +H280 %s +A /exit', ...

cd_string, ...

fullfile(cd,sprintf('out%.0f.pov',i)));

batch_file = fopen('pov.bat','w');

fprintf(batch_file,'%s\n',command_string);

fclose(batch_file);

system('pov.bat');

end

It's now easy to stitch the *.png files together to make a movie.

Saturday, August 13, 2016

POV-Ray basics

POV-Ray is a ray-tracing program. My simplistic view of what this means is that POV-Ray can turn a description of a scene into an image.

Here's an example.

1) Install POV-Ray (downloads here)

2) Start POV-Ray, and create a new file

3) Copy and paste the following text into the new file.

#include "colors.inc"

background {color White}
camera {location <-25,25,25>
sky <0,0,1>
look_at <10,0,0>}
light_source { <-25,25,25> color White }

cylinder{<0,0,0>,<10,0,0>,1 texture {pigment {color Red}}}
cylinder{<0,0,0>,<0,10,0>,1 texture {pigment {color Green}}}
cylinder{<0,0,0>,<0,0,10>,1 texture {pigment {color Blue}}}

sphere {<10,10,10>,3 texture {pigment {color OrangeRed}}}

4) Run

5) You should see the following image

6) Here's what the *.pov file did

It put a camera at x=-25, y=25, z=25.
The sky command aligned the camera along the z -axis. (If there was a pole sticking out the top of the camera, the pole would be parallel to the z axis.)
The camera is pointing at x=10, y=0, z=0.
There is a light at the same location as the camera.
Three cylinders start at the origin <0,0,0> and run along the x, y, and z axes respectively.
Finally, there's a sphere at x=10, y=10, z=10.

7) When you pressed Run, POV-Ray worked out what the camera would have seen and created a *.png file of the view.

That's the basics. Everything else I've done in POV-Ray is just an extension of these sorts of techniques. I'll describe a few of my tricks in upcoming posts.

If you want to see what experts can do, take a look at the POV-Ray Hall of Fame.

Wednesday, August 10, 2016

Word clouds

A quick aside - I will get back to POV-Ray soon.

I wanted to make a Word Cloud for something at work. I've never done this before so I adopted my usual strategy - look in the MATLAB FileExchange to see if somebody has already created a tool.

As is nearly always the case, I found something useful, WordData Visualization. It was even a Pick of the Week back in 2015.

Five minutes later, and I had what I wanted. Then I went on to generate a figure based on the text from my PhD thesis. It doesn't get much more exciting :-)

Sunday, August 7, 2016

POV-Ray - old but gold

I wanted to render some 3D images recently. I downloaded SketchUp and Blender and did some quick googling.

Both packages look really interesting but they look as if they are primarily GUI-based. I realize that's what most people probably want but I needed to render a scene that contains thousands of objects. I didn't want to have to add these into my scene one at a time.

I'm sure somebody will explain how to create complex scenes in SketchUp and Blender (please write a comment if you do) but I decided to go back to POV-Ray. It's old (the last release was in 2013) but it's very functional. I used MATLAB to create my scene in the POV-Ray format (*.pov) and then rendered the image using the POV-Ray engine.

I'll explain how I automated the process in the next few posts and created 'fly-through movies' with a single click but, for now, here's a quick example of the sort of image I was able to generate.

I think it's pretty cool.

Tuesday, June 7, 2016

Machine learning and your car insurance

Here's another super-interesting post from the blogs on MATLABCentral.org.

http://blogs.mathworks.com/headlines/2016/06/07/your-car-doesnt-have-to-be-smart-to-be-a-privacy-concern/

A group at the University of California San Diego has analyzed data collected from standard sensors that are already embedded (and active!) in modern cars. Machine learning detected 'signatures' in the driving patterns of the participants and these could be used to identify the driver with near 100% certainty in a separate series of tests.

It will be interesting to see how this technology develops. Will insurance companies start to use these techniques to determine if it was you, or your teenage child, that backed into the garage door? Will society decide that this is an invasion of privacy.

I don't know, but then again, I bike to work :-)

Ken

Monday, May 23, 2016

Numerical integration

I love reading everything Cleve Moler writes about scientific computing. He makes things sound so simple and I always learn something new. In fact, I nearly always learn lots of things.

Here's his latest blog post on numerical integration.

http://blogs.mathworks.com/cleve/2016/05/23/modernization-of-numerical-integration-from-quad-to-integral/

Wednesday, May 18, 2016

Update on WTFIT

A few weeks ago I blogged about an amazing new bot that was developed by the incomparable Ming Cheuk.

Since then, I've been wondering exactly how it works. Ming's interesting post on Medium provides some answers.

Finally, just for the record, and to keep things in perspective, I should point that WTFIT doesn't always work exactly like you'd hope.

Here's a picture I tested it with. The main character in the image is Ming.

Monday, May 9, 2016

New resources for learning about differential equations

Students often ask me how they can start to learn about scientific computing.

I normally tell them to do what I did - start with a problem that they are interested in (maybe one that they are working on?) and begin to tinker. Here's how that worked out ...

But I realize that there are probably better approaches. For example, many people now like to learn by watching videos and tutorials.

This blog post caught my eye this morning and the videos, lectures, and online material look super interesting. I hope to give these courses a try myself.

http://blogs.mathworks.com/cleve/2016/05/09/strang-and-moler-video-course-on-differential-equations/

Friday, May 6, 2016

Image recognition and a new bot

Ming Cheuk, an exceptionally talented PhD student at the University of Auckland, has created a bot to identify objects in pictures. He calls it WTFIT.

I thought I'd give it a try. You just log in with Facebook and send it a picture. Here's what I got.

Amazing. Try it yourself at http://www.wtfitbot.com/

Thursday, May 5, 2016

Jason and the IT Crowd

I came across this on Facebook today and thought it was very funny.

If you haven't come across Maurice Moss before, he was one of the lead characters on The IT Crowd, a tremendously funny British sitcom.

Incidentally, I wrote about JSON in one of my first blog posts.

Monday, May 2, 2016

Neural networks for beginners

I've been thinking a little about machine learning recently so this blog post caught my attention.

http://blogs.mathworks.com/loren/2015/08/04/artificial-neural-networks-for-beginners/?s_eid=PSM_da

It's a beautiful post that starts with a concrete example focused on using artificial networks to recognize digits and ends with a program that solves Soduko puzzles.

Thursday, April 28, 2016

New options for thresholding images in MATLAB

I've written several posts recently about thresholding images. I was thus pretty excited to see Steve Eddins announce a new series of posts on his blog.

http://blogs.mathworks.com/steve/2016/04/28/image-binarization-new-functional-designs/

My impression from a very quick scan of the documentation is that MATLAB 2016a includes a new function imbinarize which can be called with a locally adaptive threshold option.

I don't think this will lead to a breakthrough in image processing - adaptive thresholding is not a new technique - but it will simplify coding. The new function will allow users to replace multiple lines of 'old' MATLAB code with a single command.

And in my opinion, that's why MATLAB is so useful for scientific computing. It's not that the software gives you completely new options; you can code anything in C if you want to. It's just that MATLAB makes it easy to try things without investing huge amounts of time. That's why it accelerates science.

Sunday, April 24, 2016

Standard errors

This post is going to be about the standard error of the mean, which is often just referred to as the standard error.

As a reminder, we're basing our discussion on a distribution of 5 km race times. The mean of the race times is 25.001 minutes. The standard deviation s is 2.973 minutes. The mean tells you where the distribution is centered. The standard deviation tells you about its width. Look back at the last post if you need a reminder.

The standard error is subtly different. It doesn't really describe your distribution. Rather, it helps you to understand how accurately you know the mean value of your data.

But the mean is the mean, you're thinking. How can it change?

Well, you're right. You calculated the mean of 500 race times. That's a fixed number. But the assumption in this type of statistics is that you're pulling these race times from an infinite series of potential times. When you picked this particular set of 500 values, the mean happened to be 25.001. That's now your estimate of the true mean of the infinite series of potential times.

But you could have picked a different set of 500 race times from the infinite list of potential numbers, in which case, you'd probably have calculated a different estimate of the mean. A third set of 500 times could have given you a still different estimate. And so on ...

How much would these estimates of the true mean vary? That's what the standard error helps you to decide.

Take a look at the histograms below. They show what happens when you split the 500 original race times into groups of different sizes and then calculate the mean of each group.

So the middle panel (group size = 5) shows the distribution of means calculated using 100 groups which each contain 5 race times. The first value represented in the histogram is the mean of the first 5 race times, the second value is the mean of the second 5 race times, and so on.

Do you see how the distributions get narrower as the group size increases? That is because as you include more and more points in the group, it becomes increasingly representative of the original population. By the time you're averaging 20 points, there's not much difference between the individual subsets.

You can calculate the width of the histograms above just like we did when we calculated the standard deviation in the last post. The width of these distributions is the standard error of the mean. It goes down as you increase the group size. You can also calculate it as

$\text{SE}_\bar{x}\ = \frac{s}{\sqrt{n}}$

from Wikipedia where s is the standard deviation calculated for your samples and n is the number of data points you have. As you increase n, the standard error goes down, because your points are becoming more representative of the original population. However, there are diminishing returns. Once you already have quite a few data points, you have to add a lot more to get a better estimate of the mean. That bit comes from the square root of n on the bottom.

So in summary,

the mean tells you where your data are centered
the standard deviation quantifies the spread in your original data
the standard error describes the uncertainty associated with the mean value

Next up, one-sample t-tests.

Thursday, April 21, 2016

Distributions and standard deviations

This post is going to be about distributions and statistics that summarize them.

I run 5 km races fairly often and so do many of my friends. Here are the latest times (in minutes) for a few of us.

23.053, 28.543, 22.725, 21.671, 22.463, 23.282

[Okay, I'm cheating a bit here. You probably realized that we don't time our runs to 3 decimal places. I made these numbers up in MATLAB because I want a concrete example, but bear with me.]

The fastest time in this small group was ~21.7 minutes. The slowest was just over 28.5 minutes. It looks like our average time was ~25 minutes.

The next image shows a histogram for 500 5 km times. (Download the data here if you want to follow along.) The x axis shows the time in minutes. The y axis shows the number of people who ran that time.

If you look at this distribution, you can see that our fastest runner completed his race in 17 minutes (there's a bar of height 1 at 17). Similarly, two of our group took 35 minutes.

This histogram is centered around 25 minutes. Some of us took less time, some of us took more, but our 'middle time' was ~25 minutes. You can calculate that by adding up all the 5 km times and dividing by the number of races. Here's the equation copied from Wikipedia.

$\bar{x} = \frac{x_1+x_2+\cdots +x_n}{n}$

For our data, the mean happens to be 25.0 minutes.

Knowing the mean value of a distribution is often helpful but it can be useful to know how wide the distribution is too. For example, you might want to know whether (a) we are all run very similar times, or (b) whether our times span a big range. These would correspond to narrow and wide distributions respectively.

The key parameter here is the standard deviation, written as s. The standard deviation for the narrow distribution above is 1. The standard deviation for the wide distribution is 4.

You can calculate the standard deviation using another formula, again copied from the Wikipedia page

$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2}, {\rm \ \ where\ \ } \mu = \frac{1}{N} \sum_{i=1}^N x_i.$

However, I think it might be more useful to think about what the standard deviation actually represents.

Take a look at this formula

You can think of this as a constant (the bit before the e) multiplied by e^something.

The e^somethingbit is the interesting part. (x-µ)² is greater than or equal to zero because squared numbers cannot be negative. Similarly, (2*s²) also has to be greater than or equal to zero.

So the term

is e to the power of 0 (when x is µ) or e to the power of a negative number.

The maximum value of the term is 1 when x is µ. The term heads to zero as x becomes bigger or smaller than µ. If s is big, the term heads to zero slowly. Conversely if s is small, the term drops to zero very quickly as x moves away from µ.

f(x) is the equation for the normal distribution. It's a very important example of a bell-shaped or Gaussian curve. (All Gaussian curves have the functional form e^{-something x²}). The bit at the beginning just makes sure that the area under the curve always equals 1.

You can fit f(x) to our histogram by adjusting the values of µ and s until you get the best possible match between the curve and the data. (We'll talk more in later posts about how to do this fitting.) When you do this

µ is the mean of the data
s is the standard deviation

Here's what the fit looks like for our data.

In summary, the mean of a distribution tells you where it is centered. The standard deviation s tells you how wide it is.

Next up, standard errors.

Monday, April 18, 2016

Statistics - introduction

This is the introduction to a series of posts about statistics.

I intend to start slowly and build up to cover one and two-sample t-tests. If things go well, I hope to extend the discussion to linear and non-linear regression. I hope that after reading this series you will be able to explain to your friends what the p value of a regression line means.

My approach is heavily influenced by Harvey Motulsky's book "Intuitive Biostatistics". I highly recommend buying this book, or at least borrowing it from your library and devouring it voraciously. It completely changed the way I think about statistics.

The first post in this series will cover means, standard deviations, and standard errors.

Saturday, April 16, 2016

Image segmentation - part 3

This is the third post in our discussion of image segmentation. The first post talked about splitting an image by RGB color using ImageJ / Fiji. Then I talked about using k-means segmentation and the L*a*b color space. You might remember that I was pretty excited about the technique. Is it a general solution to all segmentation problems?

Let's try it on this image.

When I look at this picture, I see 3 colors: white, black/blue and lots of red. Let's convert from RGB to L*a*b and then split into 3 regions using k-means segmentation. The MATLAB code to do this is

function k_means_clustering

image_file_string = '3mokotri.tif';

% Load image

im_rgb = imread(image_file_string);

% Transform to lab color space

cform = makecform('srgb2lab');

im_lab = applycform(im_rgb,cform);

% Create a n x 2 matrix of [a b] values

a = im_lab(:,:,2);

a = double(a(:));

b = im_lab(:,:,3);

b = double(b(:));

ab = [a b];

% Construct the 2D histogram with appropriate limits

c{1}=[min(ab(:)) : max(ab(:))];

c{2}=c{1};

n = hist3(ab,c);

n1 = n';

n1(size(n,1) + 1, size(n,2) + 1) = 0;

% Display

figure(1);

clf;

imagesc(log10(n1));

hold on;

title('2D histogram of pixel values, colored based on log10 of pixel count')

xlabel('a value');

ylabel('b value');

[id,cl]=kmeans(ab,3);

figure(2);

clf;

hold on;

cm = paruly(3);

for i=1:3

vi = find(id==i);

plot(ab(vi,1),ab(vi,2),'+','Color',cm(i,:));

plot(cl(i,1),cl(i,2),'md','MarkerFaceColor','m');

end

xlabel('a value');

ylabel('b value');

set(gca,'YDir','reverse');

[r,c] = size(im_rgb(:,:,1));

im_cluster = reshape(id,r,c);

figure(3);

imagesc(im_cluster);

The end result is

Hmm, that was not what I was hoping for.

Compare with the original again.

It looks like the algorithm did a good job of finding the white lines but it didn't find the black dots I was looking for. Instead, it separated the red areas in the original picture into dark red and light red (maybe pink?) areas.

I have some ideas about how to improve the algorithm for this particular image (which I intend to address in future posts) but I think I've just demonstrated a general result of image processing. It's hard to find a single approach that works for every type of task. It's also hard to develop techniques that don't have a few fiddle factors (parameters that you adjust until 'something works').

Image processing is never easy but as you gain experience and learn more about the techniques that are available, you begin to make faster progress.

Next up, some thoughts on t-tests.

Sunday, April 10, 2016

Image segmentation - part 2

This continues the previous post where we talked about segmenting a test image into 3 parts. As a reminder, here's the test image again.

k-means clustering takes your data points and tries to group them into k clusters. It's a general technique that can be used with many sorts of data but it can be useful for segmenting images.

If you want to segment by color, it's often easier to switch to the L*a*b color-space. When you do this, the second and third planes contain all of the color data. The first L plane is to do with brightness.

Here's some MATLAB code that

loads our test image
transforms it into the lab color space
calculates the 2D histogram of pixel intensities
displays the heat map of the histogram with a log color scale

function k_means_clustering

image_file_string = 'hubert_8_3_small.png';

% Load image

im_rgb = imread(image_file_string);

% Transform to lab color space

cform = makecform('srgb2lab');

im_lab = applycform(im_rgb,cform);

% Create a n x 2 matrix of [a b] values

a = im_lab(:,:,2);

a = double(a(:));

b = im_lab(:,:,3);

b = double(b(:));

ab = [a b];

% Construct the 2D histogram with appropriate limits

c{1}=[min(ab(:)) : max(ab(:))];

c{2}=c{1};

n = hist3(ab,c);

n1 = n';

n1(size(n,1) + 1, size(n,2) + 1) = 0;

% Display

figure(1);

clf;

imagesc(log10(n1));

hold on;

title('2D histogram of pixel values, colored based on log10 of pixel count')

xlabel('a value');

ylabel('b value');

The output is

When I look at the test image, I see three colors: a gray/white that forms the background, an orangey-yellow that is muscle tissue, and a darker red that indicates collagen. So let's try to split the pixel values into 3 separate groups. In MATLAB, the command is

[id,cl]=kmeans(ab,3);

id is a vector with values of 1,2, or 3. These link each pixel to a given cluster.
cl is a 3x2 array which shows the centroids of the clusters.

This snippet of code shows which [a,b] values fall into each cluster. The magenta markers show the centroids (cl) of each cluster.

figure(2);
clf;
hold on;
cm = paruly(3);
for i=1:3
vi = find(id==i);
plot(ab(vi,1),ab(vi,2),'+','Color',cm(i,:));
plot(cl(i,1),cl(i,2),'md','MarkerFaceColor','m');
end

set(gca,'YDir','reverse');

giving

It's now easy to set each pixel in the original image based on it's cluster identity. You just have to reshape the id vector.

[r,c] = size(im_rgb(:,:,1));
im_cluster = reshape(id,r,c);

figure(3);
imagesc(im_cluster);

The output is

Compare this result to the original.

The segmentation is not perfect but we didn't set any parameters other than 'we want 3 colors'. It's also super-fast and objective. I think it's pretty cool.

Is k means clustering a perfect solution for image segmentation? We will look at that in the next post.

Saturday, April 9, 2016

Image segmentation, part 1

I often want to calculate numbers that quantify features in images. For example, the picture below is a cross-section of a piece of heart muscle. The edges are mostly white/gray and are the background of the slide. The light orange parts are muscle. The red parts are collagen.

I want to calculate the relative content of collagen, that is the number of pixels that are collagen divided by the number of pixels that aren't background.

When my lab has done this in the past, we've segmented the image by thresholding in the RGB plane. The screenshot below shows an initial attempt using the Color Thresholding tool in Fiji.

I think this approach works, eventually, but it requires a lot of fiddling to get the thresholds right, and it's definitely subjective. When I came across this page in the MATLAB documentation, I wondered if k-means clustering might be a better option.

We'll describe that approach in the next post.