Making training data sets

This ipython notebook provides an interactive overview to the make_training_data.py script that converts annotated images into training data sets. Our starting point is that an image (phase.png and nuclear.png) and its annotation (feature_0.png - edges and feature_1.png - interior) be present in the same folder. Our ending point will be a list of all the pixels from the image that we will sample to create our training data.

First lets load all of the python packages that we will need.

In [1]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
from cnn_functions import format_coord as cf
from skimage import morphology as morph
import matplotlib.pyplot as plt
import skimage as sk
import scipy as sp
from scipy import ndimage
from skimage import feature
from cnn_functions import get_image
import glob, os, fnmatch
Using Theano backend.

We first have to define some parameters - the maximum number of training examples we want to sample from the image and the size of the patches sampled from our images. The window_size variables depict the number of pixels sampled from both directions to create one training image. For example, if the window_size is 30, the size of the sampled image will be 2*30+1 = 61 pixels.

In [2]:
max_training_examples = 10000000
window_size_x = 30
window_size_y = 30

Next we need to define the directories where the training data is located. In this example, we will train on images of 3T3 cells. We've annotated 3 images and would like to use all 3. Note that the channel_names variable contains a list of all the channels that will be loaded. The channel names must be present in the names of the image files for the program to load them.

In [3]:
direc_name = '/home/vanvalen/DeepCell2/training_data/3T3'
file_name_save = os.path.join('/home/vanvalen/DeepCell2/training_data_npz/', '3T3_all_61x61.npz')
training_direcs = ["set1", "set2", "set3"] 
channel_names = ["phase", "nuclear"]

We also need to specify the number of features (cell edges and cell interior) that we have annotated. Furthermore, we need to identify which features are edges and whether we want to augment those edges with a dilation operation (https://en.wikipedia.org/wiki/Dilation_%28morphology%29).

In [4]:
num_of_features = 2
is_edge_feature = [1,0]
dil_radius = 1

Next we need to get a list of all the files in the training directories and initialize some variables.

In [5]:
num_direcs = len(training_direcs)
num_channels = len(channel_names)

imglist = []
for direc in training_direcs:
	imglist += os.listdir(os.path.join(direc_name, direc))

# Load one file to get image sizes
img_temp = get_image(os.path.join(direc_name,training_direcs[0],imglist[0]))
image_size_x, image_size_y = img_temp.shape

# Initialize arrays for the training images and the feature masks
channels = np.zeros((num_direcs, num_channels, image_size_x, image_size_y), dtype='float32')
feature_mask = np.zeros((num_direcs, num_of_features + 1, image_size_x, image_size_y))

This next block of code is responsible for loading all of the images, normalizing the training data, augmenting the edges by morphological dilation (if chosen) and constructing a feature map for the background.

In [6]:
# Load training images
direc_counter = 0
for direc in training_direcs:
	imglist = os.listdir(os.path.join(direc_name, direc))
	channel_counter = 0

	# Load channels
	for channel in channel_names:
		for img in imglist: 
			if fnmatch.fnmatch(img, r'*' + channel + r'*'):
				channel_file = img
				channel_file = os.path.join(direc_name, direc, channel_file)
				channel_img = get_image(channel_file)
			
                # Normalize the images
				p50 = np.percentile(channel_img, 50)
				channel_img /= p50

				avg_kernel = np.ones((2*window_size_x + 1, 2*window_size_y + 1))
				channel_img -= ndimage.convolve(channel_img, avg_kernel)/avg_kernel.size

				channels[direc_counter,channel_counter,:,:] = channel_img
				channel_counter += 1

	# Load feature mask
	for j in xrange(num_of_features):
		feature_name = "feature_" + str(j) + r".*"
		for img in imglist:
			if fnmatch.fnmatch(img, feature_name):
				feature_file = os.path.join(direc_name, direc, img)
				feature_img = get_image(feature_file)

				if np.sum(feature_img) > 0:
					feature_img /= np.amax(feature_img)

				if is_edge_feature[j] == 1:
					strel = sk.morphology.disk(dil_radius)
					feature_img = sk.morphology.binary_dilation(feature_img, selem = strel)

				feature_mask[direc_counter,j,:,:] = feature_img

	# Thin the augmented edges by subtracting the interior features.
	for j in xrange(num_of_features):
		if is_edge_feature[j] == 1:
			for k in xrange(num_of_features):
				if is_edge_feature[k] == 0:
					feature_mask[direc_counter,j,:,:] -= feature_mask[direc_counter,k,:,:]
			feature_mask[direc_counter,j,:,:] = feature_mask[direc_counter,j,:,:] > 0

	# Compute the mask for the background
	feature_mask_sum = np.sum(feature_mask[direc_counter,:,:,:], axis = 0)
	feature_mask[direc_counter,num_of_features,:,:] = 1 - feature_mask_sum

	direc_counter += 1

Plot the training data to double check that it's loaded. If you're running the script on a server, this part may need to be commented out in the make_training_data.py script.

In [7]:
fig,ax = plt.subplots(len(training_direcs),num_of_features+1, squeeze = False)
for j in xrange(len(training_direcs)):
	ax[j,0].imshow(channels[j,0,:,:],cmap=plt.cm.gray,interpolation='nearest')
	def form_coord(x,y):
		return cf(x,y,channels[j,0,:,:])
	ax[j,0].format_coord = form_coord
	ax[j,0].axes.get_xaxis().set_visible(False)
	ax[j,0].axes.get_yaxis().set_visible(False)

	for k in xrange(1,num_of_features+1):
		ax[j,k].imshow(feature_mask[j,k-1,:,:],cmap=plt.cm.gray,interpolation='nearest')
		def form_coord(x,y):
			return cf(x,y,feature_mask[j,k-1,:,:])
		ax[j,k].format_coord = form_coord
		ax[j,k].axes.get_xaxis().set_visible(False)
		ax[j,k].axes.get_yaxis().set_visible(False)
plt.tight_layout()
plt.show()

We've written this code to sample equally from each training image present. We first identify the feature that has the fewest examples (typically edges) and then we identify which training set has the smallest number of pixels in that feature category. We need to find the training data set with the least number of edge pixels. We then sample that number of pixels from each of the training data sets (if possible)

In [8]:
feature_mask_trimmed = feature_mask[:,:,window_size_x+1:-window_size_x-1,window_size_y+1:-window_size_y-1] 
feature_rows = []
feature_cols = []
feature_batch = []
feature_label = []

edge_num = np.Inf
for j in xrange(feature_mask_trimmed.shape[0]):
	num_of_edge_pixels = 0
	for k in xrange(len(is_edge_feature)):
		if is_edge_feature[k] == 1:
			num_of_edge_pixels += np.sum(feature_mask_trimmed[j,k,:,:])

	if num_of_edge_pixels < edge_num:
		edge_num = num_of_edge_pixels

min_pixel_counter = edge_num

Once we know how many pixels to sample, we can identify which pixels to sample from the training data. The identity of each pixel requires 3 values - which row (feature_rows), which column (feature_cols) and which training image (feature_batch). The values of all 3 are saved as arrays with the corresponding names.

In [9]:
for direc in xrange(channels.shape[0]):

	for k in xrange(num_of_features + 1):
		feature_counter = 0
		feature_rows_temp, feature_cols_temp = np.where(feature_mask[direc,k,:,:] == 1)

		# Check to make sure the features are actually present
		if len(feature_rows_temp) > 0:

			# Randomly permute index vector
			non_rand_ind = np.arange(len(feature_rows_temp))
			rand_ind = np.random.choice(non_rand_ind, size = len(feature_rows_temp), replace = False)

			for i in rand_ind:
				if feature_counter < min_pixel_counter:
					if (feature_rows_temp[i] - window_size_x > 0) and (feature_rows_temp[i] + window_size_x < image_size_x): 
						if (feature_cols_temp[i] - window_size_y > 0) and (feature_cols_temp[i] + window_size_y < image_size_y):
							feature_rows += [feature_rows_temp[i]]
							feature_cols += [feature_cols_temp[i]]
							feature_batch += [direc]
							feature_label += [k]
							feature_counter += 1

feature_rows = np.array(feature_rows,dtype = 'int32')
feature_cols = np.array(feature_cols,dtype = 'int32')
feature_batch = np.array(feature_batch, dtype = 'int32')
feature_label = np.array(feature_label, dtype = 'int32')

Finally, we clip the number of training examples so that it is less than the maximum number of allowed training examples. We also randomly shuffle the order of the training examples.

In [10]:
# Randomly select training points 
if len(feature_rows) > max_training_examples:
	non_rand_ind = np.arange(len(feature_rows))
	rand_ind = np.random.choice(non_rand_ind, size = max_training_examples, replace = False)

	feature_rows = feature_rows[rand_ind]
	feature_cols = feature_cols[rand_ind]
	feature_batch = feature_batch[rand_ind]
	feature_label = feature_label[rand_ind]

# Randomize
non_rand_ind = np.arange(len(feature_rows))
rand_ind = np.random.choice(non_rand_ind, size = len(feature_rows), replace = False)

feature_rows = feature_rows[rand_ind]
feature_cols = feature_cols[rand_ind]
feature_batch = feature_batch[rand_ind]
feature_label = feature_label[rand_ind]

Finally, lets save the training data as an npz file.

In [11]:
np.savez(file_name_save, channels = channels, y = feature_label, batch = feature_batch, pixels_x = feature_rows, pixels_y = feature_cols, win_x = window_size_x, win_y = window_size_y)