First, we need a good understanding of the serial implementation. What happens in the channel solver is very simple. Here is the equation (I've skipped out the derivation of this equation, but you can look it up elsewhere):

P0 is the current state of the channel and P1 is it's new state. h is the small time increment over which the state changes from P0 to P1. Alpha and Beta are the rates of opening and closing of the channels.

Solving this equation for each of the channels in each of the compartments is all that happens in the channel calculations. Alpha and beta change based on the current voltage of the cell. MOOSE takes care of this by maintaining a lookup table in which different values of voltage V map to unique values of alpha and beta. So before this calculation is performed, the lookup table is referred to get the values at the current voltage of the cell. There are multiple types of channels, each of which have their own unique values for a and b for each voltage. So there are multiple lookup tables - one for each type of channel.

This is how the serial algorithm works:

Solving this equation for each of the channels in each of the compartments is all that happens in the channel calculations. Alpha and beta change based on the current voltage of the cell. MOOSE takes care of this by maintaining a lookup table in which different values of voltage V map to unique values of alpha and beta. So before this calculation is performed, the lookup table is referred to get the values at the current voltage of the cell. There are multiple types of channels, each of which have their own unique values for a and b for each voltage. So there are multiple lookup tables - one for each type of channel.

This is how the serial algorithm works:

So for each channel of each compartment, the row to be looked up is found using the voltage, while the column is decided based on the type of channel. This is done during initialization and stored in a array, so it only needs to be read sequentially here.

The key point to be made here is that all of the calculations for each of the channels happens completely independently of the other channels in both the same cell and other cells. Each of these calculations can be done in parallel.

The key point to be made here is that all of the calculations for each of the channels happens completely independently of the other channels in both the same cell and other cells. Each of these calculations can be done in parallel.

It might seem pretty straightforward to parallelize this calculation - Just calculate one channel on one node of the GPU. And theoretically, that's all needs to be done. But there are a few complications that need to be addressed. Firstly, in calculations like this, the actual arithmetic operations usually take only a small portion of time. The majority of the time is spent just transferring the data onto the GPU memory before the calcualtions and back. So this memory transfer has to be optimized as much as possible for the parallelization to have any effect on processing time. The way that I implemented this is to generate arrays of all the input voltages and channel types. The arrays is filled up sequentially with the current voltage values and then transferred to GPU memory. The GPU then performs the lookup and arithmetic operations in parallel on each of the inputs and replaces the old voltage values with new ones which is then transferred back to the CPU. Here's a diagram to make things more clear:

There are ofcourse a lot of optimizations that can be done on my implementation that I didn't have the time to get around to. The first, and most obvious, is to make use of the architectural flexibility provided in CUDA more effectively. Cuda allows users to spawn a bunch of blocks each of which have a set of threads running within them. The blocks share a common 'shared' memory which is faster than global memory and so, is highly recommended to be used as much as possible. In the current implementation, only 1 block is being spawned with all the threads. This is a rather inefficient implementation. A much better approach would be to assign a block for one type of channel. This will allow the lookup table of only that one channel to be stored on the shared memory of that block, allowing for much faster lookup times.

There is still a lot of work to be done in order to make the parallel algorithm provide the sort of speedup that is expected from GPUs. This is only a first attempt at parallelization.

]]>There is still a lot of work to be done in order to make the parallel algorithm provide the sort of speedup that is expected from GPUs. This is only a first attempt at parallelization.

As a starting point, I used the code that was shown in the previous post - The summation from 1 to n. This code was put into a class called GpuInterface in GpuSolver.cu and it also had a GpuSolver.h header file. The files are shown below:

GpuSolver.h

#ifndef EXAMPLE6_H

#define EXAMPLE6_H

class GpuInterface

{

public:

int n[20];

int y;

int asize;

GpuInterface();

int calculateSum();

void setY(int);

};

#endif

GpuSolver.cu

#include <iostream>

#include <cuda.h>

#include "GpuSolver.h"

__global__

void findSumToN(int *n, int limit)

{

int tId = threadIdx.x;

for (int i=0; i<=(int)log2((double)limit); i++)

{

if (tId%(int)(pow(2.0,(double)(i+1))) == 0){

if (tId+(int)pow(2.0, (double)i) >= limit) break;

n[tId] += n[tId+(int)pow(2.0, (double)i)];

}

__syncthreads();

}

}

GpuInterface::GpuInterface()

{

y = 20;

asize = y*sizeof(int);

for (int i=0; i<y; i++)

n[i] = i;

}

int GpuInterface::calculateSum()

{

int *n_d;

cudaMalloc( (void**)&n_d, asize );

cudaMemcpy(n_d, n, asize, cudaMemcpyHostToDevice );

dim3 dimBlock( y, 1 );

dim3 dimGrid( 1, 1 );

findSumToN<<<dimGrid, dimBlock>>>(n_d, y);

cudaMemcpy(n, n_d, asize, cudaMemcpyDeviceToHost);

cudaFree (n_d);

return n[0];

}

void GpuInterface::setY(int newVal)

{

y = newVal;

asize = y*sizeof(int);

for (int i=0; i<y; i++)

n[i] = i;

}

And finally, I have a C++ file called main.cpp which just has a main function that creates a GpuSolver object and calls its functions, like so:

#include <iostream>

#include "GpuSolver.h"

int main()

{

GpuInterface obj;

obj.setY(16);

std::cout << obj.calculateSum();

return 0;

}

So the task now is to compile all of this into one executable. The key to understanding how we can do this is to understand how exactly the C++ compiler works. Professional computer science courses have a whole subject dedicated to Compilers, so I wont go into much detail. I'll just tell you that the compiler first converts the code (which is in a human readable format) into an intermediate code format. This is called assembly. The file created is called an object file. It contains essentially the same information as the source code file, but in a more machine-readable format. Separate object files are created for each source file in the project. Then, they are all linked together to form one single executable.

Typically, compilers automatically perform the assembly followed by the linking process. However, you can force it to stop after just the assembly, and then do the linking process later on. This is what we will have to do.

We run g++ on main.cpp with the -c flag that instructs g++ to stop compilation after the object files are generated. We also use the -I. flag to ask it to look for headers files within the current folder. The -o flag asks the compiler to call the output as whatever string follows the flag (in this case main.cpp.o). The full command looks like:

Typically, compilers automatically perform the assembly followed by the linking process. However, you can force it to stop after just the assembly, and then do the linking process later on. This is what we will have to do.

We run g++ on main.cpp with the -c flag that instructs g++ to stop compilation after the object files are generated. We also use the -I. flag to ask it to look for headers files within the current folder. The -o flag asks the compiler to call the output as whatever string follows the flag (in this case main.cpp.o). The full command looks like:

g++ -c -I. main.cpp -o main.cpp.o

We do a similar compilation on GpuSolver.cu with the following command

nvcc -c -I. -I/usr/local/cuda/include GpuSolver.cu -o GpuSolver.cu.o

Apart from the fact that g++ is replaced with nvcc (Nvidia CUDA Compiler) here, the only addition is the "-I/usr/local/cuda/include" flag. The path you see there contains some CUDA specific functionality that is required while compiling CUDA programs. So nvcc will need that library to compile the .cu file.

So now we have a bunch of files in our project directory:

We now need to link the two .o files into one executable. We do this with the following command:

So now we have a bunch of files in our project directory:

- main.cpp
- GpuSolver.cu
- GpuSolver.h
- main.cpp.o
- GpuSolver.cu.o

We now need to link the two .o files into one executable. We do this with the following command:

g++ -o exec GpuSolver.cu.o main.cpp.o -L/usr/local/cuda/lib -lcudart

Firstly, notice how we are just using the normal g++ call to perform the linking. g++ is clever enough to know that only the linking step is required and skips the assembly automatically. -o, like before, ensures that the output is named 'exec'. The files that need to be linked are specified next, followed by a -L and -l flag. The -L flag asks the compiler to look at the directory specified in the flag for additional files that may need to be linked. -l specifies the exact file that needs to be linked. Again, this is specific to CUDA.

When all of this has been done, you get a neat little executable that will calculate the sum from 1 to n!

I packed all of these commands into a makefile, which I've put down here

When all of this has been done, you get a neat little executable that will calculate the sum from 1 to n!

I packed all of these commands into a makefile, which I've put down here

CUDA_INSTALL_PATH := /usr/local/cuda

CXX := g++

CC := gcc

LINK := g++ -fPIC

NVCC := nvcc

# Includes

INCLUDES = -I. -I$(CUDA_INSTALL_PATH)/include

# Common flags

COMMONFLAGS += $(INCLUDES)

NVCCFLAGS += $(COMMONFLAGS)

CXXFLAGS += $(COMMONFLAGS)

CFLAGS += $(COMMONFLAGS)

LIB_CUDA := -L$(CUDA_INSTALL_PATH)/lib -lcudart

OBJS = GpuSolver.cu.o main.cpp.o

TARGET = exec

LINKLINE = $(LINK) -o $(TARGET) $(OBJS) $(LIB_CUDA)

.SUFFIXES: .c .cpp .cu .o

%.c.o: %.c

$(CC) $(CFLAGS) -c $< -o $@

%.cu.o: %.cu

$(NVCC) $(NVCCFLAGS) -c $< -o $@

%.cpp.o: %.cpp

$(CXX) $(CXXFLAGS) -c $< -o $@

$(TARGET): $(OBJS) Makefile

$(LINKLINE)

For those of you wondering what just happened, welcome to the world of Linux makefiles :)

Makefiles are a way to script out the entire compilation and installation process for programs in linux. They have very weird syntax and there is no way I can explain all of it's details here. However, there are excellent tutorials on makefiles elsewhere on the internet, so I'd suggest doing some research.

The main point is that this makefile does almost exactly what I described above, with a bit of extra functionality for things like making sure the resulting executable can be further used in other programs, as opposed to needing to be run manually.

That's it for now! I'll leave the integration into MOOSE for next time.

]]>Makefiles are a way to script out the entire compilation and installation process for programs in linux. They have very weird syntax and there is no way I can explain all of it's details here. However, there are excellent tutorials on makefiles elsewhere on the internet, so I'd suggest doing some research.

The main point is that this makefile does almost exactly what I described above, with a bit of extra functionality for things like making sure the resulting executable can be further used in other programs, as opposed to needing to be run manually.

That's it for now! I'll leave the integration into MOOSE for next time.

The task I was given was simple - **get a MOOSE class to calculate the sum of numbers from 1 upto some n, where n is defined by a member of that class**. **Then transfer the computation to the GPU while still maintaining an interface to MOOSE.**

The example class I created is basically the same as the class I wrote about a few posts back. Do refer back to that post if you need a refresher on MOOSE classes.

The only change I made is in the process function. The class now calculates the sum of numbers from 1 upto y_ (one of its datamembers) and stores the result in x_ (another one of its datamembers). Heres the process function:

The only change I made is in the process function. The class now calculates the sum of numbers from 1 upto y_ (one of its datamembers) and stores the result in x_ (another one of its datamembers). Heres the process function:

void Example::process( const Eref& e, ProcPtr p )

{

int count = 0;

for (int i=1; i<y_; i++)

{

count += i;

}

x_ = count;

}

The rest of the class is the same as before.

Now the aim is to shift this computation to the GPU. Ofcourse, being a GPU, we need to also develop a parallel algorithm to calculate this sum instead of just iterating through the array and summing the numbers. Let's create a new project where all the GPU coding can be done and later work on integrating that code into this class.

Now the aim is to shift this computation to the GPU. Ofcourse, being a GPU, we need to also develop a parallel algorithm to calculate this sum instead of just iterating through the array and summing the numbers. Let's create a new project where all the GPU coding can be done and later work on integrating that code into this class.

This section will focus on programming the GPU to calculate the sum of numbers from 1 to n.

**NOTE: You need to have a working installation of CUDA for this part. If you don't have one, a number of tutorials exist to guide you through the installation process. Complete that before you continue.**

Every CUDA program has two parts - the main section and the kernel. The main section runs on the**host** (most often the CPU). Code written here is almost identical to ordinary C++ code. The kernel is where all the magic happens. This code is loaded into the **device** (the GPU) and will be run once by each computational unit within the GPU.

So the idea is that all the control is at the hands of the CPU, where you would write ordinary C++ code to control flow of instructions, while the actual data processing is done at the GPU, which will typically have small snippets of C++ code and a small amount of local memory but will run the same code multiple times in parallel. Together, they make a formidable number crunching machine!

Let's first take a look at the CPU code

Every CUDA program has two parts - the main section and the kernel. The main section runs on the

So the idea is that all the control is at the hands of the CPU, where you would write ordinary C++ code to control flow of instructions, while the actual data processing is done at the GPU, which will typically have small snippets of C++ code and a small amount of local memory but will run the same code multiple times in parallel. Together, they make a formidable number crunching machine!

Let's first take a look at the CPU code

int main()

{

int n[20] = {0};

int *n_d;

int y=20;

dim3 dimBlock( y, 1 );

dim3 dimGrid( 1, 1 );

const int asize = y*sizeof(int);

//1) Fill up the array with numbers from 1 to y

for (int i=0; i<y; i++)

n[i] = i;

//2) Allocate memory for the array on the GPU

cudaMalloc( (void**)&n_d, asize );

//3) Copy over the array from CPU to GPU

cudaMemcpy(n_d, n, asize, cudaMemcpyHostToDevice );

//4) Call the kernel

findSumToN<<<dimGrid, dimBlock>>>(n_d, y);

//5) Copy back the array from GPU to CPU

cudaMemcpy(n, n_d, asize, cudaMemcpyDeviceToHost);

//6) Free memory on the GPU

cudaFree (n_d);

std::cout << "\nSum: " << n[0]<< '\n';

return EXIT_SUCCESS;

}

The code is about as straightforward as you can imagine. Y is the limit until which we need to sum. In the unlikely case that the code isn't self explanatory, I have put down the steps here:

Let's take a look at some of the interesting aspects of this code.

**NOTE: GPUs have a number of different levels of memory, each of which have specific tradeoffs between storage space and access speed. I haven't gone into those details here, but they are worth a read.**

Before we look at the kernel, a brief introduction to the Parallel Reduce Algorithm will be helpful

- Generate the array having integers from 1 to y.
- Allocate memory on the GPU to hold the array.
- Copy the array from CPU to GPU.
- Call the kernel y times.
- Copy the array from GPU back to CPU.
- Free memory on the GPU.

Let's take a look at some of the interesting aspects of this code.

- The dimBlock and dimGrid definitions you see on top are a consequence of how computational units (CUs) are arranged in CUDA. The GPU contains a large number of computational units which are grouped together (in sets of 32, 64 etc) to form a block. Blocks are further grouped into a grid. CUDA also provides the helpful feature of identifying CUs within a block and blocks within a grid using 2 dimensional or even 3 dimensional coordinates. The actual arrangement in the GPU will be linear ofcourse, but this interface can be very helpful when dealing with applications in 2D space (like image processing) or 3D space(like point cloud analysis).
- cudaMalloc is the GPU equivalent to malloc. It allocates the specified space on the GPU and points the specified pointer to the start address. The name of the pointer is n_d. The CUDA convention is to append _d to all pointers that are pointing to memory on the device.

- cudaMemcpy is again similar to memcpy, but does this between the CPU and GPU. Note that last parameter which determines which direction the memory is being copied.
- findSumToN is a function call to a function that we haven't yet defined. This is the kernel, and I will come to this in a moment. Before that, take a look at the triple less-than and greater-than signs. Between them is the dimGrid and dimBlock that we defined earlier. This determines the number of kernels that are launched. In our case, dimBlock is (y, 1), so y CUs will be launched per block. Since dimGrid is (1,1), only one block will be launched. So that is y CUs in total, all launched linearly, within the same block. This is ofcourse not the best way to parallelize since a block might only be able to launch 32 threads (based on the GPU hardware) and theres a good chance y will be more than that. Nevertheless, this is enough for this project.
- cudaFree, as you may imagine, just releases the memory pointed to by the pointer in its argument.

Before we look at the kernel, a brief introduction to the Parallel Reduce Algorithm will be helpful

What you see above is the reduce algorithm used with addition in its entirety!

The naive way of adding the numbers in an array is ofourse to take one element (usually the first) and keep adding each of the other elements to it until all the elements have been added. The resulting array will have the sum of numbers in it's first cell.

Here, the end result is the same, but the method is slightly different. We first take all even elements (every second element starting from the first) and add the element immediately after it. We then take every fourth element and add the element two spaces away. Then eighth, and so on until all the elements have been added.

The great thing about this approach is that it is far more parallelizable than the naive approach. To understand why, take a look at the first set of additions. Each addition takes 2 elements from the first array and puts the result into one element of the second array. None of those additions depend on values computed by any other operation. So all of them can be computed in parallel. The summation will have to pause after that first step though, to make sure all parallel units have finished computed the first level of summation before the next level of summation can be performed. This is called a synchronisation of threads.

So what will this look like in parallel code?

The naive way of adding the numbers in an array is ofourse to take one element (usually the first) and keep adding each of the other elements to it until all the elements have been added. The resulting array will have the sum of numbers in it's first cell.

Here, the end result is the same, but the method is slightly different. We first take all even elements (every second element starting from the first) and add the element immediately after it. We then take every fourth element and add the element two spaces away. Then eighth, and so on until all the elements have been added.

The great thing about this approach is that it is far more parallelizable than the naive approach. To understand why, take a look at the first set of additions. Each addition takes 2 elements from the first array and puts the result into one element of the second array. None of those additions depend on values computed by any other operation. So all of them can be computed in parallel. The summation will have to pause after that first step though, to make sure all parallel units have finished computed the first level of summation before the next level of summation can be performed. This is called a synchronisation of threads.

So what will this look like in parallel code?

int tId = threadIdx.x;

if (tId%2 == 0)

n[tId] += n[tId+1];

__syncthreads();

if (tId%4 == 0)

n[tId] += n[tId+2];

__syncthreads();

if (tId%8 == 0)

n[tId] += n[tId+4];

__syncthreads();

if (tId%16 == 0)

n[tId] += n[tId+8];

Here, tId is the ID number of the thread being run. Remember how we started y computational units in that call to findSumToN? This is (a simplified version of) the code that they will run. Each of the y CUs are given a unique thread number which is used to identify them in the code.

So what exactly is happening here? All threads with odd tId will actually do nothing! This is a waste of CUs and should be avoided in production code. All threads with tId a factor of 2 will enter the first if branch. Here, they will compute the sum of the array element n[tId] and the element immediately after it. The __syncthreads() command instructs the GPU to force all CUs to wait until all other CUs have reached the same point. This will ensure that all the first level of calculations have been done, as mentioned before.

Then, all threads whose tIds are factors of 4 enter the second if branch where they add their element with the element two spaces away. This continues onward.

I have converted all of this into a generic function shown below:

So what exactly is happening here? All threads with odd tId will actually do nothing! This is a waste of CUs and should be avoided in production code. All threads with tId a factor of 2 will enter the first if branch. Here, they will compute the sum of the array element n[tId] and the element immediately after it. The __syncthreads() command instructs the GPU to force all CUs to wait until all other CUs have reached the same point. This will ensure that all the first level of calculations have been done, as mentioned before.

Then, all threads whose tIds are factors of 4 enter the second if branch where they add their element with the element two spaces away. This continues onward.

I have converted all of this into a generic function shown below:

__global__

void findSumToN(int *n, int limit)

{

int tId = threadIdx.x;

for (int i=0; i<=(int)log2((double)limit); i++)

{

if (tId%(int)(pow(2.0,(double)(i+1))) == 0){

if (tId+(int)pow(2.0, (double)i) >= limit) break;

n[tId] += n[tId+(int)pow(2.0, (double)i)];

}

__syncthreads();

}

}

Note that this is far from optimised code! There are way more calls to math functions than are required and many threads will be severely underused. It is, however, a fairly simple example to understand.

I originally planned to put down the entire process of making the GpuSolver class and integrating it into MOOSE over here, but this is becoming a very long post. So I'll stop here and finish it in the next post.

]]>Now, out of these variables, some of them vary with time, while others stay constant. Let's take some time t=1. We will denote the time varying variables with a suffix 1. Then the equation for the ith compartment becomes:

Let us keep that equation aside for a moment and look at another bit of the derivation - The backward Euler method of approximation.

Understanding how this approximation is done is very simple with a diagram.

Understanding how this approximation is done is very simple with a diagram.

This is a (very) rough diagram to show how the approximation works. The blue line is a plot of some function F(n). Let's take two points on the X-axis, X0 and X1. We mark their locations on the plot as points C and A. Now, we take the slope of the function at A (given by the solid green line) and extend it back to the vertical line from X0. We then draw a line parallel to the just drawn line, starting from C and hitting the vertical line from X1. This is an approximation of the function at point X1 given it's value at X0 and it's slope at X1. This would predict the value of the function at X1 to be D, whereas the actual value is A.

This is how future values of the Voltages and Admittances are predicted in MOOSE.

So the equation used when performing Backward Euler Approximations is:

This is how future values of the Voltages and Admittances are predicted in MOOSE.

So the equation used when performing Backward Euler Approximations is:

Now, we just replace dV/dt with the equation we got above. Remember that the C term which was on the Left Hand Side is brought down to the denominator of the RHS before substitution. That will give:

Taking all the V1 terms together, we get

We shift that last term to the LHS and take V1 out common to get

This can be simplified as follows by making the following substitutions

Where

So these coefficients you see are what actually goes into the Hines matrix. Aii are the diagonal terms, Aij are the off diagonal terms and Bi are the terms in the B matrix.

Hopefully, this will make the math behind HSolve a little more clear. It certainly helped me understand what was really going on.

]]>Hopefully, this will make the math behind HSolve a little more clear. It certainly helped me understand what was really going on.

The code for the Hines Solver is split into a couple of elements - the HSolve, HSolve Active and HSolve Passive classes.

The starting point to understanding this code is at hsolve.cpp. Here, a class called HSolve is defined. Objects of this class take over control when the Hines Solver is being used. As can be seen, this is a MOOSE class, with all the required elements - initCinfo, process and reinit functions etc.

Note how the process, reinit and setup functions of the HSolve class calls the respective functions in the HSolveActive class, which in turn calls the respective functions in the HSolvePassive class.

The HSolveActive class mainly deals with channel current calculations - the changes in current that happen between two compartments. The HSolvePassive class deals with the actual gauss jordan elimination procedure that provide the new voltages at each segment of the cell.

It also performs calculations for Calcium concentrations, which happen simultaneously with the current calculations.

When the target of the Hines Solver is set, the zombify function is called which disconnects the actual elements of the neuronal model from their clocks (so that they wont be doing any calculations anymore) and generates zombie versions of each element (which just pass processor control to the hines solver).

My main area of interest is the HSolvePassive Class where the forward elimination and backward substitution take place. So this is what we will look at here.

Since the Hines Matrix is a sparse matrix (There are a lot more 0s in the matrix than other numbers), it makes sense to represent the matrix in some other more compact form while doing calculations. This is the HS_ matrix. It reduces the large Hines matrix into a nx4 matrix with each compartment having just 4 values. Here is a small example to show how this HS_ matrix is generated.

Consider the following compartmental model. The Hines indices of each compartment have been indicated:

By the Hines method, this will result in the matrix shown below

Where the Ys are the admittances of that particular compartment and Zs are admittances of neighbouring compartments.

As you can see, this is a sparse matrix. To obtain a HS_ matrix from this, we pull out the relevant values from it and put it into another matrix. HS_ looks like:

As you can see, this is a sparse matrix. To obtain a HS_ matrix from this, we pull out the relevant values from it and put it into another matrix. HS_ looks like:

Where Y1x is the admittance at compartment 1 including external current influences, and X1 is the external current provided at compartment 1.

Admittance produced at junctions between two compartments are stored in a separate vector called HJ_. This is quite straightforward. Each junction admittance is specified one after the other in a long list of values. The way these junction currents are calculated is worth taking a look at.

When just two compartments meet at a junction, the admittance between the two is straightforward to calculate. However, when more than two compartments meet at a junction, the junction needs to be broken down into a set of junctions each having just two compartments. For a junction with three compartments, this can be done as shown (The black lines are the actual compartments):

In such a case, the admittance at the junction is calculated using the equation Gij = Gi x Gj / Gsum . Gsum is the sum over all Gi.

Remember that the HJ_ vector stores only the values of admittances at all the junctions. There is no information on which junctions have which values! This mapping is done in two other vectors called junction_ and operandBase_.

This is a rough overview of the datastructures used in the Hines Solver. Note that this is true only for the current serial implementation. The next post will be on the actual forward elimination and back substitution algorithms that have been implemented.

]]>Remember that the HJ_ vector stores only the values of admittances at all the junctions. There is no information on which junctions have which values! This mapping is done in two other vectors called junction_ and operandBase_.

This is a rough overview of the datastructures used in the Hines Solver. Note that this is true only for the current serial implementation. The next post will be on the actual forward elimination and back substitution algorithms that have been implemented.

We saw the general equation of the neuron in the first post. Heres a refresher:

Don't hesitate to scroll down and check out that post if you don't remember this equation.

Let's simplify this equation a bit. We'll remove the resistance considerations and expand that summation to allow for only Sodium and Potassium ion channels, as these account for the vast majority of current changes in the cell. We will also assume that there are no injection currents for now. The new equation then becomes:

Let's simplify this equation a bit. We'll remove the resistance considerations and expand that summation to allow for only Sodium and Potassium ion channels, as these account for the vast majority of current changes in the cell. We will also assume that there are no injection currents for now. The new equation then becomes:

Now this equation, as you can see, is a time integral. So it assumes that voltages are constant over space. However, this is not true. Even within a single compartment, voltages can vary considerably from one end to the other. So, accuracy of models can be increased substantially by integrating over space along with time. This leads to conversion of the above equation - an Ordinary Differential Equation, into a Partial Differential Equation. Let's see what that looks like

So this is an equation in which the voltage varies continuously with both space and time! That's a pretty hard equation to solve. A clever trick called the method of lines reduces this PDE into a set of 'coupled' ODEs, which basically means that the ODEs all depend on each other. Although this mutual dependence means that the ODEs can't be solved by straightforward differentiation, they are far easier to solve than PDEs. This gives us the equation below:

Note that we've also reduced all those gV terms into one for brevity.

This is an interesting equation. It removes the dependance of V on space(x), but not entirely. Instead of a differential, the voltage for any compartment (i) now depends on the voltage values of the compartments before and after it!

In order to remove the time differential, we apply the Crank-Nicholson method, which takes the equations at t = n and t = n+1, combines them, and brings common terms together to obtain:

This is an interesting equation. It removes the dependance of V on space(x), but not entirely. Instead of a differential, the voltage for any compartment (i) now depends on the voltage values of the compartments before and after it!

In order to remove the time differential, we apply the Crank-Nicholson method, which takes the equations at t = n and t = n+1, combines them, and brings common terms together to obtain:

Where the superscript of V indicates time and the subscript indicates position, and

Now take a look at the equation above. all those coefficients can be thrown into constants to give a general equation as follows:

One point to be noted here - How did the three Vs in the RHS of the previous expression get reduced to a single R term in this one? Well, the entire RHS of the previous equation are values at time t=n. We know exactly what the system looks like at this point. So this expression will reduce to a single value. This is what R in the above equation denotes.

When represented in matrix form, this equation will form what is called a tri-diagonal matrix. Let's take a look at how that happens.

When represented in matrix form, this equation will form what is called a tri-diagonal matrix. Let's take a look at how that happens.

This is the matrix that we need to solve in order to get new voltage values (V') from old voltage values (V).

Now this matrix is in the form AX = B. We have A and B and need to solve for X. The way we do this is by inverting A and multiplying it to the left of both sides of the equation to get:

Now this matrix is in the form AX = B. We have A and B and need to solve for X. The way we do this is by inverting A and multiplying it to the left of both sides of the equation to get:

Getting the inverse matrix of A is a rather time consuming task. This is the main focus area while parallelizing this algorithm. Fast algorithms exist for general purpose inversion of matrices, but for this particular application, it is wasteful to use them directly. This is because, being a sparse matrix, such algorithms will spend a lot of processor time working on many elements which are 0s. Instead, we create a new form of representing this sparse matrix compactly and work on that matrix instead.

Note: The method above simply assumes that one compartment has only two neighbouring compartments. What about branches? Compartments can have any number of neighbouring compartments at the nodes of branches! Turns out this doesn't affect our calculations by all that much. The only change is that there will be some non-tridiagonal elements (elements that are not on the diagonal, lower diagonal or upper diagonal). We can ensure this using the Hines' method of node numbering. Once this is done, we can still solve for the matrix, keeping the off-diagonal elements in mind while performing Gaussian elimination.

So that was a pretty intensive post on the computational principles underlying the simultions in MOOSE. However, this puts us in a far better position to understand the actual code that carries out these functions in MOOSE. That is what the next post will be about.

]]>Note: The method above simply assumes that one compartment has only two neighbouring compartments. What about branches? Compartments can have any number of neighbouring compartments at the nodes of branches! Turns out this doesn't affect our calculations by all that much. The only change is that there will be some non-tridiagonal elements (elements that are not on the diagonal, lower diagonal or upper diagonal). We can ensure this using the Hines' method of node numbering. Once this is done, we can still solve for the matrix, keeping the off-diagonal elements in mind while performing Gaussian elimination.

So that was a pretty intensive post on the computational principles underlying the simultions in MOOSE. However, this puts us in a far better position to understand the actual code that carries out these functions in MOOSE. That is what the next post will be about.

This new example class is going to take two inputs, sum their values, and send the output over to another object. Apart from the header and source file, I will also explain the python script file that makes all of this happen.

First, let's get a sense of the class by taking a look at the header file:

class Example {

private:

double x_;

double y_;

double output_;

public:

Example();

double getX() const;

void setX( double x );

double getY() const;

void setY( double y );

void process( const Eref& e, ProcPtr p );

void reinit( const Eref& e, ProcPtr p );

void handleX(double arg);

void handleY(double arg);

static const Cinfo* initCinfo();

};

Yes, it has gotten considerably bigger since before. However, a large part of this has already been explained before. There are 3 data members now - x_, y_ and output_. x_ and y_ store the values coming in from the external objects (in this case, the object is called a PulseGen). output_ stores the sum of x_ and y_.

The first function is the constructor of the class. This is standard C, nothing new here. The next four functions are used to get and set variables x_ and y_. We saw this last time. Then comes declerations for process and reinit. These are new functions. They play a very important role in the framework.

Finally, the initCinfo function, which as we saw before, is essential to convert this class into a MOOSE class.

The first function is the constructor of the class. This is standard C, nothing new here. The next four functions are used to get and set variables x_ and y_. We saw this last time. Then comes declerations for process and reinit. These are new functions. They play a very important role in the framework.

- Process: Every time-step during execution, the MOOSE framework calls the process function once. Thus, this function is responsible for advancing the state of the object at each time-step. In the case of the example, this is where the summation of x_ and y_ will happen.
- Reinit: This is used to reinitialize the object. All it's variables are reset to the boundary conditions. Note that every object with a process function must necessarily also have a reinit function.

Finally, the initCinfo function, which as we saw before, is essential to convert this class into a MOOSE class.

Let us take a look at the initCinfo function now:

const Cinfo* Example::initCinfo(){

//Value Field Definitions

static ValueFinfo< Example, double > x(

"x",

"An example field of an example class",

&Example::setX,

&Example::getX

);

static ValueFinfo< Example, double > y(

"y",

"Another example field of an example class",

&Example::setY,

&Example::getY

);

//Destination Field Definitions

static DestFinfo handleX( "handleX",

"Saves arg value to x_",

new OpFunc1< Example, double >( &Example::handleX )

);

static DestFinfo handleY( "handleY",

"Saves arg value to y_",

new OpFunc1< Example, double >( &Example::handleY )

);

static DestFinfo process( "process",

"Handles process call",

new ProcOpFunc< Example >( &Example::process )

);

static DestFinfo reinit( "reinit",

"Handles reinit call",

new ProcOpFunc< Example >( &Example::reinit )

);

static ReadOnlyLookupElementValueFinfo< Example, string, vector< Id > > fieldNeighbours(

"fieldNeighbors",

"Ids of Elements connected this Element on specified field.",

&Example::getNeighbours

);

//////////////////////////////////////////////////////////////

// SharedFinfo Definitions

//////////////////////////////////////////////////////////////

static Finfo* procShared[] = {

&process, &reinit

};

static SharedFinfo proc( "proc",

"Shared message for process and reinit",

procShared, sizeof( procShared ) / sizeof( const Finfo* )

);

static Finfo *exampleFinfos[] =

{

&x, //Value

&y, //Value

&handleX, //DestFinfo

&handleY, //DestFinfo

output(), // SrcFinfo

&proc, //SharedFinfo

};

static Cinfo exampleCinfo(

"Example", // The name of the class in python

Neutral::initCinfo(), // TODO

exampleFinfos, // The array of Finfos created above

sizeof( exampleFinfos ) / sizeof ( Finfo* ), // The number of Finfos

new Dinfo< Example >() // The class Example itself (FIXME ?)

);

return &exampleCinfo;

}

Again, much bigger than the old function. But it is not as cryptic as it may seem. Let's dive in!

The first two definitions must be somewhat familiar. They are valueFinfo definitions which we saw earlier.

The next four definitions define DestFinfos. handleX and handleY, as mentioned before, handle the callbacks when external objects pass messages into the example class. Take a look at their definition. The first two fields are the name and DocString of the DestFinfo. The third parameter is the function that must be called upon activation. You can see that Example::handleX and Example::handleY are passed in to handleX and handleY respectively. The reason for the fairly complicated syntax is that MOOSE needs to know the type of function that is going to be defined. Here, it is OpFunc1, which means it is a generic function and takes in one parameter.

The next two functions are proc and reinit. Notice that they are ProcOpFuncs which mean they are going to be executed at every clock cycle. Apart from this difference, they are declared very similarly to handleX and handleY.

There is also a declaration of a SrcFinfo in this class which doesn't show up in the initCinfo definition. It is instead declared outside the function as shown:

The first two definitions must be somewhat familiar. They are valueFinfo definitions which we saw earlier.

The next four definitions define DestFinfos. handleX and handleY, as mentioned before, handle the callbacks when external objects pass messages into the example class. Take a look at their definition. The first two fields are the name and DocString of the DestFinfo. The third parameter is the function that must be called upon activation. You can see that Example::handleX and Example::handleY are passed in to handleX and handleY respectively. The reason for the fairly complicated syntax is that MOOSE needs to know the type of function that is going to be defined. Here, it is OpFunc1, which means it is a generic function and takes in one parameter.

The next two functions are proc and reinit. Notice that they are ProcOpFuncs which mean they are going to be executed at every clock cycle. Apart from this difference, they are declared very similarly to handleX and handleY.

There is also a declaration of a SrcFinfo in this class which doesn't show up in the initCinfo definition. It is instead declared outside the function as shown:

static SrcFinfo1< double > *output() {

static SrcFinfo1< double > output(

"output",

"Sends out the computed value"

);

return &output;

}

I am not yet fully sure of the use of SharedInfos. I will be updating this space as and when that gets clear.

After that is the exampleFinfos list that we saw earlier. It has many more Finfos, which are those that are initialized above. Note that the SrcFinfo output() is also present in this list.

The declaration of exampleCinfo follows. It looks largely the same a before except that the number of Finfos in the list will have to change. I have used a simple formula to calculate the number of Finfos based on the size of the list, and utilizing the fact that all Finfo pointers have the same length.

Let us now take a look at each of the function definitions

After that is the exampleFinfos list that we saw earlier. It has many more Finfos, which are those that are initialized above. Note that the SrcFinfo output() is also present in this list.

The declaration of exampleCinfo follows. It looks largely the same a before except that the number of Finfos in the list will have to change. I have used a simple formula to calculate the number of Finfos based on the size of the list, and utilizing the fact that all Finfo pointers have the same length.

Let us now take a look at each of the function definitions

Example::Example()

:

output_( 0.0 ),

x_( 0.0 ), y_( 0.0 )

{

;

}

The first function is the constructor of the class. It simply initialises its 3 variables to 0.

void Example::process( const Eref& e, ProcPtr p )

{

output_ = x_ + y_;

printf("%f\n", output_);

output()->send( e, output_ );

}

void Example::reinit( const Eref& e, ProcPtr p )

{

}

The next two functions are standard MOOSE functions. Remember that these are of type ProcOpFunc (as declared in initCinfo).

Process is called at each tick of the clock. Thus, how often it runs depends upon the timeout of the clock that it has been connected to. This is where the code for changing state of the object will go. In the case of this example, the process function calculates the sum of x_ and y_ and stores it in _output. It then prints this value and also sends the value out through the send function. Since output() is a SrcFinfo, the value sent out will go to the DestFinfo of any objects connected to the output() of this object via messages.

Reinit is called when the model needs to reinitialize itself. In this case, we do not need to do anything upon reinitialization, so reinit is blank.

//TODO: explain Eref and ProcPtr

Finally, we take a look at handleX and handleY

Process is called at each tick of the clock. Thus, how often it runs depends upon the timeout of the clock that it has been connected to. This is where the code for changing state of the object will go. In the case of this example, the process function calculates the sum of x_ and y_ and stores it in _output. It then prints this value and also sends the value out through the send function. Since output() is a SrcFinfo, the value sent out will go to the DestFinfo of any objects connected to the output() of this object via messages.

Reinit is called when the model needs to reinitialize itself. In this case, we do not need to do anything upon reinitialization, so reinit is blank.

//TODO: explain Eref and ProcPtr

Finally, we take a look at handleX and handleY

void Example::handleX(double arg )

{

x_ = arg;

}

void Example::handleY(double arg )

{

y_ = arg;

}

As you can see, they are very simple functions, they simply store the values incoming through the messages into the variables x_ and y_ so that they can be summed and returned when the process function is called.

And that is about it for this example class! This one definitely had a lot more components than the previous class, but it can actually do all the basic functions of MOOSE classes - recieve data, process it and send it out to other objects.

]]>And that is about it for this example class! This one definitely had a lot more components than the previous class, but it can actually do all the basic functions of MOOSE classes - recieve data, process it and send it out to other objects.

Here, I will detail some of my example classes, to give some idea of what all are required to be done.

The first one was a class that has one data member 'x'. It doesn't do anything with this member though - pretty much as much of a dummy class as you can get! Nevertheless, it shows the way a normal C class can be elevated to the status of a MOOSE class with initCinfo.

Here is the header file:

class Example {

private:

int x_;

public:

int getX() const;

void setX( int x );

static const Cinfo *initCinfo();

};

It looks fairly similar to a C class with a few important additions. Firstly, the private variable x is always accompanied by get and set functions (getX() and setX()). Secondly, there is a function called initCinfo(), which provides the required functionality in order for MOOSE to recognize Example as a class.

This is the code for initCinfo, which forms the bulk of example.cpp:

This is the code for initCinfo, which forms the bulk of example.cpp:

const Cinfo* Example::initCinfo(){

static ValueFinfo< Example, int > x(

"x",

"An example field of an example class",

&Example::setX,

&Example::getX

);

static Finfo *exampleFinfos[] = { &x };

static Cinfo exampleCinfo(

"Example", // The name of the class in python

Neutral::initCinfo(),

exampleFinfos, // The array of Finfos created above

1, // The number of Finfos

new Dinfo< Example >() // The class Example itself

);

return &exampleCinfo;

}

Before we go into the above code, we need to understand a little more about Moose Classes. Variables and functions in Moose Classes are stored in special fields called Finfos. They are of 3 main types - ValueFinfo, SrcFinfo and DestFinfo, although many others exist.

Let's see how the example class above uses these Finfos. In this case, only one ValueFinfo has been used. No SrcFinfos or DstFinfos. the decleration consists of the name of the variable, a brief summary (the DocString), the set method and the get method.

After that, there is a definition of exampleFinfos, which is a list of all the addresses of every Finfo. This will go into exampleCinfo, defined immediately after.

exampleCinfo makes the class available to the rest of Moose. The first parameter is the name of the class. This is followed by the initializer for the parent class (the default parent class for a new class in Neutral). This is followed by the pointer pointing to the list of Finfos and the number of Finfos present in the list.

The rest of example.cpp is just the definitions for getX and setX:

- ValueFinfos store a single value. Moose creates a large number of helper functions implicitly for each ValueFinfo. These functions help with getting and setting values.
- SrcFinfos are used as a connection point for the object to communicate with other objects. As you may have guessed, SrcFinfos can be used to send out data to other objects via messages.
- DstFinfos are the points of reception of messages. They take a function of the class as parameter and work as a callback. Thus, when a DestFinfo receives a message, it simply calls the associated function and provides it access to the contents of the message.

Let's see how the example class above uses these Finfos. In this case, only one ValueFinfo has been used. No SrcFinfos or DstFinfos. the decleration consists of the name of the variable, a brief summary (the DocString), the set method and the get method.

After that, there is a definition of exampleFinfos, which is a list of all the addresses of every Finfo. This will go into exampleCinfo, defined immediately after.

exampleCinfo makes the class available to the rest of Moose. The first parameter is the name of the class. This is followed by the initializer for the parent class (the default parent class for a new class in Neutral). This is followed by the pointer pointing to the list of Finfos and the number of Finfos present in the list.

The rest of example.cpp is just the definitions for getX and setX:

int Example::getX() const

{

return x_;

}

void Example::setX(int x)

{

x_ = x;

}

Also make sure you incude header.h and example.h in the file.

That's everything there is to know about this very basic example class. In the next post, I will go into some more details about the Moose API by explaining another example class which has a few more features.

]]>That's everything there is to know about this very basic example class. In the next post, I will go into some more details about the Moose API by explaining another example class which has a few more features.

This is the definition of MOOSE given by its creators. It basically provides a framework using which biological systems can be easily modelled and simulated, with comparatively less work from the programmer (than starting the programming from scratch).

The project is hosted at sourceforge here: http://moose.sourceforge.net/

A lot of the information on this page (most of it in fact) is derived from the work of James M Bower and David Beeman in

Neurons work on the basis of exchanging currents between them through synaptic connections. This current exchange happens by virtue of fluctuating potentials within each neuron, caused by a number of chemical and electrical processes within it. Thus, in order to simulate all the processes taking place inside a collection of neuron and simulate it accurately, its electrical properties must be accurately understood and simulated.

The electrical properties of a neuron are not constant throughout its structure. It varies with space and time. Thus, in order to simulate it, two methods can be used:

- A complex function can be formulated that defines its electrical properties at every point, and also specifies the way this varies over time.

- The neuron can be broken up into smaller compartments, each of which will have constant properties, but will interact with each other in order to represent the whole neuron together.

The electrical properties of the compartments can be approximated to the following electrical circuit:

The area inside the dotted lines is one compartment. This will be connected to other compartments on one or both sides. Here is a brief description of the different components of the circuit:

- Ra is the axial resistance. That is, the resistance that will be encountered as current enters the compartment.
- Cm is the capacitance offered by the cell membrane. This happens because the membrane itself is an insulator of current, and there is a potential difference between the inside and outside of the neuron.
- Rm and Em are the membrane resistance and membrane potential respectively, and are caused by the passive ion channels in the neuron.
- Gk is the conductance caused by the active ion channels in the neuron. This is expected to change dynamically during a simulation. Hence the variable resistor in its depiction. Ek is the associated potential, called Equilibrium Potential
- Iinject is the sudden current flow caused by firing of action potentials or provided by inserting of an electrode into the compartment.

This equation depends on Vm’ and Vm’’ which are the potentials of the two neighboring compartments. These compartments have their own equations for their potentials which will depend on their neighbors and so on. Thus, all these coupled equations must be solved in parallel.

Solving for Vm using this equation at every compartment in the neuron, and further for every neuron, will give the required prediction of the state of the model at the next time step. This is the ultimate goal of MOOSE.

This is done by creating a difference equation and solving for this using one of mutliple methods. The default is the Exponential Euler (EE) method. Other methods are the Backward Euler and Crank-Nicholson Methods, which are faster to compute, but require more work in order to set up correctly.

]]>Solving for Vm using this equation at every compartment in the neuron, and further for every neuron, will give the required prediction of the state of the model at the next time step. This is the ultimate goal of MOOSE.

This is done by creating a difference equation and solving for this using one of mutliple methods. The default is the Exponential Euler (EE) method. Other methods are the Backward Euler and Crank-Nicholson Methods, which are faster to compute, but require more work in order to set up correctly.