How to get user input as a matrix format to perform matrix addition in php? anybody having suggestions?
By : Hashem Haghbayan
Date : March 29 2020, 07:55 AM
I wish this help you Give a option to user with two input box . First for rows and second for columns; Generate a table based on these input through javascript having textbox in each and input box name should be in array based on rows and columns while you are generating. let says, user enters : rows : 3 columns: 3 code :
<table>
<tr>
<td><input type="textbox" name="matrix[0][]" value=""/> </td>
<td><input type="textbox" name="matrix[0][]" value=""/> </td>
<td><input type="textbox" name="matrix[0][]" value=""/> </td>
</tr>
<tr>
<td><input type="textbox" name="matrix[1][]" value=""/> </td>
<td><input type="textbox" name="matrix[1][]" value=""/> </td>
<td><input type="textbox" name="matrix[1][]" value=""/> </td>
</tr>
<tr>
<td><input type="textbox" name="matrix[2][]" value=""/> </td>
<td><input type="textbox" name="matrix[2][]" value=""/> </td>
<td><input type="textbox" name="matrix[2][]" value=""/> </td>
</tr>
</table>
<?php
$matrixArr = $_POST['matrix']; // it will be a two dimenssion array having value as matrix have
?>

Which is best among Hybrid CPUGPU, only GPU,onlyCPU for implementing large matrix addition or matrix multiplication?
By : Ivo Ivanov
Date : March 29 2020, 07:55 AM
I think the issue was by ths following , The problem with CPUGPU hybrid computations where you need the result back on CPU is the latency between the two. If you expect to do some computation on GPU and have the result back on CPU, there can be easily several milliseconds of delay from starting the computation on GPU to get the results back on CPU, so the amount of work done on GPU should be significant. Or you need significant amount of work on CPU between starting GPU computation and getting the results back from GPU. Performing 1000 element matrix addition is tiny amount of work thus you would be better off performing the entire computation on CPU instead. You also have the overhead of transferring the data back and forth between the CPU & GPU across the PCI bus which adds to the overhead, so computations which require small amount of data transferred between the two lean more towards hybrid solution. If you never need to read the result back from GPU to CPU, then you don't have the latency issue though. For example you could do Nbody simulation on GPU and perform visualization on GPU as well thus never needing the result on CPU. But the moment you need the result of the simulation back to CPU you have to deal with the latency issue.

Program crashes at: (1) matrix multiplication; and (2) failed matrix addition/subtraction
By : Colin Telfer
Date : March 29 2020, 07:55 AM
this will help You posted a copy constructor and assignment operator. Your assignment operator has 4 major issues: You should pass the parameter by const reference, not by value. You should return a reference to the current object, not a brand new object. If new throws an exception during assignment, you've messed up your object by deleting the memory beforehand. It is redundant. The same code appears in your copy constructor. code :
#include <algorithm>
//...
Matrix& Matrix::operator=(const Matrix& aMatrix)
{
Matrix temp(aMatrix);
swap(*this, temp);
return *this;
}
void Matrix::swap(Matrix& left, Matrix& right)
{
std::swap(left.rows, right.rows);
std::swap(left.cols, right.cols);
std::swap(left.element, right.element);
}
Matrix& Matrix::operator+=(const Matrix &aMatrix)
{
if(rows != aMatrix.rows  cols != aMatrix.cols)
throw SomeException;
for(int i = 0; i < rows; i++)
{
for(int x = 0; x < cols; x++)
element[i][x] += aMatrix.element[i][x];
}
return *this;
}
Matrix Matrix::operator+(const Matrix &aMatrix)
{
Matrix temp(*this);
temp += aMatrix;
return temp;
}

cuda magma matrixmatrix addition kernel
By : Muhammad Yusran
Date : March 29 2020, 07:55 AM
hop of those help? I tried using similar format as magmablas_sgeadd_q kernel, however I am not getting proper outputs, moreover every time I run it, I get a different output. The code that I used is given below: , There were two coding errors that I found: code :
if ( ind < m ) {
dA += ind + iby*ldda;
dB += ind + iby*lddb;
dC += ind + iby*lddb; // add this line
for (int i = 0; i < m; ++i)
{
for (int j = 0; j < n ; j ++)
h_A[i*m+j] = rand()/(float)RAND_MAX;
$ cat t1213.cu
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cuda_runtime.h>
#define BLK_X 2
#define BLK_Y 1
__global__ void matrixAdd2( const float *dA, const float *dB, float *dC, int m, int n)
{
int ldda = m;
int lddb = m;
int ind = blockIdx.x*BLK_X + threadIdx.x;
int iby = blockIdx.y*BLK_Y;
/* check if full blockcolumn */
bool full = (iby + BLK_Y <= n);
/* do only rows inside matrix */
if ( ind < m ) {
dA += ind + iby*ldda;
dB += ind + iby*lddb;
dC += ind + iby*lddb;
if ( full )
{
// full blockcolumn
#pragma unroll
for( int j=0; j < BLK_Y; ++j )
{
dC[j*lddb] = dA[j*ldda] + dB[j*lddb];
printf("A is %f, B is %f, C is %f \n",dA[j*ldda],dB[j*lddb],dC[j*lddb]);
}
}
else
{
// partial blockcolumn
for( int j=0; j < BLK_Y && iby+j < n; ++j )
{
dC[j*lddb] = dA[j*ldda] + dB[j*lddb];
printf("parital: A is %f, B is %f, C is %f \n",dA[j*ldda],dB[j*lddb],dC[j*lddb]);
}
}
}
}
int main ( void )
{
int m = 4; // a  mxn matrix
int n = 2; // b  mxn matrix
size_t size = m * n * sizeof(float);
printf("Matrix addition of %d rows and %d columns \n", m, n);
// allocate matrices on the host
float *h_A = (float *)malloc(size); // a mxn matrix on the host
float *h_B = (float *)malloc(size); // b mxn matrix on the host
float *h_C = (float *)malloc(size); // b mxn matrix on the host
// Initialize the host input matrixs
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < m ; j ++)
{
h_A[i*m+j] = rand()/(float)RAND_MAX;
h_B[i*m+j] = rand()/(float)RAND_MAX;
}
}
// Allocate the device input matrix A
float *d_A = NULL;
cudaError_t err = cudaMalloc((void **)&d_A, size);; // d_a  mxn matrix a on the device
// Allocate the device input matrix B
float *d_B = NULL;
err = cudaMalloc((void **)&d_B, size);
// Allocate the device output matrix C
float *d_C = NULL;
err = cudaMalloc((void **)&d_C, size);
// Copy the host input matrixs A and B in host memory to the device input matrixs in device memory
printf("Copy input data from the host memory to the CUDA device\n");
err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// defining number of threads and blocks
dim3 threads( BLK_X, BLK_Y );
dim3 grid((int)ceil(m/BLK_X),(int)ceil(n/BLK_Y) );
// Launching kernel
matrixAdd2<<<grid, threads, 0>>>(d_A, d_B, d_C, m, n);
// Copy the device result matrix in device memory to the host result matrix in host memory.
printf("Copy output data from the CUDA device to the host memory\n");
err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
//print A matrix
printf("Matrix A");
for (int i = 0; i < n; i++)
{
for (int j = 0; j < m; j++)
{
printf(" %f", h_A[i*m+j]);
}
printf("\n");
}
// print B matrix if required
printf("Matrix B");
for (int i = 0; i < n; i++)
{
for (int j = 0; j < m; j++)
{
printf(" %f", h_B[i*m+j]);
}
printf("\n");
}
int flag = 0;
//Error checkng
printf("Matrix C ");
for (int i = 0; i < n; i++)
{
for (int j = 0; j < m; j++)
{
printf("%f", h_C[i*m+j]);
if(h_C[i*m+j] == h_A[i*m+j] + h_B[i*m+j] )
{
flag = flag + 1;
}
}
printf("\n");
}
if(flag==m*n)
{
printf("Test PASSED\n");
}
// Free device global memory
err = cudaFree(d_A);
err = cudaFree(d_B);
err = cudaFree(d_C);
// Free host memory
free(h_A);
free(h_B);
free(h_C);
err = cudaDeviceReset();
printf("Done\n");
return 0;
}
$ nvcc o t1213 t1213.cu
$ cudamemcheck ./t1213
========= CUDAMEMCHECK
Matrix addition of 4 rows and 2 columns
Copy input data from the host memory to the CUDA device
Copy output data from the CUDA device to the host memory
A is 0.277775, B is 0.553970, C is 0.831745
A is 0.477397, B is 0.628871, C is 1.106268
A is 0.364784, B is 0.513401, C is 0.878185
A is 0.952230, B is 0.916195, C is 1.868425
A is 0.911647, B is 0.197551, C is 1.109199
A is 0.335223, B is 0.768230, C is 1.103452
A is 0.840188, B is 0.394383, C is 1.234571
A is 0.783099, B is 0.798440, C is 1.581539
Matrix A 0.840188 0.783099 0.911647 0.335223
0.277775 0.477397 0.364784 0.952230
Matrix B 0.394383 0.798440 0.197551 0.768230
0.553970 0.628871 0.513401 0.916195
Matrix C 1.2345711.5815391.1091991.103452
0.8317451.1062680.8781851.868425
Test PASSED
Done
========= ERROR SUMMARY: 0 errors
$

Why Matrix Addition is slower than MatrixVector Multiplication in Eigen?
By : Borna Morovic
Date : March 29 2020, 07:55 AM
To fix this issue Short answer: you calculated the number of operations but neglected to count memory accesses for which there is nearly x2 more costly loads for the addition case. Details below. First of all, the practical number of operations is the same for both operation because modern CPUs are able to perform one independent addition and multiplication at the same time. Two sequential mul/add like x*y+z can even be fused as a single operation having the same cost than 1 addition or 1 multiplication. If your CPU supports FMA, this is what happens with march=native, but I doubt FMA plays any role here.

