Reasoning behind shifting over the text whem mismatch occurs in KMP algorithm?

I am not sure that you've got problems with understanding in only
this point, so, if you don't mind, I'll just describe (with as much
explanation as possible) the whole algorithm. The answer to your
question is probably in the last paragraph, but you'd better read it
all to understand my terminology better.
During the KMP algorithm, you are, actually, counting nearly the
same values as in the table (this is usually called prefix function).
So when you get to position i in the text, you need to count the
maximum length of a substring in the text ending in position i which
equals some prefix of the pattern. It is quite clear that if and only
if the length of this substring is equal to length of the pattern, you
have found the pattern in the text.
So, how do you count this prefix function value fast? (I suppose that
you count these for the pattern using some O(n^2) algorithm, which is
not enough fast).
Let's suppose that we've already did everything for the first
i1 symbols of the text and we are now working with
position i . We will also need the prefixfunction value
for the previous symbol of the text: p[i1] .
Let's compare text[i] and pattern[p[i1]] (indexation from 0, if
you don't mind). We already know that pattern[0:p[i1]1] ==
text[i1+p[i1],i1] : that's the definition of
p[i1] . So, if text[i] == pattern[p[i1]] ,
we now know that pattern[0:p[i1]] == text[i1+p[i1] ,
i]', and that's why p[i] = p[i  1]. But the interesting part starts
when text[i] != pattern[p[i1]] .
When these symbols are different, we start jumping. The reason for
that is that we want to find the next possible prefix as fast as we
can. So, how we do it. Just look at the picture here and
follow the explanation (the yellow parts are substrings found for
text[i1] ). We're trying to find some string
s : s:=s1+text[i] . Because of the
prefixfunction definition, s1=s2, c=test[i] . But we
already know (from finding the value for text[i1] ) that
two yellow parts in the picture are the same, so s3
actually equals s1 , and so s3=s1 . So we can
find the length of s1 : it is table[p[i1]] .
So now if c1=text[i] , we should stop: we've found
p[i] , it is s1.length + 1 . And if
c1!=text[i] , we can just repeat the same jumping, looking
now at the first table[table[p[i1]]] symbols of the
pattern, and so we go on until we find the answer, or we get to the
first 0 symbols in this case p[i]:=0 .


Reasoning behind shifting over the text whem mismatch occurs in KMP algorithm? 
I am not sure that you've got problems with understanding in only
this point, so, if you don't mind, I'll just describe (with as much
explanation as possible) the whole algorithm. The answer to your
question is probably in the last paragraph, but you'd better read it
all to understand my terminology better.
During the KMP algorithm, you are, actually, counting nearly the
same values as in the table (this is usually called prefix function).
So when you get to position i in the text, you need to count the
maximum length of a substring in the text ending in position i which
equals some prefix of the pattern. It is quite clear that if and only
if the length of this substring is equal to length of the pattern, you
have found the pattern in the text.
So, how do you count this prefix function value fast? (I suppose that
you count these for the pattern using some O(n^2) algorithm, which is
not enough fast).
Let's suppose that we've already did everything for the first
i1 symbols of the text and we are now working with
position i . We will also need the prefixfunction value
for the previous symbol of the text: p[i1] .
Let's compare text[i] and pattern[p[i1]] (indexation from 0, if
you don't mind). We already know that pattern[0:p[i1]1] ==
text[i1+p[i1],i1] : that's the definition of
p[i1] . So, if text[i] == pattern[p[i1]] ,
we now know that pattern[0:p[i1]] == text[i1+p[i1] ,
i]', and that's why p[i] = p[i  1]. But the interesting part starts
when text[i] != pattern[p[i1]] .
When these symbols are different, we start jumping. The reason for
that is that we want to find the next possible prefix as fast as we
can. So, how we do it. Just look at the picture here and
follow the explanation (the yellow parts are substrings found for
text[i1] ). We're trying to find some string
s : s:=s1+text[i] . Because of the
prefixfunction definition, s1=s2, c=test[i] . But we
already know (from finding the value for text[i1] ) that
two yellow parts in the picture are the same, so s3
actually equals s1 , and so s3=s1 . So we can
find the length of s1 : it is table[p[i1]] .
So now if c1=text[i] , we should stop: we've found
p[i] , it is s1.length + 1 . And if
c1!=text[i] , we can just repeat the same jumping, looking
now at the first table[table[p[i1]]] symbols of the
pattern, and so we go on until we find the answer, or we get to the
first 0 symbols in this case p[i]:=0 .

KMP algorithm with one mismatch allowed in Java 
public class KMPStringSearch {
/**
* Searches for all occurances of the word in the sentence. Runs
in O(n+k)
* where n is the word length and k is the sentence length.
*
* @param word The word that is being searched
* @param sentence The collections of word over which the search
happens.
* @return The list of starting indices of the matched word in the
sentence.
* Empty list in case of no match.
*/
public List<Integer> searchString(final String word, final
String sentence) {
final List<Integer> matchedIndices = new
ArrayList<>();
final int sentenceLength = sentence.length();
final int wordLength = word.length();
int beginMatch = 0; // the starting position in sentence from
which the match started
int idxWord = 0; // the index of the character of the word
that is being compared to a character in string
final List<Integer> partialTable =
createPartialMatchTable(word);
while (beginMatch + idxWord < sentenceLength) {
if (word.charAt(idxWord) == sentence.charAt(beginMatch +
idxWord)) {
// the characters have matched
if (idxWord == wordLength  1) {
// the word is complete. we have a match.
matchedIndices.add(beginMatch);
// restart the search
beginMatch = beginMatch + idxWord 
partialTable.get(idxWord);
if (partialTable.get(idxWord) > 1) {
idxWord = partialTable.get(idxWord);
} else {
idxWord = 0;
}
} else {
idxWord++;
}
} else {
// mismatch. restart the search.
beginMatch = beginMatch + idxWord 
partialTable.get(idxWord);
if (partialTable.get(idxWord) > 1) {
idxWord = partialTable.get(idxWord);
} else {
idxWord = 0;
}
}
}
return Collections.unmodifiableList(matchedIndices);
}
/**
* Creates the Partial Match Table for the word. Runs in O(n)
where n is the
* length of the word.
*
* @param word The word whose Partial Match Table is required.
* @return The table as a list of integers.
*/
public List<Integer> createPartialMatchTable(final String
word) {
if (StringUtils.isBlank(word)) {
return Collections.EMPTY_LIST;
}
final int length = word.length();
final List<Integer> partialTable = new
ArrayList<>(length + 1);
partialTable.add(1);
partialTable.add(0);
final char firstChar = word.charAt(0);
for (int idx = 1; idx < word.length(); idx++) {
final int prevVal = partialTable.get(idx);
if (prevVal == 0) {
if (word.charAt(idx) == firstChar) {
partialTable.add(1);
} else {
partialTable.add(0);
}
} else if (word.charAt(idx) == word.charAt(prevVal)) {
partialTable.add(prevVal + 1);
} else {
partialTable.add(0);
}
}
return Collections.unmodifiableList(partialTable);
}
}

What charactershifting / pseudoencryption algorithm is used here? 
is anybody familiar with this sort of obfuscation scheme in the
Windows world?
Once you understand it correctly, it's just a trivial rotation
cipher like ROT13.
Why would anyone use this?
Well, in general, this is very common. Let's say you have some data
that you need to obfuscate. But the decryption algorithm and key have
to be embedded in software that the viewers have. There's no point
using something fancy like AES, because someone can always just dig
the algorithm and key out of your code instead of cracking AES. An
encryption scheme that's even marginally harder to crack than finding
the hidden key is just as good as a perfect encryption scheme—that
is, good enough to deter casual viewers, and useless against serious
attackers. (Often you aren't even really worried about
stopping attacks, but about proving after the fact that your
attacker must have acted in bad faith for contractual/legal reasons.)
So, you use either a simple rotation cipher, or a simple xor
cipher—it's fast, it's hard to get wrong and easy to debug, and if
worst comes to worst you can even decrypt it manually to recover
corrupted data.
As for the particulars:
If you want to handle nonASCII characters, you pretty much have to
use Unicode. If you used some fixed 8bit charset, or the local
system's OEM charset, you wouldn't be able to handle passwords from
other machines.
A Python script would almost certainly handle Unicode characters,
because in Python you either deal in bytes in a str , or
Unicode characters in a unicode . But a Windows C or .NET
app would be much more likely to use UTF16, because Windows native
APIs deal in UTF16LE code points in a WCHAR * (aka a
string of 16bit words).
So, why 4142? Well, it really doesn't matter what the key is. I'm
guessing some programmer suggested 42. His manager then said "That doesn't sound very
secure." He sighed and said, "I already explained why no key is going
to be any more secure than… you know what, forget it, what about
4142?" The manager said, "Ooh, that sounds like a really secure
number!" So that's why 4142.
If it's not a library function, can you think of a better method
to deobfuscate these values without resorting to the magic 142
number.
You do need to resort to the magic 4142, but you can make this a
lot simpler:
def decrypt(block):
return struct.pack('>H', (4142  int(block, 10)) % 65536)
So, each block of 5 characters is the decimal representation of a
UTF16 code unit, subtracted from 4142, using C unsignedshort
wraparound rules.
This would be trivial to implement in native Windows C, but it's
slightly harder in Python. The best transformation function I can come
up with is:
def decrypt_block(block):
return struct.pack('>H', (4142  int(block, 10)) % 65536)
def decrypt(pwd):
blocks = [pwd[i:i+5] for i in range(0, len(pwd), 5)]
return ''.join(map(decrypt_block, blocks)).decode('utf16be')
This would be a lot more trivial in C or C#, which is probably what
they implemented things in, so let me explain what I'm doing.
You already know how to transform the string into a sequence of
5character blocks.
My int(block, 10) is doing the same thing as your
int(block.lstrip('0')) , making sure that a
'0' prefix doesn't make Python treat it as an octal
numeral instead of decimal, but more explicitly. I don't think this is
actually necessary in Jython 2.2 (it definitely isn't in more modern
Python/Jython), but I left it just in case.
Next, in C, you'd just do unsigned short x = 4142U 
y; , which would automatically underflow appropriately. Python
doesn't have unsigned short values, just signed
int , so we have to do the underflow manually. (Because
Python uses floored division and remainder, the sign is always the
same as the divisor—this wouldn't be true in C, at least not C99 and
most platforms' C89.)
Then, in C, we'd just cast the unsigned short to a 16bit "wide
character"; Python doesn't have any way to do that, so we have to use
struct.pack . (Note that I'm converting
it to bigendian, because I think that makes this easier to debug; in
C you'd convert to nativeendian, and since this is Windows, that
would be littleendian.)
So, now we've got a sequence of 2character UTF16BE code points.
I just join them into one big string, then
decode it as UTF16BE.
If you really want to test that I've got this right, you'll need to
find characters that aren't just nonASCII, but nonWestern. In
particular, you need:
 A character that's > U+4142 but < U+10000. Most CJK ideographs,
like U+7000 (瀀), fit the bill. This should appear as
'41006' , because that's 41420x7000 rolled over as an
unsigned short.
 A character that's >= U+10000. This includes uncommon CJK
characters, specialized mathematical characters, characters from
ancient scripts, etc. For example, the Old Italic character U+10300 (

While concatenating two columns an output mismatch occurs 
Concatenating strings, where one of your strings are
NULL will always yield NULL .
For example: SELECT NULL + 'Moo' will result in
NULL  always.
ISNULL() accepts 2 arguments and will return the second
when the first resolves to NULL  BOL
Do you mean to do this?
SELECT [Badge Id] = v.PassNo
,VisitorName = VName
,[Name(Department)] = ISNULL(e.eName, '') + ISNULL(' (' +
d.dtName + ')', '')
,v.EntryTime
FROM Visitorlogo_tbl V
JOIN DepartmentMaster_tbl D ON v.Deptid=d.dtId
LEFT
JOIN EmployeeMaster_tbl E ON v.empid=e.eId
WHERE v.EntryTime >= '20140629'

Parallel algorithm that does a small insertion/shifting 
One possible approach.
Due to the ambiguity of your shifting (0, 1, 0, 1 ,
0, 1, 1, 1 and 0, 1, 0 ,0 all produce the
same data offset pattern, for example) it's not possible to just
create a prefix sum of the shift pattern to produce the relative
offset at each position. An observation we can make, however, is that
a valid offset pattern will be created if each zero in the shift
pattern gets replaced by the first nonzero shift value to its
left:
0, 1, 0, 0 (shift pattern)
0, 1, 1, 1 (offset pattern)
or
0, 2, 0, 2 (shift pattern)
0, 2, 2, 2 (offset pattern)
So how to do this? Let's assume we have the second test case shift
pattern:
0, 1, 0, 0, 2, 0, 0, 0
Our desired offset pattern would be:
0, 1, 1, 1, 2, 2, 2, 2
for a given shift pattern, create a binary value, where each
bit is one if the value at the corresponding index into the shift
pattern is zero, and zero otherwise. We can use a warp vote instruction, called
__ballot() for this. Each lane will get the same value
from the ballot:
1 0 1 1 0 1 1 1 (this is a single binary 8bit
value in this case)
Each warp lane will now take this value, and add a value to it
which has a 1 bit at the warp lane position. Using lane 1 for the
remainder of the example:
+ 0 0 0 0 0 0 1 0 (the only 1 bit in this value
will be at the lane index)
= 1 0 1 1 1 0 0 1
We now take the result of step 2, and bitwise exclusiveOR with
the result from step 1:
= 0 0 0 0 1 1 1 0
We now count the number of 1 bits in this value (there is a
__popc() intrinsic for this), and subtract one from the
result. So for the lane 1 example above, the result of this step
would be 2 , since there are 3 bits set. This gives use
the distance to the first value to our left that is nonzero in the
original shift pattern. So for the lane 1 example, the first nonzero
value to the left of lane 1 is 2 lanes higher, i.e. lane 3.
For each lane, we use the result of step 4 to grab the
appropriate offset value for that lane. We can process all lanes at
once using a __shfl_down() warp shuffle instruction.
0, 1, 1, 1, 2, 2, 2, 2
Thus producing our desired "offset pattern".
Once we have the desired offset pattern, the process of having each
warp lane use its offset value to appropriately shift its data item is
straightforward.
Here is a fully worked example, using your 3 test cases. Steps 14
above are contained in the __device__ function
mydelta . The remainder of the kernel is performing the
step 5 shuffle, appropriately indexing into the data, and copying the
data. Due to the usage of the warp shuffle instructions, we must
compile this for a cc3.0 or higher GPU. (However, it would not be
difficult to replace the warp shuffle instructions with other indexing
code that would allow operation on cc2.0 or greater devices.) Also,
due to the various intrinsics used, this function cannot work for more
than 32 data items, but that was a prerequisite condition stated in
your question.
$ cat t475.cu
#include <stdio.h>
#define DSIZE 8
#define cudaCheckErrors(msg)
do {
cudaError_t __err = cudaGetLastError();
if (__err != cudaSuccess) {
fprintf(stderr, "Fatal error: %s (%s at %s:%d)
",
msg, cudaGetErrorString(__err),
__FILE__, __LINE__);
fprintf(stderr, "*** FAILED  ABORTING
");
exit(1);
}
} while (0)
__device__ int mydelta(const int shift){
unsigned nz = __ballot(shift == 0);
unsigned mylane = (threadIdx.x & 31);
unsigned lanebit = 1<<mylane;
unsigned temp = nz + lanebit;
temp = nz ^ temp;
unsigned delta = __popc(temp);
return delta1;
}
__global__ void mykernel(const int *data, const unsigned *shift, int
*result, const int limit){ // limit <= 32
if (threadIdx.x < limit){
unsigned lshift = shift[(limit  1)  threadIdx.x];
unsigned delta = mydelta(lshift);
unsigned myshift = __shfl_down(lshift, delta);
myshift = __shfl(myshift, ((limit 1)  threadIdx.x)); // reverse
offset pattern
result[threadIdx.x] = 0;
if ((myshift + threadIdx.x) < limit)
result[threadIdx.x + myshift] = data[threadIdx.x];
}
}
int main(){
int A[DSIZE] = {3, 6, 7, 8, 1, 2, 3, 5};
unsigned tc1B[DSIZE] = {0, 1, 0, 0, 0, 0, 0, 0};
unsigned tc2B[DSIZE] = {0, 1, 0, 0, 2, 0, 0, 0};
unsigned tc3B[DSIZE] = {0, 1, 0, 0, 1, 0, 0, 0};
int *d_data, *d_result, *h_result;
unsigned *d_shift;
h_result = (int *)malloc(DSIZE*sizeof(int));
if (h_result == NULL) { printf("malloc fail
"); return 1;}
cudaMalloc(&d_data, DSIZE*sizeof(int));
cudaMalloc(&d_shift, DSIZE*sizeof(unsigned));
cudaMalloc(&d_result, DSIZE*sizeof(int));
cudaCheckErrors("cudaMalloc fail");
cudaMemcpy(d_data, A, DSIZE*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_shift, tc1B, DSIZE*sizeof(unsigned),
cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result,
DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int),
cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("index: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", i);
printf("
A: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", A[i]);
printf("
tc1 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc1B[i]);
printf("
tc1 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
cudaMemcpy(d_shift, tc2B, DSIZE*sizeof(unsigned),
cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result,
DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int),
cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("
tc2 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc2B[i]);
printf("
tc2 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
cudaMemcpy(d_shift, tc3B, DSIZE*sizeof(unsigned),
cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result,
DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int),
cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("
tc3 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc3B[i]);
printf("
tc2 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
printf("
");
return 0;
}
$ nvcc arch=sm_35 o t475 t475.cu
$ ./t475
index: 0, 1, 2, 3, 4, 5, 6, 7,
A: 3, 6, 7, 8, 1, 2, 3, 5,
tc1 B: 0, 1, 0, 0, 0, 0, 0, 0,
tc1 C: 3, 0, 6, 7, 8, 1, 2, 3,
tc2 B: 0, 1, 0, 0, 2, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 0, 1, 2,
tc3 B: 0, 1, 0, 0, 1, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 1, 2, 3,
$

Finding "decent" numbers algorithm reasoning? 
OK, the thinking goes something like this.
 Once you've worked out how many
5 s and how many
3 s you want, you should frontload the 5 s.
The ordering makes no odds to whether the number is decent; but it'll
be bigger if the 5 s are at the front.
 The number of
3 s you want should be the smallest
number that satisfies the constraints, because then you'll have more
5 s, making for a bigger number.
 The number of
3 s must be divisible by 5. That means
that there's never any point having more than 10 3 s: you
should only consider 0, 5 or 10 3 s. This is because you
want the smallest number that leaves the number of remaining digits
divisible by 3, to satisfy the other constraint. If having 15
3 s works, then so does having 0 3 s; if 20
works, then so does 5; if 25 works, then so does 10. In general,
subtracting 15 from the number of 3 s will leave the
constraints both still satisfied if they were before.
 The number of
5 s must therefore be n0
or n5 or n10 , and we want the one that
gives a number that's divisible by 3. The calculation c =
5*(2*n%3) will give you 0 3 s and therefore
n 5 s if n was already divisible
by 3; and 10 3 s and therefore n10
5 s if n was one greater than a multiple of
3, in which case n10 is still divisible by 3; and so
on.
 The only thing to test is whether the calculation of
c 3 s and nc 5 s
satisfies the implicit constraint that nc should be
nonnegative. If it's negative then there's no solution; if it's
nonnegative then this is a valid solution and frontloading the
5 s will give you the largest such solution.
This is one of quite a wide class of "programming" problems where
the test isn't really to see whether you can bash out some code that
does the job, but to see whether you can reduce the problem logically
to the point where it's trivial and can be solved very
efficiently.



