[ Python: Code string to shorter meaningless string in reversible way ]
I'm sorry for the generic question (I don't have any past knowledge about compression and I don't know if it has a possible solution).
I have some codes of always 19 characters.
These characters can be only: A-Z
, a-z
, 0-9
, .
, :
, -
An example can be something like 1995AbC...123..456Z
What I want to do is to find a way to convert in a reversible way that string to a shorter one that contains only ascii characters: something like gfSDd2H
.
- Is it possible?
- Is there a way to do it in python?
Thanks!
Answer 1
You can try to compress the string and the encode the result to for example base64. This of course assumes that your original strings are compressible. For strings of 19 characters this seems unlikely.
If you are allowed to persist some data you can compress the first string to 1, the second to 2, etc... and you will need to store the mapping you made in for example a database so that you can reverse it. You can then encode the number as a base 64 (or some other base) string.
This is similar to how URL shortening services work.
Answer 2
You allow 65 different characters. Assuming all inputs have the same probability, every encoding would produce not less than 19*65/128 ≈ 10 characters. However, since you probably want to ignore unprintable characters, this is diminished to 19*65/95=13 characters with a perfect mapping. Therefore, any such mapping will not lead to a significant reduction in space.
Answer 3
Of course (?) it's possible in Python. All you would be doing is converting a base-65 number into a base-95 or base-94 number, and back again. It's just that it would be a bit slow, and as pointed out in another answer, you wouldn't be saving much space
Here (untested) are the basic building blocks:
def ttoi(text, base, letter_values):
"""converts a base-"base" string to an int"""
n = 0
for c in text:
n = n * base + letter_values[c]
return n
def itot(number, base, alphabet, padsize):
"""converts an int into a base-"base" string
The result is left-padded to "padsize" using the zero-value character"""
temp = []
assert number >= 0
while number:
number, digit = divmod(number, base)
temp.append(alphabet[digit])
return max(0, padsize - len(temp)) * alphabet[0] + "".join(reversed(temp))
Definitions for e.g. your existing base-65 code:
b65_letter_values = {
'A': 0, 'Z': 25, 'a': 26, 'z': 51, '0': 52, '9': 61,
# etc
}
b65_alphabet = "ABCetcXYZabcetcxyz0123456789.:-"
b65_padsize = 19