learning tech · Programming · Python · technology · Uncategorized

Excuse me. I’m going to need to see your ID.

When I started developing LinkApp, my link aggregator, I knew would need to identify each link. I also made the choice to use Redis as my database and that would require a key for identification purposes.

With my first pass of building this application I decided to make an ID using the URL. My logic was that I would only allow one link per URL address. That would ensure each ID was unique.

A hash function is used to map data of an arbitrary size to data of a fixed size. When the URL is hashed it creates the ID of a fixed size. If a user attempts to make a post with an existing URL it will get caught. It will get caught because I put in a check in the form of the url_exists() function. This function will raise an Exception when a duplicate URL is entered.

The way I implemented this was to use the hashlib module from the Standard Python Library. In the code below I created a key() function using hashlib. With the hashlib module I was able to utilize the MD5 algorithm.

def prefix_key(self, raw_id):
    """Adds a prefix to the raw_id."""
    return "link:%s" % (raw_id,)
          
def key(self, url_address):
    """Generate a database key and hashed id based on the URL"""
    hashed = hashlib.md5(url_address.encode('utf-8')).hexdigest()
    
    raw_id = hashed
    redis_key = self.prefix_key(hashed)

    return raw_id, redis_key

Okay, that’s all good and fine but what does that actually mean?

The hashlib module is built into Python, it’s part of the Standard Python Library. It implements a common interface which accommodates the use of multiple hash algorithms. You can read more about the hashlib module here in the Python documentation.

This means the code you use for one hash algorithm can be used for other supported algorithms simply swapping out the names.

# Using sha256

m = hashlib.sha256("blah".encode("utf-8")).digest()

# Using MD5

m = hashlib.md5("blah".encode("utf-8")).digest()

The above code illustrates what a common interface means.

Well, what is a MD5?

The MD5 algorithm is a hash function that produces a 128-bit hash value.

This means the MD5 algorithm would be able to take various size (length) URLs and map that data to be the same size key every time despite the difference in the size of the URL. How this works is quite interesting but I do not want to bog down this post with the specifics.

Now MD5 isn’t your only choice and before you use the hashlib module I would look at all your options and choose which one will work best for your project. I chose MD5 because it has the shortest output, 32 characters. This matters because the mapping of the address end up being the raw_id prefixed with the prefix_key. It also keeps URLs short.

Let’s look at the each part of the code and walk through it. As each part of the code is being talked about it will appear in bold and in red.

hashed = hashlib.md5(url_address.encode('utf-8')).hexdigest()

The url_address is a string. Using the encode method with the default character encoding of utf-8 provides us with a Unicode compatible bytes object. URL addresses can have special characters, by using the encode method and utf-8 it will return characters that are 8-bit blocks.

hashed = hashlib.md5(url_address.encode('utf-8')).hexdigest()

The hashlib module uses the MD5 algorithm, which takes as input, a binary a message of arbitrary length (in this case our encoded url_address) and produces, as output, a 128 bit message digest or secure hash.

hashed = hashlib.md5(url_address.encode('utf-8')).hexdigest()

The data is passed to the hexdigest method and returns as a string object of double length, containing only hexadecimal digits. Hexadecimal is base 16. Each byte (8 bits) of the 128 bit message digest is converted into an int, and then to hex. (128/8 = 16)

hashed = hashlib.md5(url_address.encode('utf-8')).hexdigest()

I assign this value to the variable “hashed“.

 

def prefix_key(self, raw_id):
    """Adds a prefix to the raw_id."""
    return "link:%s" % (raw_id,)
          
def key(self, url_address):
    """Generate a database key and hashed id based on the URL"""
    hashed = hashlib.md5(url_address.encode('utf-8')).hexdigest()
    
    raw_id = hashed
    redis_key = self.prefix_key(hashed)

    return raw_id, redis_key

The hashed value is assigned to raw_id.  This serves as an ID for each link. The hashed value is prefixed with a prefix_key() function to make a redis_key.

This solution worked fine. That is until I needed to modify a URL, as in the case when it was entered incorrectly. Change the URL and now the raw_id and redis_key no longer match the hashed url_address. This problem required its own solution.

I would not be able to just change the old url_address to the new url_address. If a url_address is changed the hashed raw_id of the old address would not match a hash of the new url_address.

Within my modify() function I create an empty dictionary called fields. In the event url_address is changed I had to create a whole other function which I called rename().

(Below are just the parts of the modify function being discussed, the entire block of code can be seen here.)

def modify(self, raw_id, page_title=None, desc_text=None, url_address=None, author=None, created=None, tags=None):
          """Modify an existing link in the database."""
    fields = {}

    if url_address is not None:
        fields['url_address'] = url_address
             
    if fields:
        if fields.get("url_address", None):
            comp_id, junk = self.key(url_address)
                if comp_id != raw_id:
                    raw_id = self.rename(raw_id, url_address)     

The modify() function would check to see if url_address is different than what is already in Redis, if it is, rename() gets called. The purpose of the rename() function is change the raw_id and redis_key.

def rename(self, raw_id, new_url):
    """
    Delete and re-add a link when the url has changed.
    """
    existing = self.list_one(raw_id)[0]

    self.delete(raw_id)
         
    created = datetime.strptime(existing['created'], CREATED_TIME_FORMAT)
    tags = existing['tags'].split("|")

    return self.add(page_title=existing['page_title'], 
                    desc_text=existing['desc_text'], 
                    url_address=new_url, 
                    author=existing['author'], 
                    tags=tags, 
                    created=created)

This solved the problem of using the URL to make an ID.

It created it’s own problem though. If the URL needed to be changed because it was entered incorrectly that would require other changes. The hashed URL would need to be deleted and re-added to the database, as well as all the other values for the link.

Time to refactor and learn a new way to implement hashed IDs!

I spent some time researching different methods of generating my database key and hashed ID. The conclusion I came to would be to use Hashids. A few quick reasons why I settled on this:

  • It’s open-source.
  • It is user-friendly.
  • It tries to avoid generating naughty words.
  • The IDs are short and unique.
  • The documentation is easy to understand.

The new code for the key() function is now this:

from hashids import Hashids

...

def prefix_key(self, raw_id):
    """Adds a prefix to the raw_id"""
    return "link:%s" % (raw_id,)

def key(self):
    """Generate a database key and hashed ID using Hashids."""
    hashed = Hashids()
    
    raw_id = hashed.encode(random.randint(0, 256), random.randint(0, 256), random.randint(0, 256))
    redis_key = self.prefix_key(raw_id)

    return raw_id, redis_key

 

I start by instantiating an instance of Hashids.

hashed = Hashids()

 

I generate three random numbers within the given span of numbers using random.randint(). This takes place of using the URL. I generate three random numbers because it allows me enough variance that I will be less likely to run into collisions. If I felt my database would reach past 16 million links I could increase the value 256 to something higher. Using the random.randint(0, 256) three times is equivalent to 256*256*256 which equals 16,777,216 combinations.

raw_id = hashed.encode(random.randint(0, 256), random.randint(0, 256), random.randint(0, 256))

 

I use the encode method on Hashids which is passed the three randomly generated integers. Hashids converts the integers into strings and the encode method returns an encoded version of the string. Hashids uses a larger alphabet which means there are more characters to work with. A salt could be added before the encoding to shuffle the alphabet consistently.

raw_id = hashed.encode(random.randint(0, 256), random.randint(0, 256), random.randint(0, 256))

 

I now no longer rely on my URL nor do I have to worry about someone needing to modify it in a link.

The change wasn’t as simple as changing just the key() function. I had a few areas in the code that I needed to change as well.

In the add() function, I altered the code to first check if a url exists. If it does I raise an Exception. This will end the add function and the user will not be able to add a url which exists in the database.

I would need a place to store a collection of all of the URLs in the database. This is done by adding the URL to a set in Redis called the url_hold. Making this a set means duplicates are not an issue. It does mean unless a URL is checked it would overwrite that particular URL in the database.

This is why the add() function first makes a call to url_exists which checks the url_hold for an existing URL. If it doesn’t exist, the url_address is added to the url_hold.

def add(self, page_title, desc_text, url_address, author, tags, created=None):
    
    if self.url_exists(url_address):
        raise Exception("URL '%s' exists" % (url_address,)) 
      
    raw_id, redis_key = self.key()

    with self.connection.pipeline() as pipe:

        pipe.sadd("url_hold", url_address)

 

In the delete() function, I need to delete a URL from my url_hold. If a URL were to remain in the url_hold after a link was deleted from the database it would raise an Exception if another person attempted to add the same URL or re-add a corrected URL

def delete(self, raw_id):
    url_to_be_deleted = self.connection.hmget(self.prefix_key(raw_id), "url_address")[0]

    with self.connection.pipeline() as pipe:
        pipe.srem("url_hold", url_to_be_deleted)

I use the Redis command HMGET which I pass the key and the field (url_address) that I need to delete from the url_hold. That value is stored in the variable url_to_be_deleted. I then proceed with the code as it was before, utilizing the pipeline.

I am now able to remove the rename() function entirely.

In the modify() function I preformed the same HMGET to Redis as I did in the delete() function, but this is set to a variable called url_to_be_modified.

def modify(self, raw_id, page_title=None, desc_text=None, url_address=None, author=None, created=None, tags=None):

    url_to_be_modified = self.connection.hmget(self.prefix_key(raw_id), "url_address")[0]

    fields = {}

    if url_address is not None:
        fields['url_address'] = url_address

    if fields:
        if fields.get("url_address", None):
            if self.url_exists(url_address):
                raise Exception("URL '%s' exists" % (url_address,))

        with self.connection.pipeline() as pipe:
            pipe.hmset(self.prefix_key(raw_id), fields)
            
            if fields.get("url_address", None):
                if url_to_be_modified != fields["url_address"]:
                    pipe.srem("url_hold", url_to_be_modified)
                    pipe.sadd("url_hold", fields["url_address"])

    

As with the previous version of modify() I’m still creating an empty dictionary called fields. I start as I did before with an if statement checking for the url_address. In the new code, I’m simply using url_exists() function and if it returns True if an Exception is raised. If no Exception is raised I check to see if url_address is in fields, if it is, it checks to see if url_to_be_modified value is not the same as the value of url_address in fields. If it is not the same, we perform a srem to remove the old address (url_to_be_modified) and then an sadd to add the new url_address to the url_hold.

Previously this is where I would have to get the url_address, check it against the raw_id, and if it matched, I’d have to then reset the raw_id using the rename() function.

The last bits that needed to be altered were the regex in wsgilinkapp.py.

The regex is looking for a particular pattern which is has to match. This is the URL mapping part of my code.  In the old code, the URL required 32 characters after the last slash. In the new code 32 was changed to + as the new ID could be variable in it’s length.

Old regex:

match = re.search("/([^/]{32})$", environ['PATH_INFO'])

New regex:

match = re.search("/([^/]+)$", environ['PATH_INFO'])

There is one last problem. You have a database that is populated with the old implementation of the code. If you deployed it any new entries would add the url to the url_hold but it would be missing anything that was previously added. I’ll tackle how to deal with this in my next blog post.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s