This past week, I made significant technical changes to Pokésummary, my command-line Python program. In this blog post, I’ll recount the process of redesigning the data model.
Some background
Pokésummary version 1 consisted of three modules:
__main__.py
, which contained the logic for reading Pokémon data into memory and handled the command-line interface;
displaying.py
, which handled displaying Pokémon summaries;
and parsing.py
, which contained a utility function for reading csv
files into dictionaries.
I was unsatisfied with this structure–I didn’t like how __main__.py
violated the single responsibility principle.
Additionally, with the way I stored Pokémon data into memory,
it was hard to reason about my program.
In __main__.py
, I had the following code to get the data of every Pokémon:
data_dictionary = parsing.csv_to_nested_dict(
"pokesummary.data",
"pokemon_modified.csv",
"pokemon_name"
)
This gave me a 2D dictionary: the outer dictionary mapped Pokémon names to stats,
and the inner dictionaries (the stats) mapped specific attributes (e.g. “attack_stat”) to values (e.g. 49).
The problem was that my code didn’t explicitly define which attributes the stats should have.
The attributes were given by my dataset,
so my code in displaying.py
had to know the column names of my dataset to access attributes.
For example, here is how I accessed a Pokémon’s classification:
pokemon_stats['classification']
Since the column name for Pokémon classifications in the dataset is “classification”, I had to use that string as my dictionary key. Surely there was a better way.
The model-view-controller design pattern
I remembered someone had once told me about the model-view-controller design pattern. The idea is to separate your program into three components: the model manages the program’s data, the view displays the data, and the controller responds to user input.
It seemed like this could work well for Pokésummary.
I had the view already, displaying.py
.
I also had the controller, __main__.py
.
I just needed a model for Pokémon stats.
Creating this model would explicitly define which attributes were available,
and separating out the logic that read data into it from the controller
would satisfy the single responsibility principle.
Data Classes
Since my Pokémon stats model would store data, I implemented it using Data Classes.
With normal classes, you need to write an __init__()
method (the constructor).
Data Classes make things simpler by automatically generating it;
all you need to do is define the class’s attributes and their types1.
I made a Pokemon
class, which contains a PokemonBaseStats
–here is the code:
@dataclass(frozen=True)
class BaseStats:
hp: int
attack: int
defense: int
special_attack: int
special_defense: int
speed: int
@dataclass(frozen=True)
class Pokemon:
name: str
classification: str
height: float
weight: float
primary_type: str
secondary_type: str
base_stats: BaseStats
Now, with Data Classes, the available attributes are explicitly defined. I can also now use dot notation, which is much cleaner than using key lookup:
pokemon.classification
Inheriting from UserDict
Next, I needed a way to read my dataset into a collection of Pokemon
objects.
I thought it’d be nice to create my own class that mapped strings to Pokemon
;
the class could encapsulate reading from my dataset.
Maybe inherit from dict
?
But, reading an article on this,
I saw that inheriting from dict
had some unexpected behavior.
So, I followed the article’s advice and inherited from UserDict
instead.
UserDict
is a wrapper around a dictionary object,
but it is implemented such that values can be accessed just like a dictionary.
The UserDict
constructor allowed me to give the internal dictionary some initial data.
So, to implement PokemonDict
,
I created a static method to read my dataset into a dictionary,
and I passed the output of this method into the constructor.
class PokemonDict(UserDict):
def __init__(self):
pokemon_dictionary = self.read_dataset_to_dictionary()
UserDict.__init__(self, pokemon_dictionary)
@staticmethod
def read_dataset_to_dictionary():
with resources.open_text(data, "pokemon_modified.csv") as f:
csv_iterator = csv.DictReader(f)
dataset_dict = {
csv_row["pokemon_name"]: Pokemon(
name=csv_row["pokemon_name"],
classification=csv_row["classification"],
height=float(csv_row["pokemon_height"]),
weight=float(csv_row["pokemon_weight"]),
primary_type=csv_row["primary_type"],
secondary_type=csv_row["secondary_type"],
base_stats=BaseStats(
hp=int(csv_row["health_stat"]),
attack=int(csv_row["attack_stat"]),
defense=int(csv_row["defense_stat"]),
special_attack=int(csv_row["special_attack_stat"]),
special_defense=int(csv_row["special_defense_stat"]),
speed=int(csv_row["speed_stat"]),
)
)
for csv_row in csv_iterator
}
return dataset_dict
Although this dictionary comprehension isn’t too pretty,
it’s much more explicit than before.
Each attribute of the Pokemon
class is mapped to a value from the current row.
And with this, I was able to replace the complicated csv_to_nested_dict
call with one line:
pokemon_dict = PokemonDict()
The logic of reading Pokémon data is no longer in the controller part of the program, thus satisfying the single responsibility principle.
With both of my problems solved, I could finally rest easy–wait no, who am I kidding? As always, I wanted to do more.
Enums
There was one thing in particular that was bothering me: I had stored Pokémon types as strings, but it would be better to represent them using enum members since there’s a small set of Pokémon types. Enums are great in cases like these because they clearly document the possible values and prevent errors caused by using invalid ones2.
I created an enum, PokemonType
:
@unique
class PokemonType(Enum):
NORMAL = "Normal"
FIRE = "Fire"
# ... [rest of the types omitted for brevity]
Now, in the Pokemon
class, I could write:
primary_type: PokemonType
and in PokemonDict
:
primary_type=PokemonType(csv_row["primary_type"]),
But, what about secondary type?
A Pokémon might not have two types; in that case, csv_row["secondary_type"]
would be an empty string.
With enums, I initially thought I’d have a NO_TYPE = ""
member.
However, it didn’t make sense for NO_TYPE
to show up when you iterated through the enum members.
Looking this up, it seemed like the consensus
was that it’s better to use a nullable: a variable set to None
in the absence of a value.
My initial code for this was probably the worst Python I’ve ever written.
secondary_type=PokemonType(s) if (s := csv_row["secondary_type"]) else None
Since this was in the dictionary comprehension,
I needed to use an assignment expression (:=
), which isn’t supported in Python 3.7.
A clear no-go.
I ended up learning about class methods3, which are commonly used to provide multiple ways to instantiate a class. I wrote a class method with the same functionality as the above code block:
@classmethod
def optional_pokemon_type(cls, s: str):
if s == "":
return None
return cls(s)
With this, I could write:
secondary_type=PokemonType.optional_pokemon_type(csv_row["secondary_type"])
Still pretty long, but much easier to understand.
The grid of type defenses
With PokemonType
as an enum, I could use its members as the keys of a dictionary.
So, I rewrote the code for the grid of type defenses to use PokemonType
.
First, I defined TypeDefenses
as an alias for Dict[PokemonType, float]
.
Then, I wrote the TypeDefensesDict
class; although similar to PokemonDict
, it caused me a fair amount of pain to write.
Here is its _read_dataset_to_dict()
method:
@staticmethod
def _read_dataset_to_dict() -> Dict[PokemonType, TypeDefenses]:
with resources.open_text(data, "type_defenses_modified.csv") as f:
data_iterator = csv.reader(f, quoting=csv.QUOTE_NONNUMERIC)
# Gets the column names as a list of PokemonType members.
attacking_types = [
PokemonType(s)
for s in data_iterator.__next__()[1:]
]
all_type_defenses = {
PokemonType(row[0]): dict(
zip(attacking_types, cast(List[float], row[1:]))
)
for row in data_iterator
}
return all_type_defenses
I found that you could read numbers directly as floats using quoting=csv.QUOTE_NONNUMERIC
,
but the problem was, Python couldn’t know that each row
contained all floats besides the first element.
Mypy kept giving me errors, thinking that the elements of each row
were all supposed to be strings.
I came up with some ideas to resolve this,
but they didn’t work,
so I ended up using the cast
function from typing
to make mypy happy.
Addendum: Why not use namedtuples?
I experimented with using namedtuples to store Pokémon data instead of Data Classes.
From a purely practical standpoint, namedtuples are better–they are slightly faster,
reducing the time to print two summaries by around 10ms.
They also use less memory; when I left pokesummary -i
open,
the version with Data Classes used 11MB of memory,
while the version with namedtuples used 10MB of memory.
However, using namedtuples makes less sense than using Data Classes from a design standpoint. From my understanding, namedtuples should only be used as a replacement for tuples4; they keep all the functionality of tuples. So, namedtuples are iterable, and they can be unpacked. This kind of functionality doesn’t make sense for Pokémon objects.