I485: Biologically Inspired Computing

Lab 4: Evolutionary Algorithms


Evolved swimmers, from Karl Sim's "Evolving Virtual Creatures", SIGGRAPH '94 (source)

Summary

  1. Section 3.5 in de Castro's Fundamentals of Natural Computing provides a good introduction to evolutionary computing. Section 3.5.1 covers 'standard genetic algorithms'.
  2. Complete the following problems for a total of 10 points, assigned to each question for I485 and H400.
  3. Please submit your well-documented code as separate python files, one for each answer. For the Eureqa question, attach the '.fxp' file containing your work. Summarize your answers in a PDF document including figures, if any. Organize this document well, clearly indicating the question being answered and the respective Python file name(s). Upload all files into the 'Assignment 4' folder in Oncourse.
  4. Assignment is due before class on April 22nd.

    Questions or problems? See email and office hours information.


The problems:


  1. Genetic Algorithms.

    a) [1 point] [0.5 points] The basics of a genetic algorithm. Algorithm 3.6 in your textbook describes a pseudocode for a typical genetic algorithm (GA). A template of the implementation is available in python, containing the below functions. Describe in your own words, what each of the individual functions in a typical GA does, and how they all work in concert. For example, if you are describing the function 'evaluate' explain exactly what the function evaluates (inputs), how it evaluates (algorithm) and the range and format of the final evaluation scores (outputs). The object of this question is for you to get a picture of what's under the hood, and convey it.
    • init()
    • evaluate(population)
    • select(population, fitnesses)
    • recombine(father_genotype, mother_genotype, recombination_rate)
    • mutate(genotype, mutation_rate)

    b) [1.5 points] [1 point] A simple genetic algorithm. Using a population size, genome size, and number of iterations of your preference, implement a GA to evolve a 'textual' genotype in the following way. Your init() method should initialize each genotype in the population to consist of a string of random letters. The objective is to evolve genotypes that contain as many (lowercase) a's as possible. That is, you can use the number of lowercase a's present in a genotype as its fitness score in your evaluate() method. At the end of each generation, print out the best performing genotype. For solving this simple problem (by GA standards), you may choose to not use the recombine() method, but you are welcome to try it. NOTE: Your mutation method should not explicitly replace a genotype letter with 'a'. Although it will lead to solutions more quickly, it's not natural from an evolution point of view that is blind in a sense.

    c) [2.5 points] [2 points] Evolving L-systems. Implement a GA with the genotypes encoding L-system production rules similar to those that you used to generate trees in Lab 2. A good encoding scheme to use though you are welcome to use another one if you'd like is to create genotypes using only the characters in alphabet: {'F','[',']','+', '-'} that represent 'step forward', 'open branch', 'close branch', 'turn right' and 'turn left' respectively. Each genotype would represent a single production rule to replace 'F'; the default axiom would be 'F' (see L-System 1 specification in Q3 of Lab2, for example). Be sure to have the brackets balanced in every genotype, particularly after recombination (a paper for reference on tips for ensuring balanced brackets). You may like to have fixed-length genotypes to keep things simple, but you are welcome to play with variable-length genotypes as well. Implement the GA with roulette wheel selection with or without elite carryover. Define your own values for the recombination and mutation rates (try small values first), and most importantly formulate a good fitness function that evaluates your L-system generated at the end of a fixed number of iterations. Hints: you may want your L-system to look like a tree that is maximally banched, with branches that are maximally spread out, with as much symmetry as possible, or you may want your L-system to maximally possess the characteristics of a shape that you like, etc. By the way, it is very useful to draw your 'best-so-far' L-system at the end of every GA generation.

    d) [Optional, extra 1.5 points] [1.5 points] Increase the mutation rate and run your GA again. Does your GA find good solutions faster or slower than it did earlier? Choose at least 3 different mutation rates sufficiently spread in the range (0, 1), and discuss how mutation rate affects the search.

  2. Discovering mathematical formulae using evolutionary software.

    a) [5 points] [3.5 points] Eureqa is a symbolic regression tool based on evolutionary algorithms that is used to find mathematical relationships among variables in a dataset. In this problem, you will use Eureqa to play with a simple dataset on a binary classification task. The dataset contains n = 16 data items representing different animal species. Each species is a data vector of 13 binary (0 or 1) variables representing features such as size, hairiness, etc. A '1' indicates the presence of that feature, and '0' otherwise. Each of the 16 species can be classified into one of two categories: bird or terrestrial animal (indicated in the dataset). Eureqa can be trained to learn the association between the species' features and their categories, and thus have it automatically compute the category using only the features. Eureqa learns associations in the form of a mathematical function, using genetic programming. The object of this problem is for you to learn to use Eureqa for machine learning, and also get a picture of how its learning evolves by following its visualizations. Below is a series of steps that guides in setting up the data, specifying the target functions, and finally let predictive expressions evolve. A sample project that trains Eureqa to learn a simple logical OR function is available (remove the '.txt' extension and open the '.fxp' file in Eureqa). Save your work below in a '.fxp' file from Eureqa and attach it to your submission.
    • First enter the data into Eureqa. Remember to add a category column containing values of either 0 or 1 indicating whether the corresponding row contains features of a bird or a terrestrial animal. Then, set up your target function by stating that category is a function of all the 13 variables. In the tab 'Define Search', under 'Primary Options', you can select various 'Formula building-blocks' that you want in your to-be-discovered formula. Since we are dealing with binary variables, you may opt for the 'Logical' operators (AND, OR, NOT, XOR). Alternatively, you may use the 'basic' arithmetic operators (+, -, *, /), in which case, you also need to include a squashing function in order to a binary classification. As the squash function makes sense only with numerical values, do not use it if you include the logical operators. Don't forget to check 'Input Variable', because your formula should contain input variables in it. Start the search. Go to the 'Results' tab and report the best solution. Does the discovered function fit the data well (see the 'Solution Fit Plot' in the top right of the same tab)?
    • If you set up the search as above, almost certainly Eureqa will discover a function of just one variable that are each best fitting functions with nil error. Let's call this variable a simple predictor since it is wholly sufficient to predict the category. There are four such simple predictors in the dataset. Find all of them. Hints: Modify the target function appropriately to find one simple predictor at a time; you can also verify your findings by simple visual inspection of the dataset.
    • Set up a new target function that does not include any of the four simple predictors that you found above. Now what kind of formulae does Eureqa discover? How well do these formula fit the data? Since you have eliminated all of the simple predictors, this time at least one of your discovered formulae should comprise of more than one independent variable. Translate such formulae into words. Do the descriptions match your pre-existing knowledge about birds and animals? Elaborate on it.
    • Set up a few other target functions by excluding one ore more variables from the formulae you found above. For each formula you find, report on how well it fits the data.

    b) [Optional, extra 1.5 points] [1.5 points] Summarize all your findings, by describing the length of the best formulae you found, their respective errors in fitting the data etc. You may have found different kinds of formulae, some containing simple predictors, and others with compund predictors. As a budding scientist, comment on the existence of such multiple relationships in a dataset. To make predictions about new data items that the present dataset does not contain (say, a lizard), what formula would you prefer? Would you use a longer formula that fits the present data very well? Or would you rather prefer a shorter formula that contains some error? See the 'Error vs. Complexity' plot in the bottom right of the 'Results' tab to visualize this trade-off. Do you think that this trade-off might affect how well the corresponding formulae generalize to species not available in the training dataset?

More

Acknowledgements

Links

All Class Labs

I485/I585 Biologically-inspired Computing

Life Inspired


For more information contact Santosh Manicka or Luis Rocha.
Last Modified: April 7, 2015