Mathematical Programming Solver with SQLFlow

Mathematical programming (aka. mathematical optimization) is the selection of a best element (with regard to some criterion) from some set of available alternatives. Solving optimization problems is widely used in fields like economics and finance, computer science, engineering, and researching. In this design, we try to make SQLFlow be able to solve below branches of programming problems using SQL statements by leveraging systems like pyomo or our internal programming project at Ant Group, they both support these types of programming categories:

Background

To understand what’s a mathematical programming problem, let’s take a look at this example (example originally published at http://faculty.kutztown.edu/vasko/MAT121/MAT121web/Example\_2.html):

Giapetto’s Woodcarving, Inc., manufactures two types of wooden toys: soldiers and trains.

A soldier sells for $27 and uses $10 worth of raw materials. Each soldier that is manufactured increases Giapetto’s variable labor and overhead costs by $14. A train sells for $21 and uses $9 worth of raw materials. Each train built increases Giapetto’s variable labor and overhead costs by $10. The manufacture of wooden soldiers and trains requires two types of skilled labor: carpentry and finishing. A soldier requires 2 hours of finishing labor and 1 hour of carpentry labor. A train requires 1 hour of finishing labor and 1 hour of carpentry labor. Each week, Giapetto can obtain all the needed raw material but only 100 finishing hours and 80 carpentry hours. Demand for trains is unlimited, but at most 40 soldiers are bought each week. Giapetto wants to maximize weekly profit (revenues-costs).

Let

  • x be the number of soldiers produced each week
  • y be the number of trains produced each week

Then the objective is:

Maximise Z = (27 - 10 - 14)x + (21 - 9 - 10)y = 3x + 2y

The constraints are:

  • 2x + y <= 100 (Finishing Constraint)
  • x + y <= 80 (Carpentry Constraint)
  • x <=40 (Constraint on demand for soldiers)
  • x,y >= 0 (sign restriction)

To solve this problem, we can use tools like AMPL or open-source tools like pyomo, or Matlab, or R.

You can learn from the examples in the above links that:

  • Using Matlab and R are quite the same, they require users to define the constraints as matrixes and call a function to get the result. They both have their own grammar, and you have to write code according to their language specifications.
  • Using “AMPL like” high-level language to describe the problem, and call the corresponding solvers. Pyomo and AMPL have similar components in their grammar: sets, parameters, variables, objectives, constraints (https://pyomo.readthedocs.io/en/stable/pyomo\_modeling\_components/index.html).

So we can see that using AMPL is a modern and general way of describing mathematical programming problems. We can simply write the AMPL snippet to describe the above problem:

  1. set X;
  2. var ProductAmt{x in X} >= 0;
  3. param Price{x in X};
  4. param MaterialCost{x in X};
  5. param OtherCost{x in X};
  6. param Finishing{x in X};
  7. param Carpentry{x in X};
  8. param Demand{x in X};
  9. maximize profit:
  10. sum{x in X} (Price[x] - MaterialCost[x] - OtherCost[x]) * ProductAmt[x];
  11. s.t. finishing: sum{x in X} Finishing[x] * ProductAmt[x] <= 100;
  12. s.t. carpentry: sum{x in X} Carpentry[x] * ProductAmt[x] <= 80;
  13. s.t. demand{x in X}: ProductAmt[x] <= 40;

Grammar Design

In order to extend SQL to have completely same ability of AMPL, the extended syntax should be able to describe objective and constraints while the input data table can store the params for each variable, and the rows in the table is naturally become the set we defined in AMPL.

Linear Programming Syntax

Then we have the below table woodcarving:

productpricematerials_costother_costfinishingcarpentrymax_num
soldier2710142140
train219101110000

In the woodcarving:

  • The set X is row one and row two.
  • We have one variable, and the variable name strings is stored in column product. In cases that have cross variables (like the example described at https://en.wikipedia.org/wiki/AMPL), the table should have multiple string columns to store the variable names.
  • Other columns like price, materials_cost are all params for the corresponding variable.

Then we can use below extended SQL syntax to describe above example:

  1. SELECT * FROM woodcarving
  2. TO MAXIMIZE SUM((price - materials_cost - other_cost) * amount)
  3. CONSTRAINT SUM(finishing * amount) <= 100,
  4. SUM(carpentry * amount) <= 80,
  5. amount <= max_num
  6. WITH variable="amount(product)",
  7. var_type="Integers"
  8. [USING glpk]
  9. INTO result_table;

In the SQL statement:

  • TO MAXIMIZE|MINIMIZE ... defines an expression string that describes the objective.
    • The syntax MAXIMIZE|MINIMIZE is used to specify the objective sense.
    • In the expression, SUM means sum the value across all rows like normal SQL statements.
  • CONSTRAINT ... expression strings that describe the constraints, can have multiple CONSTRAINT expressions separated by comma.
  • WITH attributes:
    • variable="amount(product)": required, specify the variable definition, product is the column that stores the variable name. Using comma to separate if there are multiple variables, e.g. shipment(plants,markets).
    • var_type="Integers": optional, specify the variable type, there should only be one variable in current cases. The format is like var_type="Type", the type can be Integers, NonNegativeIntegers, Reals etc. The default variable type is Integers.
  • USING: optional, solver tool to use, default: glpk.
  • INTO result_table: set the result table name.

After the SQL statement finishes execution, the result table result_table should look like:

productamount
soldier20
train60

Combinatorial Optimization Syntax

Combinatorial Optimization Problem (https://en.wikipedia.org/wiki/Combinatorial\_optimization) is a subset of mathematical optimization that is widely used in real life. Here we demostrate how to use a SQL statement to solve a typicall combinational optimization problem.

For example, there are several plants that manufactures products and several markets that sells them (see the example described in https://en.wikipedia.org/wiki/AMPL for details), we want to minimize the cost of transportation between plants and markets, we have three tables looks like below:

  1. Plants capacity table:

    plantscapacity
    plantA100
    plantB90
  2. Markets demand table:

    marketsdemand
    marketA130
    marketB60
  3. Plants to markets distance table:

    plantsmarketsdistance
    plantAmarketA140
    plantAmarketB210
    plantBmarketA300
    plantBmarketB90
  4. When we start to solve the problem, we’d like to join the tables beforehand:

    1. SELECT trans.plants, trans.markets, trans.distance, plants.capacity, markets.demand FROM transportation AS trans
    2. LEFT JOIN plants ON src.plants = plants.plants
    3. LEFT JOIN markets ON src.markets = markets.markets;

    Then we have a “joined” table like below to start the solving process:

    plantsmarketsdistancecapacitydemand
    plantAmarketA140100130
    plantAmarketB21010060
    plantBmarketA30090130
    plantBmarketB909060

Then we can use below extended SQL syntax to describe above example:

  1. SELECT src.plants, src.markets, src.distance, plants.capacity, markets.demand FROM transportation AS src
  2. LEFT JOIN plants ON src.plants = plants.plants
  3. LEFT JOIN markets ON src.markets = markets.markets
  4. TO MINIMIZE SUM(shipment * distance * 90 / 1000)
  5. CONSTRAINT SUM(shipment) <= capacity GROUP BY plants,
  6. SUM(shipment) >= demand GROUP BY markets
  7. WITH variable="shipment(plants,markets)",
  8. var_type="Integers"
  9. [USING glpk]
  10. INTO result_table;
  • In the above SQL statement, the syntax is quite the same as the single variable example, yet
  • The CONSTRAINT including a GROUP BY clause is a “partial aggregation constraint”, take CONSTRAINT SUM(markets) <= capacity GROUP BY plants as an example, it means:
    1. for each plant,
    2. the sum of “shipment amount to each market from current plant” should be less than the current plant’s capacity.

Then after the solving job has completed, we should have below contents in the result_table (the result column is a fake result for demonstration):

plantsmarketsshipment
plantAmarketA123
plantAmarketB123
plantBmarketA123
plantBmarketB123

Aggregation Functions

Support any aggregation functions accross rows that the programming solvers support. We may need to add support more syntax than SUM in the future.

Implementation

  1. Update our extended syntax parser to support TO MAXIMIZE|MINIMIZE clauses.
  2. Add an IR struct to represent TO MAXIMIZE|MINIMIZE clause.
  3. Create a table to store the result.
  4. Add code generator to generate code like below example to run, for different mathematical programming software, we may need to add different code generators. Since we extend SQL to have the same ability that AMPL has, we can almost connect to any software we like.
  5. The generated code should be able to output the result to the result table.

Intermediate Representation

The extended TO MAXIMIZE|MINIMIZE syntax can be represented by below Go structure after parsing:

  1. type SolveExpr struct {
  2. // Expression parsed from SQL statement of objective and constraints, used for code generation.
  3. // e.g. sum((@TABLE.price[i] - @TABLE.materials_cost[i] - @TABLE.other_cost[i]) * @X[i])
  4. Expression string
  5. // Iterate variables like {i in X} in the above example.
  6. IterVars []string{}
  7. }
  8. type MathProgrammingStmt struct {
  9. // Select is the select statement before TO MAXIMIZE|MINIMIZE clause.
  10. Select string
  11. // Attributes is a map of parsed attribute in the WITH clause.
  12. Attributes map[string]interface{}
  13. // Objective
  14. Objective SolveExpr
  15. // ObjectiveSense, 0: maximize, 1: minimize
  16. ObjectiveSense int
  17. // Constraints
  18. Constraints []*SolveExpr{}
  19. // ResultTable is the table name to store results.
  20. ResultTable string
  21. // When SQLFLOW_submitter == "pai", tmp tables will be created for solving tasks
  22. TmpTrainTable string
  23. }

Example of Generated Code

  1. from pyomo.opt import SolverFactory
  2. from pyomo.environ import (ConcreteModel, Var, Objective, maximize, Constraint, NonNegativeIntegers)
  3. model = ConcreteModel()
  4. # input dataframe
  5. data_df = ... # Construct dataframe from DBMS driver here.
  6. # variable
  7. size = len(data_df.type)
  8. variables = range(size)
  9. model.x = Var(variables, within=NonNegativeIntegers)
  10. # object
  11. def objRule(model):
  12. return sum([(data_df.price[i] - data_df.materials_cost[i] - data_df.other_cost[i]) * model.x[i] for i in model.x])
  13. model.Obj = Objective(rule=objRule, sense=maximize)
  14. # constrains
  15. def rule1(model):
  16. return sum([data_df.finishing[i] * model.x[i] for i in model.x]) <= 100
  17. model.c1 = Constraint(rule=rule1)
  18. def rule2(model):
  19. return sum([data_df.carpentry[i] * model.x[i] for i in model.x]) <= 80
  20. model.c2 = Constraint(rule=rule2)
  21. def rule3(model):
  22. return model.x[i] <= data_df.max_num[i] for i in model.x
  23. model.c3 = Constraint(rule=rule3)
  24. if __name__ == '__main__':
  25. with SolverFactory('glpk', executable='glpsol') as solver:
  26. results = solver.solve(model)
  27. print(results)
  28. model.display()
  29. result_data = pd.DataFrame(columns=['var_id', 'var', 'x'])
  30. result_data['var_id'] = [i for i in model.x]
  31. result_data['var'] = data_df.type.values
  32. result_data['x'] = [model.x[i]() for i in model.x]