Add Benchmark

Use the questions collected in https://docs.google.com/spreadsheets/u/2/d/102qjVI496DkJ2PiMOk8s0QEXCv0ekZSlP8WuLwew1jk to write some evaluation code to benchmark the workflow against full context (i.e. gemini) and other rag alternatives (i.e. agentic rag)