Automatic ancient Roman coin analysis only recently emerged as a topic of computer science research. Nevertheless, owing to its ever-increasing popularity, the field is already reaching a certain degree of maturity, as witnessed by a substantial publication output in the last decade. At the same time, it is becoming evident that research progress is being limited by a somewhat veering direction of effort and the lack of a coherent framework which facilitates the acquisition and dissemination of robust, repeatable, and rigorous evidence. Thus, in the present article, we seek to address several associated challenges. To start with, (i) we provide a first overview and discussion of different challenges in the field, some of which have been scarcely investigated to date, and others which have hitherto been unrecognized and unaddressed. Secondly, (ii) we introduce the first data set, carefully curated and collected for the purpose of facilitating methodological evaluation of algorithms and, specifically, the effects of coin preservation grades on the performance of automatic methods. Indeed, until now, only one published work at all recognized the need for this kind of analysis, which, to any numismatist, would be a trivially obvious fact. We also discuss a wide range of considerations which had to be taken into account in collecting this corpus, explain our decisions, and describe its content in detail. Briefly, the data set comprises 100 different coin types, all with multiple examples in Fine, Very Fine, and Extremely Fine conditions, giving a total of over 650 different specimens. These correspond to 44 issuing authorities and span the time period of approximately 300 years (from 27 BC until 244 AD). In summary, the present article should be an invaluable resource to researchers in the field, and we encourage the community to adopt the collected corpus, freely available for research purposes, as a standard evaluation benchmark.