How to Turn 175 Years of Words in Scientific American into an Image
By Moritz Stefaner, Credit, Nick Higgins
Posted in Strange
I set up a data-preprocessing pipeline early on to extract the text from the .pdf files and run the first analyses. A central question in any data-science project is how wide a net one casts on the data set. It soon became apparent that any texts from the predigital era of Scientific American (before 1993) are to some degree affected by optical character-recognition (OCR) errors. Doing data science means having to live with imperfections. Each new discovery instigates curiosity, and I encourage others to view this data set not as an objective and final measurement but as inspiration for new questions.