we have very large scrapers ~2.000.000 rows and facing some performance issues. I've read the performance optimization chapter in the manual and adjusted my projects accordingly. However, I see very heavy disk usage (tested with MySQL and SQLlite) while my project is running.
Problem seems to be the many and small chucks which are getting written to disk while running. Would it be possible to f. e. cache ~10-50 MB in RAM before persisting it to disk/database (or better: make the caching amount value configurable)? This would lead to a less heavy disk usage in my opinion.
On the SQL-side we have "LOAD INFILE" to insert data efficiently - maybe there is a way to load it that way into the database. Another possibility would be a binary stream of data which is written to a propriertary file format.
Would be great if you guys would have a solution for better performance! If there is such an option already, please point me to the right direction.