[00:05] *** jeekl has joined #archiveteam-ot [00:08] *** ayanami_ has quit IRC (Quit: Leaving) [00:13] 0|0|0|USE TEMP B-TREE FOR ORDER BY [00:13] Yeah, that does not sound good. [00:23] ivan: Hmm, any ideas how to avoid that? Even an index over (priority, status, id) (i.e. the columns I ORDER BY) doesn't help. [00:24] The query SQLAlchemy generates is: SELECT ... FROM queued_urls WHERE queued_urls.status IN ('todo', 'error') ORDER BY queued_urls.priority DESC, queued_urls.status DESC, queued_urls.id LIMIT 1 OFFSET 0; [00:25] And this is what SQLite makes of it: [00:25] 0|0|0|SEARCH TABLE queued_urls USING INDEX ix_queued_urls_status (status=?) [00:25] 0|0|0|EXECUTE LIST SUBQUERY 1 [00:25] 0|0|0|USE TEMP B-TREE FOR ORDER BY [00:53] JAA: I assume you need an index that matches most of the ORDER BY, either forward or reverse [00:54] ivan: CREATE INDEX ix_queued_urls_prio_status_id ON queued_urls (priority DESC, status DESC, id); should do that, no? [00:54] you confirmed this index exists? [00:55] hmm [00:55] https://www.sqlite.org/lang_indexedby.html [00:55] sqlite> EXPLAIN QUERY PLAN SELECT * FROM queued_urls WHERE status = "todo" AND priority = 0 LIMIT 1; [00:56] 0|0|0|SEARCH TABLE queued_urls USING INDEX ix_queued_urls_prio_status_id (priority=? AND status=?) [00:56] Index seems to be fine. [00:56] maybe the sqlalchemy is a little too complicated for the query planner [00:56] maybe the IN or the OFFSET [00:57] try minimized versions [00:58] Dropping OFFSET and rewriting the IN to status = "todo" OR status = "error" makes no difference. [00:59] Using only 'status = "todo"' instead of the OR gets rid of the EXECUTE LIST SUBQUERY, but that's probably not the issue. [00:59] Yeah, doesn't seem to make a difference for the performance. [01:00] The query SQLAlchemy generates seems to be pretty basic really. Haven't seen OFFSET before, but that should simply get optimised out probably. [01:00] Otherwise, it's exactly how I'd write that query by hand as well. [01:01] Hmmm [01:01] sqlite> EXPLAIN QUERY PLAN SELECT * FROM queued_urls INDEXED BY ix_queued_urls_prio_status_id WHERE queued_urls.status IN ('todo', 'error') ORDER BY queued_urls.priority DESC, queued_urls.status DESC, queued_urls.id LIMIT 1; [01:01] 0|0|0|SCAN TABLE queued_urls USING INDEX ix_queued_urls_prio_status_id [01:01] Interesting. [01:02] heh [01:02] Yup, that INDEXED BY makes all the difference. [01:04] I wonder if anyone ever uses wpull with a non-SQLite DB. [01:05] Except me with that PostgreSQL setup, but I replaced the entire URLTable there anyway since SQLAlchemy had such terrible performance. [01:05] Maybe we can just get rid of it entirely. I mean, the code looks nice and everything, but it's just *soo* slow compared to writing optimised SQL. [01:05] Or even just writing standard SQL. [01:06] For example, the queued_urls table has parent_id and root_id columns to refer to the parent and root URL, respectively. [01:06] But if you make a SELECT, SQLAlchemy doesn't do a JOIN to get that data as well. Nope, instead, when you first access the values, it goes to fetch it from the DB. [01:07] You can make a SELECT query and get thousands of rows from the table at once, and then it will retrieve the parent and root URL for each entry individually. [01:07] This was my main issue on the distributed setup. I was checking out URL blocks instead of individual URLs to reduce the latency impact. Well, yeah... [01:08] Couldn't get SQLAlchemy to do the right thing there either. [01:08] Anyway... [01:09] "The INDEXED BY clause is not intended for use in tuning the performance of a query." HAHA, GREAT! [01:11] It's also "premature optimization" according to https://www.sqlite.org/queryplanner-ng.html#howtofix [01:20] It probably doesn't help that the 'status' column is a VARCHAR(11) which is then enforced to have one of five values. :-| [01:22] ivan: Thoughts on using INDEXED BY on SQLite for now since it works and finding a proper solution later? [01:22] https://www.fanatical.com/en/pick-and-mix/aspyr-pick-and-mix for any Linux Gamers or Mac Gamers [01:23] Where the proper solution in my opinion would be something involving getting rid of SQLAlchemy's ORM if not all of it, i.e. where we'd have the queries directly in the code instead of these .filter_by etc. constructs. [01:23] The nice thing about SQLAlchemy is support for all kinds of different databases, so it might still be worth keeping that part. [01:28] *** deathy has quit IRC (Read error: Connection reset by peer) [01:28] *** mgrytbak has quit IRC (Read error: Connection reset by peer) [01:29] *** mgrytbak has joined #archiveteam-ot [01:30] *** DrasticAc has quit IRC (Ping timeout: 615 seconds) [01:31] *** pnJay has quit IRC (Ping timeout: 615 seconds) [01:34] *** diggan has quit IRC (Ping timeout: 615 seconds) [01:35] Just need to figure out how to add that clause. SQLAlchemy doesn't make this stuff easy. [01:37] *** horkermon has quit IRC (Ping timeout: 615 seconds) [01:37] *** chferfa has quit IRC (Read error: Connection reset by peer) [01:38] RIP IRCCloud [01:38] looks like it [01:39] *** mgrytbak has quit IRC (Read error: Connection reset by peer) [01:43] *** simon987 has joined #archiveteam-ot [01:47] *** diggan has joined #archiveteam-ot [01:47] *** mgrytbak has joined #archiveteam-ot [01:47] *** horkermon has joined #archiveteam-ot [01:47] *** Fusl_ has quit IRC (Ping timeout: 615 seconds) [01:48] *** Fusl_ has joined #archiveteam-ot [01:48] *** Fusl sets mode: +o Fusl_ [01:48] *** deathy has joined #archiveteam-ot [01:48] ivan: Oh, regarding the "does anyone use non-SQLite with wpull", I just remembered that I had to remove the "OR IGNORE" clause on INSERT queries. Those are SQLite-specific but used in the BaseSQLURLTable, not just the SQLite one. So I guess the answer is "noone" because it basically can't work. I believe "OR IGNORE" is unique to SQLite; other DBs use "ON CONFLICT IGNORE" and similar constructs instead. [01:49] Cf https://github.com/JustAnotherArchivist/wpull/commit/5dee4313d580590d4b6e77115c20db53eb666fc5 [01:50] *** pnJay has joined #archiveteam-ot [01:53] *** chferfa has joined #archiveteam-ot [01:57] *** DrasticAc has joined #archiveteam-ot [03:00] *** cfarquhar has quit IRC (Read error: Operation timed out) [03:00] *** cfarquhar has joined #archiveteam-ot [03:03] *** apt-get has quit IRC (Read error: Operation timed out) [03:17] "sqlite3.OperationalError: no query solution" :-| [03:23] *** odemg has quit IRC (Ping timeout: 615 seconds) [03:24] *** deathy_ has joined #archiveteam-ot [03:24] *** pnJay_ has joined #archiveteam-ot [03:24] *** JAA has quit IRC (Read error: Operation timed out) [03:24] *** asdf0101 has quit IRC (Read error: Operation timed out) [03:25] *** lunik1 has quit IRC (Read error: Operation timed out) [03:26] *** simon816 has quit IRC (Ping timeout: 246 seconds) [03:27] *** cfarquhar has quit IRC (Ping timeout: 246 seconds) [03:28] *** eythian has quit IRC (Ping timeout: 246 seconds) [03:29] *** odemg has joined #archiveteam-ot [03:29] *** deathy has quit IRC (Ping timeout: 615 seconds) [03:29] *** deathy_ is now known as deathy [03:31] *** pnJay has quit IRC (Ping timeout: 615 seconds) [03:31] *** pnJay_ is now known as pnJay [03:32] ivan: I got the INDEXED BY solution working. There is a significant slowdown, but it's not as bad as before (duh). Based on 10 runs of a small test script, I'm seeing a ~40 % slowdown. 1000 checkouts/checkins take roughly 8.5 s with the old code and 11.9 s with the new one on my test setup with that huge DB I mentioned. [03:32] *** eythian has joined #archiveteam-ot [03:40] bakJAA: how badly do you need this priority thing anyway? [03:41] if it's any faster, maybe there could be a two-priority todo/todofirst? [03:41] also wow 8.5 s [03:41] does this go any faster without any fsync? [03:42] I hope this isn't _just_ for Sketch [03:43] *** ayanami_ has joined #archiveteam-ot [03:43] It would be extremely useful in ArchiveBot to defer loading less important resources later. For example, on a forum, we might want to delay member profiles or per-post URLs so that we can cover as much content as possible first (e.g. before the shutdown/deletion/whatever). Same for off-site links. [03:44] Now, we often run multiple jobs, one with very restrictive ignores to make sure it finishes in time, and another which runs wild. That's far from optimal though. [03:44] I assume it would be much faster without fsync, yeah. Let me test that. [03:45] *** dhyan_nat has joined #archiveteam-ot [03:47] marked: Not at all. It probably won't even be used for Sketch. This has been on my plate for 1.5 years, I implemented everything beginning of last year but never really polished it. [03:49] *nod* sounds reeeaallllyy useful [03:53] ivan: Hmm, interesting, the PRAGMA query isn't logged by SQLAlchemy. Maybe because it uses the connection.execute method directly? [03:53] But I don't see any performance difference whatsoever between having the PRAGMA applied or not. [03:54] The _apply_pragma_callback function does get called though. [04:01] *** paul2520 has quit IRC (Read error: Operation timed out) [04:02] *** paul2520 has joined #archiveteam-ot [04:04] *** bakJAA is now known as AlsoJAA [04:20] I've used https://github.com/uber/pyflame to figure out why Python programs are slow but the new thing is https://github.com/benfred/py-spy [04:25] *** simon816 has joined #archiveteam-ot [04:25] *** asdf0101 has joined #archiveteam-ot [04:26] *** cfarquhar has joined #archiveteam-ot [04:26] *** lunik1 has joined #archiveteam-ot [04:29] *** JAA has joined #archiveteam-ot [04:29] *** Fusl sets mode: +o JAA [04:29] *** AlsoJAA sets mode: +o JAA [04:30] *** ayanami_ has quit IRC (Quit: Leaving) [05:55] *** Zerote has joined #archiveteam-ot [06:42] *** Specular has joined #archiveteam-ot [06:42] any idea if this has been submitted to archivebot, or considered as a project? https://www.reddit.com/r/Archiveteam/comments/b7avru/facepunch_studios_garrys_mod_rust_forums_are_very/ [06:42] filtered the archivebot dashboard crawls but didn't see anything (kind of expected given the post was a month ago) [06:56] picked up a second batch of flavorless loose leaf tea, gaawd. [06:57] so inconsistent. It's either fantastic or completely bland. Quality control is all over the place. [07:01] *** deathy has quit IRC () [07:01] *** deathy has joined #archiveteam-ot [07:17] *** diggan has quit IRC () [07:18] *** diggan has joined #archiveteam-ot [07:18] *** Zerote has quit IRC (Read error: Operation timed out) [07:21] *** Zerote has joined #archiveteam-ot [07:34] *** Oddly has joined #archiveteam-ot [07:50] Specular: There's a channel #faceplant for that - I'm not sure if a project got off the ground though [07:51] ah, k [10:04] *** chferfa has quit IRC (Quit: Connection closed for inactivity) [10:04] *** Verified_ has quit IRC (Ping timeout: 252 seconds) [10:10] *** killsushi has quit IRC (Quit: Leaving) [10:37] *** deevious has joined #archiveteam-ot [10:40] *** Specular has quit IRC (Read error: Connection reset by peer) [10:43] *** BlueMaxim has joined #archiveteam-ot [10:45] *** Hani111 has joined #archiveteam-ot [10:46] *** RKenshin has joined #archiveteam-ot [10:46] *** eientei9- has joined #archiveteam-ot [10:47] *** atbk has quit IRC (Read error: Connection reset by peer) [10:47] *** eientei95 has quit IRC (Read error: Connection reset by peer) [10:47] *** Kenshin has quit IRC (Read error: Connection reset by peer) [10:47] *** RKenshin is now known as Kenshin [10:47] *** eientei9- is now known as eientei95 [10:47] *** Hani111_ has joined #archiveteam-ot [10:47] *** AlsoJAA has quit IRC (Remote host closed the connection) [10:47] *** AlsoJAA has joined #archiveteam-ot [10:47] *** BlueMax has quit IRC (Read error: Connection reset by peer) [10:48] *** svchfoo3 sets mode: +o AlsoJAA [10:51] *** phuz has joined #archiveteam-ot [10:52] *** gandalf_ has joined #archiveteam-ot [10:54] *** Vito` has quit IRC (Read error: Connection reset by peer) [10:54] *** mgrytbak has quit IRC (Read error: Connection reset by peer) [10:54] *** revi has quit IRC (Read error: Connection reset by peer) [10:54] *** t3 has quit IRC (Write error: Connection reset by peer) [10:54] *** hook54321 has quit IRC (Write error: Connection reset by peer) [10:54] *** diggan has quit IRC (Read error: Connection reset by peer) [10:54] *** Dj-Wawa has quit IRC (Write error: Connection reset by peer) [10:54] *** martinlig has quit IRC (Write error: Connection reset by peer) [10:54] *** pnJay has quit IRC (Read error: Connection reset by peer) [10:54] *** horkermon has quit IRC (Read error: Connection reset by peer) [10:54] *** DrasticAc has quit IRC (Read error: Connection reset by peer) [10:54] *** chr1sm has quit IRC (Write error: Connection reset by peer) [10:54] *** MarcoS has quit IRC (Write error: Connection reset by peer) [10:54] *** Ctrl-S_ has quit IRC (Write error: Connection reset by peer) [10:54] *** HCross has quit IRC (Write error: Connection reset by peer) [10:54] *** deathy has quit IRC (Write error: Connection reset by peer) [10:54] *** Dimtree has quit IRC () [10:54] *** rellem has quit IRC (Ping timeout: 615 seconds) [10:54] *** apache2 has quit IRC (Ping timeout: 615 seconds) [10:54] *** kode54 has quit IRC (Ping timeout: 615 seconds) [10:54] *** Hani has quit IRC (Ping timeout: 615 seconds) [10:54] *** xit_ has quit IRC (Ping timeout: 615 seconds) [10:54] *** Hani111_ is now known as Hani [10:54] *** phuzion has quit IRC (Write error: Connection reset by peer) [10:54] *** kode54 has joined #archiveteam-ot [10:54] *** Jens has quit IRC (Ping timeout: 615 seconds) [10:54] *** deathy has joined #archiveteam-ot [10:54] *** gandalf has quit IRC (Quit: A wizard always leaves on time.) [10:54] *** Meroje has quit IRC (Remote host closed the connection) [10:54] *** abstract has quit IRC (Quit: *.banana *.split) [10:54] *** N4Y has quit IRC (Read error: Connection reset by peer) [10:54] *** mgrytbak has joined #archiveteam-ot [10:54] *** martinlig has joined #archiveteam-ot [10:54] *** DrasticAc has joined #archiveteam-ot [10:54] *** HCross has joined #archiveteam-ot [10:54] *** MarcoS has joined #archiveteam-ot [10:54] *** Ctrl-S_ has joined #archiveteam-ot [10:54] *** apache2 has joined #archiveteam-ot [10:54] *** Meroje has joined #archiveteam-ot [10:54] *** Jens has joined #archiveteam-ot [10:54] *** chr1sm has joined #archiveteam-ot [10:54] *** Dj-Wawa has joined #archiveteam-ot [10:54] *** revi has joined #archiveteam-ot [10:54] *** horkermon has joined #archiveteam-ot [10:54] *** rellem has joined #archiveteam-ot [10:54] *** Vito` has joined #archiveteam-ot [10:54] *** t3 has joined #archiveteam-ot [10:54] *** hook54321 has joined #archiveteam-ot [10:54] *** diggan has joined #archiveteam-ot [10:54] *** xit_ has joined #archiveteam-ot [10:54] *** abstract has joined #archiveteam-ot [10:54] *** gandalf_ is now known as gandalf [10:54] *** Fusl sets mode: +o HCross [10:54] *** pnJay has joined #archiveteam-ot [10:54] *** atbk has joined #archiveteam-ot [10:54] *** Fusl sets mode: +o hook54321 [10:56] *** Hani111 has quit IRC (Ping timeout: 615 seconds) [10:56] *** Shen has quit IRC (Ping timeout: 240 seconds) [10:57] *** N4Y has joined #archiveteam-ot [11:05] *** Dimtree has joined #archiveteam-ot [11:54] *** Shen has joined #archiveteam-ot [12:12] *** apt-get has joined #archiveteam-ot [12:19] Specular (offline): I believe Fusl ran a grab-site crawl [12:41] *** Specular has joined #archiveteam-ot [12:47] *** Verified_ has joined #archiveteam-ot [12:55] *** deevious has quit IRC (Ping timeout: 252 seconds) [12:57] *** deevious has joined #archiveteam-ot [13:15] *** Specular has quit IRC (Read error: Connection reset by peer) [13:16] *** Specular has joined #archiveteam-ot [13:22] *** Despatche has quit IRC (Quit: Read error: Connection reset by deer) [13:22] *** Despatche has joined #archiveteam-ot [13:23] *** Specular has quit IRC (Read error: Connection reset by peer) [13:25] *** Specular has joined #archiveteam-ot [13:37] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [14:05] *** robogoat_ has quit IRC (Read error: Operation timed out) [14:21] *** Specular has quit IRC (Ping timeout: 360 seconds) [14:24] i ran it through archivebot, not grab-site [14:25] *** robogoat has joined #archiveteam-ot [14:41] *** deevious has quit IRC (Ping timeout: 252 seconds) [14:48] *** deevious has joined #archiveteam-ot [14:49] *** AlsoJAA has quit IRC (Quit: leaving) [14:49] *** AlsoJAA has joined #archiveteam-ot [14:50] *** JAA sets mode: +o AlsoJAA [14:53] *** Zerote has quit IRC (Read error: Operation timed out) [15:02] ivan: Sure, I can do performance analysis, but the PRAGMA not taking effect probably has nothing to do with the performance on the Python side at least. [15:16] *** Zerote has joined #archiveteam-ot [16:08] *** ealgase has joined #archiveteam-ot [17:05] *** m007a83 has quit IRC (Ping timeout: 252 seconds) [17:08] *** apt-get has quit IRC (Read error: Operation timed out) [17:36] *** katocala has quit IRC () [17:41] *** Verified_ has quit IRC (Ping timeout: 252 seconds) [17:47] *** m007a83 has joined #archiveteam-ot [18:25] *** martinlig has quit IRC (Quit: Connection closed for inactivity) [18:31] *** kiska1 has quit IRC (Read error: Operation timed out) [18:32] *** kiska1 has joined #archiveteam-ot [18:32] *** Fusl sets mode: +o kiska1 [19:20] *** mal has quit IRC (mal) [19:42] *** dhyan_nat has quit IRC (Read error: Operation timed out) [20:00] *** Verified_ has joined #archiveteam-ot [20:13] *** icedice has joined #archiveteam-ot [20:17] *** mal has joined #archiveteam-ot [20:20] *** mal has quit IRC (Client Quit) [20:21] *** mal has joined #archiveteam-ot [21:46] *** BlueMax has joined #archiveteam-ot [22:32] *** BlueMax has quit IRC (Read error: Connection reset by peer) [22:57] *** Ravenloft has joined #archiveteam-ot