#archiveteam-ot 2019-05-02,Thu

↑back Search

Time Nickname Message
00:05 🔗 jeekl has joined #archiveteam-ot
00:08 🔗 ayanami_ has quit IRC (Quit: Leaving)
00:13 🔗 JAA 0|0|0|USE TEMP B-TREE FOR ORDER BY
00:13 🔗 JAA Yeah, that does not sound good.
00:23 🔗 JAA ivan: Hmm, any ideas how to avoid that? Even an index over (priority, status, id) (i.e. the columns I ORDER BY) doesn't help.
00:24 🔗 JAA The query SQLAlchemy generates is: SELECT ... FROM queued_urls WHERE queued_urls.status IN ('todo', 'error') ORDER BY queued_urls.priority DESC, queued_urls.status DESC, queued_urls.id LIMIT 1 OFFSET 0;
00:25 🔗 JAA And this is what SQLite makes of it:
00:25 🔗 JAA 0|0|0|SEARCH TABLE queued_urls USING INDEX ix_queued_urls_status (status=?)
00:25 🔗 JAA 0|0|0|EXECUTE LIST SUBQUERY 1
00:25 🔗 JAA 0|0|0|USE TEMP B-TREE FOR ORDER BY
00:53 🔗 ivan JAA: I assume you need an index that matches most of the ORDER BY, either forward or reverse
00:54 🔗 JAA ivan: CREATE INDEX ix_queued_urls_prio_status_id ON queued_urls (priority DESC, status DESC, id); should do that, no?
00:54 🔗 ivan you confirmed this index exists?
00:55 🔗 ivan hmm
00:55 🔗 ivan https://www.sqlite.org/lang_indexedby.html
00:55 🔗 JAA sqlite> EXPLAIN QUERY PLAN SELECT * FROM queued_urls WHERE status = "todo" AND priority = 0 LIMIT 1;
00:56 🔗 JAA 0|0|0|SEARCH TABLE queued_urls USING INDEX ix_queued_urls_prio_status_id (priority=? AND status=?)
00:56 🔗 JAA Index seems to be fine.
00:56 🔗 ivan maybe the sqlalchemy is a little too complicated for the query planner
00:56 🔗 ivan maybe the IN or the OFFSET
00:57 🔗 ivan try minimized versions
00:58 🔗 JAA Dropping OFFSET and rewriting the IN to status = "todo" OR status = "error" makes no difference.
00:59 🔗 JAA Using only 'status = "todo"' instead of the OR gets rid of the EXECUTE LIST SUBQUERY, but that's probably not the issue.
00:59 🔗 JAA Yeah, doesn't seem to make a difference for the performance.
01:00 🔗 JAA The query SQLAlchemy generates seems to be pretty basic really. Haven't seen OFFSET before, but that should simply get optimised out probably.
01:00 🔗 JAA Otherwise, it's exactly how I'd write that query by hand as well.
01:01 🔗 JAA Hmmm
01:01 🔗 JAA sqlite> EXPLAIN QUERY PLAN SELECT * FROM queued_urls INDEXED BY ix_queued_urls_prio_status_id WHERE queued_urls.status IN ('todo', 'error') ORDER BY queued_urls.priority DESC, queued_urls.status DESC, queued_urls.id LIMIT 1;
01:01 🔗 JAA 0|0|0|SCAN TABLE queued_urls USING INDEX ix_queued_urls_prio_status_id
01:01 🔗 JAA Interesting.
01:02 🔗 ivan heh
01:02 🔗 JAA Yup, that INDEXED BY makes all the difference.
01:04 🔗 JAA I wonder if anyone ever uses wpull with a non-SQLite DB.
01:05 🔗 JAA Except me with that PostgreSQL setup, but I replaced the entire URLTable there anyway since SQLAlchemy had such terrible performance.
01:05 🔗 JAA Maybe we can just get rid of it entirely. I mean, the code looks nice and everything, but it's just *soo* slow compared to writing optimised SQL.
01:05 🔗 JAA Or even just writing standard SQL.
01:06 🔗 JAA For example, the queued_urls table has parent_id and root_id columns to refer to the parent and root URL, respectively.
01:06 🔗 JAA But if you make a SELECT, SQLAlchemy doesn't do a JOIN to get that data as well. Nope, instead, when you first access the values, it goes to fetch it from the DB.
01:07 🔗 JAA You can make a SELECT query and get thousands of rows from the table at once, and then it will retrieve the parent and root URL for each entry individually.
01:07 🔗 JAA This was my main issue on the distributed setup. I was checking out URL blocks instead of individual URLs to reduce the latency impact. Well, yeah...
01:08 🔗 JAA Couldn't get SQLAlchemy to do the right thing there either.
01:08 🔗 JAA Anyway...
01:09 🔗 JAA "The INDEXED BY clause is not intended for use in tuning the performance of a query." HAHA, GREAT!
01:11 🔗 JAA It's also "premature optimization" according to https://www.sqlite.org/queryplanner-ng.html#howtofix
01:20 🔗 JAA It probably doesn't help that the 'status' column is a VARCHAR(11) which is then enforced to have one of five values. :-|
01:22 🔗 JAA ivan: Thoughts on using INDEXED BY on SQLite for now since it works and finding a proper solution later?
01:22 🔗 Flashfire https://www.fanatical.com/en/pick-and-mix/aspyr-pick-and-mix for any Linux Gamers or Mac Gamers
01:23 🔗 JAA Where the proper solution in my opinion would be something involving getting rid of SQLAlchemy's ORM if not all of it, i.e. where we'd have the queries directly in the code instead of these .filter_by etc. constructs.
01:23 🔗 JAA The nice thing about SQLAlchemy is support for all kinds of different databases, so it might still be worth keeping that part.
01:28 🔗 deathy has quit IRC (Read error: Connection reset by peer)
01:28 🔗 mgrytbak has quit IRC (Read error: Connection reset by peer)
01:29 🔗 mgrytbak has joined #archiveteam-ot
01:30 🔗 DrasticAc has quit IRC (Ping timeout: 615 seconds)
01:31 🔗 pnJay has quit IRC (Ping timeout: 615 seconds)
01:34 🔗 diggan has quit IRC (Ping timeout: 615 seconds)
01:35 🔗 JAA Just need to figure out how to add that clause. SQLAlchemy doesn't make this stuff easy.
01:37 🔗 horkermon has quit IRC (Ping timeout: 615 seconds)
01:37 🔗 chferfa has quit IRC (Read error: Connection reset by peer)
01:38 🔗 JAA RIP IRCCloud
01:38 🔗 Flashfire looks like it
01:39 🔗 mgrytbak has quit IRC (Read error: Connection reset by peer)
01:43 🔗 simon987 has joined #archiveteam-ot
01:47 🔗 diggan has joined #archiveteam-ot
01:47 🔗 mgrytbak has joined #archiveteam-ot
01:47 🔗 horkermon has joined #archiveteam-ot
01:47 🔗 Fusl_ has quit IRC (Ping timeout: 615 seconds)
01:48 🔗 Fusl_ has joined #archiveteam-ot
01:48 🔗 Fusl sets mode: +o Fusl_
01:48 🔗 deathy has joined #archiveteam-ot
01:48 🔗 JAA ivan: Oh, regarding the "does anyone use non-SQLite with wpull", I just remembered that I had to remove the "OR IGNORE" clause on INSERT queries. Those are SQLite-specific but used in the BaseSQLURLTable, not just the SQLite one. So I guess the answer is "noone" because it basically can't work. I believe "OR IGNORE" is unique to SQLite; other DBs use "ON CONFLICT IGNORE" and similar constructs instead.
01:49 🔗 JAA Cf https://github.com/JustAnotherArchivist/wpull/commit/5dee4313d580590d4b6e77115c20db53eb666fc5
01:50 🔗 pnJay has joined #archiveteam-ot
01:53 🔗 chferfa has joined #archiveteam-ot
01:57 🔗 DrasticAc has joined #archiveteam-ot
03:00 🔗 cfarquhar has quit IRC (Read error: Operation timed out)
03:00 🔗 cfarquhar has joined #archiveteam-ot
03:03 🔗 apt-get has quit IRC (Read error: Operation timed out)
03:17 🔗 JAA "sqlite3.OperationalError: no query solution" :-|
03:23 🔗 odemg has quit IRC (Ping timeout: 615 seconds)
03:24 🔗 deathy_ has joined #archiveteam-ot
03:24 🔗 pnJay_ has joined #archiveteam-ot
03:24 🔗 JAA has quit IRC (Read error: Operation timed out)
03:24 🔗 asdf0101 has quit IRC (Read error: Operation timed out)
03:25 🔗 lunik1 has quit IRC (Read error: Operation timed out)
03:26 🔗 simon816 has quit IRC (Ping timeout: 246 seconds)
03:27 🔗 cfarquhar has quit IRC (Ping timeout: 246 seconds)
03:28 🔗 eythian has quit IRC (Ping timeout: 246 seconds)
03:29 🔗 odemg has joined #archiveteam-ot
03:29 🔗 deathy has quit IRC (Ping timeout: 615 seconds)
03:29 🔗 deathy_ is now known as deathy
03:31 🔗 pnJay has quit IRC (Ping timeout: 615 seconds)
03:31 🔗 pnJay_ is now known as pnJay
03:32 🔗 bakJAA ivan: I got the INDEXED BY solution working. There is a significant slowdown, but it's not as bad as before (duh). Based on 10 runs of a small test script, I'm seeing a ~40 % slowdown. 1000 checkouts/checkins take roughly 8.5 s with the old code and 11.9 s with the new one on my test setup with that huge DB I mentioned.
03:32 🔗 eythian has joined #archiveteam-ot
03:40 🔗 ivan bakJAA: how badly do you need this priority thing anyway?
03:41 🔗 ivan if it's any faster, maybe there could be a two-priority todo/todofirst?
03:41 🔗 ivan also wow 8.5 s
03:41 🔗 ivan does this go any faster without any fsync?
03:42 🔗 marked I hope this isn't _just_ for Sketch
03:43 🔗 ayanami_ has joined #archiveteam-ot
03:43 🔗 bakJAA It would be extremely useful in ArchiveBot to defer loading less important resources later. For example, on a forum, we might want to delay member profiles or per-post URLs so that we can cover as much content as possible first (e.g. before the shutdown/deletion/whatever). Same for off-site links.
03:44 🔗 bakJAA Now, we often run multiple jobs, one with very restrictive ignores to make sure it finishes in time, and another which runs wild. That's far from optimal though.
03:44 🔗 bakJAA I assume it would be much faster without fsync, yeah. Let me test that.
03:45 🔗 dhyan_nat has joined #archiveteam-ot
03:47 🔗 bakJAA marked: Not at all. It probably won't even be used for Sketch. This has been on my plate for 1.5 years, I implemented everything beginning of last year but never really polished it.
03:49 🔗 marked *nod* sounds reeeaallllyy useful
03:53 🔗 bakJAA ivan: Hmm, interesting, the PRAGMA query isn't logged by SQLAlchemy. Maybe because it uses the connection.execute method directly?
03:53 🔗 bakJAA But I don't see any performance difference whatsoever between having the PRAGMA applied or not.
03:54 🔗 bakJAA The _apply_pragma_callback function does get called though.
04:01 🔗 paul2520 has quit IRC (Read error: Operation timed out)
04:02 🔗 paul2520 has joined #archiveteam-ot
04:04 🔗 bakJAA is now known as AlsoJAA
04:20 🔗 ivan I've used https://github.com/uber/pyflame to figure out why Python programs are slow but the new thing is https://github.com/benfred/py-spy
04:25 🔗 simon816 has joined #archiveteam-ot
04:25 🔗 asdf0101 has joined #archiveteam-ot
04:26 🔗 cfarquhar has joined #archiveteam-ot
04:26 🔗 lunik1 has joined #archiveteam-ot
04:29 🔗 JAA has joined #archiveteam-ot
04:29 🔗 Fusl sets mode: +o JAA
04:29 🔗 AlsoJAA sets mode: +o JAA
04:30 🔗 ayanami_ has quit IRC (Quit: Leaving)
05:55 🔗 Zerote has joined #archiveteam-ot
06:42 🔗 Specular has joined #archiveteam-ot
06:42 🔗 Specular any idea if this has been submitted to archivebot, or considered as a project? https://www.reddit.com/r/Archiveteam/comments/b7avru/facepunch_studios_garrys_mod_rust_forums_are_very/
06:42 🔗 Specular filtered the archivebot dashboard crawls but didn't see anything (kind of expected given the post was a month ago)
06:56 🔗 Specular picked up a second batch of flavorless loose leaf tea, gaawd.
06:57 🔗 Specular so inconsistent. It's either fantastic or completely bland. Quality control is all over the place.
07:01 🔗 deathy has quit IRC ()
07:01 🔗 deathy has joined #archiveteam-ot
07:17 🔗 diggan has quit IRC ()
07:18 🔗 diggan has joined #archiveteam-ot
07:18 🔗 Zerote has quit IRC (Read error: Operation timed out)
07:21 🔗 Zerote has joined #archiveteam-ot
07:34 🔗 Oddly has joined #archiveteam-ot
07:50 🔗 dxrt Specular: There's a channel #faceplant for that - I'm not sure if a project got off the ground though
07:51 🔗 Specular ah, k
10:04 🔗 chferfa has quit IRC (Quit: Connection closed for inactivity)
10:04 🔗 Verified_ has quit IRC (Ping timeout: 252 seconds)
10:10 🔗 killsushi has quit IRC (Quit: Leaving)
10:37 🔗 deevious has joined #archiveteam-ot
10:40 🔗 Specular has quit IRC (Read error: Connection reset by peer)
10:43 🔗 BlueMaxim has joined #archiveteam-ot
10:45 🔗 Hani111 has joined #archiveteam-ot
10:46 🔗 RKenshin has joined #archiveteam-ot
10:46 🔗 eientei9- has joined #archiveteam-ot
10:47 🔗 atbk has quit IRC (Read error: Connection reset by peer)
10:47 🔗 eientei95 has quit IRC (Read error: Connection reset by peer)
10:47 🔗 Kenshin has quit IRC (Read error: Connection reset by peer)
10:47 🔗 RKenshin is now known as Kenshin
10:47 🔗 eientei9- is now known as eientei95
10:47 🔗 Hani111_ has joined #archiveteam-ot
10:47 🔗 AlsoJAA has quit IRC (Remote host closed the connection)
10:47 🔗 AlsoJAA has joined #archiveteam-ot
10:47 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
10:48 🔗 svchfoo3 sets mode: +o AlsoJAA
10:51 🔗 phuz has joined #archiveteam-ot
10:52 🔗 gandalf_ has joined #archiveteam-ot
10:54 🔗 Vito` has quit IRC (Read error: Connection reset by peer)
10:54 🔗 mgrytbak has quit IRC (Read error: Connection reset by peer)
10:54 🔗 revi has quit IRC (Read error: Connection reset by peer)
10:54 🔗 t3 has quit IRC (Write error: Connection reset by peer)
10:54 🔗 hook54321 has quit IRC (Write error: Connection reset by peer)
10:54 🔗 diggan has quit IRC (Read error: Connection reset by peer)
10:54 🔗 Dj-Wawa has quit IRC (Write error: Connection reset by peer)
10:54 🔗 martinlig has quit IRC (Write error: Connection reset by peer)
10:54 🔗 pnJay has quit IRC (Read error: Connection reset by peer)
10:54 🔗 horkermon has quit IRC (Read error: Connection reset by peer)
10:54 🔗 DrasticAc has quit IRC (Read error: Connection reset by peer)
10:54 🔗 chr1sm has quit IRC (Write error: Connection reset by peer)
10:54 🔗 MarcoS has quit IRC (Write error: Connection reset by peer)
10:54 🔗 Ctrl-S_ has quit IRC (Write error: Connection reset by peer)
10:54 🔗 HCross has quit IRC (Write error: Connection reset by peer)
10:54 🔗 deathy has quit IRC (Write error: Connection reset by peer)
10:54 🔗 Dimtree has quit IRC ()
10:54 🔗 rellem has quit IRC (Ping timeout: 615 seconds)
10:54 🔗 apache2 has quit IRC (Ping timeout: 615 seconds)
10:54 🔗 kode54 has quit IRC (Ping timeout: 615 seconds)
10:54 🔗 Hani has quit IRC (Ping timeout: 615 seconds)
10:54 🔗 xit_ has quit IRC (Ping timeout: 615 seconds)
10:54 🔗 Hani111_ is now known as Hani
10:54 🔗 phuzion has quit IRC (Write error: Connection reset by peer)
10:54 🔗 kode54 has joined #archiveteam-ot
10:54 🔗 Jens has quit IRC (Ping timeout: 615 seconds)
10:54 🔗 deathy has joined #archiveteam-ot
10:54 🔗 gandalf has quit IRC (Quit: A wizard always leaves on time.)
10:54 🔗 Meroje has quit IRC (Remote host closed the connection)
10:54 🔗 abstract has quit IRC (Quit: *.banana *.split)
10:54 🔗 N4Y has quit IRC (Read error: Connection reset by peer)
10:54 🔗 mgrytbak has joined #archiveteam-ot
10:54 🔗 martinlig has joined #archiveteam-ot
10:54 🔗 DrasticAc has joined #archiveteam-ot
10:54 🔗 HCross has joined #archiveteam-ot
10:54 🔗 MarcoS has joined #archiveteam-ot
10:54 🔗 Ctrl-S_ has joined #archiveteam-ot
10:54 🔗 apache2 has joined #archiveteam-ot
10:54 🔗 Meroje has joined #archiveteam-ot
10:54 🔗 Jens has joined #archiveteam-ot
10:54 🔗 chr1sm has joined #archiveteam-ot
10:54 🔗 Dj-Wawa has joined #archiveteam-ot
10:54 🔗 revi has joined #archiveteam-ot
10:54 🔗 horkermon has joined #archiveteam-ot
10:54 🔗 rellem has joined #archiveteam-ot
10:54 🔗 Vito` has joined #archiveteam-ot
10:54 🔗 t3 has joined #archiveteam-ot
10:54 🔗 hook54321 has joined #archiveteam-ot
10:54 🔗 diggan has joined #archiveteam-ot
10:54 🔗 xit_ has joined #archiveteam-ot
10:54 🔗 abstract has joined #archiveteam-ot
10:54 🔗 gandalf_ is now known as gandalf
10:54 🔗 Fusl sets mode: +o HCross
10:54 🔗 pnJay has joined #archiveteam-ot
10:54 🔗 atbk has joined #archiveteam-ot
10:54 🔗 Fusl sets mode: +o hook54321
10:56 🔗 Hani111 has quit IRC (Ping timeout: 615 seconds)
10:56 🔗 Shen has quit IRC (Ping timeout: 240 seconds)
10:57 🔗 N4Y has joined #archiveteam-ot
11:05 🔗 Dimtree has joined #archiveteam-ot
11:54 🔗 Shen has joined #archiveteam-ot
12:12 🔗 apt-get has joined #archiveteam-ot
12:19 🔗 marked Specular (offline): I believe Fusl ran a grab-site crawl
12:41 🔗 Specular has joined #archiveteam-ot
12:47 🔗 Verified_ has joined #archiveteam-ot
12:55 🔗 deevious has quit IRC (Ping timeout: 252 seconds)
12:57 🔗 deevious has joined #archiveteam-ot
13:15 🔗 Specular has quit IRC (Read error: Connection reset by peer)
13:16 🔗 Specular has joined #archiveteam-ot
13:22 🔗 Despatche has quit IRC (Quit: Read error: Connection reset by deer)
13:22 🔗 Despatche has joined #archiveteam-ot
13:23 🔗 Specular has quit IRC (Read error: Connection reset by peer)
13:25 🔗 Specular has joined #archiveteam-ot
13:37 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
14:05 🔗 robogoat_ has quit IRC (Read error: Operation timed out)
14:21 🔗 Specular has quit IRC (Ping timeout: 360 seconds)
14:24 🔗 Fusl i ran it through archivebot, not grab-site
14:25 🔗 robogoat has joined #archiveteam-ot
14:41 🔗 deevious has quit IRC (Ping timeout: 252 seconds)
14:48 🔗 deevious has joined #archiveteam-ot
14:49 🔗 AlsoJAA has quit IRC (Quit: leaving)
14:49 🔗 AlsoJAA has joined #archiveteam-ot
14:50 🔗 JAA sets mode: +o AlsoJAA
14:53 🔗 Zerote has quit IRC (Read error: Operation timed out)
15:02 🔗 JAA ivan: Sure, I can do performance analysis, but the PRAGMA not taking effect probably has nothing to do with the performance on the Python side at least.
15:16 🔗 Zerote has joined #archiveteam-ot
16:08 🔗 ealgase has joined #archiveteam-ot
17:05 🔗 m007a83 has quit IRC (Ping timeout: 252 seconds)
17:08 🔗 apt-get has quit IRC (Read error: Operation timed out)
17:36 🔗 katocala has quit IRC ()
17:41 🔗 Verified_ has quit IRC (Ping timeout: 252 seconds)
17:47 🔗 m007a83 has joined #archiveteam-ot
18:25 🔗 martinlig has quit IRC (Quit: Connection closed for inactivity)
18:31 🔗 kiska1 has quit IRC (Read error: Operation timed out)
18:32 🔗 kiska1 has joined #archiveteam-ot
18:32 🔗 Fusl sets mode: +o kiska1
19:20 🔗 mal has quit IRC (mal)
19:42 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
20:00 🔗 Verified_ has joined #archiveteam-ot
20:13 🔗 icedice has joined #archiveteam-ot
20:17 🔗 mal has joined #archiveteam-ot
20:20 🔗 mal has quit IRC (Client Quit)
20:21 🔗 mal has joined #archiveteam-ot
21:46 🔗 BlueMax has joined #archiveteam-ot
22:32 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
22:57 🔗 Ravenloft has joined #archiveteam-ot

irclogger-viewer