[02:30] *** ndiddy has quit IRC () [02:41] *** SpaffGarg has quit IRC (Read error: Operation timed out) [02:43] *** SpaffGarg has joined #archiveteam-bs [02:54] *** zeryl has joined #archiveteam-bs [03:12] *** pizzaiolo has quit IRC (pizzaiolo) [03:23] *** zeryl has quit IRC (Quit: Page closed) [03:23] *** Zeryl has joined #archiveteam-bs [04:11] Let's try here: [04:11] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [04:15] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:21] *** Sk1d has joined #archiveteam-bs [04:29] *** signius has quit IRC (Quit: Leaving) [05:04] is there anything that can be archived from a XMPP server? [05:08] *** fie has joined #archiveteam-bs [05:28] possibly contact info, and MUC logs depending how long back they allow reviewin, if at all [05:45] i'm uploading more Charlie Rose from 1992-01 [07:07] *** chazchaz has quit IRC (Read error: Operation timed out) [07:07] *** Kenshin has quit IRC (Read error: Operation timed out) [07:08] *** chazchaz has joined #archiveteam-bs [07:09] *** Kenshin has joined #archiveteam-bs [07:12] *** schbirid has joined #archiveteam-bs [07:25] *** SpaffGarg has quit IRC (Read error: Operation timed out) [07:28] *** SpaffGarg has joined #archiveteam-bs [08:08] *** mls has quit IRC (Ping timeout: 250 seconds) [08:15] *** mls has joined #archiveteam-bs [08:32] *** GE has joined #archiveteam-bs [08:48] *** nyany has quit IRC (Ping timeout: 506 seconds) [08:52] *** antonizoo has quit IRC () [08:52] *** antonizoo has joined #archiveteam-bs [09:05] *** Jonison has joined #archiveteam-bs [09:13] *** GE has quit IRC (Remote host closed the connection) [10:50] *** godane has quit IRC (Ping timeout: 268 seconds) [11:01] *** Honno has joined #archiveteam-bs [11:03] *** godane has joined #archiveteam-bs [11:09] *** godane has quit IRC (Quit: Leaving.) [11:29] *** GE has joined #archiveteam-bs [12:10] *** Ravenloft has quit IRC (Read error: Operation timed out) [13:06] *** BlueMaxim has quit IRC (Quit: Leaving) [13:23] *** GE has quit IRC (Remote host closed the connection) [13:54] *** GE has joined #archiveteam-bs [14:01] *** Jonison has quit IRC (ny.us.hub hub.se) [14:01] *** SpaffGarg has quit IRC (ny.us.hub hub.se) [14:01] *** Kenshin has quit IRC (ny.us.hub hub.se) [14:01] *** K4k has quit IRC (ny.us.hub hub.se) [14:01] *** SketchCow has quit IRC (ny.us.hub hub.se) [14:01] *** Kaz has quit IRC (ny.us.hub hub.se) [14:01] *** Ctrl-S___ has quit IRC (ny.us.hub hub.se) [14:01] *** alembic has quit IRC (ny.us.hub hub.se) [14:01] *** floogulin has quit IRC (ny.us.hub hub.se) [14:01] *** HCross2 has quit IRC (ny.us.hub hub.se) [14:01] *** deathy has quit IRC (ny.us.hub hub.se) [14:01] *** alfie has quit IRC (ny.us.hub hub.se) [14:01] *** BartoCH has quit IRC (ny.us.hub hub.se) [14:01] *** ThisAsYou has quit IRC (ny.us.hub hub.se) [14:01] *** tklk has quit IRC (ny.us.hub hub.se) [14:01] *** Sue_ has quit IRC (ny.us.hub hub.se) [14:01] *** Muad-Dib has quit IRC (ny.us.hub hub.se) [14:01] *** Sanqui has quit IRC (ny.us.hub hub.se) [14:01] *** Meroje has quit IRC (ny.us.hub hub.se) [14:01] *** raphidae has quit IRC (ny.us.hub hub.se) [14:01] *** Boppen has quit IRC (ny.us.hub hub.se) [14:01] *** mls has quit IRC (ny.us.hub hub.se) [14:01] *** Sk1d has quit IRC (ny.us.hub hub.se) [14:01] *** andai has quit IRC (ny.us.hub hub.se) [14:01] *** Aoede has quit IRC (ny.us.hub hub.se) [14:01] *** nightpool has quit IRC (ny.us.hub hub.se) [14:01] *** hook54321 has quit IRC (ny.us.hub hub.se) [14:01] *** VeganMars has quit IRC (ny.us.hub hub.se) [14:01] *** Riviera has quit IRC (ny.us.hub hub.se) [14:01] *** SN4T14 has quit IRC (ny.us.hub hub.se) [14:01] *** tuluu_ has quit IRC (ny.us.hub hub.se) [14:01] *** JensRex has quit IRC (ny.us.hub hub.se) [14:01] *** tammy_ has quit IRC (ny.us.hub hub.se) [14:01] *** i0npulse has quit IRC (ny.us.hub hub.se) [14:01] *** Hecatz has quit IRC (ny.us.hub hub.se) [14:01] *** Rai-chan has quit IRC (ny.us.hub hub.se) [14:01] *** medowar has quit IRC (ny.us.hub hub.se) [14:01] *** purplebot has quit IRC (ny.us.hub hub.se) [14:01] *** Madchen has quit IRC (ny.us.hub hub.se) [14:01] *** PurpleSym has quit IRC (ny.us.hub hub.se) [14:01] *** altlabel has quit IRC (ny.us.hub hub.se) [14:01] *** Zeryl has quit IRC (ny.us.hub hub.se) [14:01] *** Jon- has quit IRC (ny.us.hub hub.se) [14:01] *** Stilett0 has quit IRC (ny.us.hub hub.se) [14:01] *** dashcloud has quit IRC (ny.us.hub hub.se) [14:01] *** espes__ has quit IRC (ny.us.hub hub.se) [14:01] *** kvieta has quit IRC (ny.us.hub hub.se) [14:01] *** Darkstar has quit IRC (ny.us.hub hub.se) [14:01] *** Lord_Nigh has quit IRC (ny.us.hub hub.se) [14:01] *** brayden_ has quit IRC (ny.us.hub hub.se) [14:01] *** t2t2 has quit IRC (ny.us.hub hub.se) [14:01] *** RichardG has quit IRC (ny.us.hub hub.se) [14:01] *** kurt has quit IRC (ny.us.hub hub.se) [14:01] *** Odd0002 has quit IRC (ny.us.hub hub.se) [14:01] *** ploop has quit IRC (ny.us.hub hub.se) [14:01] *** DFJustin has quit IRC (ny.us.hub hub.se) [14:01] *** SilSte has quit IRC (ny.us.hub hub.se) [14:01] *** Fletcher has quit IRC (ny.us.hub hub.se) [14:01] *** antonizoo has quit IRC (ny.us.hub hub.se) [14:01] *** fie has quit IRC (ny.us.hub hub.se) [14:01] *** tsr has quit IRC (ny.us.hub hub.se) [14:01] *** yuitimoth has quit IRC (ny.us.hub hub.se) [14:01] *** luckcolor has quit IRC (ny.us.hub hub.se) [14:01] *** tephra has quit IRC (ny.us.hub hub.se) [14:01] *** antomatic has quit IRC (ny.us.hub hub.se) [14:01] *** SmileyG has quit IRC (ny.us.hub hub.se) [14:01] *** kevinr has quit IRC (ny.us.hub hub.se) [14:01] *** Frogging has quit IRC (ny.us.hub hub.se) [14:01] *** johnny4 has quit IRC (ny.us.hub hub.se) [14:01] *** bsmith093 has quit IRC (ny.us.hub hub.se) [14:01] *** kisspunch has quit IRC (ny.us.hub hub.se) [14:01] *** tapedrive has quit IRC (ny.us.hub hub.se) [14:01] *** wolfpld has quit IRC (ny.us.hub hub.se) [14:02] *** antonizoo has joined #archiveteam-bs [14:02] *** fie has joined #archiveteam-bs [14:02] *** tsr has joined #archiveteam-bs [14:02] *** yuitimoth has joined #archiveteam-bs [14:02] *** luckcolor has joined #archiveteam-bs [14:02] *** tephra has joined #archiveteam-bs [14:02] *** SmileyG has joined #archiveteam-bs [14:02] *** antomatic has joined #archiveteam-bs [14:02] *** kevinr has joined #archiveteam-bs [14:02] *** Frogging has joined #archiveteam-bs [14:02] *** irc.efnet.nl sets mode: +oooo luckcolor SmileyG antomatic Frogging [14:02] *** johnny4 has joined #archiveteam-bs [14:02] *** bsmith093 has joined #archiveteam-bs [14:02] *** kisspunch has joined #archiveteam-bs [14:02] *** tapedrive has joined #archiveteam-bs [14:02] *** wolfpld has joined #archiveteam-bs [14:02] *** irc.efnet.nl sets mode: +o bsmith093 [14:02] *** swebb sets mode: +o antomatic [14:02] *** Frogging sets mode: +o yipdw [14:02] *** SmileyG has quit IRC (Write error: Broken pipe) [14:02] *** Smiley has joined #archiveteam-bs [14:15] *** Zeryl has joined #archiveteam-bs [14:15] *** Stilett0 has joined #archiveteam-bs [14:15] *** Riviera has joined #archiveteam-bs [14:15] *** dashcloud has joined #archiveteam-bs [14:15] *** SN4T14 has joined #archiveteam-bs [14:15] *** espes__ has joined #archiveteam-bs [14:15] *** tuluu_ has joined #archiveteam-bs [14:15] *** kvieta has joined #archiveteam-bs [14:15] *** Darkstar has joined #archiveteam-bs [14:15] *** JensRex has joined #archiveteam-bs [14:15] *** tammy_ has joined #archiveteam-bs [14:15] *** i0npulse has joined #archiveteam-bs [14:15] *** Hecatz has joined #archiveteam-bs [14:15] *** medowar has joined #archiveteam-bs [14:15] *** Rai-chan has joined #archiveteam-bs [14:15] *** purplebot has joined #archiveteam-bs [14:15] *** Lord_Nigh has joined #archiveteam-bs [14:15] *** ploop has joined #archiveteam-bs [14:15] *** brayden_ has joined #archiveteam-bs [14:15] *** t2t2 has joined #archiveteam-bs [14:15] *** kurt has joined #archiveteam-bs [14:15] *** Odd0002 has joined #archiveteam-bs [14:15] *** DFJustin has joined #archiveteam-bs [14:15] *** hub.dk sets mode: +oooo medowar Lord_Nigh brayden_ DFJustin [14:15] *** SilSte has joined #archiveteam-bs [14:15] *** Fletcher has joined #archiveteam-bs [14:15] *** Madchen has joined #archiveteam-bs [14:15] *** altlabel has joined #archiveteam-bs [14:15] *** PurpleSym has joined #archiveteam-bs [14:15] *** hub.dk sets mode: +oo Fletcher PurpleSym [14:15] *** swebb sets mode: +o brayden_ [14:15] *** swebb sets mode: +o DFJustin [14:15] *** jmtd has joined #archiveteam-bs [14:24] *** Boppen has joined #archiveteam-bs [14:32] *** Jonison has joined #archiveteam-bs [14:32] *** Kenshin has joined #archiveteam-bs [14:32] *** K4k has joined #archiveteam-bs [14:32] *** SketchCow has joined #archiveteam-bs [14:32] *** Kaz has joined #archiveteam-bs [14:32] *** Ctrl-S___ has joined #archiveteam-bs [14:32] *** alembic has joined #archiveteam-bs [14:32] *** floogulin has joined #archiveteam-bs [14:32] *** HCross2 has joined #archiveteam-bs [14:32] *** deathy has joined #archiveteam-bs [14:32] *** alfie has joined #archiveteam-bs [14:32] *** BartoCH has joined #archiveteam-bs [14:32] *** tklk has joined #archiveteam-bs [14:32] *** raphidae has joined #archiveteam-bs [14:32] *** ThisAsYou has joined #archiveteam-bs [14:32] *** Muad-Dib has joined #archiveteam-bs [14:32] *** Meroje has joined #archiveteam-bs [14:32] *** Sue_ has joined #archiveteam-bs [14:32] *** Sanqui has joined #archiveteam-bs [14:32] *** efnet.port80.se sets mode: +oooo SketchCow Kaz HCross2 Sanqui [14:32] *** swebb sets mode: +o SketchCow [14:34] *** Jonison has quit IRC (Read error: Connection reset by peer) [14:52] *** nyany has joined #archiveteam-bs [15:30] *** Aranje has joined #archiveteam-bs [15:42] *** SpaffGarg has joined #archiveteam-bs [15:42] *** RichardG_ has joined #archiveteam-bs [15:42] *** mls has joined #archiveteam-bs [15:42] *** Sk1d has joined #archiveteam-bs [15:42] *** andai has joined #archiveteam-bs [15:42] *** Aoede has joined #archiveteam-bs [15:42] *** nightpool has joined #archiveteam-bs [15:42] *** hook54321 has joined #archiveteam-bs [15:42] *** VeganMars has joined #archiveteam-bs [15:44] *** RichardG_ is now known as RichardG [16:01] *** pizzaiolo has joined #archiveteam-bs [16:24] *** phuz has joined #archiveteam-bs [16:24] *** phuzion has quit IRC (Read error: Connection reset by peer) [16:35] *** antonizoo has quit IRC (Remote host closed the connection) [16:44] *** ZexaronS has joined #archiveteam-bs [16:50] *** antonizoo has joined #archiveteam-bs [17:25] *** sun_rise has joined #archiveteam-bs [17:27] If anyone is around I'm interested in pointing archivebot at something in the other channel [17:28] *** GE has quit IRC (Remote host closed the connection) [18:00] The job finished but I can't find it in the viewer (or anywhere else?) Says status completed. I'm a little confused. [18:07] *** pizzaiolo has quit IRC (Read error: Connection reset by peer) [18:08] *** pizzaiolo has joined #archiveteam-bs [18:15] sun_rise: iirc jobs are uploaded/ingested about daily [18:15] *** SpaffGarg has quit IRC (Ping timeout: 250 seconds) [18:21] *** SpaffGarg has joined #archiveteam-bs [19:10] *** GE has joined #archiveteam-bs [19:23] *** pizzaiolo has quit IRC (Quit: pizzaiolo) [19:23] *** JAA has joined #archiveteam-bs [19:25] *** pizzaiolo has joined #archiveteam-bs [19:38] *** Aranje has quit IRC (Ping timeout: 245 seconds) [20:05] *** ZexaronS- has joined #archiveteam-bs [20:06] *** sep332 has quit IRC (Read error: Operation timed out) [20:06] *** ZexaronS has quit IRC (Read error: Operation timed out) [20:23] *** sep332 has joined #archiveteam-bs [21:08] *** speculaas has joined #archiveteam-bs [21:08] speculaas: okay, so, it's *possible* to extract data from the existing archives, but it currently still requires some manual work [21:09] speculaas: specifically, you can download the indexes of all the Hyves items on archive.org, which contain a list of every URL that is contained in a given item along with its 'offset' (position in the WARC file) [21:09] oke [21:09] speculaas: you can then use those positions to do a HTTP range request and retrieve just those bits of the WARC file, obtaining the pages [21:09] Here are some archives https://archive.org/details/hyves?&sort=-downloads&page=2 [21:10] speculaas: there's - to my knowledge - not yet a nice one-stop way to extract an account [21:10] speculaas: if you just want to *look* at the account, it's faster to look it up in the wayback machine [21:10] all the Hyves archives should have been imported into that [21:17] The url for that is: www.hyves.nl/username ? [21:19] I already know the url but I see my account is not public [21:32] *** schbirid has quit IRC (Quit: Leaving) [21:32] speculaas: ah yeah, we only got the public profiles... so if it was a private profile, I'm afraid it can't be recovered :/ [21:32] speculaas: unless a friend kept around a copy... [21:35] Oke, than I know enough. Thanks for your time;) [21:36] speculaas: good luck in your search :) [21:40] *** speculaas has quit IRC (Ping timeout: 268 seconds) [21:43] *** sun_rise has quit IRC (ny.us.hub irc.efnet.nl) [21:43] *** fie has quit IRC (ny.us.hub irc.efnet.nl) [21:43] *** tsr has quit IRC (ny.us.hub irc.efnet.nl) [21:43] *** yuitimoth has quit IRC (ny.us.hub irc.efnet.nl) [21:43] *** luckcolor has quit IRC (ny.us.hub irc.efnet.nl) [21:43] *** tephra has quit IRC (ny.us.hub irc.efnet.nl) [21:43] *** antomatic has quit IRC (ny.us.hub irc.efnet.nl) [21:43] *** kevinr has quit IRC (ny.us.hub irc.efnet.nl) [21:43] *** Frogging has quit IRC (ny.us.hub irc.efnet.nl) [21:43] *** johnny4 has quit IRC (ny.us.hub irc.efnet.nl) [21:43] *** bsmith093 has quit IRC (ny.us.hub irc.efnet.nl) [21:43] *** kisspunch has quit IRC (ny.us.hub irc.efnet.nl) [21:43] *** tapedrive has quit IRC (ny.us.hub irc.efnet.nl) [21:43] *** wolfpld has quit IRC (ny.us.hub irc.efnet.nl) [21:44] *** sun_rise has joined #archiveteam-bs [21:44] *** fie has joined #archiveteam-bs [21:44] *** tsr has joined #archiveteam-bs [21:44] *** yuitimoth has joined #archiveteam-bs [21:44] *** luckcolor has joined #archiveteam-bs [21:44] *** tephra has joined #archiveteam-bs [21:44] *** antomatic has joined #archiveteam-bs [21:44] *** kevinr has joined #archiveteam-bs [21:44] *** Frogging has joined #archiveteam-bs [21:44] *** johnny4 has joined #archiveteam-bs [21:44] *** bsmith093 has joined #archiveteam-bs [21:44] *** irc.efnet.nl sets mode: +oooo luckcolor antomatic Frogging bsmith093 [21:44] *** kisspunch has joined #archiveteam-bs [21:44] *** tapedrive has joined #archiveteam-bs [21:44] *** wolfpld has joined #archiveteam-bs [21:44] *** swebb sets mode: +o antomatic [21:44] *** Frogging sets mode: +o yipdw [22:05] is it using lxml? [22:06] *** FalconK has joined #archiveteam-bs [22:06] hah! [22:06] yo Sanqui [22:06] oh hey [22:07] so a bunch of the archivebot pipelines are dual-core atoms clocking at 2.4GHz in virtualized environments [22:07] Sanqui: it depends on the configuration. libxml on some pipelines, html5lib on others [22:07] we started with libxml but it kept crashing for some reasn [22:07] html5lib gets more stuff and seems more stable but is more expensive re CPU [22:07] I think all of them are html5lib now? [22:07] for reference, i wrote <@Sanqui> is it using lxml? [22:07] probably, but I can't be sure of that since people can change the pip manifest [22:08] ty [22:08] yeah html5lib is gonna be cpu expensive [22:08] anyway, I don't think there's a way around parsing the documents to get links and stuff [22:08] ha, we have a manageability issue too writ large [22:08] ideally we'd use libxml and allow to change with a parameter [22:08] if a certain website had issues [22:08] a suggestion from some local crew I know was to forego the XML parsing entirely and just use a best-effort regex, and accept that it will find some bullshit [22:08] recent Chrome release has an official headless mode and that seems interesting [22:09] the biggest reason to not use a regex is that it will fall down making relative URLs out of any / in anything [22:09] or else miss tons of stuff [22:09] yeah, that's a lot of webpages :P [22:09] so I rejected that solution [22:09] fucking relative URLs [22:10] hm [22:10] I still think using a browser is probably the way to go [22:10] so ... yeah [22:10] you could make a more... "outzoomed" regex looking for href= [22:10] ugh [22:10] more and more websites are using client rendering [22:10] computers suck [22:10] one could do that yes [22:10] for client rendering we have to use phantomjs anyway [22:10] phantomjs is dead [22:10] or Headless Chrome Because Yes [22:10] and if you're looking for an optimized way to parse documents, you might as well look at a Web browser [22:10] tbh it'd be very nice if we could just spin up chromes [22:11] that will cause us to need more CPU, not less [22:11] ^ [22:11] maybe [22:12] in place of phantomjs anyway [22:12] it would be a lot nicer to use headless chrome than phantomjs for the things we do need client-side rendering for [22:12] but we still need client-side rendering for a small minority of sites [22:12] the only major use of it I've noticed, actually, is twitter. [22:12] I think that may be because that's the only place it seems to reliably work [22:12] "reliably" [22:12] phantomjs is also crashy af [22:12] that may also be the case [22:13] but there's also a lot of blog sites that use client-side rendering and have no fallback [22:13] I'm not at all opposed to using headless chrome in place of phantomjs and seeing how it performs [22:13] honestly, for archiving websites like twitter, youtube, facebook etc., the bot should have specific modes that are curated [22:13] usually it's software developers because software developers are idiots [22:13] Sanqui: yes, that's also on the long todo list [22:13] can confirm, I write software [22:13] we want a !twitter at least, and possibly an !youtube and !reddit [22:14] !facebook would require a lot of work [22:15] separately, there's this CPU usage issue :P [22:15] did anyone manage to get a useful CPU profile? I tried once but I just got a bunch of "your progrm is spending most of its time in Python's evaluator" [22:15] which is like saying "your program is spending most of its time running" [22:15] there's *another* issue, which is that wpull.db grows to tens of GB when crawling large sites, but I'm willing to live with that for the moment since the high CPU usage is actually the pain point right now [22:15] anyway, to drive the point home: phantomjs is over, the lead developer has stepped out in anticipation of headless chrome https://groups.google.com/forum/#!topic/phantomjs/9aI5d-LDuNE [22:15] yipdw: I did! [22:15] oh [22:15] so we need to do something eventually [22:15] do you still have the profile data? [22:15] ananiel-s6 is currently dedicated to profiling [22:16] ah good [22:16] let me see if I do still have it; if not, I can get it again later [22:16] yeah, I'd like to see that. I got as far as perf and then I got annoyed and had to switch gears [22:16] but I recall html5lib stuff featured very heavily [22:16] perf is fucking awful to deal with [22:16] I hate optimizing [22:16] html5lib is parsing in python [22:16] I keep seeing good testimonials for Telemetry [22:16] we really want lxml [22:16] we've tried libxml before [22:16] it kept blowing up [22:17] then we should figure out why and report it upstream [22:17] let's see - how does one read cprofile things again [22:17] actually yipdw do you just want the cprofile? [22:17] (sorry for the 'we', i'm not trying to sound smart here) [22:17] I'll put it up somewhere [22:17] Sanqui: I mean, yes, but in the meantime it was easier to just switch to html5lib and deliver something working [22:17] (i fully recognize i have done zero archivebot development) [22:18] we have a very bad test process right now for archivebot [22:18] yes I noticed tests are failing [22:18] which is make it do real work, and then wait until it falls over, and see if you got enough information to figure out the failure case [22:18] *** REiN^ has quit IRC (Read error: Operation timed out) [22:19] chfoo wrote a smoke test harness, but there's a lot of moving parts and I haven't looked at what it takes to put them back together in the Travis environment [22:19] on my end, this is mostly because I have $infinity things to do that aren't archivebot, so... :P [22:19] no offense to chfoo but his code has a LOT of moving parts [22:19] i mean really I don't think "test it in production" is really a bad idea here [22:19] it's not; it just takes forever [22:19] if you have good telemetry, it's awesome [22:19] I'm not familiar with what libxml and html5lib really are internally, but probably the best option would be to the XML parser library from a browser (i.e. Chromium or Firefox), right? [22:20] html5lib is basically that [22:20] I don't know of anyone who has extracted those for consumption in something else [22:20] it's intended afaik to be a W3C compliant HTML parser not unlike, say, sax for XML [22:20] I guess that's a good point too [22:20] we really can't use "an XML library" to be pedantic [22:20] Yeah. Problem is, many websites aren't W3C compliant. [22:21] HTML isn't XML and archivebot has to be able to deal with that [22:21] maybe it's html5lib that needs perf [22:21] We still want to be able to handle those. [22:21] we don't currently have a significant problem with that [22:21] JAA: it's inside out here [22:21] indeed archivebot tends to get pointed at a lot of small, old sites [22:21] W3C defines how to deal with websites that aren't W3C compliant [22:21] and browsers follow that [22:21] other than operator error I haven't had many complaints of archivebot missing things [22:21] I don't have notes, but I think that's another reason why html5lib switch happened [22:21] I see [22:21] if it's noticed, I'd love to hear about it [22:22] it just got better results [22:22] no point in performing faster if you miss page requisites etc [22:22] *** REiN^ has joined #archiveteam-bs [22:22] could always try pypy :) [22:22] If html5lib works so well, how about rewriting it as a C extension? /s [22:23] *** ZexaronS- has quit IRC (Leaving) [22:23] debugging the intersection of python and C is prohibited by the Geneva Conventions [22:23] I mean you can inflict it on yourself but [22:24] *** GE has quit IRC (Remote host closed the connection) [22:24] tangentially related, I'm working on a project and part of it is an app that calls into a Go library [22:24] from C [22:25] http://ananiels6.falconkirtaran.net/cprof.dat [22:25] the app is trivially stack-smashable if you send a URL that's longer than 2048 bytes [22:25] that link work? [22:25] I thought that was really funny [22:25] because it's like "Go will save me" [22:25] yeah no [22:25] Connecting to ananiels6.falconkirtaran.net (ananiels6.falconkirtaran.net)|51.15.47.106|:80... failed: Connection refused. [22:26] brb [22:27] I've done it before, and it's actually not too bad as long as you can keep all the real work in C and just have a thin transition layer converting the stuff from/to Python variables. [22:27] But for obvious reasons, I wouldn't want to implement an XML parser, ever. Most certainly not in C. [22:29] this is not work we should be doing [22:31] +1 [22:31] ffs [22:31] http://ananiels6.falconkirtaran.net:8000/cprof.dat [22:31] strings are awful [22:32] hooray for SimpleHTTPServer [22:32] Indeed. wpull really needs fixing. Version 2.0.1 has so many bugs that it's not even funny; e.g. concurrency is broken entirely and aborting doesn't work. And version 1.2.3 throws up when used with the current html5lib version since the API changed and the requirements.txt doesn't force the specific, compatible version. [22:32] yipdw and I did the work to transition archivebot to wpull2 like 6 months back [22:33] I suggest that we roll with chfoo's changes and deprecate 1.x [22:33] but we will need to fix concurrency for sure [22:33] aborting is working fine for archivebot by the way [22:33] er... as fine as it ever has worked [22:34] ok [22:35] Interesting. I always had to hard-abort (twice ^C) it when I tried. After few attempts, I went to 1.2.3 [22:36] anyway the thing that really jumps out at me in the 650 second profile there: [22:36] 926 23.751 0.026 324.046 0.350 /home/archivebot/.local/lib/python3.5/site-packages/wpull/scraper/html.py:127(_process_elements) [22:37] it's spending literally 50% of its time in html._process_elements [22:38] well, in some way that's kinda cool [22:38] it means all of our add-on stuff isn't the slow bit [22:38] yeah... [22:39] by comparison, by the way, it spends about 7.5% of its time working with sqlite [22:39] *** Stilett0 has quit IRC (Read error: Operation timed out) [22:39] which to me is counterinituitive. I thought running hundreds of regular expressions on each document would be a problem [22:39] turns out, it isn't the dominating factor [22:39] profiles are awesome [22:39] yeah [22:39] our regexp running is efficient, I think, right? it compiles them into one state machine? [22:39] no [22:39] Hmm, doesn't the HTML parsing happen outside of _process_elements? [22:39] no idea [22:40] but we do compile the regexes / make use of the Python regexp cache [22:40] * FalconK nods [22:40] so it's probably fast enough [22:40] the regex thing doesn't even seem to appear in the profiling [22:40] er [22:40] not anywhere near the top [22:40] neat [22:41] it's good to know also that sqlite is fast [22:41] 78084 0.907 0.000 28.838 0.000 /home/archivebot/.local/lib/python3.5/site-packages/wpull/application/hook.py:132(notify) [22:41] 5% of time in hooks of any kind [22:41] i had a suspicion it was more than sufficient for this but it's cool to see that it's at the bottom [22:41] * FalconK nods [22:41] so, hmm [22:41] what is process_elements doing [22:41] I don't even remember what this job was (probably !a http://cnn.com/ or something) [22:42] but yes, one wonders [22:42] It seems that the parsing happens in wpull.document.html.HTMLReader.iter_elements . [22:42] FalconK: can you put that profile data back up? [22:42] I get a connection refused talking to that site [22:43] that or if you can drill down into process_elements that'd be fab [22:43] oh sure [22:43] it's a pretty big method [22:43] sorry, it was python http.server and I took it down to read the data [22:43] ah ok [22:43] Yeah, line profiling for _process_elements would be helpful. [22:43] up again [22:43] go for it [22:44] hmm [22:44] Connecting to ananiels6.falconkirtaran.net (ananiels6.falconkirtaran.net)|51.15.47.106|:80... failed: Connection refused. [22:44] :8000 [22:44] oh feck [22:44] there we go [22:44] done [22:44] :) [22:44] I'll leave it up for a bit while I read _process_elements [22:45] python -m cProfile -s cumtime will never not be funny to me [22:45] also hi yes I am 12 [22:46] oddly clean_link_soup is negligible [22:46] 210118 1.320 0.000 3.423 0.000 /home/archivebot/.local/lib/python3.5/site-packages/wpull/scraper/util.py:38(clean_link_soup) [22:46] it'd be funny if it ended up being urljoin_safe or something [22:47] 50% of overall time spent in string concat and reallocation [22:47] 164679 1.413 0.000 30.091 0.000 /home/archivebot/.local/lib/python3.5/site-packages/wpull/url.py:684(urljoin) [22:47] wat [22:47] are you fucking kidding me [22:47] :P [22:47] Sidenote: I think there's a bug in _process_elements: "if self._only_relative:" followed by "if link_info.base_link or '://' in link_info.link:" probably doesn't catch protocol-relative links, i.e. 'href="//example.com/"'. [22:47] JAA: hmm [22:47] what's that? [22:48] I don't recall scheme-relative links being a problem, but we can try that out [22:48] oh huh, https://www.paulirish.com/2010/the-protocol-relative-url/... TIL [22:48] they're vaguely useful [22:50] JAA: I think you're right; the best way to address it would be a PR [22:51] this is kind of a big deal too: [22:51] 1077 0.230 0.000 51.563 0.048 /home/archivebot/.local/lib/python3.5/site-packages/wpull/database/wrap.py:41(add_many) [22:52] FalconK: wait, are you sure this is html5lib? [22:52] I see parse_lxml in the output [22:52] I went through the same inquiry [22:52] I don't remember the conclusion I came to [22:52] FalconK: I guess. Then again, other PRs have been sitting there for months, so motivation is limited. Also, I have no idea how to fix it properly without breaking other stuff. Paths in URLs can contain several consecutive slashes IIRC; that is, href="some//path" is equivalent to href="some/path". [22:52] either both are in use, or else someone put html5lib in but left all the functions named like libxml. [22:53] JAA: right, it looks like only r'^//' is protocol-relative [22:54] no comment on PRs except that archivebot specifies github.com/falconkirtaran/wpull in requirements.txt [22:54] because before my omnibus PR was accepted wpull2 was too crashy to use [22:55] well, maybe parse_lxml is the wrong place to look anyway. profile indicates that most of the time in there is spent in the "start" method but that method just invokes callbacks [22:55] heya yipdw I think that add_many prof item might contain the plugins? [22:55] and the callbacks aren't showing up in the profile, AFAICT [22:56] FalconK: not sure [22:56] oh wait, the callbacks are in the called: section [22:57] oh, still not a problem [22:57] 2159 0.036 0.000 6.803 0.003 archive_bot_plugin.py:214(accept_url) [22:58] huh. highest total time in start is /home/archivebot/.local/lib/python3.5/site-packages/wpull/collections.py:244(__init__) [22:58] does this just spend most of its time managing lists? [22:58] what does *that* abstraction do [22:58] it might, though [22:58] wpull -r keeps a lot of lists [22:59] *** Stilett0 has joined #archiveteam-bs [22:59] line 244 of collections.py is the initializer for FrozenDict [22:59] which does e.g. [22:59] def __init__(self, orig_dict): [22:59] self.orig_dict = orig_dict [22:59] self.hash_cache = hash(tuple(sorted(self.orig_dict.items()))) [22:59] over 1.68 million calls to that that seems like it might be a thing [22:59] wait [23:00] hash(...sorted(? [23:00] yeah [23:00] why [23:00] I don't know [23:00] do Python hashes guarantee any sort of iteration order? [23:00] I know Ruby does [23:00] I suppose that would depend [23:01] what stability properties does it require? [23:01] not sure [23:01] python is not my primary language [23:01] (that'd be C++, followed by x86 ASM) [23:02] FrozenDict is used in lxml.HTMLParserTarget.start [23:02] well! [23:02] I'm not really sure if it's needed, though [23:02] hard to tell [23:03] it's also not immediately clear to me what it's wrapping -- it's 'attrib' [23:03] (tag attributes?) [23:03] murdering it entirely would speed us up by 2% [23:04] AKA 4-6 page grabs per hundred seconds [23:04] or more, depending on what effect that would have with fewer allocations [23:04] oh, true [23:04] the allocator is still a black box to us [23:04] I was just poking at it because it showed up pretty high in the profiles [23:04] though I feel like it's probably spending a lot more time sorting than allocating [23:05] I don't think __init__ captures time python spends allocating [23:05] and actually the python heap processing was insignificant anyway wan't it? [23:05] it might not, but FrozenDict is making more objects in its initializer [23:05] i.e. the new hash and the temporary tuple [23:05] mm [23:05] I don't know how expensive that is on the allocator (it might be trivial) [23:06] anyway, I guess one thing to try would be to remove FrozenDict() with, like dict() [23:06] I don't think allocator time is captured with the jit time [23:06] but yeah, we could try that on ananiel-S6 [23:06] you lose the immutability guarantee but it'd be one way to see if FrozenDict() introduces a large penalty [23:06] or, in the specific case of start(), just don't wrap attrib in a FrozenDict() [23:07] I doubt it will have a perceptible macro difference but it would be neat to see how it changes the profile [23:08] now I'm confused about this: [23:08] speaking of C++, one thing that C++ has made me really paranoid about (probably overly paranoid) is allocations [23:08] there's both lxml_.py and htmllib5_.py [23:08] why [23:08] like every time I've had a performance problem, it wasn't algorithmic. it was because I was fucking mallocing too much [23:08] or treating cache lines like slacklines [23:08] that sort of things [23:09] FalconK: huh [23:09] dunno [23:10] maybe this is using libxml after all? [23:11] I wonder if it using libxml for XHTML documents and html5lib for others? [23:11] I remember there was some complex dispatch logic [23:12] it's just so ungodly complex [23:13] maybe using Chrome as the HTML processor would actually be faster :P [23:13] let wpull handle queue management, retry, etc [23:13] doubt it but who knows [23:14] I mean you might still be at high CPU%, but the CPU might be doing more [23:14] one thing that is good about html5lib/libxml2 is that it doesn't execute needless javascript [23:14] we may be able to disable doing that in headless chrome [23:15] it doesn't, but Javascript has been doing things to the DOM for quite a while [23:15] I don't know if it's needless [23:15] there was some other browser like this, I forgot what it was [23:16] it was webkit based [23:16] it's needful to grab, for sure [23:16] and it was meant to be used in a UNIX Philosophy way [23:16] which means it has an impossible name [23:17] AH [23:17] uzbl [23:18] maybe that's an option too in the "use a browser engine to give us what we need to do our thing" arena [23:18] or i dunno how good is servo these days :P [23:19] every time I try to run servo nightly it eats up all my cores but doesn't render anything [23:19] but that could be an environment issue [23:21] *** Ravenloft has joined #archiveteam-bs [23:22] *** JAA has quit IRC (Quit: Page closed) [23:24] ok, new profiling on !a https://www.npr.org/ [23:24] in 10 or 20 I'll kill it and we can look [23:25] it seems to not be crashing without FrozenDict [23:25] ... I say, as it crashes [23:25] this fucking bug: [23:25] File "/home/archivebot/.local/lib/python3.5/site-packages/chardet/universaldetector.py", line 271, in close [23:25] for prober in self._charset_probers[0].probers: [23:25] IndexError: list index out of range [23:25] CRITICAL Sorry, Wpull unexpectedly crashed. [23:25] CRITICAL Please report this problem to the authors at Wpull's issue tracker so it may be fixed. If you know how to program, maybe help us fix it? Thank you for helping us help you help us all. [23:25] which is not new [23:27] what the [23:27] oh [23:27] right [23:32] *** superkuh has quit IRC (Remote host closed the connection) [23:34] *** superkuh has joined #archiveteam-bs [23:49] yipdw: http://ananiels6.falconkirtaran.net:8000/02_post_rm_FrozenDict [23:55] it certainly didn't seem to break anything, and now that 2% is gone [23:55] it's spending a significant amount of time on epoll_wait, which is good since that means it's a little network-bound [23:57] *** BlueMaxim has joined #archiveteam-bs [23:59] 20 1.237 0.062 1.995 0.100 /home/archivebot/.local/lib/python3.5/site-packages/chardet/mbcharsetprober.py:61(feed) [23:59] that's 0.062 seconds per call. what is that even for?