#archiveteam-bs 2017-05-04,Thu

↑back Search

Time Nickname Message
02:30 🔗 ndiddy has quit IRC ()
02:41 🔗 SpaffGarg has quit IRC (Read error: Operation timed out)
02:43 🔗 SpaffGarg has joined #archiveteam-bs
02:54 🔗 zeryl has joined #archiveteam-bs
03:12 🔗 pizzaiolo has quit IRC (pizzaiolo)
03:23 🔗 zeryl has quit IRC (Quit: Page closed)
03:23 🔗 Zeryl has joined #archiveteam-bs
04:11 🔗 Zeryl Let's try here:
04:11 🔗 Zeryl WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
04:15 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
04:21 🔗 Sk1d has joined #archiveteam-bs
04:29 🔗 signius has quit IRC (Quit: Leaving)
05:04 🔗 Odd0002 is there anything that can be archived from a XMPP server?
05:08 🔗 fie has joined #archiveteam-bs
05:28 🔗 Zeryl possibly contact info, and MUC logs depending how long back they allow reviewin, if at all
05:45 🔗 godane i'm uploading more Charlie Rose from 1992-01
07:07 🔗 chazchaz has quit IRC (Read error: Operation timed out)
07:07 🔗 Kenshin has quit IRC (Read error: Operation timed out)
07:08 🔗 chazchaz has joined #archiveteam-bs
07:09 🔗 Kenshin has joined #archiveteam-bs
07:12 🔗 schbirid has joined #archiveteam-bs
07:25 🔗 SpaffGarg has quit IRC (Read error: Operation timed out)
07:28 🔗 SpaffGarg has joined #archiveteam-bs
08:08 🔗 mls has quit IRC (Ping timeout: 250 seconds)
08:15 🔗 mls has joined #archiveteam-bs
08:32 🔗 GE has joined #archiveteam-bs
08:48 🔗 nyany has quit IRC (Ping timeout: 506 seconds)
08:52 🔗 antonizoo has quit IRC ()
08:52 🔗 antonizoo has joined #archiveteam-bs
09:05 🔗 Jonison has joined #archiveteam-bs
09:13 🔗 GE has quit IRC (Remote host closed the connection)
10:50 🔗 godane has quit IRC (Ping timeout: 268 seconds)
11:01 🔗 Honno has joined #archiveteam-bs
11:03 🔗 godane has joined #archiveteam-bs
11:09 🔗 godane has quit IRC (Quit: Leaving.)
11:29 🔗 GE has joined #archiveteam-bs
12:10 🔗 Ravenloft has quit IRC (Read error: Operation timed out)
13:06 🔗 BlueMaxim has quit IRC (Quit: Leaving)
13:23 🔗 GE has quit IRC (Remote host closed the connection)
13:54 🔗 GE has joined #archiveteam-bs
14:01 🔗 Jonison has quit IRC (ny.us.hub hub.se)
14:01 🔗 SpaffGarg has quit IRC (ny.us.hub hub.se)
14:01 🔗 Kenshin has quit IRC (ny.us.hub hub.se)
14:01 🔗 K4k has quit IRC (ny.us.hub hub.se)
14:01 🔗 SketchCow has quit IRC (ny.us.hub hub.se)
14:01 🔗 Kaz has quit IRC (ny.us.hub hub.se)
14:01 🔗 Ctrl-S___ has quit IRC (ny.us.hub hub.se)
14:01 🔗 alembic has quit IRC (ny.us.hub hub.se)
14:01 🔗 floogulin has quit IRC (ny.us.hub hub.se)
14:01 🔗 HCross2 has quit IRC (ny.us.hub hub.se)
14:01 🔗 deathy has quit IRC (ny.us.hub hub.se)
14:01 🔗 alfie has quit IRC (ny.us.hub hub.se)
14:01 🔗 BartoCH has quit IRC (ny.us.hub hub.se)
14:01 🔗 ThisAsYou has quit IRC (ny.us.hub hub.se)
14:01 🔗 tklk has quit IRC (ny.us.hub hub.se)
14:01 🔗 Sue_ has quit IRC (ny.us.hub hub.se)
14:01 🔗 Muad-Dib has quit IRC (ny.us.hub hub.se)
14:01 🔗 Sanqui has quit IRC (ny.us.hub hub.se)
14:01 🔗 Meroje has quit IRC (ny.us.hub hub.se)
14:01 🔗 raphidae has quit IRC (ny.us.hub hub.se)
14:01 🔗 Boppen has quit IRC (ny.us.hub hub.se)
14:01 🔗 mls has quit IRC (ny.us.hub hub.se)
14:01 🔗 Sk1d has quit IRC (ny.us.hub hub.se)
14:01 🔗 andai has quit IRC (ny.us.hub hub.se)
14:01 🔗 Aoede has quit IRC (ny.us.hub hub.se)
14:01 🔗 nightpool has quit IRC (ny.us.hub hub.se)
14:01 🔗 hook54321 has quit IRC (ny.us.hub hub.se)
14:01 🔗 VeganMars has quit IRC (ny.us.hub hub.se)
14:01 🔗 Riviera has quit IRC (ny.us.hub hub.se)
14:01 🔗 SN4T14 has quit IRC (ny.us.hub hub.se)
14:01 🔗 tuluu_ has quit IRC (ny.us.hub hub.se)
14:01 🔗 JensRex has quit IRC (ny.us.hub hub.se)
14:01 🔗 tammy_ has quit IRC (ny.us.hub hub.se)
14:01 🔗 i0npulse has quit IRC (ny.us.hub hub.se)
14:01 🔗 Hecatz has quit IRC (ny.us.hub hub.se)
14:01 🔗 Rai-chan has quit IRC (ny.us.hub hub.se)
14:01 🔗 medowar has quit IRC (ny.us.hub hub.se)
14:01 🔗 purplebot has quit IRC (ny.us.hub hub.se)
14:01 🔗 Madchen has quit IRC (ny.us.hub hub.se)
14:01 🔗 PurpleSym has quit IRC (ny.us.hub hub.se)
14:01 🔗 altlabel has quit IRC (ny.us.hub hub.se)
14:01 🔗 Zeryl has quit IRC (ny.us.hub hub.se)
14:01 🔗 Jon- has quit IRC (ny.us.hub hub.se)
14:01 🔗 Stilett0 has quit IRC (ny.us.hub hub.se)
14:01 🔗 dashcloud has quit IRC (ny.us.hub hub.se)
14:01 🔗 espes__ has quit IRC (ny.us.hub hub.se)
14:01 🔗 kvieta has quit IRC (ny.us.hub hub.se)
14:01 🔗 Darkstar has quit IRC (ny.us.hub hub.se)
14:01 🔗 Lord_Nigh has quit IRC (ny.us.hub hub.se)
14:01 🔗 brayden_ has quit IRC (ny.us.hub hub.se)
14:01 🔗 t2t2 has quit IRC (ny.us.hub hub.se)
14:01 🔗 RichardG has quit IRC (ny.us.hub hub.se)
14:01 🔗 kurt has quit IRC (ny.us.hub hub.se)
14:01 🔗 Odd0002 has quit IRC (ny.us.hub hub.se)
14:01 🔗 ploop has quit IRC (ny.us.hub hub.se)
14:01 🔗 DFJustin has quit IRC (ny.us.hub hub.se)
14:01 🔗 SilSte has quit IRC (ny.us.hub hub.se)
14:01 🔗 Fletcher has quit IRC (ny.us.hub hub.se)
14:01 🔗 antonizoo has quit IRC (ny.us.hub hub.se)
14:01 🔗 fie has quit IRC (ny.us.hub hub.se)
14:01 🔗 tsr has quit IRC (ny.us.hub hub.se)
14:01 🔗 yuitimoth has quit IRC (ny.us.hub hub.se)
14:01 🔗 luckcolor has quit IRC (ny.us.hub hub.se)
14:01 🔗 tephra has quit IRC (ny.us.hub hub.se)
14:01 🔗 antomatic has quit IRC (ny.us.hub hub.se)
14:01 🔗 SmileyG has quit IRC (ny.us.hub hub.se)
14:01 🔗 kevinr has quit IRC (ny.us.hub hub.se)
14:01 🔗 Frogging has quit IRC (ny.us.hub hub.se)
14:01 🔗 johnny4 has quit IRC (ny.us.hub hub.se)
14:01 🔗 bsmith093 has quit IRC (ny.us.hub hub.se)
14:01 🔗 kisspunch has quit IRC (ny.us.hub hub.se)
14:01 🔗 tapedrive has quit IRC (ny.us.hub hub.se)
14:01 🔗 wolfpld has quit IRC (ny.us.hub hub.se)
14:02 🔗 antonizoo has joined #archiveteam-bs
14:02 🔗 fie has joined #archiveteam-bs
14:02 🔗 tsr has joined #archiveteam-bs
14:02 🔗 yuitimoth has joined #archiveteam-bs
14:02 🔗 luckcolor has joined #archiveteam-bs
14:02 🔗 tephra has joined #archiveteam-bs
14:02 🔗 SmileyG has joined #archiveteam-bs
14:02 🔗 antomatic has joined #archiveteam-bs
14:02 🔗 kevinr has joined #archiveteam-bs
14:02 🔗 Frogging has joined #archiveteam-bs
14:02 🔗 irc.efnet.nl sets mode: +oooo luckcolor SmileyG antomatic Frogging
14:02 🔗 johnny4 has joined #archiveteam-bs
14:02 🔗 bsmith093 has joined #archiveteam-bs
14:02 🔗 kisspunch has joined #archiveteam-bs
14:02 🔗 tapedrive has joined #archiveteam-bs
14:02 🔗 wolfpld has joined #archiveteam-bs
14:02 🔗 irc.efnet.nl sets mode: +o bsmith093
14:02 🔗 swebb sets mode: +o antomatic
14:02 🔗 Frogging sets mode: +o yipdw
14:02 🔗 SmileyG has quit IRC (Write error: Broken pipe)
14:02 🔗 Smiley has joined #archiveteam-bs
14:15 🔗 Zeryl has joined #archiveteam-bs
14:15 🔗 Stilett0 has joined #archiveteam-bs
14:15 🔗 Riviera has joined #archiveteam-bs
14:15 🔗 dashcloud has joined #archiveteam-bs
14:15 🔗 SN4T14 has joined #archiveteam-bs
14:15 🔗 espes__ has joined #archiveteam-bs
14:15 🔗 tuluu_ has joined #archiveteam-bs
14:15 🔗 kvieta has joined #archiveteam-bs
14:15 🔗 Darkstar has joined #archiveteam-bs
14:15 🔗 JensRex has joined #archiveteam-bs
14:15 🔗 tammy_ has joined #archiveteam-bs
14:15 🔗 i0npulse has joined #archiveteam-bs
14:15 🔗 Hecatz has joined #archiveteam-bs
14:15 🔗 medowar has joined #archiveteam-bs
14:15 🔗 Rai-chan has joined #archiveteam-bs
14:15 🔗 purplebot has joined #archiveteam-bs
14:15 🔗 Lord_Nigh has joined #archiveteam-bs
14:15 🔗 ploop has joined #archiveteam-bs
14:15 🔗 brayden_ has joined #archiveteam-bs
14:15 🔗 t2t2 has joined #archiveteam-bs
14:15 🔗 kurt has joined #archiveteam-bs
14:15 🔗 Odd0002 has joined #archiveteam-bs
14:15 🔗 DFJustin has joined #archiveteam-bs
14:15 🔗 hub.dk sets mode: +oooo medowar Lord_Nigh brayden_ DFJustin
14:15 🔗 SilSte has joined #archiveteam-bs
14:15 🔗 Fletcher has joined #archiveteam-bs
14:15 🔗 Madchen has joined #archiveteam-bs
14:15 🔗 altlabel has joined #archiveteam-bs
14:15 🔗 PurpleSym has joined #archiveteam-bs
14:15 🔗 hub.dk sets mode: +oo Fletcher PurpleSym
14:15 🔗 swebb sets mode: +o brayden_
14:15 🔗 swebb sets mode: +o DFJustin
14:15 🔗 jmtd has joined #archiveteam-bs
14:24 🔗 Boppen has joined #archiveteam-bs
14:32 🔗 Jonison has joined #archiveteam-bs
14:32 🔗 Kenshin has joined #archiveteam-bs
14:32 🔗 K4k has joined #archiveteam-bs
14:32 🔗 SketchCow has joined #archiveteam-bs
14:32 🔗 Kaz has joined #archiveteam-bs
14:32 🔗 Ctrl-S___ has joined #archiveteam-bs
14:32 🔗 alembic has joined #archiveteam-bs
14:32 🔗 floogulin has joined #archiveteam-bs
14:32 🔗 HCross2 has joined #archiveteam-bs
14:32 🔗 deathy has joined #archiveteam-bs
14:32 🔗 alfie has joined #archiveteam-bs
14:32 🔗 BartoCH has joined #archiveteam-bs
14:32 🔗 tklk has joined #archiveteam-bs
14:32 🔗 raphidae has joined #archiveteam-bs
14:32 🔗 ThisAsYou has joined #archiveteam-bs
14:32 🔗 Muad-Dib has joined #archiveteam-bs
14:32 🔗 Meroje has joined #archiveteam-bs
14:32 🔗 Sue_ has joined #archiveteam-bs
14:32 🔗 Sanqui has joined #archiveteam-bs
14:32 🔗 efnet.port80.se sets mode: +oooo SketchCow Kaz HCross2 Sanqui
14:32 🔗 swebb sets mode: +o SketchCow
14:34 🔗 Jonison has quit IRC (Read error: Connection reset by peer)
14:52 🔗 nyany has joined #archiveteam-bs
15:30 🔗 Aranje has joined #archiveteam-bs
15:42 🔗 SpaffGarg has joined #archiveteam-bs
15:42 🔗 RichardG_ has joined #archiveteam-bs
15:42 🔗 mls has joined #archiveteam-bs
15:42 🔗 Sk1d has joined #archiveteam-bs
15:42 🔗 andai has joined #archiveteam-bs
15:42 🔗 Aoede has joined #archiveteam-bs
15:42 🔗 nightpool has joined #archiveteam-bs
15:42 🔗 hook54321 has joined #archiveteam-bs
15:42 🔗 VeganMars has joined #archiveteam-bs
15:44 🔗 RichardG_ is now known as RichardG
16:01 🔗 pizzaiolo has joined #archiveteam-bs
16:24 🔗 phuz has joined #archiveteam-bs
16:24 🔗 phuzion has quit IRC (Read error: Connection reset by peer)
16:35 🔗 antonizoo has quit IRC (Remote host closed the connection)
16:44 🔗 ZexaronS has joined #archiveteam-bs
16:50 🔗 antonizoo has joined #archiveteam-bs
17:25 🔗 sun_rise has joined #archiveteam-bs
17:27 🔗 sun_rise If anyone is around I'm interested in pointing archivebot at something in the other channel
17:28 🔗 GE has quit IRC (Remote host closed the connection)
18:00 🔗 sun_rise The job finished but I can't find it in the viewer (or anywhere else?) Says status completed. I'm a little confused.
18:07 🔗 pizzaiolo has quit IRC (Read error: Connection reset by peer)
18:08 🔗 pizzaiolo has joined #archiveteam-bs
18:15 🔗 joepie91 sun_rise: iirc jobs are uploaded/ingested about daily
18:15 🔗 SpaffGarg has quit IRC (Ping timeout: 250 seconds)
18:21 🔗 SpaffGarg has joined #archiveteam-bs
19:10 🔗 GE has joined #archiveteam-bs
19:23 🔗 pizzaiolo has quit IRC (Quit: pizzaiolo)
19:23 🔗 JAA has joined #archiveteam-bs
19:25 🔗 pizzaiolo has joined #archiveteam-bs
19:38 🔗 Aranje has quit IRC (Ping timeout: 245 seconds)
20:05 🔗 ZexaronS- has joined #archiveteam-bs
20:06 🔗 sep332 has quit IRC (Read error: Operation timed out)
20:06 🔗 ZexaronS has quit IRC (Read error: Operation timed out)
20:23 🔗 sep332 has joined #archiveteam-bs
21:08 🔗 speculaas has joined #archiveteam-bs
21:08 🔗 joepie91 speculaas: okay, so, it's *possible* to extract data from the existing archives, but it currently still requires some manual work
21:09 🔗 joepie91 speculaas: specifically, you can download the indexes of all the Hyves items on archive.org, which contain a list of every URL that is contained in a given item along with its 'offset' (position in the WARC file)
21:09 🔗 speculaas oke
21:09 🔗 joepie91 speculaas: you can then use those positions to do a HTTP range request and retrieve just those bits of the WARC file, obtaining the pages
21:09 🔗 speculaas Here are some archives https://archive.org/details/hyves?&sort=-downloads&page=2
21:10 🔗 joepie91 speculaas: there's - to my knowledge - not yet a nice one-stop way to extract an account
21:10 🔗 joepie91 speculaas: if you just want to *look* at the account, it's faster to look it up in the wayback machine
21:10 🔗 joepie91 all the Hyves archives should have been imported into that
21:17 🔗 speculaas The url for that is: www.hyves.nl/username ?
21:19 🔗 speculaas I already know the url but I see my account is not public
21:32 🔗 schbirid has quit IRC (Quit: Leaving)
21:32 🔗 joepie91 speculaas: ah yeah, we only got the public profiles... so if it was a private profile, I'm afraid it can't be recovered :/
21:32 🔗 joepie91 speculaas: unless a friend kept around a copy...
21:35 🔗 speculaas Oke, than I know enough. Thanks for your time;)
21:36 🔗 joepie91 speculaas: good luck in your search :)
21:40 🔗 speculaas has quit IRC (Ping timeout: 268 seconds)
21:43 🔗 sun_rise has quit IRC (ny.us.hub irc.efnet.nl)
21:43 🔗 fie has quit IRC (ny.us.hub irc.efnet.nl)
21:43 🔗 tsr has quit IRC (ny.us.hub irc.efnet.nl)
21:43 🔗 yuitimoth has quit IRC (ny.us.hub irc.efnet.nl)
21:43 🔗 luckcolor has quit IRC (ny.us.hub irc.efnet.nl)
21:43 🔗 tephra has quit IRC (ny.us.hub irc.efnet.nl)
21:43 🔗 antomatic has quit IRC (ny.us.hub irc.efnet.nl)
21:43 🔗 kevinr has quit IRC (ny.us.hub irc.efnet.nl)
21:43 🔗 Frogging has quit IRC (ny.us.hub irc.efnet.nl)
21:43 🔗 johnny4 has quit IRC (ny.us.hub irc.efnet.nl)
21:43 🔗 bsmith093 has quit IRC (ny.us.hub irc.efnet.nl)
21:43 🔗 kisspunch has quit IRC (ny.us.hub irc.efnet.nl)
21:43 🔗 tapedrive has quit IRC (ny.us.hub irc.efnet.nl)
21:43 🔗 wolfpld has quit IRC (ny.us.hub irc.efnet.nl)
21:44 🔗 sun_rise has joined #archiveteam-bs
21:44 🔗 fie has joined #archiveteam-bs
21:44 🔗 tsr has joined #archiveteam-bs
21:44 🔗 yuitimoth has joined #archiveteam-bs
21:44 🔗 luckcolor has joined #archiveteam-bs
21:44 🔗 tephra has joined #archiveteam-bs
21:44 🔗 antomatic has joined #archiveteam-bs
21:44 🔗 kevinr has joined #archiveteam-bs
21:44 🔗 Frogging has joined #archiveteam-bs
21:44 🔗 johnny4 has joined #archiveteam-bs
21:44 🔗 bsmith093 has joined #archiveteam-bs
21:44 🔗 irc.efnet.nl sets mode: +oooo luckcolor antomatic Frogging bsmith093
21:44 🔗 kisspunch has joined #archiveteam-bs
21:44 🔗 tapedrive has joined #archiveteam-bs
21:44 🔗 wolfpld has joined #archiveteam-bs
21:44 🔗 swebb sets mode: +o antomatic
21:44 🔗 Frogging sets mode: +o yipdw
22:05 🔗 Sanqui is it using lxml?
22:06 🔗 FalconK has joined #archiveteam-bs
22:06 🔗 FalconK hah!
22:06 🔗 FalconK yo Sanqui
22:06 🔗 Sanqui oh hey
22:07 🔗 FalconK so a bunch of the archivebot pipelines are dual-core atoms clocking at 2.4GHz in virtualized environments
22:07 🔗 yipdw Sanqui: it depends on the configuration. libxml on some pipelines, html5lib on others
22:07 🔗 yipdw we started with libxml but it kept crashing for some reasn
22:07 🔗 yipdw html5lib gets more stuff and seems more stable but is more expensive re CPU
22:07 🔗 FalconK I think all of them are html5lib now?
22:07 🔗 Sanqui for reference, i wrote <@Sanqui> is it using lxml?
22:07 🔗 yipdw probably, but I can't be sure of that since people can change the pip manifest
22:08 🔗 FalconK ty
22:08 🔗 Sanqui yeah html5lib is gonna be cpu expensive
22:08 🔗 yipdw anyway, I don't think there's a way around parsing the documents to get links and stuff
22:08 🔗 FalconK ha, we have a manageability issue too writ large
22:08 🔗 Sanqui ideally we'd use libxml and allow to change with a parameter
22:08 🔗 Sanqui if a certain website had issues
22:08 🔗 FalconK a suggestion from some local crew I know was to forego the XML parsing entirely and just use a best-effort regex, and accept that it will find some bullshit
22:08 🔗 yipdw recent Chrome release has an official headless mode and that seems interesting
22:09 🔗 FalconK the biggest reason to not use a regex is that it will fall down making relative URLs out of any / in anything
22:09 🔗 FalconK or else miss tons of stuff
22:09 🔗 yipdw yeah, that's a lot of webpages :P
22:09 🔗 FalconK so I rejected that solution
22:09 🔗 FalconK fucking relative URLs
22:10 🔗 xmc hm
22:10 🔗 yipdw I still think using a browser is probably the way to go
22:10 🔗 xmc so ... yeah
22:10 🔗 Sanqui you could make a more... "outzoomed" regex looking for href=
22:10 🔗 xmc ugh
22:10 🔗 yipdw more and more websites are using client rendering
22:10 🔗 xmc computers suck
22:10 🔗 FalconK one could do that yes
22:10 🔗 FalconK for client rendering we have to use phantomjs anyway
22:10 🔗 Sanqui phantomjs is dead
22:10 🔗 xmc or Headless Chrome Because Yes
22:10 🔗 yipdw and if you're looking for an optimized way to parse documents, you might as well look at a Web browser
22:10 🔗 Sanqui tbh it'd be very nice if we could just spin up chromes
22:11 🔗 FalconK that will cause us to need more CPU, not less
22:11 🔗 xmc ^
22:11 🔗 yipdw maybe
22:12 🔗 Sanqui in place of phantomjs anyway
22:12 🔗 FalconK it would be a lot nicer to use headless chrome than phantomjs for the things we do need client-side rendering for
22:12 🔗 FalconK but we still need client-side rendering for a small minority of sites
22:12 🔗 FalconK the only major use of it I've noticed, actually, is twitter.
22:12 🔗 yipdw I think that may be because that's the only place it seems to reliably work
22:12 🔗 yipdw "reliably"
22:12 🔗 FalconK phantomjs is also crashy af
22:12 🔗 FalconK that may also be the case
22:13 🔗 yipdw but there's also a lot of blog sites that use client-side rendering and have no fallback
22:13 🔗 FalconK I'm not at all opposed to using headless chrome in place of phantomjs and seeing how it performs
22:13 🔗 Sanqui honestly, for archiving websites like twitter, youtube, facebook etc., the bot should have specific modes that are curated
22:13 🔗 yipdw usually it's software developers because software developers are idiots
22:13 🔗 FalconK Sanqui: yes, that's also on the long todo list
22:13 🔗 yipdw can confirm, I write software
22:13 🔗 FalconK we want a !twitter at least, and possibly an !youtube and !reddit
22:14 🔗 FalconK !facebook would require a lot of work
22:15 🔗 FalconK separately, there's this CPU usage issue :P
22:15 🔗 yipdw did anyone manage to get a useful CPU profile? I tried once but I just got a bunch of "your progrm is spending most of its time in Python's evaluator"
22:15 🔗 yipdw which is like saying "your program is spending most of its time running"
22:15 🔗 FalconK there's *another* issue, which is that wpull.db grows to tens of GB when crawling large sites, but I'm willing to live with that for the moment since the high CPU usage is actually the pain point right now
22:15 🔗 Sanqui anyway, to drive the point home: phantomjs is over, the lead developer has stepped out in anticipation of headless chrome https://groups.google.com/forum/#!topic/phantomjs/9aI5d-LDuNE
22:15 🔗 FalconK yipdw: I did!
22:15 🔗 yipdw oh
22:15 🔗 Sanqui so we need to do something eventually
22:15 🔗 yipdw do you still have the profile data?
22:15 🔗 FalconK ananiel-s6 is currently dedicated to profiling
22:16 🔗 yipdw ah good
22:16 🔗 FalconK let me see if I do still have it; if not, I can get it again later
22:16 🔗 yipdw yeah, I'd like to see that. I got as far as perf and then I got annoyed and had to switch gears
22:16 🔗 FalconK but I recall html5lib stuff featured very heavily
22:16 🔗 FalconK perf is fucking awful to deal with
22:16 🔗 FalconK I hate optimizing
22:16 🔗 Sanqui html5lib is parsing in python
22:16 🔗 yipdw I keep seeing good testimonials for Telemetry
22:16 🔗 Sanqui we really want lxml
22:16 🔗 yipdw we've tried libxml before
22:16 🔗 yipdw it kept blowing up
22:17 🔗 Sanqui then we should figure out why and report it upstream
22:17 🔗 FalconK let's see - how does one read cprofile things again
22:17 🔗 FalconK actually yipdw do you just want the cprofile?
22:17 🔗 Sanqui (sorry for the 'we', i'm not trying to sound smart here)
22:17 🔗 FalconK I'll put it up somewhere
22:17 🔗 yipdw Sanqui: I mean, yes, but in the meantime it was easier to just switch to html5lib and deliver something working
22:17 🔗 Sanqui (i fully recognize i have done zero archivebot development)
22:18 🔗 FalconK we have a very bad test process right now for archivebot
22:18 🔗 Sanqui yes I noticed tests are failing
22:18 🔗 FalconK which is make it do real work, and then wait until it falls over, and see if you got enough information to figure out the failure case
22:18 🔗 REiN^ has quit IRC (Read error: Operation timed out)
22:19 🔗 yipdw chfoo wrote a smoke test harness, but there's a lot of moving parts and I haven't looked at what it takes to put them back together in the Travis environment
22:19 🔗 FalconK on my end, this is mostly because I have $infinity things to do that aren't archivebot, so... :P
22:19 🔗 FalconK no offense to chfoo but his code has a LOT of moving parts
22:19 🔗 yipdw i mean really I don't think "test it in production" is really a bad idea here
22:19 🔗 FalconK it's not; it just takes forever
22:19 🔗 yipdw if you have good telemetry, it's awesome
22:19 🔗 JAA I'm not familiar with what libxml and html5lib really are internally, but probably the best option would be to the XML parser library from a browser (i.e. Chromium or Firefox), right?
22:20 🔗 FalconK html5lib is basically that
22:20 🔗 yipdw I don't know of anyone who has extracted those for consumption in something else
22:20 🔗 FalconK it's intended afaik to be a W3C compliant HTML parser not unlike, say, sax for XML
22:20 🔗 yipdw I guess that's a good point too
22:20 🔗 yipdw we really can't use "an XML library" to be pedantic
22:20 🔗 JAA Yeah. Problem is, many websites aren't W3C compliant.
22:21 🔗 yipdw HTML isn't XML and archivebot has to be able to deal with that
22:21 🔗 FalconK maybe it's html5lib that needs perf
22:21 🔗 JAA We still want to be able to handle those.
22:21 🔗 FalconK we don't currently have a significant problem with that
22:21 🔗 Sanqui JAA: it's inside out here
22:21 🔗 yipdw indeed archivebot tends to get pointed at a lot of small, old sites
22:21 🔗 Sanqui W3C defines how to deal with websites that aren't W3C compliant
22:21 🔗 Sanqui and browsers follow that
22:21 🔗 FalconK other than operator error I haven't had many complaints of archivebot missing things
22:21 🔗 yipdw I don't have notes, but I think that's another reason why html5lib switch happened
22:21 🔗 JAA I see
22:21 🔗 FalconK if it's noticed, I'd love to hear about it
22:22 🔗 yipdw it just got better results
22:22 🔗 yipdw no point in performing faster if you miss page requisites etc
22:22 🔗 REiN^ has joined #archiveteam-bs
22:22 🔗 Sanqui could always try pypy :)
22:22 🔗 JAA If html5lib works so well, how about rewriting it as a C extension? /s
22:23 🔗 ZexaronS- has quit IRC (Leaving)
22:23 🔗 yipdw debugging the intersection of python and C is prohibited by the Geneva Conventions
22:23 🔗 yipdw I mean you can inflict it on yourself but
22:24 🔗 GE has quit IRC (Remote host closed the connection)
22:24 🔗 yipdw tangentially related, I'm working on a project and part of it is an app that calls into a Go library
22:24 🔗 yipdw from C
22:25 🔗 FalconK http://ananiels6.falconkirtaran.net/cprof.dat
22:25 🔗 yipdw the app is trivially stack-smashable if you send a URL that's longer than 2048 bytes
22:25 🔗 FalconK that link work?
22:25 🔗 yipdw I thought that was really funny
22:25 🔗 yipdw because it's like "Go will save me"
22:25 🔗 yipdw yeah no
22:25 🔗 yipdw Connecting to ananiels6.falconkirtaran.net (ananiels6.falconkirtaran.net)|51.15.47.106|:80... failed: Connection refused.
22:26 🔗 yipdw brb
22:27 🔗 JAA I've done it before, and it's actually not too bad as long as you can keep all the real work in C and just have a thin transition layer converting the stuff from/to Python variables.
22:27 🔗 JAA But for obvious reasons, I wouldn't want to implement an XML parser, ever. Most certainly not in C.
22:29 🔗 Sanqui this is not work we should be doing
22:31 🔗 FalconK +1
22:31 🔗 FalconK ffs
22:31 🔗 FalconK http://ananiels6.falconkirtaran.net:8000/cprof.dat
22:31 🔗 FalconK strings are awful
22:32 🔗 Sanqui hooray for SimpleHTTPServer
22:32 🔗 JAA Indeed. wpull really needs fixing. Version 2.0.1 has so many bugs that it's not even funny; e.g. concurrency is broken entirely and aborting doesn't work. And version 1.2.3 throws up when used with the current html5lib version since the API changed and the requirements.txt doesn't force the specific, compatible version.
22:32 🔗 FalconK yipdw and I did the work to transition archivebot to wpull2 like 6 months back
22:33 🔗 FalconK I suggest that we roll with chfoo's changes and deprecate 1.x
22:33 🔗 FalconK but we will need to fix concurrency for sure
22:33 🔗 FalconK aborting is working fine for archivebot by the way
22:33 🔗 FalconK er... as fine as it ever has worked
22:34 🔗 yipdw ok
22:35 🔗 JAA Interesting. I always had to hard-abort (twice ^C) it when I tried. After few attempts, I went to 1.2.3
22:36 🔗 FalconK anyway the thing that really jumps out at me in the 650 second profile there:
22:36 🔗 FalconK 926 23.751 0.026 324.046 0.350 /home/archivebot/.local/lib/python3.5/site-packages/wpull/scraper/html.py:127(_process_elements)
22:37 🔗 FalconK it's spending literally 50% of its time in html._process_elements
22:38 🔗 yipdw well, in some way that's kinda cool
22:38 🔗 yipdw it means all of our add-on stuff isn't the slow bit
22:38 🔗 FalconK yeah...
22:39 🔗 FalconK by comparison, by the way, it spends about 7.5% of its time working with sqlite
22:39 🔗 Stilett0 has quit IRC (Read error: Operation timed out)
22:39 🔗 yipdw which to me is counterinituitive. I thought running hundreds of regular expressions on each document would be a problem
22:39 🔗 yipdw turns out, it isn't the dominating factor
22:39 🔗 yipdw profiles are awesome
22:39 🔗 FalconK yeah
22:39 🔗 FalconK our regexp running is efficient, I think, right? it compiles them into one state machine?
22:39 🔗 yipdw no
22:39 🔗 JAA Hmm, doesn't the HTML parsing happen outside of _process_elements?
22:39 🔗 FalconK no idea
22:40 🔗 yipdw but we do compile the regexes / make use of the Python regexp cache
22:40 🔗 * FalconK nods
22:40 🔗 yipdw so it's probably fast enough
22:40 🔗 FalconK the regex thing doesn't even seem to appear in the profiling
22:40 🔗 FalconK er
22:40 🔗 FalconK not anywhere near the top
22:40 🔗 yipdw neat
22:41 🔗 yipdw it's good to know also that sqlite is fast
22:41 🔗 FalconK 78084 0.907 0.000 28.838 0.000 /home/archivebot/.local/lib/python3.5/site-packages/wpull/application/hook.py:132(notify)
22:41 🔗 FalconK 5% of time in hooks of any kind
22:41 🔗 yipdw i had a suspicion it was more than sufficient for this but it's cool to see that it's at the bottom
22:41 🔗 * FalconK nods
22:41 🔗 yipdw so, hmm
22:41 🔗 yipdw what is process_elements doing
22:41 🔗 FalconK I don't even remember what this job was (probably !a http://cnn.com/ or something)
22:42 🔗 FalconK but yes, one wonders
22:42 🔗 JAA It seems that the parsing happens in wpull.document.html.HTMLReader.iter_elements .
22:42 🔗 yipdw FalconK: can you put that profile data back up?
22:42 🔗 yipdw I get a connection refused talking to that site
22:43 🔗 yipdw that or if you can drill down into process_elements that'd be fab
22:43 🔗 FalconK oh sure
22:43 🔗 yipdw it's a pretty big method
22:43 🔗 FalconK sorry, it was python http.server and I took it down to read the data
22:43 🔗 yipdw ah ok
22:43 🔗 JAA Yeah, line profiling for _process_elements would be helpful.
22:43 🔗 FalconK up again
22:43 🔗 FalconK go for it
22:44 🔗 yipdw hmm
22:44 🔗 yipdw Connecting to ananiels6.falconkirtaran.net (ananiels6.falconkirtaran.net)|51.15.47.106|:80... failed: Connection refused.
22:44 🔗 FalconK :8000
22:44 🔗 yipdw oh feck
22:44 🔗 yipdw there we go
22:44 🔗 yipdw done
22:44 🔗 FalconK :)
22:44 🔗 FalconK I'll leave it up for a bit while I read _process_elements
22:45 🔗 yipdw python -m cProfile -s cumtime will never not be funny to me
22:45 🔗 yipdw also hi yes I am 12
22:46 🔗 FalconK oddly clean_link_soup is negligible
22:46 🔗 FalconK 210118 1.320 0.000 3.423 0.000 /home/archivebot/.local/lib/python3.5/site-packages/wpull/scraper/util.py:38(clean_link_soup)
22:46 🔗 yipdw it'd be funny if it ended up being urljoin_safe or something
22:47 🔗 yipdw 50% of overall time spent in string concat and reallocation
22:47 🔗 FalconK 164679 1.413 0.000 30.091 0.000 /home/archivebot/.local/lib/python3.5/site-packages/wpull/url.py:684(urljoin)
22:47 🔗 yipdw wat
22:47 🔗 yipdw are you fucking kidding me
22:47 🔗 FalconK :P
22:47 🔗 JAA Sidenote: I think there's a bug in _process_elements: "if self._only_relative:" followed by "if link_info.base_link or '://' in link_info.link:" probably doesn't catch protocol-relative links, i.e. 'href="//example.com/"'.
22:47 🔗 yipdw JAA: hmm
22:47 🔗 FalconK what's that?
22:48 🔗 yipdw I don't recall scheme-relative links being a problem, but we can try that out
22:48 🔗 FalconK oh huh, https://www.paulirish.com/2010/the-protocol-relative-url/... TIL
22:48 🔗 xmc they're vaguely useful
22:50 🔗 FalconK JAA: I think you're right; the best way to address it would be a PR
22:51 🔗 FalconK this is kind of a big deal too:
22:51 🔗 FalconK 1077 0.230 0.000 51.563 0.048 /home/archivebot/.local/lib/python3.5/site-packages/wpull/database/wrap.py:41(add_many)
22:52 🔗 yipdw FalconK: wait, are you sure this is html5lib?
22:52 🔗 yipdw I see parse_lxml in the output
22:52 🔗 FalconK I went through the same inquiry
22:52 🔗 FalconK I don't remember the conclusion I came to
22:52 🔗 JAA FalconK: I guess. Then again, other PRs have been sitting there for months, so motivation is limited. Also, I have no idea how to fix it properly without breaking other stuff. Paths in URLs can contain several consecutive slashes IIRC; that is, href="some//path" is equivalent to href="some/path".
22:52 🔗 FalconK either both are in use, or else someone put html5lib in but left all the functions named like libxml.
22:53 🔗 FalconK JAA: right, it looks like only r'^//' is protocol-relative
22:54 🔗 FalconK no comment on PRs except that archivebot specifies github.com/falconkirtaran/wpull in requirements.txt
22:54 🔗 FalconK because before my omnibus PR was accepted wpull2 was too crashy to use
22:55 🔗 yipdw well, maybe parse_lxml is the wrong place to look anyway. profile indicates that most of the time in there is spent in the "start" method but that method just invokes callbacks
22:55 🔗 FalconK heya yipdw I think that add_many prof item might contain the plugins?
22:55 🔗 yipdw and the callbacks aren't showing up in the profile, AFAICT
22:56 🔗 yipdw FalconK: not sure
22:56 🔗 yipdw oh wait, the callbacks are in the called: section
22:57 🔗 FalconK oh, still not a problem
22:57 🔗 FalconK 2159 0.036 0.000 6.803 0.003 archive_bot_plugin.py:214(accept_url)
22:58 🔗 yipdw huh. highest total time in start is /home/archivebot/.local/lib/python3.5/site-packages/wpull/collections.py:244(__init__)
22:58 🔗 yipdw does this just spend most of its time managing lists?
22:58 🔗 FalconK what does *that* abstraction do
22:58 🔗 FalconK it might, though
22:58 🔗 FalconK wpull -r keeps a lot of lists
22:59 🔗 Stilett0 has joined #archiveteam-bs
22:59 🔗 yipdw line 244 of collections.py is the initializer for FrozenDict
22:59 🔗 yipdw which does e.g.
22:59 🔗 yipdw def __init__(self, orig_dict):
22:59 🔗 yipdw self.orig_dict = orig_dict
22:59 🔗 yipdw self.hash_cache = hash(tuple(sorted(self.orig_dict.items())))
22:59 🔗 yipdw over 1.68 million calls to that that seems like it might be a thing
22:59 🔗 FalconK wait
23:00 🔗 FalconK hash(...sorted(?
23:00 🔗 yipdw yeah
23:00 🔗 FalconK why
23:00 🔗 yipdw I don't know
23:00 🔗 yipdw do Python hashes guarantee any sort of iteration order?
23:00 🔗 yipdw I know Ruby does
23:00 🔗 FalconK I suppose that would depend
23:01 🔗 FalconK what stability properties does it require?
23:01 🔗 yipdw not sure
23:01 🔗 FalconK python is not my primary language
23:01 🔗 FalconK (that'd be C++, followed by x86 ASM)
23:02 🔗 yipdw FrozenDict is used in lxml.HTMLParserTarget.start
23:02 🔗 FalconK well!
23:02 🔗 yipdw I'm not really sure if it's needed, though
23:02 🔗 yipdw hard to tell
23:03 🔗 yipdw it's also not immediately clear to me what it's wrapping -- it's 'attrib'
23:03 🔗 yipdw (tag attributes?)
23:03 🔗 FalconK murdering it entirely would speed us up by 2%
23:04 🔗 FalconK AKA 4-6 page grabs per hundred seconds
23:04 🔗 yipdw or more, depending on what effect that would have with fewer allocations
23:04 🔗 FalconK oh, true
23:04 🔗 FalconK the allocator is still a black box to us
23:04 🔗 yipdw I was just poking at it because it showed up pretty high in the profiles
23:04 🔗 FalconK though I feel like it's probably spending a lot more time sorting than allocating
23:05 🔗 FalconK I don't think __init__ captures time python spends allocating
23:05 🔗 FalconK and actually the python heap processing was insignificant anyway wan't it?
23:05 🔗 yipdw it might not, but FrozenDict is making more objects in its initializer
23:05 🔗 yipdw i.e. the new hash and the temporary tuple
23:05 🔗 FalconK mm
23:05 🔗 yipdw I don't know how expensive that is on the allocator (it might be trivial)
23:06 🔗 yipdw anyway, I guess one thing to try would be to remove FrozenDict() with, like dict()
23:06 🔗 FalconK I don't think allocator time is captured with the jit time
23:06 🔗 FalconK but yeah, we could try that on ananiel-S6
23:06 🔗 yipdw you lose the immutability guarantee but it'd be one way to see if FrozenDict() introduces a large penalty
23:06 🔗 yipdw or, in the specific case of start(), just don't wrap attrib in a FrozenDict()
23:07 🔗 yipdw I doubt it will have a perceptible macro difference but it would be neat to see how it changes the profile
23:08 🔗 FalconK now I'm confused about this:
23:08 🔗 yipdw speaking of C++, one thing that C++ has made me really paranoid about (probably overly paranoid) is allocations
23:08 🔗 FalconK there's both lxml_.py and htmllib5_.py
23:08 🔗 FalconK why
23:08 🔗 yipdw like every time I've had a performance problem, it wasn't algorithmic. it was because I was fucking mallocing too much
23:08 🔗 yipdw or treating cache lines like slacklines
23:08 🔗 yipdw that sort of things
23:09 🔗 yipdw FalconK: huh
23:09 🔗 yipdw dunno
23:10 🔗 yipdw maybe this is using libxml after all?
23:11 🔗 FalconK I wonder if it using libxml for XHTML documents and html5lib for others?
23:11 🔗 FalconK I remember there was some complex dispatch logic
23:12 🔗 FalconK it's just so ungodly complex
23:13 🔗 yipdw maybe using Chrome as the HTML processor would actually be faster :P
23:13 🔗 yipdw let wpull handle queue management, retry, etc
23:13 🔗 FalconK doubt it but who knows
23:14 🔗 yipdw I mean you might still be at high CPU%, but the CPU might be doing more
23:14 🔗 FalconK one thing that is good about html5lib/libxml2 is that it doesn't execute needless javascript
23:14 🔗 FalconK we may be able to disable doing that in headless chrome
23:15 🔗 yipdw it doesn't, but Javascript has been doing things to the DOM for quite a while
23:15 🔗 yipdw I don't know if it's needless
23:15 🔗 yipdw there was some other browser like this, I forgot what it was
23:16 🔗 yipdw it was webkit based
23:16 🔗 FalconK it's needful to grab, for sure
23:16 🔗 yipdw and it was meant to be used in a UNIX Philosophy way
23:16 🔗 yipdw which means it has an impossible name
23:17 🔗 yipdw AH
23:17 🔗 yipdw uzbl
23:18 🔗 yipdw maybe that's an option too in the "use a browser engine to give us what we need to do our thing" arena
23:18 🔗 yipdw or i dunno how good is servo these days :P
23:19 🔗 yipdw every time I try to run servo nightly it eats up all my cores but doesn't render anything
23:19 🔗 yipdw but that could be an environment issue
23:21 🔗 Ravenloft has joined #archiveteam-bs
23:22 🔗 JAA has quit IRC (Quit: Page closed)
23:24 🔗 FalconK ok, new profiling on !a https://www.npr.org/
23:24 🔗 FalconK in 10 or 20 I'll kill it and we can look
23:25 🔗 FalconK it seems to not be crashing without FrozenDict
23:25 🔗 FalconK ... I say, as it crashes
23:25 🔗 FalconK this fucking bug:
23:25 🔗 FalconK File "/home/archivebot/.local/lib/python3.5/site-packages/chardet/universaldetector.py", line 271, in close
23:25 🔗 FalconK for prober in self._charset_probers[0].probers:
23:25 🔗 FalconK IndexError: list index out of range
23:25 🔗 FalconK CRITICAL Sorry, Wpull unexpectedly crashed.
23:25 🔗 FalconK CRITICAL Please report this problem to the authors at Wpull's issue tracker so it may be fixed. If you know how to program, maybe help us fix it? Thank you for helping us help you help us all.
23:25 🔗 FalconK which is not new
23:27 🔗 yipdw what the
23:27 🔗 yipdw oh
23:27 🔗 yipdw right
23:32 🔗 superkuh has quit IRC (Remote host closed the connection)
23:34 🔗 superkuh has joined #archiveteam-bs
23:49 🔗 FalconK yipdw: http://ananiels6.falconkirtaran.net:8000/02_post_rm_FrozenDict
23:55 🔗 FalconK it certainly didn't seem to break anything, and now that 2% is gone
23:55 🔗 FalconK it's spending a significant amount of time on epoll_wait, which is good since that means it's a little network-bound
23:57 🔗 BlueMaxim has joined #archiveteam-bs
23:59 🔗 FalconK 20 1.237 0.062 1.995 0.100 /home/archivebot/.local/lib/python3.5/site-packages/chardet/mbcharsetprober.py:61(feed)
23:59 🔗 FalconK that's 0.062 seconds per call. what is that even for?

irclogger-viewer