#archiveteam-bs 2017-05-04,Thu

↑back Search

Time	Nickname	Message
02:30 ^🔗		ndiddy has quit IRC ()
02:41 ^🔗		SpaffGarg has quit IRC (Read error: Operation timed out)
02:43 ^🔗		SpaffGarg has joined #archiveteam-bs
02:54 ^🔗		zeryl has joined #archiveteam-bs
03:12 ^🔗		pizzaiolo has quit IRC (pizzaiolo)
03:23 ^🔗		zeryl has quit IRC (Quit: Page closed)
03:23 ^🔗		Zeryl has joined #archiveteam-bs
04:11 ^🔗	Zeryl	Let's try here:
04:11 ^🔗	Zeryl	WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
04:15 ^🔗		Sk1d has quit IRC (Ping timeout: 194 seconds)
04:21 ^🔗		Sk1d has joined #archiveteam-bs
04:29 ^🔗		signius has quit IRC (Quit: Leaving)
05:04 ^🔗	Odd0002	is there anything that can be archived from a XMPP server?
05:08 ^🔗		fie has joined #archiveteam-bs
05:28 ^🔗	Zeryl	possibly contact info, and MUC logs depending how long back they allow reviewin, if at all
05:45 ^🔗	godane	i'm uploading more Charlie Rose from 1992-01
07:07 ^🔗		chazchaz has quit IRC (Read error: Operation timed out)
07:07 ^🔗		Kenshin has quit IRC (Read error: Operation timed out)
07:08 ^🔗		chazchaz has joined #archiveteam-bs
07:09 ^🔗		Kenshin has joined #archiveteam-bs
07:12 ^🔗		schbirid has joined #archiveteam-bs
07:25 ^🔗		SpaffGarg has quit IRC (Read error: Operation timed out)
07:28 ^🔗		SpaffGarg has joined #archiveteam-bs
08:08 ^🔗		mls has quit IRC (Ping timeout: 250 seconds)
08:15 ^🔗		mls has joined #archiveteam-bs
08:32 ^🔗		GE has joined #archiveteam-bs
08:48 ^🔗		nyany has quit IRC (Ping timeout: 506 seconds)
08:52 ^🔗		antonizoo has quit IRC ()
08:52 ^🔗		antonizoo has joined #archiveteam-bs
09:05 ^🔗		Jonison has joined #archiveteam-bs
09:13 ^🔗		GE has quit IRC (Remote host closed the connection)
10:50 ^🔗		godane has quit IRC (Ping timeout: 268 seconds)
11:01 ^🔗		Honno has joined #archiveteam-bs
11:03 ^🔗		godane has joined #archiveteam-bs
11:09 ^🔗		godane has quit IRC (Quit: Leaving.)
11:29 ^🔗		GE has joined #archiveteam-bs
12:10 ^🔗		Ravenloft has quit IRC (Read error: Operation timed out)
13:06 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
13:23 ^🔗		GE has quit IRC (Remote host closed the connection)
13:54 ^🔗		GE has joined #archiveteam-bs
14:01 ^🔗		Jonison has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		SpaffGarg has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Kenshin has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		K4k has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		SketchCow has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Kaz has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Ctrl-S___ has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		alembic has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		floogulin has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		HCross2 has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		deathy has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		alfie has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		BartoCH has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		ThisAsYou has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		tklk has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Sue_ has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Muad-Dib has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Sanqui has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Meroje has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		raphidae has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Boppen has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		mls has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Sk1d has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		andai has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Aoede has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		nightpool has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		hook54321 has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		VeganMars has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Riviera has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		SN4T14 has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		tuluu_ has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		JensRex has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		tammy_ has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		i0npulse has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Hecatz has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Rai-chan has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		medowar has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		purplebot has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Madchen has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		PurpleSym has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		altlabel has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Zeryl has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Jon- has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Stilett0 has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		dashcloud has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		espes__ has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		kvieta has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Darkstar has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Lord_Nigh has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		brayden_ has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		t2t2 has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		RichardG has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		kurt has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Odd0002 has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		ploop has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		DFJustin has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		SilSte has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Fletcher has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		antonizoo has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		fie has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		tsr has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		yuitimoth has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		luckcolor has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		tephra has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		antomatic has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		SmileyG has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		kevinr has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		Frogging has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		johnny4 has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		bsmith093 has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		kisspunch has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		tapedrive has quit IRC (ny.us.hub hub.se)
14:01 ^🔗		wolfpld has quit IRC (ny.us.hub hub.se)
14:02 ^🔗		antonizoo has joined #archiveteam-bs
14:02 ^🔗		fie has joined #archiveteam-bs
14:02 ^🔗		tsr has joined #archiveteam-bs
14:02 ^🔗		yuitimoth has joined #archiveteam-bs
14:02 ^🔗		luckcolor has joined #archiveteam-bs
14:02 ^🔗		tephra has joined #archiveteam-bs
14:02 ^🔗		SmileyG has joined #archiveteam-bs
14:02 ^🔗		antomatic has joined #archiveteam-bs
14:02 ^🔗		kevinr has joined #archiveteam-bs
14:02 ^🔗		Frogging has joined #archiveteam-bs
14:02 ^🔗		irc.efnet.nl sets mode: +oooo luckcolor SmileyG antomatic Frogging
14:02 ^🔗		johnny4 has joined #archiveteam-bs
14:02 ^🔗		bsmith093 has joined #archiveteam-bs
14:02 ^🔗		kisspunch has joined #archiveteam-bs
14:02 ^🔗		tapedrive has joined #archiveteam-bs
14:02 ^🔗		wolfpld has joined #archiveteam-bs
14:02 ^🔗		irc.efnet.nl sets mode: +o bsmith093
14:02 ^🔗		swebb sets mode: +o antomatic
14:02 ^🔗		Frogging sets mode: +o yipdw
14:02 ^🔗		SmileyG has quit IRC (Write error: Broken pipe)
14:02 ^🔗		Smiley has joined #archiveteam-bs
14:15 ^🔗		Zeryl has joined #archiveteam-bs
14:15 ^🔗		Stilett0 has joined #archiveteam-bs
14:15 ^🔗		Riviera has joined #archiveteam-bs
14:15 ^🔗		dashcloud has joined #archiveteam-bs
14:15 ^🔗		SN4T14 has joined #archiveteam-bs
14:15 ^🔗		espes__ has joined #archiveteam-bs
14:15 ^🔗		tuluu_ has joined #archiveteam-bs
14:15 ^🔗		kvieta has joined #archiveteam-bs
14:15 ^🔗		Darkstar has joined #archiveteam-bs
14:15 ^🔗		JensRex has joined #archiveteam-bs
14:15 ^🔗		tammy_ has joined #archiveteam-bs
14:15 ^🔗		i0npulse has joined #archiveteam-bs
14:15 ^🔗		Hecatz has joined #archiveteam-bs
14:15 ^🔗		medowar has joined #archiveteam-bs
14:15 ^🔗		Rai-chan has joined #archiveteam-bs
14:15 ^🔗		purplebot has joined #archiveteam-bs
14:15 ^🔗		Lord_Nigh has joined #archiveteam-bs
14:15 ^🔗		ploop has joined #archiveteam-bs
14:15 ^🔗		brayden_ has joined #archiveteam-bs
14:15 ^🔗		t2t2 has joined #archiveteam-bs
14:15 ^🔗		kurt has joined #archiveteam-bs
14:15 ^🔗		Odd0002 has joined #archiveteam-bs
14:15 ^🔗		DFJustin has joined #archiveteam-bs
14:15 ^🔗		hub.dk sets mode: +oooo medowar Lord_Nigh brayden_ DFJustin
14:15 ^🔗		SilSte has joined #archiveteam-bs
14:15 ^🔗		Fletcher has joined #archiveteam-bs
14:15 ^🔗		Madchen has joined #archiveteam-bs
14:15 ^🔗		altlabel has joined #archiveteam-bs
14:15 ^🔗		PurpleSym has joined #archiveteam-bs
14:15 ^🔗		hub.dk sets mode: +oo Fletcher PurpleSym
14:15 ^🔗		swebb sets mode: +o brayden_
14:15 ^🔗		swebb sets mode: +o DFJustin
14:15 ^🔗		jmtd has joined #archiveteam-bs
14:24 ^🔗		Boppen has joined #archiveteam-bs
14:32 ^🔗		Jonison has joined #archiveteam-bs
14:32 ^🔗		Kenshin has joined #archiveteam-bs
14:32 ^🔗		K4k has joined #archiveteam-bs
14:32 ^🔗		SketchCow has joined #archiveteam-bs
14:32 ^🔗		Kaz has joined #archiveteam-bs
14:32 ^🔗		Ctrl-S___ has joined #archiveteam-bs
14:32 ^🔗		alembic has joined #archiveteam-bs
14:32 ^🔗		floogulin has joined #archiveteam-bs
14:32 ^🔗		HCross2 has joined #archiveteam-bs
14:32 ^🔗		deathy has joined #archiveteam-bs
14:32 ^🔗		alfie has joined #archiveteam-bs
14:32 ^🔗		BartoCH has joined #archiveteam-bs
14:32 ^🔗		tklk has joined #archiveteam-bs
14:32 ^🔗		raphidae has joined #archiveteam-bs
14:32 ^🔗		ThisAsYou has joined #archiveteam-bs
14:32 ^🔗		Muad-Dib has joined #archiveteam-bs
14:32 ^🔗		Meroje has joined #archiveteam-bs
14:32 ^🔗		Sue_ has joined #archiveteam-bs
14:32 ^🔗		Sanqui has joined #archiveteam-bs
14:32 ^🔗		efnet.port80.se sets mode: +oooo SketchCow Kaz HCross2 Sanqui
14:32 ^🔗		swebb sets mode: +o SketchCow
14:34 ^🔗		Jonison has quit IRC (Read error: Connection reset by peer)
14:52 ^🔗		nyany has joined #archiveteam-bs
15:30 ^🔗		Aranje has joined #archiveteam-bs
15:42 ^🔗		SpaffGarg has joined #archiveteam-bs
15:42 ^🔗		RichardG_ has joined #archiveteam-bs
15:42 ^🔗		mls has joined #archiveteam-bs
15:42 ^🔗		Sk1d has joined #archiveteam-bs
15:42 ^🔗		andai has joined #archiveteam-bs
15:42 ^🔗		Aoede has joined #archiveteam-bs
15:42 ^🔗		nightpool has joined #archiveteam-bs
15:42 ^🔗		hook54321 has joined #archiveteam-bs
15:42 ^🔗		VeganMars has joined #archiveteam-bs
15:44 ^🔗		RichardG_ is now known as RichardG
16:01 ^🔗		pizzaiolo has joined #archiveteam-bs
16:24 ^🔗		phuz has joined #archiveteam-bs
16:24 ^🔗		phuzion has quit IRC (Read error: Connection reset by peer)
16:35 ^🔗		antonizoo has quit IRC (Remote host closed the connection)
16:44 ^🔗		ZexaronS has joined #archiveteam-bs
16:50 ^🔗		antonizoo has joined #archiveteam-bs
17:25 ^🔗		sun_rise has joined #archiveteam-bs
17:27 ^🔗	sun_rise	If anyone is around I'm interested in pointing archivebot at something in the other channel
17:28 ^🔗		GE has quit IRC (Remote host closed the connection)
18:00 ^🔗	sun_rise	The job finished but I can't find it in the viewer (or anywhere else?) Says status completed. I'm a little confused.
18:07 ^🔗		pizzaiolo has quit IRC (Read error: Connection reset by peer)
18:08 ^🔗		pizzaiolo has joined #archiveteam-bs
18:15 ^🔗	joepie91	sun_rise: iirc jobs are uploaded/ingested about daily
18:15 ^🔗		SpaffGarg has quit IRC (Ping timeout: 250 seconds)
18:21 ^🔗		SpaffGarg has joined #archiveteam-bs
19:10 ^🔗		GE has joined #archiveteam-bs
19:23 ^🔗		pizzaiolo has quit IRC (Quit: pizzaiolo)
19:23 ^🔗		JAA has joined #archiveteam-bs
19:25 ^🔗		pizzaiolo has joined #archiveteam-bs
19:38 ^🔗		Aranje has quit IRC (Ping timeout: 245 seconds)
20:05 ^🔗		ZexaronS- has joined #archiveteam-bs
20:06 ^🔗		sep332 has quit IRC (Read error: Operation timed out)
20:06 ^🔗		ZexaronS has quit IRC (Read error: Operation timed out)
20:23 ^🔗		sep332 has joined #archiveteam-bs
21:08 ^🔗		speculaas has joined #archiveteam-bs
21:08 ^🔗	joepie91	speculaas: okay, so, it's possible to extract data from the existing archives, but it currently still requires some manual work
21:09 ^🔗	joepie91	speculaas: specifically, you can download the indexes of all the Hyves items on archive.org, which contain a list of every URL that is contained in a given item along with its 'offset' (position in the WARC file)
21:09 ^🔗	speculaas	oke
21:09 ^🔗	joepie91	speculaas: you can then use those positions to do a HTTP range request and retrieve just those bits of the WARC file, obtaining the pages
21:09 ^🔗	speculaas	Here are some archives https://archive.org/details/hyves?&sort=-downloads&page=2
21:10 ^🔗	joepie91	speculaas: there's - to my knowledge - not yet a nice one-stop way to extract an account
21:10 ^🔗	joepie91	speculaas: if you just want to look at the account, it's faster to look it up in the wayback machine
21:10 ^🔗	joepie91	all the Hyves archives should have been imported into that
21:17 ^🔗	speculaas	The url for that is: www.hyves.nl/username ?
21:19 ^🔗	speculaas	I already know the url but I see my account is not public
21:32 ^🔗		schbirid has quit IRC (Quit: Leaving)
21:32 ^🔗	joepie91	speculaas: ah yeah, we only got the public profiles... so if it was a private profile, I'm afraid it can't be recovered :/
21:32 ^🔗	joepie91	speculaas: unless a friend kept around a copy...
21:35 ^🔗	speculaas	Oke, than I know enough. Thanks for your time;)
21:36 ^🔗	joepie91	speculaas: good luck in your search :)
21:40 ^🔗		speculaas has quit IRC (Ping timeout: 268 seconds)
21:43 ^🔗		sun_rise has quit IRC (ny.us.hub irc.efnet.nl)
21:43 ^🔗		fie has quit IRC (ny.us.hub irc.efnet.nl)
21:43 ^🔗		tsr has quit IRC (ny.us.hub irc.efnet.nl)
21:43 ^🔗		yuitimoth has quit IRC (ny.us.hub irc.efnet.nl)
21:43 ^🔗		luckcolor has quit IRC (ny.us.hub irc.efnet.nl)
21:43 ^🔗		tephra has quit IRC (ny.us.hub irc.efnet.nl)
21:43 ^🔗		antomatic has quit IRC (ny.us.hub irc.efnet.nl)
21:43 ^🔗		kevinr has quit IRC (ny.us.hub irc.efnet.nl)
21:43 ^🔗		Frogging has quit IRC (ny.us.hub irc.efnet.nl)
21:43 ^🔗		johnny4 has quit IRC (ny.us.hub irc.efnet.nl)
21:43 ^🔗		bsmith093 has quit IRC (ny.us.hub irc.efnet.nl)
21:43 ^🔗		kisspunch has quit IRC (ny.us.hub irc.efnet.nl)
21:43 ^🔗		tapedrive has quit IRC (ny.us.hub irc.efnet.nl)
21:43 ^🔗		wolfpld has quit IRC (ny.us.hub irc.efnet.nl)
21:44 ^🔗		sun_rise has joined #archiveteam-bs
21:44 ^🔗		fie has joined #archiveteam-bs
21:44 ^🔗		tsr has joined #archiveteam-bs
21:44 ^🔗		yuitimoth has joined #archiveteam-bs
21:44 ^🔗		luckcolor has joined #archiveteam-bs
21:44 ^🔗		tephra has joined #archiveteam-bs
21:44 ^🔗		antomatic has joined #archiveteam-bs
21:44 ^🔗		kevinr has joined #archiveteam-bs
21:44 ^🔗		Frogging has joined #archiveteam-bs
21:44 ^🔗		johnny4 has joined #archiveteam-bs
21:44 ^🔗		bsmith093 has joined #archiveteam-bs
21:44 ^🔗		irc.efnet.nl sets mode: +oooo luckcolor antomatic Frogging bsmith093
21:44 ^🔗		kisspunch has joined #archiveteam-bs
21:44 ^🔗		tapedrive has joined #archiveteam-bs
21:44 ^🔗		wolfpld has joined #archiveteam-bs
21:44 ^🔗		swebb sets mode: +o antomatic
21:44 ^🔗		Frogging sets mode: +o yipdw
22:05 ^🔗	Sanqui	is it using lxml?
22:06 ^🔗		FalconK has joined #archiveteam-bs
22:06 ^🔗	FalconK	hah!
22:06 ^🔗	FalconK	yo Sanqui
22:06 ^🔗	Sanqui	oh hey
22:07 ^🔗	FalconK	so a bunch of the archivebot pipelines are dual-core atoms clocking at 2.4GHz in virtualized environments
22:07 ^🔗	yipdw	Sanqui: it depends on the configuration. libxml on some pipelines, html5lib on others
22:07 ^🔗	yipdw	we started with libxml but it kept crashing for some reasn
22:07 ^🔗	yipdw	html5lib gets more stuff and seems more stable but is more expensive re CPU
22:07 ^🔗	FalconK	I think all of them are html5lib now?
22:07 ^🔗	Sanqui	for reference, i wrote <@Sanqui> is it using lxml?
22:07 ^🔗	yipdw	probably, but I can't be sure of that since people can change the pip manifest
22:08 ^🔗	FalconK	ty
22:08 ^🔗	Sanqui	yeah html5lib is gonna be cpu expensive
22:08 ^🔗	yipdw	anyway, I don't think there's a way around parsing the documents to get links and stuff
22:08 ^🔗	FalconK	ha, we have a manageability issue too writ large
22:08 ^🔗	Sanqui	ideally we'd use libxml and allow to change with a parameter
22:08 ^🔗	Sanqui	if a certain website had issues
22:08 ^🔗	FalconK	a suggestion from some local crew I know was to forego the XML parsing entirely and just use a best-effort regex, and accept that it will find some bullshit
22:08 ^🔗	yipdw	recent Chrome release has an official headless mode and that seems interesting
22:09 ^🔗	FalconK	the biggest reason to not use a regex is that it will fall down making relative URLs out of any / in anything
22:09 ^🔗	FalconK	or else miss tons of stuff
22:09 ^🔗	yipdw	yeah, that's a lot of webpages :P
22:09 ^🔗	FalconK	so I rejected that solution
22:09 ^🔗	FalconK	fucking relative URLs
22:10 ^🔗	xmc	hm
22:10 ^🔗	yipdw	I still think using a browser is probably the way to go
22:10 ^🔗	xmc	so ... yeah
22:10 ^🔗	Sanqui	you could make a more... "outzoomed" regex looking for href=
22:10 ^🔗	xmc	ugh
22:10 ^🔗	yipdw	more and more websites are using client rendering
22:10 ^🔗	xmc	computers suck
22:10 ^🔗	FalconK	one could do that yes
22:10 ^🔗	FalconK	for client rendering we have to use phantomjs anyway
22:10 ^🔗	Sanqui	phantomjs is dead
22:10 ^🔗	xmc	or Headless Chrome Because Yes
22:10 ^🔗	yipdw	and if you're looking for an optimized way to parse documents, you might as well look at a Web browser
22:10 ^🔗	Sanqui	tbh it'd be very nice if we could just spin up chromes
22:11 ^🔗	FalconK	that will cause us to need more CPU, not less
22:11 ^🔗	xmc	^
22:11 ^🔗	yipdw	maybe
22:12 ^🔗	Sanqui	in place of phantomjs anyway
22:12 ^🔗	FalconK	it would be a lot nicer to use headless chrome than phantomjs for the things we do need client-side rendering for
22:12 ^🔗	FalconK	but we still need client-side rendering for a small minority of sites
22:12 ^🔗	FalconK	the only major use of it I've noticed, actually, is twitter.
22:12 ^🔗	yipdw	I think that may be because that's the only place it seems to reliably work
22:12 ^🔗	yipdw	"reliably"
22:12 ^🔗	FalconK	phantomjs is also crashy af
22:12 ^🔗	FalconK	that may also be the case
22:13 ^🔗	yipdw	but there's also a lot of blog sites that use client-side rendering and have no fallback
22:13 ^🔗	FalconK	I'm not at all opposed to using headless chrome in place of phantomjs and seeing how it performs
22:13 ^🔗	Sanqui	honestly, for archiving websites like twitter, youtube, facebook etc., the bot should have specific modes that are curated
22:13 ^🔗	yipdw	usually it's software developers because software developers are idiots
22:13 ^🔗	FalconK	Sanqui: yes, that's also on the long todo list
22:13 ^🔗	yipdw	can confirm, I write software
22:13 ^🔗	FalconK	we want a !twitter at least, and possibly an !youtube and !reddit
22:14 ^🔗	FalconK	!facebook would require a lot of work
22:15 ^🔗	FalconK	separately, there's this CPU usage issue :P
22:15 ^🔗	yipdw	did anyone manage to get a useful CPU profile? I tried once but I just got a bunch of "your progrm is spending most of its time in Python's evaluator"
22:15 ^🔗	yipdw	which is like saying "your program is spending most of its time running"
22:15 ^🔗	FalconK	there's another issue, which is that wpull.db grows to tens of GB when crawling large sites, but I'm willing to live with that for the moment since the high CPU usage is actually the pain point right now
22:15 ^🔗	Sanqui	anyway, to drive the point home: phantomjs is over, the lead developer has stepped out in anticipation of headless chrome https://groups.google.com/forum/#!topic/phantomjs/9aI5d-LDuNE
22:15 ^🔗	FalconK	yipdw: I did!
22:15 ^🔗	yipdw	oh
22:15 ^🔗	Sanqui	so we need to do something eventually
22:15 ^🔗	yipdw	do you still have the profile data?
22:15 ^🔗	FalconK	ananiel-s6 is currently dedicated to profiling
22:16 ^🔗	yipdw	ah good
22:16 ^🔗	FalconK	let me see if I do still have it; if not, I can get it again later
22:16 ^🔗	yipdw	yeah, I'd like to see that. I got as far as perf and then I got annoyed and had to switch gears
22:16 ^🔗	FalconK	but I recall html5lib stuff featured very heavily
22:16 ^🔗	FalconK	perf is fucking awful to deal with
22:16 ^🔗	FalconK	I hate optimizing
22:16 ^🔗	Sanqui	html5lib is parsing in python
22:16 ^🔗	yipdw	I keep seeing good testimonials for Telemetry
22:16 ^🔗	Sanqui	we really want lxml
22:16 ^🔗	yipdw	we've tried libxml before
22:16 ^🔗	yipdw	it kept blowing up
22:17 ^🔗	Sanqui	then we should figure out why and report it upstream
22:17 ^🔗	FalconK	let's see - how does one read cprofile things again
22:17 ^🔗	FalconK	actually yipdw do you just want the cprofile?
22:17 ^🔗	Sanqui	(sorry for the 'we', i'm not trying to sound smart here)
22:17 ^🔗	FalconK	I'll put it up somewhere
22:17 ^🔗	yipdw	Sanqui: I mean, yes, but in the meantime it was easier to just switch to html5lib and deliver something working
22:17 ^🔗	Sanqui	(i fully recognize i have done zero archivebot development)
22:18 ^🔗	FalconK	we have a very bad test process right now for archivebot
22:18 ^🔗	Sanqui	yes I noticed tests are failing
22:18 ^🔗	FalconK	which is make it do real work, and then wait until it falls over, and see if you got enough information to figure out the failure case
22:18 ^🔗		REiN^ has quit IRC (Read error: Operation timed out)
22:19 ^🔗	yipdw	chfoo wrote a smoke test harness, but there's a lot of moving parts and I haven't looked at what it takes to put them back together in the Travis environment
22:19 ^🔗	FalconK	on my end, this is mostly because I have $infinity things to do that aren't archivebot, so... :P
22:19 ^🔗	FalconK	no offense to chfoo but his code has a LOT of moving parts
22:19 ^🔗	yipdw	i mean really I don't think "test it in production" is really a bad idea here
22:19 ^🔗	FalconK	it's not; it just takes forever
22:19 ^🔗	yipdw	if you have good telemetry, it's awesome
22:19 ^🔗	JAA	I'm not familiar with what libxml and html5lib really are internally, but probably the best option would be to the XML parser library from a browser (i.e. Chromium or Firefox), right?
22:20 ^🔗	FalconK	html5lib is basically that
22:20 ^🔗	yipdw	I don't know of anyone who has extracted those for consumption in something else
22:20 ^🔗	FalconK	it's intended afaik to be a W3C compliant HTML parser not unlike, say, sax for XML
22:20 ^🔗	yipdw	I guess that's a good point too
22:20 ^🔗	yipdw	we really can't use "an XML library" to be pedantic
22:20 ^🔗	JAA	Yeah. Problem is, many websites aren't W3C compliant.
22:21 ^🔗	yipdw	HTML isn't XML and archivebot has to be able to deal with that
22:21 ^🔗	FalconK	maybe it's html5lib that needs perf
22:21 ^🔗	JAA	We still want to be able to handle those.
22:21 ^🔗	FalconK	we don't currently have a significant problem with that
22:21 ^🔗	Sanqui	JAA: it's inside out here
22:21 ^🔗	yipdw	indeed archivebot tends to get pointed at a lot of small, old sites
22:21 ^🔗	Sanqui	W3C defines how to deal with websites that aren't W3C compliant
22:21 ^🔗	Sanqui	and browsers follow that
22:21 ^🔗	FalconK	other than operator error I haven't had many complaints of archivebot missing things
22:21 ^🔗	yipdw	I don't have notes, but I think that's another reason why html5lib switch happened
22:21 ^🔗	JAA	I see
22:21 ^🔗	FalconK	if it's noticed, I'd love to hear about it
22:22 ^🔗	yipdw	it just got better results
22:22 ^🔗	yipdw	no point in performing faster if you miss page requisites etc
22:22 ^🔗		REiN^ has joined #archiveteam-bs
22:22 ^🔗	Sanqui	could always try pypy :)
22:22 ^🔗	JAA	If html5lib works so well, how about rewriting it as a C extension? /s
22:23 ^🔗		ZexaronS- has quit IRC (Leaving)
22:23 ^🔗	yipdw	debugging the intersection of python and C is prohibited by the Geneva Conventions
22:23 ^🔗	yipdw	I mean you can inflict it on yourself but
22:24 ^🔗		GE has quit IRC (Remote host closed the connection)
22:24 ^🔗	yipdw	tangentially related, I'm working on a project and part of it is an app that calls into a Go library
22:24 ^🔗	yipdw	from C
22:25 ^🔗	FalconK	http://ananiels6.falconkirtaran.net/cprof.dat
22:25 ^🔗	yipdw	the app is trivially stack-smashable if you send a URL that's longer than 2048 bytes
22:25 ^🔗	FalconK	that link work?
22:25 ^🔗	yipdw	I thought that was really funny
22:25 ^🔗	yipdw	because it's like "Go will save me"
22:25 ^🔗	yipdw	yeah no
22:25 ^🔗	yipdw	Connecting to ananiels6.falconkirtaran.net (ananiels6.falconkirtaran.net)\|51.15.47.106\|:80... failed: Connection refused.
22:26 ^🔗	yipdw	brb
22:27 ^🔗	JAA	I've done it before, and it's actually not too bad as long as you can keep all the real work in C and just have a thin transition layer converting the stuff from/to Python variables.
22:27 ^🔗	JAA	But for obvious reasons, I wouldn't want to implement an XML parser, ever. Most certainly not in C.
22:29 ^🔗	Sanqui	this is not work we should be doing
22:31 ^🔗	FalconK	+1
22:31 ^🔗	FalconK	ffs
22:31 ^🔗	FalconK	http://ananiels6.falconkirtaran.net:8000/cprof.dat
22:31 ^🔗	FalconK	strings are awful
22:32 ^🔗	Sanqui	hooray for SimpleHTTPServer
22:32 ^🔗	JAA	Indeed. wpull really needs fixing. Version 2.0.1 has so many bugs that it's not even funny; e.g. concurrency is broken entirely and aborting doesn't work. And version 1.2.3 throws up when used with the current html5lib version since the API changed and the requirements.txt doesn't force the specific, compatible version.
22:32 ^🔗	FalconK	yipdw and I did the work to transition archivebot to wpull2 like 6 months back
22:33 ^🔗	FalconK	I suggest that we roll with chfoo's changes and deprecate 1.x
22:33 ^🔗	FalconK	but we will need to fix concurrency for sure
22:33 ^🔗	FalconK	aborting is working fine for archivebot by the way
22:33 ^🔗	FalconK	er... as fine as it ever has worked
22:34 ^🔗	yipdw	ok
22:35 ^🔗	JAA	Interesting. I always had to hard-abort (twice ^C) it when I tried. After few attempts, I went to 1.2.3
22:36 ^🔗	FalconK	anyway the thing that really jumps out at me in the 650 second profile there:
22:36 ^🔗	FalconK	926 23.751 0.026 324.046 0.350 /home/archivebot/.local/lib/python3.5/site-packages/wpull/scraper/html.py:127(_process_elements)
22:37 ^🔗	FalconK	it's spending literally 50% of its time in html._process_elements
22:38 ^🔗	yipdw	well, in some way that's kinda cool
22:38 ^🔗	yipdw	it means all of our add-on stuff isn't the slow bit
22:38 ^🔗	FalconK	yeah...
22:39 ^🔗	FalconK	by comparison, by the way, it spends about 7.5% of its time working with sqlite
22:39 ^🔗		Stilett0 has quit IRC (Read error: Operation timed out)
22:39 ^🔗	yipdw	which to me is counterinituitive. I thought running hundreds of regular expressions on each document would be a problem
22:39 ^🔗	yipdw	turns out, it isn't the dominating factor
22:39 ^🔗	yipdw	profiles are awesome
22:39 ^🔗	FalconK	yeah
22:39 ^🔗	FalconK	our regexp running is efficient, I think, right? it compiles them into one state machine?
22:39 ^🔗	yipdw	no
22:39 ^🔗	JAA	Hmm, doesn't the HTML parsing happen outside of _process_elements?
22:39 ^🔗	FalconK	no idea
22:40 ^🔗	yipdw	but we do compile the regexes / make use of the Python regexp cache
22:40 ^🔗	*	FalconK nods
22:40 ^🔗	yipdw	so it's probably fast enough
22:40 ^🔗	FalconK	the regex thing doesn't even seem to appear in the profiling
22:40 ^🔗	FalconK	er
22:40 ^🔗	FalconK	not anywhere near the top
22:40 ^🔗	yipdw	neat
22:41 ^🔗	yipdw	it's good to know also that sqlite is fast
22:41 ^🔗	FalconK	78084 0.907 0.000 28.838 0.000 /home/archivebot/.local/lib/python3.5/site-packages/wpull/application/hook.py:132(notify)
22:41 ^🔗	FalconK	5% of time in hooks of any kind
22:41 ^🔗	yipdw	i had a suspicion it was more than sufficient for this but it's cool to see that it's at the bottom
22:41 ^🔗	*	FalconK nods
22:41 ^🔗	yipdw	so, hmm
22:41 ^🔗	yipdw	what is process_elements doing
22:41 ^🔗	FalconK	I don't even remember what this job was (probably !a http://cnn.com/ or something)
22:42 ^🔗	FalconK	but yes, one wonders
22:42 ^🔗	JAA	It seems that the parsing happens in wpull.document.html.HTMLReader.iter_elements .
22:42 ^🔗	yipdw	FalconK: can you put that profile data back up?
22:42 ^🔗	yipdw	I get a connection refused talking to that site
22:43 ^🔗	yipdw	that or if you can drill down into process_elements that'd be fab
22:43 ^🔗	FalconK	oh sure
22:43 ^🔗	yipdw	it's a pretty big method
22:43 ^🔗	FalconK	sorry, it was python http.server and I took it down to read the data
22:43 ^🔗	yipdw	ah ok
22:43 ^🔗	JAA	Yeah, line profiling for _process_elements would be helpful.
22:43 ^🔗	FalconK	up again
22:43 ^🔗	FalconK	go for it
22:44 ^🔗	yipdw	hmm
22:44 ^🔗	yipdw	Connecting to ananiels6.falconkirtaran.net (ananiels6.falconkirtaran.net)\|51.15.47.106\|:80... failed: Connection refused.
22:44 ^🔗	FalconK	:8000
22:44 ^🔗	yipdw	oh feck
22:44 ^🔗	yipdw	there we go
22:44 ^🔗	yipdw	done
22:44 ^🔗	FalconK	:)
22:44 ^🔗	FalconK	I'll leave it up for a bit while I read _process_elements
22:45 ^🔗	yipdw	python -m cProfile -s cumtime will never not be funny to me
22:45 ^🔗	yipdw	also hi yes I am 12
22:46 ^🔗	FalconK	oddly clean_link_soup is negligible
22:46 ^🔗	FalconK	210118 1.320 0.000 3.423 0.000 /home/archivebot/.local/lib/python3.5/site-packages/wpull/scraper/util.py:38(clean_link_soup)
22:46 ^🔗	yipdw	it'd be funny if it ended up being urljoin_safe or something
22:47 ^🔗	yipdw	50% of overall time spent in string concat and reallocation
22:47 ^🔗	FalconK	164679 1.413 0.000 30.091 0.000 /home/archivebot/.local/lib/python3.5/site-packages/wpull/url.py:684(urljoin)
22:47 ^🔗	yipdw	wat
22:47 ^🔗	yipdw	are you fucking kidding me
22:47 ^🔗	FalconK	:P
22:47 ^🔗	JAA	Sidenote: I think there's a bug in _process_elements: "if self._only_relative:" followed by "if link_info.base_link or '://' in link_info.link:" probably doesn't catch protocol-relative links, i.e. 'href="//example.com/"'.
22:47 ^🔗	yipdw	JAA: hmm
22:47 ^🔗	FalconK	what's that?
22:48 ^🔗	yipdw	I don't recall scheme-relative links being a problem, but we can try that out
22:48 ^🔗	FalconK	oh huh, https://www.paulirish.com/2010/the-protocol-relative-url/... TIL
22:48 ^🔗	xmc	they're vaguely useful
22:50 ^🔗	FalconK	JAA: I think you're right; the best way to address it would be a PR
22:51 ^🔗	FalconK	this is kind of a big deal too:
22:51 ^🔗	FalconK	1077 0.230 0.000 51.563 0.048 /home/archivebot/.local/lib/python3.5/site-packages/wpull/database/wrap.py:41(add_many)
22:52 ^🔗	yipdw	FalconK: wait, are you sure this is html5lib?
22:52 ^🔗	yipdw	I see parse_lxml in the output
22:52 ^🔗	FalconK	I went through the same inquiry
22:52 ^🔗	FalconK	I don't remember the conclusion I came to
22:52 ^🔗	JAA	FalconK: I guess. Then again, other PRs have been sitting there for months, so motivation is limited. Also, I have no idea how to fix it properly without breaking other stuff. Paths in URLs can contain several consecutive slashes IIRC; that is, href="some//path" is equivalent to href="some/path".
22:52 ^🔗	FalconK	either both are in use, or else someone put html5lib in but left all the functions named like libxml.
22:53 ^🔗	FalconK	JAA: right, it looks like only r'^//' is protocol-relative
22:54 ^🔗	FalconK	no comment on PRs except that archivebot specifies github.com/falconkirtaran/wpull in requirements.txt
22:54 ^🔗	FalconK	because before my omnibus PR was accepted wpull2 was too crashy to use
22:55 ^🔗	yipdw	well, maybe parse_lxml is the wrong place to look anyway. profile indicates that most of the time in there is spent in the "start" method but that method just invokes callbacks
22:55 ^🔗	FalconK	heya yipdw I think that add_many prof item might contain the plugins?
22:55 ^🔗	yipdw	and the callbacks aren't showing up in the profile, AFAICT
22:56 ^🔗	yipdw	FalconK: not sure
22:56 ^🔗	yipdw	oh wait, the callbacks are in the called: section
22:57 ^🔗	FalconK	oh, still not a problem
22:57 ^🔗	FalconK	2159 0.036 0.000 6.803 0.003 archive_bot_plugin.py:214(accept_url)
22:58 ^🔗	yipdw	huh. highest total time in start is /home/archivebot/.local/lib/python3.5/site-packages/wpull/collections.py:244(__init__)
22:58 ^🔗	yipdw	does this just spend most of its time managing lists?
22:58 ^🔗	FalconK	what does that abstraction do
22:58 ^🔗	FalconK	it might, though
22:58 ^🔗	FalconK	wpull -r keeps a lot of lists
22:59 ^🔗		Stilett0 has joined #archiveteam-bs
22:59 ^🔗	yipdw	line 244 of collections.py is the initializer for FrozenDict
22:59 ^🔗	yipdw	which does e.g.
22:59 ^🔗	yipdw	def __init__(self, orig_dict):
22:59 ^🔗	yipdw	self.orig_dict = orig_dict
22:59 ^🔗	yipdw	self.hash_cache = hash(tuple(sorted(self.orig_dict.items())))
22:59 ^🔗	yipdw	over 1.68 million calls to that that seems like it might be a thing
22:59 ^🔗	FalconK	wait
23:00 ^🔗	FalconK	hash(...sorted(?
23:00 ^🔗	yipdw	yeah
23:00 ^🔗	FalconK	why
23:00 ^🔗	yipdw	I don't know
23:00 ^🔗	yipdw	do Python hashes guarantee any sort of iteration order?
23:00 ^🔗	yipdw	I know Ruby does
23:00 ^🔗	FalconK	I suppose that would depend
23:01 ^🔗	FalconK	what stability properties does it require?
23:01 ^🔗	yipdw	not sure
23:01 ^🔗	FalconK	python is not my primary language
23:01 ^🔗	FalconK	(that'd be C++, followed by x86 ASM)
23:02 ^🔗	yipdw	FrozenDict is used in lxml.HTMLParserTarget.start
23:02 ^🔗	FalconK	well!
23:02 ^🔗	yipdw	I'm not really sure if it's needed, though
23:02 ^🔗	yipdw	hard to tell
23:03 ^🔗	yipdw	it's also not immediately clear to me what it's wrapping -- it's 'attrib'
23:03 ^🔗	yipdw	(tag attributes?)
23:03 ^🔗	FalconK	murdering it entirely would speed us up by 2%
23:04 ^🔗	FalconK	AKA 4-6 page grabs per hundred seconds
23:04 ^🔗	yipdw	or more, depending on what effect that would have with fewer allocations
23:04 ^🔗	FalconK	oh, true
23:04 ^🔗	FalconK	the allocator is still a black box to us
23:04 ^🔗	yipdw	I was just poking at it because it showed up pretty high in the profiles
23:04 ^🔗	FalconK	though I feel like it's probably spending a lot more time sorting than allocating
23:05 ^🔗	FalconK	I don't think __init__ captures time python spends allocating
23:05 ^🔗	FalconK	and actually the python heap processing was insignificant anyway wan't it?
23:05 ^🔗	yipdw	it might not, but FrozenDict is making more objects in its initializer
23:05 ^🔗	yipdw	i.e. the new hash and the temporary tuple
23:05 ^🔗	FalconK	mm
23:05 ^🔗	yipdw	I don't know how expensive that is on the allocator (it might be trivial)
23:06 ^🔗	yipdw	anyway, I guess one thing to try would be to remove FrozenDict() with, like dict()
23:06 ^🔗	FalconK	I don't think allocator time is captured with the jit time
23:06 ^🔗	FalconK	but yeah, we could try that on ananiel-S6
23:06 ^🔗	yipdw	you lose the immutability guarantee but it'd be one way to see if FrozenDict() introduces a large penalty
23:06 ^🔗	yipdw	or, in the specific case of start(), just don't wrap attrib in a FrozenDict()
23:07 ^🔗	yipdw	I doubt it will have a perceptible macro difference but it would be neat to see how it changes the profile
23:08 ^🔗	FalconK	now I'm confused about this:
23:08 ^🔗	yipdw	speaking of C++, one thing that C++ has made me really paranoid about (probably overly paranoid) is allocations
23:08 ^🔗	FalconK	there's both lxml_.py and htmllib5_.py
23:08 ^🔗	FalconK	why
23:08 ^🔗	yipdw	like every time I've had a performance problem, it wasn't algorithmic. it was because I was fucking mallocing too much
23:08 ^🔗	yipdw	or treating cache lines like slacklines
23:08 ^🔗	yipdw	that sort of things
23:09 ^🔗	yipdw	FalconK: huh
23:09 ^🔗	yipdw	dunno
23:10 ^🔗	yipdw	maybe this is using libxml after all?
23:11 ^🔗	FalconK	I wonder if it using libxml for XHTML documents and html5lib for others?
23:11 ^🔗	FalconK	I remember there was some complex dispatch logic
23:12 ^🔗	FalconK	it's just so ungodly complex
23:13 ^🔗	yipdw	maybe using Chrome as the HTML processor would actually be faster :P
23:13 ^🔗	yipdw	let wpull handle queue management, retry, etc
23:13 ^🔗	FalconK	doubt it but who knows
23:14 ^🔗	yipdw	I mean you might still be at high CPU%, but the CPU might be doing more
23:14 ^🔗	FalconK	one thing that is good about html5lib/libxml2 is that it doesn't execute needless javascript
23:14 ^🔗	FalconK	we may be able to disable doing that in headless chrome
23:15 ^🔗	yipdw	it doesn't, but Javascript has been doing things to the DOM for quite a while
23:15 ^🔗	yipdw	I don't know if it's needless
23:15 ^🔗	yipdw	there was some other browser like this, I forgot what it was
23:16 ^🔗	yipdw	it was webkit based
23:16 ^🔗	FalconK	it's needful to grab, for sure
23:16 ^🔗	yipdw	and it was meant to be used in a UNIX Philosophy way
23:16 ^🔗	yipdw	which means it has an impossible name
23:17 ^🔗	yipdw	AH
23:17 ^🔗	yipdw	uzbl
23:18 ^🔗	yipdw	maybe that's an option too in the "use a browser engine to give us what we need to do our thing" arena
23:18 ^🔗	yipdw	or i dunno how good is servo these days :P
23:19 ^🔗	yipdw	every time I try to run servo nightly it eats up all my cores but doesn't render anything
23:19 ^🔗	yipdw	but that could be an environment issue
23:21 ^🔗		Ravenloft has joined #archiveteam-bs
23:22 ^🔗		JAA has quit IRC (Quit: Page closed)
23:24 ^🔗	FalconK	ok, new profiling on !a https://www.npr.org/
23:24 ^🔗	FalconK	in 10 or 20 I'll kill it and we can look
23:25 ^🔗	FalconK	it seems to not be crashing without FrozenDict
23:25 ^🔗	FalconK	... I say, as it crashes
23:25 ^🔗	FalconK	this fucking bug:
23:25 ^🔗	FalconK	File "/home/archivebot/.local/lib/python3.5/site-packages/chardet/universaldetector.py", line 271, in close
23:25 ^🔗	FalconK	for prober in self._charset_probers[0].probers:
23:25 ^🔗	FalconK	IndexError: list index out of range
23:25 ^🔗	FalconK	CRITICAL Sorry, Wpull unexpectedly crashed.
23:25 ^🔗	FalconK	CRITICAL Please report this problem to the authors at Wpull's issue tracker so it may be fixed. If you know how to program, maybe help us fix it? Thank you for helping us help you help us all.
23:25 ^🔗	FalconK	which is not new
23:27 ^🔗	yipdw	what the
23:27 ^🔗	yipdw	oh
23:27 ^🔗	yipdw	right
23:32 ^🔗		superkuh has quit IRC (Remote host closed the connection)
23:34 ^🔗		superkuh has joined #archiveteam-bs
23:49 ^🔗	FalconK	yipdw: http://ananiels6.falconkirtaran.net:8000/02_post_rm_FrozenDict
23:55 ^🔗	FalconK	it certainly didn't seem to break anything, and now that 2% is gone
23:55 ^🔗	FalconK	it's spending a significant amount of time on epoll_wait, which is good since that means it's a little network-bound
23:57 ^🔗		BlueMaxim has joined #archiveteam-bs
23:59 ^🔗	FalconK	20 1.237 0.062 1.995 0.100 /home/archivebot/.local/lib/python3.5/site-packages/chardet/mbcharsetprober.py:61(feed)
23:59 ^🔗	FalconK	that's 0.062 seconds per call. what is that even for?

irclogger-viewer