#archiveteam-bs 2017-05-04,Thu

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***ndiddy has quit IRC () [02:30]
SpaffGarg has quit IRC (Read error: Operation timed out)
SpaffGarg has joined #archiveteam-bs
[02:41]
zeryl has joined #archiveteam-bs [02:54]
.... (idle for 18mn)
pizzaiolo has quit IRC (pizzaiolo) [03:12]
zeryl has quit IRC (Quit: Page closed)
Zeryl has joined #archiveteam-bs
[03:23]
.......... (idle for 48mn)
ZerylLet's try here:
WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
[04:11]
***Sk1d has quit IRC (Ping timeout: 194 seconds) [04:15]
Sk1d has joined #archiveteam-bs [04:21]
signius has quit IRC (Quit: Leaving) [04:29]
........ (idle for 35mn)
Odd0002is there anything that can be archived from a XMPP server? [05:04]
***fie has joined #archiveteam-bs [05:08]
..... (idle for 20mn)
Zerylpossibly contact info, and MUC logs depending how long back they allow reviewin, if at all [05:28]
.... (idle for 17mn)
godanei'm uploading more Charlie Rose from 1992-01 [05:45]
................. (idle for 1h22mn)
***chazchaz has quit IRC (Read error: Operation timed out)
Kenshin has quit IRC (Read error: Operation timed out)
chazchaz has joined #archiveteam-bs
Kenshin has joined #archiveteam-bs
schbirid has joined #archiveteam-bs
[07:07]
SpaffGarg has quit IRC (Read error: Operation timed out)
SpaffGarg has joined #archiveteam-bs
[07:25]
......... (idle for 40mn)
mls has quit IRC (Ping timeout: 250 seconds) [08:08]
mls has joined #archiveteam-bs [08:15]
.... (idle for 17mn)
GE has joined #archiveteam-bs [08:32]
.... (idle for 16mn)
nyany has quit IRC (Ping timeout: 506 seconds)
antonizoo has quit IRC ()
antonizoo has joined #archiveteam-bs
[08:48]
Jonison has joined #archiveteam-bs [09:05]
GE has quit IRC (Remote host closed the connection) [09:13]
.................... (idle for 1h37mn)
godane has quit IRC (Ping timeout: 268 seconds) [10:50]
Honno has joined #archiveteam-bs
godane has joined #archiveteam-bs
[11:01]
godane has quit IRC (Quit: Leaving.) [11:09]
..... (idle for 20mn)
GE has joined #archiveteam-bs [11:29]
......... (idle for 41mn)
Ravenloft has quit IRC (Read error: Operation timed out) [12:10]
............ (idle for 56mn)
BlueMaxim has quit IRC (Quit: Leaving) [13:06]
.... (idle for 17mn)
GE has quit IRC (Remote host closed the connection) [13:23]
....... (idle for 31mn)
GE has joined #archiveteam-bs [13:54]
Jonison has quit IRC (ny.us.hub hub.se)
SpaffGarg has quit IRC (ny.us.hub hub.se)
Kenshin has quit IRC (ny.us.hub hub.se)
K4k has quit IRC (ny.us.hub hub.se)
SketchCow has quit IRC (ny.us.hub hub.se)
Kaz has quit IRC (ny.us.hub hub.se)
Ctrl-S___ has quit IRC (ny.us.hub hub.se)
alembic has quit IRC (ny.us.hub hub.se)
floogulin has quit IRC (ny.us.hub hub.se)
HCross2 has quit IRC (ny.us.hub hub.se)
deathy has quit IRC (ny.us.hub hub.se)
alfie has quit IRC (ny.us.hub hub.se)
BartoCH has quit IRC (ny.us.hub hub.se)
ThisAsYou has quit IRC (ny.us.hub hub.se)
tklk has quit IRC (ny.us.hub hub.se)
Sue_ has quit IRC (ny.us.hub hub.se)
Muad-Dib has quit IRC (ny.us.hub hub.se)
Sanqui has quit IRC (ny.us.hub hub.se)
Meroje has quit IRC (ny.us.hub hub.se)
raphidae has quit IRC (ny.us.hub hub.se)
Boppen has quit IRC (ny.us.hub hub.se)
mls has quit IRC (ny.us.hub hub.se)
Sk1d has quit IRC (ny.us.hub hub.se)
andai has quit IRC (ny.us.hub hub.se)
Aoede has quit IRC (ny.us.hub hub.se)
nightpool has quit IRC (ny.us.hub hub.se)
hook54321 has quit IRC (ny.us.hub hub.se)
VeganMars has quit IRC (ny.us.hub hub.se)
Riviera has quit IRC (ny.us.hub hub.se)
SN4T14 has quit IRC (ny.us.hub hub.se)
tuluu_ has quit IRC (ny.us.hub hub.se)
JensRex has quit IRC (ny.us.hub hub.se)
tammy_ has quit IRC (ny.us.hub hub.se)
i0npulse has quit IRC (ny.us.hub hub.se)
Hecatz has quit IRC (ny.us.hub hub.se)
Rai-chan has quit IRC (ny.us.hub hub.se)
medowar has quit IRC (ny.us.hub hub.se)
purplebot has quit IRC (ny.us.hub hub.se)
Madchen has quit IRC (ny.us.hub hub.se)
PurpleSym has quit IRC (ny.us.hub hub.se)
altlabel has quit IRC (ny.us.hub hub.se)
Zeryl has quit IRC (ny.us.hub hub.se)
Jon- has quit IRC (ny.us.hub hub.se)
Stilett0 has quit IRC (ny.us.hub hub.se)
dashcloud has quit IRC (ny.us.hub hub.se)
espes__ has quit IRC (ny.us.hub hub.se)
kvieta has quit IRC (ny.us.hub hub.se)
Darkstar has quit IRC (ny.us.hub hub.se)
Lord_Nigh has quit IRC (ny.us.hub hub.se)
brayden_ has quit IRC (ny.us.hub hub.se)
t2t2 has quit IRC (ny.us.hub hub.se)
RichardG has quit IRC (ny.us.hub hub.se)
kurt has quit IRC (ny.us.hub hub.se)
Odd0002 has quit IRC (ny.us.hub hub.se)
ploop has quit IRC (ny.us.hub hub.se)
DFJustin has quit IRC (ny.us.hub hub.se)
SilSte has quit IRC (ny.us.hub hub.se)
Fletcher has quit IRC (ny.us.hub hub.se)
antonizoo has quit IRC (ny.us.hub hub.se)
fie has quit IRC (ny.us.hub hub.se)
tsr has quit IRC (ny.us.hub hub.se)
yuitimoth has quit IRC (ny.us.hub hub.se)
luckcolor has quit IRC (ny.us.hub hub.se)
tephra has quit IRC (ny.us.hub hub.se)
antomatic has quit IRC (ny.us.hub hub.se)
SmileyG has quit IRC (ny.us.hub hub.se)
kevinr has quit IRC (ny.us.hub hub.se)
Frogging has quit IRC (ny.us.hub hub.se)
johnny4 has quit IRC (ny.us.hub hub.se)
bsmith093 has quit IRC (ny.us.hub hub.se)
kisspunch has quit IRC (ny.us.hub hub.se)
tapedrive has quit IRC (ny.us.hub hub.se)
wolfpld has quit IRC (ny.us.hub hub.se)
antonizoo has joined #archiveteam-bs
fie has joined #archiveteam-bs
tsr has joined #archiveteam-bs
yuitimoth has joined #archiveteam-bs
luckcolor has joined #archiveteam-bs
tephra has joined #archiveteam-bs
SmileyG has joined #archiveteam-bs
antomatic has joined #archiveteam-bs
kevinr has joined #archiveteam-bs
Frogging has joined #archiveteam-bs
irc.efnet.nl sets mode: +oooo luckcolor SmileyG antomatic Frogging
johnny4 has joined #archiveteam-bs
bsmith093 has joined #archiveteam-bs
kisspunch has joined #archiveteam-bs
tapedrive has joined #archiveteam-bs
wolfpld has joined #archiveteam-bs
irc.efnet.nl sets mode: +o bsmith093
swebb sets mode: +o antomatic
Frogging sets mode: +o yipdw
SmileyG has quit IRC (Write error: Broken pipe)
Smiley has joined #archiveteam-bs
[14:01]
Zeryl has joined #archiveteam-bs
Stilett0 has joined #archiveteam-bs
Riviera has joined #archiveteam-bs
dashcloud has joined #archiveteam-bs
SN4T14 has joined #archiveteam-bs
espes__ has joined #archiveteam-bs
tuluu_ has joined #archiveteam-bs
kvieta has joined #archiveteam-bs
Darkstar has joined #archiveteam-bs
JensRex has joined #archiveteam-bs
tammy_ has joined #archiveteam-bs
i0npulse has joined #archiveteam-bs
Hecatz has joined #archiveteam-bs
medowar has joined #archiveteam-bs
Rai-chan has joined #archiveteam-bs
purplebot has joined #archiveteam-bs
Lord_Nigh has joined #archiveteam-bs
ploop has joined #archiveteam-bs
brayden_ has joined #archiveteam-bs
t2t2 has joined #archiveteam-bs
kurt has joined #archiveteam-bs
Odd0002 has joined #archiveteam-bs
DFJustin has joined #archiveteam-bs
hub.dk sets mode: +oooo medowar Lord_Nigh brayden_ DFJustin
SilSte has joined #archiveteam-bs
Fletcher has joined #archiveteam-bs
Madchen has joined #archiveteam-bs
altlabel has joined #archiveteam-bs
PurpleSym has joined #archiveteam-bs
hub.dk sets mode: +oo Fletcher PurpleSym
swebb sets mode: +o brayden_
swebb sets mode: +o DFJustin
jmtd has joined #archiveteam-bs
[14:15]
Boppen has joined #archiveteam-bs [14:24]
Jonison has joined #archiveteam-bs
Kenshin has joined #archiveteam-bs
K4k has joined #archiveteam-bs
SketchCow has joined #archiveteam-bs
Kaz has joined #archiveteam-bs
Ctrl-S___ has joined #archiveteam-bs
alembic has joined #archiveteam-bs
floogulin has joined #archiveteam-bs
HCross2 has joined #archiveteam-bs
deathy has joined #archiveteam-bs
alfie has joined #archiveteam-bs
BartoCH has joined #archiveteam-bs
tklk has joined #archiveteam-bs
raphidae has joined #archiveteam-bs
ThisAsYou has joined #archiveteam-bs
Muad-Dib has joined #archiveteam-bs
Meroje has joined #archiveteam-bs
Sue_ has joined #archiveteam-bs
Sanqui has joined #archiveteam-bs
efnet.port80.se sets mode: +oooo SketchCow Kaz HCross2 Sanqui
swebb sets mode: +o SketchCow
Jonison has quit IRC (Read error: Connection reset by peer)
[14:32]
.... (idle for 18mn)
nyany has joined #archiveteam-bs [14:52]
........ (idle for 38mn)
Aranje has joined #archiveteam-bs [15:30]
SpaffGarg has joined #archiveteam-bs
RichardG_ has joined #archiveteam-bs
mls has joined #archiveteam-bs
Sk1d has joined #archiveteam-bs
andai has joined #archiveteam-bs
Aoede has joined #archiveteam-bs
nightpool has joined #archiveteam-bs
hook54321 has joined #archiveteam-bs
VeganMars has joined #archiveteam-bs
RichardG_ is now known as RichardG
[15:42]
.... (idle for 17mn)
pizzaiolo has joined #archiveteam-bs [16:01]
..... (idle for 23mn)
phuz has joined #archiveteam-bs
phuzion has quit IRC (Read error: Connection reset by peer)
[16:24]
antonizoo has quit IRC (Remote host closed the connection) [16:35]
ZexaronS has joined #archiveteam-bs [16:44]
antonizoo has joined #archiveteam-bs [16:50]
........ (idle for 35mn)
sun_rise has joined #archiveteam-bs [17:25]
sun_riseIf anyone is around I'm interested in pointing archivebot at something in the other channel [17:27]
***GE has quit IRC (Remote host closed the connection) [17:28]
....... (idle for 32mn)
sun_riseThe job finished but I can't find it in the viewer (or anywhere else?) Says status completed. I'm a little confused. [18:00]
***pizzaiolo has quit IRC (Read error: Connection reset by peer)
pizzaiolo has joined #archiveteam-bs
[18:07]
joepie91sun_rise: iirc jobs are uploaded/ingested about daily [18:15]
***SpaffGarg has quit IRC (Ping timeout: 250 seconds) [18:15]
SpaffGarg has joined #archiveteam-bs [18:21]
.......... (idle for 49mn)
GE has joined #archiveteam-bs [19:10]
pizzaiolo has quit IRC (Quit: pizzaiolo)
JAA has joined #archiveteam-bs
pizzaiolo has joined #archiveteam-bs
[19:23]
Aranje has quit IRC (Ping timeout: 245 seconds) [19:38]
...... (idle for 27mn)
ZexaronS- has joined #archiveteam-bs
sep332 has quit IRC (Read error: Operation timed out)
ZexaronS has quit IRC (Read error: Operation timed out)
[20:05]
.... (idle for 17mn)
sep332 has joined #archiveteam-bs [20:23]
.......... (idle for 45mn)
speculaas has joined #archiveteam-bs [21:08]
joepie91speculaas: okay, so, it's *possible* to extract data from the existing archives, but it currently still requires some manual work
speculaas: specifically, you can download the indexes of all the Hyves items on archive.org, which contain a list of every URL that is contained in a given item along with its 'offset' (position in the WARC file)
[21:08]
speculaasoke [21:09]
joepie91speculaas: you can then use those positions to do a HTTP range request and retrieve just those bits of the WARC file, obtaining the pages [21:09]
speculaasHere are some archives https://archive.org/details/hyves?&sort=-downloads&page=2 [21:09]
joepie91speculaas: there's - to my knowledge - not yet a nice one-stop way to extract an account
speculaas: if you just want to *look* at the account, it's faster to look it up in the wayback machine
all the Hyves archives should have been imported into that
[21:10]
speculaasThe url for that is: www.hyves.nl/username ?
I already know the url but I see my account is not public
[21:17]
***schbirid has quit IRC (Quit: Leaving) [21:32]
joepie91speculaas: ah yeah, we only got the public profiles... so if it was a private profile, I'm afraid it can't be recovered :/
speculaas: unless a friend kept around a copy...
[21:32]
speculaasOke, than I know enough. Thanks for your time;) [21:35]
joepie91speculaas: good luck in your search :) [21:36]
***speculaas has quit IRC (Ping timeout: 268 seconds)
sun_rise has quit IRC (ny.us.hub irc.efnet.nl)
fie has quit IRC (ny.us.hub irc.efnet.nl)
tsr has quit IRC (ny.us.hub irc.efnet.nl)
yuitimoth has quit IRC (ny.us.hub irc.efnet.nl)
luckcolor has quit IRC (ny.us.hub irc.efnet.nl)
tephra has quit IRC (ny.us.hub irc.efnet.nl)
antomatic has quit IRC (ny.us.hub irc.efnet.nl)
kevinr has quit IRC (ny.us.hub irc.efnet.nl)
Frogging has quit IRC (ny.us.hub irc.efnet.nl)
johnny4 has quit IRC (ny.us.hub irc.efnet.nl)
bsmith093 has quit IRC (ny.us.hub irc.efnet.nl)
kisspunch has quit IRC (ny.us.hub irc.efnet.nl)
tapedrive has quit IRC (ny.us.hub irc.efnet.nl)
wolfpld has quit IRC (ny.us.hub irc.efnet.nl)
sun_rise has joined #archiveteam-bs
fie has joined #archiveteam-bs
tsr has joined #archiveteam-bs
yuitimoth has joined #archiveteam-bs
luckcolor has joined #archiveteam-bs
tephra has joined #archiveteam-bs
antomatic has joined #archiveteam-bs
kevinr has joined #archiveteam-bs
Frogging has joined #archiveteam-bs
johnny4 has joined #archiveteam-bs
bsmith093 has joined #archiveteam-bs
irc.efnet.nl sets mode: +oooo luckcolor antomatic Frogging bsmith093
kisspunch has joined #archiveteam-bs
tapedrive has joined #archiveteam-bs
wolfpld has joined #archiveteam-bs
swebb sets mode: +o antomatic
Frogging sets mode: +o yipdw
[21:40]
..... (idle for 21mn)
Sanquiis it using lxml? [22:05]
***FalconK has joined #archiveteam-bs [22:06]
FalconKhah!
yo Sanqui
[22:06]
Sanquioh hey [22:06]
FalconKso a bunch of the archivebot pipelines are dual-core atoms clocking at 2.4GHz in virtualized environments [22:07]
yipdwSanqui: it depends on the configuration. libxml on some pipelines, html5lib on others
we started with libxml but it kept crashing for some reasn
html5lib gets more stuff and seems more stable but is more expensive re CPU
[22:07]
FalconKI think all of them are html5lib now? [22:07]
Sanquifor reference, i wrote <@Sanqui> is it using lxml? [22:07]
yipdwprobably, but I can't be sure of that since people can change the pip manifest [22:07]
FalconKty [22:08]
Sanquiyeah html5lib is gonna be cpu expensive [22:08]
yipdwanyway, I don't think there's a way around parsing the documents to get links and stuff [22:08]
FalconKha, we have a manageability issue too writ large [22:08]
Sanquiideally we'd use libxml and allow to change with a parameter
if a certain website had issues
[22:08]
FalconKa suggestion from some local crew I know was to forego the XML parsing entirely and just use a best-effort regex, and accept that it will find some bullshit [22:08]
yipdwrecent Chrome release has an official headless mode and that seems interesting [22:08]
FalconKthe biggest reason to not use a regex is that it will fall down making relative URLs out of any / in anything
or else miss tons of stuff
[22:09]
yipdwyeah, that's a lot of webpages :P [22:09]
FalconKso I rejected that solution
fucking relative URLs
[22:09]
xmchm [22:10]
yipdwI still think using a browser is probably the way to go [22:10]
xmcso ... yeah [22:10]
Sanquiyou could make a more... "outzoomed" regex looking for href= [22:10]
xmcugh [22:10]
yipdwmore and more websites are using client rendering [22:10]
xmccomputers suck [22:10]
FalconKone could do that yes
for client rendering we have to use phantomjs anyway
[22:10]
Sanquiphantomjs is dead [22:10]
xmcor Headless Chrome Because Yes [22:10]
yipdwand if you're looking for an optimized way to parse documents, you might as well look at a Web browser [22:10]
Sanquitbh it'd be very nice if we could just spin up chromes [22:10]
FalconKthat will cause us to need more CPU, not less [22:11]
xmc^ [22:11]
yipdwmaybe [22:11]
Sanquiin place of phantomjs anyway [22:12]
FalconKit would be a lot nicer to use headless chrome than phantomjs for the things we do need client-side rendering for
but we still need client-side rendering for a small minority of sites
the only major use of it I've noticed, actually, is twitter.
[22:12]
yipdwI think that may be because that's the only place it seems to reliably work
"reliably"
[22:12]
FalconKphantomjs is also crashy af
that may also be the case
[22:12]
yipdwbut there's also a lot of blog sites that use client-side rendering and have no fallback [22:13]
FalconKI'm not at all opposed to using headless chrome in place of phantomjs and seeing how it performs [22:13]
Sanquihonestly, for archiving websites like twitter, youtube, facebook etc., the bot should have specific modes that are curated [22:13]
yipdwusually it's software developers because software developers are idiots [22:13]
FalconKSanqui: yes, that's also on the long todo list [22:13]
yipdwcan confirm, I write software [22:13]
FalconKwe want a !twitter at least, and possibly an !youtube and !reddit
!facebook would require a lot of work
separately, there's this CPU usage issue :P
[22:13]
yipdwdid anyone manage to get a useful CPU profile? I tried once but I just got a bunch of "your progrm is spending most of its time in Python's evaluator"
which is like saying "your program is spending most of its time running"
[22:15]
FalconKthere's *another* issue, which is that wpull.db grows to tens of GB when crawling large sites, but I'm willing to live with that for the moment since the high CPU usage is actually the pain point right now [22:15]
Sanquianyway, to drive the point home: phantomjs is over, the lead developer has stepped out in anticipation of headless chrome https://groups.google.com/forum/#!topic/phantomjs/9aI5d-LDuNE [22:15]
FalconKyipdw: I did! [22:15]
yipdwoh [22:15]
Sanquiso we need to do something eventually [22:15]
yipdwdo you still have the profile data? [22:15]
FalconKananiel-s6 is currently dedicated to profiling [22:15]
yipdwah good [22:16]
FalconKlet me see if I do still have it; if not, I can get it again later [22:16]
yipdwyeah, I'd like to see that. I got as far as perf and then I got annoyed and had to switch gears [22:16]
FalconKbut I recall html5lib stuff featured very heavily
perf is fucking awful to deal with
I hate optimizing
[22:16]
Sanquihtml5lib is parsing in python [22:16]
yipdwI keep seeing good testimonials for Telemetry [22:16]
Sanquiwe really want lxml [22:16]
yipdwwe've tried libxml before
it kept blowing up
[22:16]
Sanquithen we should figure out why and report it upstream [22:17]
FalconKlet's see - how does one read cprofile things again
actually yipdw do you just want the cprofile?
[22:17]
Sanqui(sorry for the 'we', i'm not trying to sound smart here) [22:17]
FalconKI'll put it up somewhere [22:17]
yipdwSanqui: I mean, yes, but in the meantime it was easier to just switch to html5lib and deliver something working [22:17]
Sanqui(i fully recognize i have done zero archivebot development) [22:17]
FalconKwe have a very bad test process right now for archivebot [22:18]
Sanquiyes I noticed tests are failing [22:18]
FalconKwhich is make it do real work, and then wait until it falls over, and see if you got enough information to figure out the failure case [22:18]
***REiN^ has quit IRC (Read error: Operation timed out) [22:18]
yipdwchfoo wrote a smoke test harness, but there's a lot of moving parts and I haven't looked at what it takes to put them back together in the Travis environment [22:19]
FalconKon my end, this is mostly because I have $infinity things to do that aren't archivebot, so... :P
no offense to chfoo but his code has a LOT of moving parts
[22:19]
yipdwi mean really I don't think "test it in production" is really a bad idea here [22:19]
FalconKit's not; it just takes forever [22:19]
yipdwif you have good telemetry, it's awesome [22:19]
JAAI'm not familiar with what libxml and html5lib really are internally, but probably the best option would be to the XML parser library from a browser (i.e. Chromium or Firefox), right? [22:19]
FalconKhtml5lib is basically that [22:20]
yipdwI don't know of anyone who has extracted those for consumption in something else [22:20]
FalconKit's intended afaik to be a W3C compliant HTML parser not unlike, say, sax for XML [22:20]
yipdwI guess that's a good point too
we really can't use "an XML library" to be pedantic
[22:20]
JAAYeah. Problem is, many websites aren't W3C compliant. [22:20]
yipdwHTML isn't XML and archivebot has to be able to deal with that [22:21]
FalconKmaybe it's html5lib that needs perf [22:21]
JAAWe still want to be able to handle those. [22:21]
FalconKwe don't currently have a significant problem with that [22:21]
SanquiJAA: it's inside out here [22:21]
yipdwindeed archivebot tends to get pointed at a lot of small, old sites [22:21]
SanquiW3C defines how to deal with websites that aren't W3C compliant
and browsers follow that
[22:21]
FalconKother than operator error I haven't had many complaints of archivebot missing things [22:21]
yipdwI don't have notes, but I think that's another reason why html5lib switch happened [22:21]
JAAI see [22:21]
FalconKif it's noticed, I'd love to hear about it [22:21]
yipdwit just got better results
no point in performing faster if you miss page requisites etc
[22:22]
***REiN^ has joined #archiveteam-bs [22:22]
Sanquicould always try pypy :) [22:22]
JAAIf html5lib works so well, how about rewriting it as a C extension? /s [22:22]
***ZexaronS- has quit IRC (Leaving) [22:23]
yipdwdebugging the intersection of python and C is prohibited by the Geneva Conventions
I mean you can inflict it on yourself but
[22:23]
***GE has quit IRC (Remote host closed the connection) [22:24]
yipdwtangentially related, I'm working on a project and part of it is an app that calls into a Go library
from C
[22:24]
FalconKhttp://ananiels6.falconkirtaran.net/cprof.dat [22:25]
yipdwthe app is trivially stack-smashable if you send a URL that's longer than 2048 bytes [22:25]
FalconKthat link work? [22:25]
yipdwI thought that was really funny
because it's like "Go will save me"
yeah no
Connecting to ananiels6.falconkirtaran.net (ananiels6.falconkirtaran.net)|51.15.47.106|:80... failed: Connection refused.
brb
[22:25]
JAAI've done it before, and it's actually not too bad as long as you can keep all the real work in C and just have a thin transition layer converting the stuff from/to Python variables.
But for obvious reasons, I wouldn't want to implement an XML parser, ever. Most certainly not in C.
[22:27]
Sanquithis is not work we should be doing [22:29]
FalconK+1
ffs
http://ananiels6.falconkirtaran.net:8000/cprof.dat
strings are awful
[22:31]
Sanquihooray for SimpleHTTPServer [22:32]
JAAIndeed. wpull really needs fixing. Version 2.0.1 has so many bugs that it's not even funny; e.g. concurrency is broken entirely and aborting doesn't work. And version 1.2.3 throws up when used with the current html5lib version since the API changed and the requirements.txt doesn't force the specific, compatible version. [22:32]
FalconKyipdw and I did the work to transition archivebot to wpull2 like 6 months back
I suggest that we roll with chfoo's changes and deprecate 1.x
but we will need to fix concurrency for sure
aborting is working fine for archivebot by the way
er... as fine as it ever has worked
[22:32]
yipdwok [22:34]
JAAInteresting. I always had to hard-abort (twice ^C) it when I tried. After few attempts, I went to 1.2.3 [22:35]
FalconKanyway the thing that really jumps out at me in the 650 second profile there:
926 23.751 0.026 324.046 0.350 /home/archivebot/.local/lib/python3.5/site-packages/wpull/scraper/html.py:127(_process_elements)
it's spending literally 50% of its time in html._process_elements
[22:36]
yipdwwell, in some way that's kinda cool
it means all of our add-on stuff isn't the slow bit
[22:38]
FalconKyeah...
by comparison, by the way, it spends about 7.5% of its time working with sqlite
[22:38]
***Stilett0 has quit IRC (Read error: Operation timed out) [22:39]
yipdwwhich to me is counterinituitive. I thought running hundreds of regular expressions on each document would be a problem
turns out, it isn't the dominating factor
profiles are awesome
[22:39]
FalconKyeah
our regexp running is efficient, I think, right? it compiles them into one state machine?
[22:39]
yipdwno [22:39]
JAAHmm, doesn't the HTML parsing happen outside of _process_elements? [22:39]
FalconKno idea [22:39]
yipdwbut we do compile the regexes / make use of the Python regexp cache [22:40]
FalconKFalconK nods [22:40]
yipdwso it's probably fast enough [22:40]
FalconKthe regex thing doesn't even seem to appear in the profiling
er
not anywhere near the top
[22:40]
yipdwneat
it's good to know also that sqlite is fast
[22:40]
FalconK78084 0.907 0.000 28.838 0.000 /home/archivebot/.local/lib/python3.5/site-packages/wpull/application/hook.py:132(notify)
5% of time in hooks of any kind
[22:41]
yipdwi had a suspicion it was more than sufficient for this but it's cool to see that it's at the bottom [22:41]
FalconKFalconK nods [22:41]
yipdwso, hmm
what is process_elements doing
[22:41]
FalconKI don't even remember what this job was (probably !a http://cnn.com/ or something)
but yes, one wonders
[22:41]
JAAIt seems that the parsing happens in wpull.document.html.HTMLReader.iter_elements . [22:42]
yipdwFalconK: can you put that profile data back up?
I get a connection refused talking to that site
that or if you can drill down into process_elements that'd be fab
[22:42]
FalconKoh sure [22:43]
yipdwit's a pretty big method [22:43]
FalconKsorry, it was python http.server and I took it down to read the data [22:43]
yipdwah ok [22:43]
JAAYeah, line profiling for _process_elements would be helpful. [22:43]
FalconKup again
go for it
[22:43]
yipdwhmm
Connecting to ananiels6.falconkirtaran.net (ananiels6.falconkirtaran.net)|51.15.47.106|:80... failed: Connection refused.
[22:44]
FalconK:8000 [22:44]
yipdwoh feck
there we go
done
[22:44]
FalconK:)
I'll leave it up for a bit while I read _process_elements
[22:44]
yipdwpython -m cProfile -s cumtime will never not be funny to me
also hi yes I am 12
[22:45]
FalconKoddly clean_link_soup is negligible
210118 1.320 0.000 3.423 0.000 /home/archivebot/.local/lib/python3.5/site-packages/wpull/scraper/util.py:38(clean_link_soup)
[22:46]
yipdwit'd be funny if it ended up being urljoin_safe or something
50% of overall time spent in string concat and reallocation
[22:46]
FalconK164679 1.413 0.000 30.091 0.000 /home/archivebot/.local/lib/python3.5/site-packages/wpull/url.py:684(urljoin) [22:47]
yipdwwat
are you fucking kidding me
[22:47]
FalconK:P [22:47]
JAASidenote: I think there's a bug in _process_elements: "if self._only_relative:" followed by "if link_info.base_link or '://' in link_info.link:" probably doesn't catch protocol-relative links, i.e. 'href="//example.com/"'. [22:47]
yipdwJAA: hmm [22:47]
FalconKwhat's that? [22:47]
yipdwI don't recall scheme-relative links being a problem, but we can try that out [22:48]
FalconKoh huh, https://www.paulirish.com/2010/the-protocol-relative-url/... TIL [22:48]
xmcthey're vaguely useful [22:48]
FalconKJAA: I think you're right; the best way to address it would be a PR
this is kind of a big deal too:
1077 0.230 0.000 51.563 0.048 /home/archivebot/.local/lib/python3.5/site-packages/wpull/database/wrap.py:41(add_many)
[22:50]
yipdwFalconK: wait, are you sure this is html5lib?
I see parse_lxml in the output
[22:52]
FalconKI went through the same inquiry
I don't remember the conclusion I came to
[22:52]
JAAFalconK: I guess. Then again, other PRs have been sitting there for months, so motivation is limited. Also, I have no idea how to fix it properly without breaking other stuff. Paths in URLs can contain several consecutive slashes IIRC; that is, href="some//path" is equivalent to href="some/path". [22:52]
FalconKeither both are in use, or else someone put html5lib in but left all the functions named like libxml.
JAA: right, it looks like only r'^//' is protocol-relative
no comment on PRs except that archivebot specifies github.com/falconkirtaran/wpull in requirements.txt
because before my omnibus PR was accepted wpull2 was too crashy to use
[22:52]
yipdwwell, maybe parse_lxml is the wrong place to look anyway. profile indicates that most of the time in there is spent in the "start" method but that method just invokes callbacks [22:55]
FalconKheya yipdw I think that add_many prof item might contain the plugins? [22:55]
yipdwand the callbacks aren't showing up in the profile, AFAICT
FalconK: not sure
oh wait, the callbacks are in the called: section
[22:55]
FalconKoh, still not a problem
2159 0.036 0.000 6.803 0.003 archive_bot_plugin.py:214(accept_url)
[22:57]
yipdwhuh. highest total time in start is /home/archivebot/.local/lib/python3.5/site-packages/wpull/collections.py:244(__init__)
does this just spend most of its time managing lists?
[22:58]
FalconKwhat does *that* abstraction do
it might, though
wpull -r keeps a lot of lists
[22:58]
***Stilett0 has joined #archiveteam-bs [22:59]
yipdwline 244 of collections.py is the initializer for FrozenDict
which does e.g.
def __init__(self, orig_dict):
self.orig_dict = orig_dict
self.hash_cache = hash(tuple(sorted(self.orig_dict.items())))
over 1.68 million calls to that that seems like it might be a thing
[22:59]
FalconKwait
hash(...sorted(?
[22:59]
yipdwyeah [23:00]
FalconKwhy [23:00]
yipdwI don't know
do Python hashes guarantee any sort of iteration order?
I know Ruby does
[23:00]
FalconKI suppose that would depend
what stability properties does it require?
[23:00]
yipdwnot sure [23:01]
FalconKpython is not my primary language
(that'd be C++, followed by x86 ASM)
[23:01]
yipdwFrozenDict is used in lxml.HTMLParserTarget.start [23:02]
FalconKwell! [23:02]
yipdwI'm not really sure if it's needed, though
hard to tell
it's also not immediately clear to me what it's wrapping -- it's 'attrib'
(tag attributes?)
[23:02]
FalconKmurdering it entirely would speed us up by 2%
AKA 4-6 page grabs per hundred seconds
[23:03]
yipdwor more, depending on what effect that would have with fewer allocations [23:04]
FalconKoh, true
the allocator is still a black box to us
[23:04]
yipdwI was just poking at it because it showed up pretty high in the profiles [23:04]
FalconKthough I feel like it's probably spending a lot more time sorting than allocating
I don't think __init__ captures time python spends allocating
and actually the python heap processing was insignificant anyway wan't it?
[23:04]
yipdwit might not, but FrozenDict is making more objects in its initializer
i.e. the new hash and the temporary tuple
[23:05]
FalconKmm [23:05]
yipdwI don't know how expensive that is on the allocator (it might be trivial)
anyway, I guess one thing to try would be to remove FrozenDict() with, like dict()
[23:05]
FalconKI don't think allocator time is captured with the jit time
but yeah, we could try that on ananiel-S6
[23:06]
yipdwyou lose the immutability guarantee but it'd be one way to see if FrozenDict() introduces a large penalty
or, in the specific case of start(), just don't wrap attrib in a FrozenDict()
I doubt it will have a perceptible macro difference but it would be neat to see how it changes the profile
[23:06]
FalconKnow I'm confused about this: [23:08]
yipdwspeaking of C++, one thing that C++ has made me really paranoid about (probably overly paranoid) is allocations [23:08]
FalconKthere's both lxml_.py and htmllib5_.py
why
[23:08]
yipdwlike every time I've had a performance problem, it wasn't algorithmic. it was because I was fucking mallocing too much
or treating cache lines like slacklines
that sort of things
FalconK: huh
dunno
maybe this is using libxml after all?
[23:08]
FalconKI wonder if it using libxml for XHTML documents and html5lib for others?
I remember there was some complex dispatch logic
it's just so ungodly complex
[23:11]
yipdwmaybe using Chrome as the HTML processor would actually be faster :P
let wpull handle queue management, retry, etc
[23:13]
FalconKdoubt it but who knows [23:13]
yipdwI mean you might still be at high CPU%, but the CPU might be doing more [23:14]
FalconKone thing that is good about html5lib/libxml2 is that it doesn't execute needless javascript
we may be able to disable doing that in headless chrome
[23:14]
yipdwit doesn't, but Javascript has been doing things to the DOM for quite a while
I don't know if it's needless
there was some other browser like this, I forgot what it was
it was webkit based
[23:15]
FalconKit's needful to grab, for sure [23:16]
yipdwand it was meant to be used in a UNIX Philosophy way
which means it has an impossible name
AH
uzbl
maybe that's an option too in the "use a browser engine to give us what we need to do our thing" arena
or i dunno how good is servo these days :P
every time I try to run servo nightly it eats up all my cores but doesn't render anything
but that could be an environment issue
[23:16]
***Ravenloft has joined #archiveteam-bs
JAA has quit IRC (Quit: Page closed)
[23:21]
FalconKok, new profiling on !a https://www.npr.org/
in 10 or 20 I'll kill it and we can look
it seems to not be crashing without FrozenDict
... I say, as it crashes
this fucking bug:
File "/home/archivebot/.local/lib/python3.5/site-packages/chardet/universaldetector.py", line 271, in close
for prober in self._charset_probers[0].probers:
IndexError: list index out of range
CRITICAL Sorry, Wpull unexpectedly crashed.
CRITICAL Please report this problem to the authors at Wpull's issue tracker so it may be fixed. If you know how to program, maybe help us fix it? Thank you for helping us help you help us all.
which is not new
[23:24]
yipdwwhat the
oh
right
[23:27]
***superkuh has quit IRC (Remote host closed the connection)
superkuh has joined #archiveteam-bs
[23:32]
.... (idle for 15mn)
FalconKyipdw: http://ananiels6.falconkirtaran.net:8000/02_post_rm_FrozenDict [23:49]
it certainly didn't seem to break anything, and now that 2% is gone
it's spending a significant amount of time on epoll_wait, which is good since that means it's a little network-bound
[23:55]
***BlueMaxim has joined #archiveteam-bs [23:57]
FalconK20 1.237 0.062 1.995 0.100 /home/archivebot/.local/lib/python3.5/site-packages/chardet/mbcharsetprober.py:61(feed)
that's 0.062 seconds per call. what is that even for?
[23:59]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)