[00:02] *** Martle has joined #archiveteam-bs [01:12] *** Stilett0 has joined #archiveteam-bs [01:14] *** Sk1d has quit IRC (Read error: Operation timed out) [01:14] *** Stiletto has quit IRC (Read error: Operation timed out) [01:14] *** Stiletto has joined #archiveteam-bs [01:16] *** Stilett0 has quit IRC (Ping timeout: 260 seconds) [01:17] *** Sk1d has joined #archiveteam-bs [01:21] so looks like i have 475k items this year [01:22] this is a bigger upload year then 2016 which was my biggest upload year by item count [01:24] *** Stilett0 has joined #archiveteam-bs [01:25] *** Stiletto has quit IRC (Ping timeout: 264 seconds) [01:29] *** Sk1d has quit IRC (Read error: Operation timed out) [01:32] *** Sk1d has joined #archiveteam-bs [01:45] *** Sk1d has quit IRC (Read error: Operation timed out) [01:46] *** Pixi` has joined #archiveteam-bs [01:47] *** Sk1d has joined #archiveteam-bs [01:49] *** Pixi has quit IRC (Read error: Operation timed out) [01:58] SketchCow: i'm going to see about updating the SBS 8 NEWS collection [01:58] i'm just going to be grabbing the mp4 files cause i just want to get it done [02:05] *** BlueMax has joined #archiveteam-bs [03:32] *** zerkalo has quit IRC (Ping timeout: 264 seconds) [03:32] *** zerkalo has joined #archiveteam-bs [03:35] *** RichardG_ has joined #archiveteam-bs [03:36] *** Pixi has joined #archiveteam-bs [03:38] *** RichardG has quit IRC (Ping timeout: 360 seconds) [03:38] *** VerifiedJ has quit IRC (Read error: Operation timed out) [03:39] *** dxrt has quit IRC (Ping timeout: 360 seconds) [03:39] *** Polylith has quit IRC (Ping timeout: 360 seconds) [03:39] *** dxrt has joined #archiveteam-bs [03:41] *** superkuh has quit IRC (Excess Flood) [03:41] *** SketchCo1 has joined #archiveteam-bs [03:42] *** swebb sets mode: +o SketchCo1 [03:42] *** Polylith has joined #archiveteam-bs [03:42] *** unlobito has quit IRC (Ping timeout: 360 seconds) [03:43] *** unlobito has joined #archiveteam-bs [03:43] *** Pixi` has quit IRC (Read error: Operation timed out) [03:43] *** me_ has quit IRC (Read error: Operation timed out) [03:46] *** Sk1d has quit IRC (Read error: Operation timed out) [03:47] *** phirephly has quit IRC (Ping timeout: 360 seconds) [03:47] *** twigfoot has quit IRC (Ping timeout: 360 seconds) [03:47] *** arkiver has quit IRC (Ping timeout: 360 seconds) [03:47] *** Zebranky has quit IRC (Remote host closed the connection) [03:47] *** Zebranky has joined #archiveteam-bs [03:47] *** SketchCow has quit IRC (Read error: Connection reset by peer) [03:48] *** phirephly has joined #archiveteam-bs [03:48] *** Darkstar has quit IRC (Read error: Connection reset by peer) [03:48] *** eprillios has quit IRC (Ping timeout: 360 seconds) [03:49] *** twigfoot has joined #archiveteam-bs [03:49] *** closure has quit IRC (Read error: Operation timed out) [03:50] *** me_ has joined #archiveteam-bs [03:50] *** closure has joined #archiveteam-bs [03:50] *** anarchat has joined #archiveteam-bs [03:50] *** anarcat has quit IRC (Ping timeout: 360 seconds) [03:51] *** Sk1d has joined #archiveteam-bs [03:52] *** eprillios has joined #archiveteam-bs [03:52] *** SketchCo1 is now known as SketchCow [03:52] *** arkiver has joined #archiveteam-bs [03:53] *** Darkstar has joined #archiveteam-bs [03:56] *** Pixi has quit IRC (Quit: Pixi) [03:56] *** anarchat is now known as anarcat [03:56] *** Stiletto has joined #archiveteam-bs [03:57] *** Pixi has joined #archiveteam-bs [03:58] *** Stilett0 has quit IRC (Read error: Operation timed out) [03:59] *** superkuh has joined #archiveteam-bs [03:59] *** superkuh has quit IRC (Excess Flood) [04:03] *** Sk1d has quit IRC (Read error: Operation timed out) [04:05] *** Cameron_D has quit IRC (Read error: Operation timed out) [04:07] *** Sk1d has joined #archiveteam-bs [04:08] *** JTL has joined #archiveteam-bs [04:08] *** Cameron_D has joined #archiveteam-bs [04:08] *** superkuh has joined #archiveteam-bs [04:12] *** Stilett0 has joined #archiveteam-bs [04:17] *** Stiletto has quit IRC (Read error: Operation timed out) [04:23] *** Dimtree has quit IRC (Read error: Connection reset by peer) [04:32] *** Dimtree has joined #archiveteam-bs [04:51] *** qw3rty117 has joined #archiveteam-bs [04:57] *** qw3rty116 has quit IRC (Read error: Operation timed out) [05:32] *** Sk1d has quit IRC (Read error: Operation timed out) [05:35] *** Sk1d has joined #archiveteam-bs [05:48] *** Sk1d has quit IRC (Read error: Operation timed out) [05:52] *** Sk1d has joined #archiveteam-bs [06:05] *** Sk1d has quit IRC (Read error: Operation timed out) [06:08] *** Sk1d has joined #archiveteam-bs [06:18] *** logchfoo0 starts logging #archiveteam-bs at Fri Nov 02 06:18:53 2018 [06:18] *** logchfoo0 has joined #archiveteam-bs [06:19] *** Petri152 has joined #archiveteam-bs [06:21] *** Sk1d has quit IRC (Read error: Operation timed out) [06:25] *** Sk1d has joined #archiveteam-bs [07:13] *** Martle has quit IRC (Leaving) [07:31] *** SmileyG has quit IRC (Read error: Operation timed out) [07:31] *** Sk1d has quit IRC (Read error: Operation timed out) [07:31] *** Smiley has joined #archiveteam-bs [07:34] *** brayden has quit IRC (Ping timeout: 260 seconds) [07:35] *** Sk1d has joined #archiveteam-bs [07:36] *** alex__ has joined #archiveteam-bs [07:47] *** brayden has joined #archiveteam-bs [07:47] *** swebb sets mode: +o brayden [07:48] *** Sk1d has quit IRC (Read error: Operation timed out) [07:50] *** Sk1d has joined #archiveteam-bs [08:03] https://m.dw.com/en/armenias-parliament-dissolved-forcing-december-election/a-46126035 [08:03] Should we do something about this? [08:04] *** Sk1d has quit IRC (Read error: Operation timed out) [08:07] *** Sk1d has joined #archiveteam-bs [08:22] *** Petri152 has quit IRC (hub.efnet.us irc.colosolutions.net) [08:22] *** Mayonaise has quit IRC (hub.efnet.us irc.colosolutions.net) [08:22] *** sknebel_ has quit IRC (hub.efnet.us irc.colosolutions.net) [08:22] *** jspiros has quit IRC (hub.efnet.us irc.colosolutions.net) [08:22] *** S1mpbrain has quit IRC (hub.efnet.us irc.colosolutions.net) [08:22] *** ivan has quit IRC (hub.efnet.us irc.colosolutions.net) [08:22] *** JAA has quit IRC (hub.efnet.us irc.colosolutions.net) [08:22] *** K4k has quit IRC (hub.efnet.us irc.colosolutions.net) [08:22] *** c4rc4s has quit IRC (hub.efnet.us irc.colosolutions.net) [08:22] *** zyphlar has quit IRC (hub.efnet.us irc.colosolutions.net) [08:32] *** Petri152 has joined #archiveteam-bs [08:32] *** Mayonaise has joined #archiveteam-bs [08:32] *** sknebel_ has joined #archiveteam-bs [08:32] *** jspiros has joined #archiveteam-bs [08:32] *** S1mpbrain has joined #archiveteam-bs [08:32] *** ivan has joined #archiveteam-bs [08:32] *** JAA has joined #archiveteam-bs [08:32] *** K4k has joined #archiveteam-bs [08:32] *** c4rc4s has joined #archiveteam-bs [08:32] *** zyphlar has joined #archiveteam-bs [08:32] *** irc.colosolutions.net sets mode: +o JAA [08:32] *** swebb sets mode: +o JAA [08:32] *** bakJAA sets mode: +o JAA [08:34] *** JAA sets mode: +o bakJAA [08:46] *** Sk1d has quit IRC (Read error: Operation timed out) [08:49] *** Sk1d has joined #archiveteam-bs [09:02] *** Sk1d has quit IRC (Read error: Operation timed out) [09:06] *** Sk1d has joined #archiveteam-bs [09:17] *** Sk1d has quit IRC (Read error: Operation timed out) [09:22] *** Sk1d has joined #archiveteam-bs [09:34] *** Sk1d has quit IRC (Read error: Operation timed out) [09:36] *** alex__ has quit IRC (Quit: alex__) [09:38] *** alex__ has joined #archiveteam-bs [09:40] *** Sk1d has joined #archiveteam-bs [09:50] *** alex__ has quit IRC (Quit: alex__) [10:23] *** VerifiedJ has joined #archiveteam-bs [10:31] *** Sk1d has quit IRC (Read error: Operation timed out) [10:34] *** Sk1d has joined #archiveteam-bs [10:48] *** Sk1d has quit IRC (Read error: Operation timed out) [10:50] *** Sk1d has joined #archiveteam-bs [10:52] *** alex__ has joined #archiveteam-bs [11:02] *** BlueMax has quit IRC (Read error: Connection reset by peer) [11:10] SketchCow: https://archive.org/details/chromebot-2018-11-02-b514ee Item preview of new uploads now contains five randomly selected screenshots. [12:02] So, back on the topic of the US midterms, my first scrape has found: [12:02] ~2800 twitter accounts (currently running snscrape to get list of URLs to stick in archivebot) [12:03] ~5000 facebook accounts (will do the same as with the twitter accounts) [12:03] ~6500 campaign websites [12:03] I would recommend splitting them into batches [12:04] these were scraped from Vote411.org, which indicates ~26,000 candidates [12:04] betamax mentioned 220k tweets from 250 accounts in -ot, so yeah, definitely do batches. [12:04] 220k tweets is 10% of the accounts so far... :) [12:05] Yeah, ~2.5M tweets total if the number of tweets per account is representative. [12:05] betamax: I’m offering to run a private chromebot grab for Twitter and Facebook on all of these accounts. [12:06] PurpleSym: that is probably a good idea [12:06] I shall stick the lists of URLs somewhere once the grab completes [12:06] the real issue though is the campaign sites [12:07] Yeah, just ping me when you’re ready. [12:07] Are private grab-site/wpull grabs an option for these? [12:07] *** Sk1d has quit IRC (Read error: Operation timed out) [12:07] ideally they'd end up in the wayback, would a private grab allow this? [12:08] I don't think so [12:08] Unless it gets uploaded by an approved account [12:08] Which the AB one is, or PurppleSym's one [12:08] PurpleSym's* [12:09] kiskaAus AB pipeline IBM stats http://prntscr.com/ldixvt [12:09] Yep, I guess I could upload them. But we’ll have to split up the 6500 sites to get some parallelism. [12:09] yes [12:09] So I would recommend just putting them on to AB and see what happens [12:10] It's also a pain that out of the ~26,000 candidates listed on Vote411, only a few thousand have external web links [12:10] *** Sk1d has joined #archiveteam-bs [12:11] I fail to believe that 90% of the candidates don't have a twitter account [12:11] anyone think of other websites to scrape to get a more representitive list of social media / websites of candidates? [12:11] betamax: Could it be worth trying to match names or IDs against Wikidata? [12:12] You could try and see if they use hash tags to represent the election. ie Australia uses #auspol [12:12] I don't know how complete Wikidata is, but VoynichCr's bot seems to be doing pretty well. [12:13] *** svchfoo1 has joined #archiveteam-bs [12:13] *** svchfoo3 has joined #archiveteam-bs [12:13] *** PurpleSym sets mode: +oo svchfoo1 svchfoo3 [12:14] hmm, just tried searching wikidata for a random candidate that *has* a twitter/facebook/website listed on Vote411, (Dana Nessel) and nothing came up [12:15] Isn't VoynichCr a wizzard in wikidata? [12:17] I welcome any and all wizards! [12:18] unfortunately the way vote411 was scraped means that I only have lists of URLs, and don't know the names of candidates that I got, or the names of candidates where I'm missing data on [12:21] wikidata is a good source, but it isnt complete [12:23] if the person isnt notable for wikipedia, or he has an article but it is not famous, wikidata entries can suck [12:52] *** Mateon1 has quit IRC (Read error: Operation timed out) [12:58] *** Mateon1 has joined #archiveteam-bs [13:15] *** Sk1d has quit IRC (Read error: Operation timed out) [13:19] *** Sk1d has joined #archiveteam-bs [13:33] *** Sk1d has quit IRC (Read error: Operation timed out) [13:36] *** Sk1d has joined #archiveteam-bs [13:44] Hello, so my grab-site stopped due low disk space. I was stupid enough not run the scrip which check for low disk space... anyhow, can I resume it? [13:45] ColdIce: I don't think grab-site supports resuming grabs. https://github.com/ludios/grab-site/issues/58 But maybe ivan can tell you more. [13:48] *** Sk1d has quit IRC (Read error: Operation timed out) [13:52] *** Sk1d has joined #archiveteam-bs [15:39] sometimes I resume a crawl poorly by feeding the $(gs-dump-urls wpull.db todo) list into a new --1 crawl [16:16] ivan, tell me more. Total noob here [16:17] cd DIR (with the crawl) [16:17] gs-dump-urls wpull.db todo > ../.website-continued [16:17] grab-site --1 -i .website-continued [16:17] a non-recursive crawl of the URLs already discovered [16:17] if that's not good enough just crawl it again, sorry :( [16:17] how about ignores and offsite links? Pass args to grab-site again? [16:18] you can copy some of the control files from the old dir [16:18] offsite links are not followed with --1 anyway [16:56] *** Sk1d has quit IRC (Read error: Operation timed out) [16:59] *** Sk1d has joined #archiveteam-bs [17:00] so [17:01] ivan / JAA : i've at least created https://github.com/ArchiveTeam/wpull/pull/402 so we can get tests back on track [17:01] err wtf [17:01] that was deleted?? [17:02] anarcat: So yeah, the plan is to get my PR merged, then sort out the remaining failures caused by FalconK's changes between 2.0.1 and 2.0.3. [17:02] what the f [17:02] i created PR #402, but now it's gone [17:02] and i can't recreate it [17:02] And then we need to take a hard look at how wpull uses asyncio and why it's broken. [17:02] wtf is wrong with github [17:02] Just GitHub being GitHub: https://status.github.com/messages [17:03] I've said before that I'd prefer a self-hosted GitLab/Gitea/whatever instance, but oh well. [17:03] wow, and now it's back https://github.com/ArchiveTeam/wpull/pull/402 [17:03] and it's gone [17:03] sigh [17:03] anyways [17:04] that should normally kick into https://travis-ci.org/ArchiveTeam/wpull/pull_requests eventually [17:04] JAA: i'd love to see your PR merged, but there's already a bunch of work to get it up to pep8 and all that stuff [17:04] ivan made a bunch of comments on it [17:05] Yeah, I know. Hope to get to that soon. [17:05] okay [17:05] well for sure it would be nice to fix tests first, so maybe we can concentrate on that [17:07] i'll do that in my branch, worst case you just need to remerge with master - but because i cherry-picked your commits, that should be a noop [17:07] i mean assuming my stuff gets merged in the first place :p [17:07] Assuming we ever get to see your PR. :-P [17:08] hehe [17:08] Oh, there it is. I fixed the html5lib thing on my branch/PR already. [17:08] well it's there now [17:08] yeah i know, i just cherry-picked that *one* commit [17:08] Ah right [17:08] then we could add that *one* magic commit that fixes the asyncio stuff [17:08] (or just mark the youtubedl tests as xfail so we can get the test suite green again) [17:09] because it's better to have a test suite fake green than red [17:09] [citation needed] [17:09] * anarcat shrugs [17:09] there are other PRs in there, which fix other unrelated things [17:10] if they are marked as red, how do we know they don't introduce new regressions? [17:10] if the test suite would be green with known failures, then we still know about those failures but at least don't introduce *other* regressions [17:10] anyways [17:10] i need to step out, we'll see how travis works that out [17:12] Well, we'd need to look at the test results for the failures until we fix it properly. Anyway, I intend to revert the removal of CONNECT support (which is what broke youtube-dl), which is one of the test failures. And the other failures are related to the removal of FrozenDict (a performance hog), so they should also be easy enough to fix. [17:13] But yeah, I'll try to get my PR ready for merging this weekend hopefully, and that should bring us a good step forward. [17:14] *** wp494 has quit IRC (Read error: Operation timed out) [17:14] *** Sk1d has quit IRC (Read error: Operation timed out) [17:14] *** wp494 has joined #archiveteam-bs [17:18] *** Sk1d has joined #archiveteam-bs [17:25] *** Pixi has quit IRC (Quit: Pixi) [18:42] *** Stiletto has joined #archiveteam-bs [18:46] *** Stilett0 has quit IRC (Read error: Operation timed out) [19:05] *** alex____ has joined #archiveteam-bs [19:06] *** alex__ has quit IRC (Ping timeout: 252 seconds) [19:08] *** Pixi has joined #archiveteam-bs [19:21] *** Martle has joined #archiveteam-bs [19:24] JAA: if you can point to those commits, i can add that to the branch and we can start by making travis green [19:24] or i can dig for those myself as well [19:29] anarcat: git log v2.0.1..master should be useful. [19:29] I don't have the commit IDs handy right now. [19:30] * anarcat nods [19:30] i think it's simply 5613807 [19:31] https://travis-ci.org/ArchiveTeam/wpull/jobs/449990630 [19:33] one failure already :/ [19:34] sigh [19:34] FAILED (SKIP=11, errors=1, failures=4) [19:43] i tried to revert two more commits, no go either [19:44] i need to move on, unfortunately [19:45] assert host == 'localhost' [19:49] anarcat: The CONNECT removal is part of a commit also having some other changes, so that needs to be reverted manually/partially. [19:54] figures [19:54] i just blindly reverted it [19:56] i dunno, actually... 5613807 is pretty much self-contained: it's just CONNECT and FrozenDict removal, AFAICT [20:12] sigh.. tests fail even on 2.0.1 here [20:16] i'm trying to bisect when the test suite was ever working, and i'm down to 1.2.3 now [20:18] anarcat: Well yes, but I don't think we want to revert the FrozenDict removal since it did provide a sizeable speedup. [20:18] ah, i misunderstood you there [20:19] well even if we keep that, the tests fail [20:19] the test suite seems a little brittle [20:19] i wonder if it's because i'm on python 3.6 [20:20] anarcat: 2.0.1 passed: https://travis-ci.org/ArchiveTeam/wpull/builds/139339474 [20:20] not here [20:21] i'll try again in 3.5 [20:48] nope [20:48] kinda hard to make this thing work at all... crashing on phantomjs because it can't open DISPLAY [20:51] i can reproduce the same test failures as the ones i get in the current travis pipeline on debian stretch with python 3.5 [20:51] in 2.0.1 [20:52] https://ptpb.pw/E-OP [20:53] so i would guess this is not a regression introduced by falconkirtaran - we might have something else here, maybe regarding some dependenceis [21:10] *** Sk1d has quit IRC (Read error: Operation timed out) [21:13] *** Sk1d has joined #archiveteam-bs [21:47] *** tuluu has quit IRC (Remote host closed the connection) [21:49] *** tuluu has joined #archiveteam-bs [22:04] *** BlueMax has joined #archiveteam-bs [23:28] anarcat: That does seem like something related to your PhantomJS installation. I remember that we've had issues with the package from the Debian and Ubuntu repos before. That's why the installation instructions for ArchiveBot pipelines say to download the binary from Bitbucket instead. [23:29] See here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=817277 [23:30] It has something to do with how Qt and WebKit are built. [23:31] In the long term, we should get rid of PhantomJS entirely since it's no longer maintained. I don't know if we should simply drop it entirely or somehow find an alternative though. [23:50] ggarh [23:51] Yeah. Avoid PhantomJS like the plague. [23:51] dropping it seems like a good first step if (a) it doesn't work and (b) it breaks the test suite [23:52] The PhantomJS binary from the maintainer's Bitbucket repository works fine. That's what's used on Travis as well, I believe. [23:53] Ah no, Travis has PhantomJS installed otherwise, and I guess that's a version which works headlessly then. [23:56] how? i don't see where it's installed in travis-ci.yml [23:57] i guess it's part of the base image... [23:57] Yeah, it's installed on the system level at Travis: https://docs.travis-ci.com/user/gui-and-headless-browsers/#using-phantomjs [23:58] And they explicitly mention that it's actually headless and that xvfb is not needed. [23:59] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=817277