[00:17] *** Mayeau is now known as Mayonaise [00:36] *** kristian_ has quit IRC (Quit: Leaving) [01:23] *** dashcloud has quit IRC (Read error: Operation timed out) [01:26] *** dashcloud has joined #archiveteam-bs [01:31] *** krazedkat has quit IRC (Quit: Leaving) [02:19] *** VADemon has quit IRC (Quit: left4dead) [03:42] *** jrwr has quit IRC (Remote host closed the connection) [04:30] *** ravetcofx has quit IRC (Ping timeout: 1208 seconds) [04:40] *** ravetcofx has joined #archiveteam-bs [04:55] is bayimg all backed up? [05:28] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [05:34] *** Sk1d has joined #archiveteam-bs [06:09] *** ravetcofx has quit IRC (Read error: Operation timed out) [06:50] *** Somebody1 has quit IRC (Read error: Operation timed out) [07:22] *** Somebody1 has joined #archiveteam-bs [08:06] alembic: yes [08:11] Short braindump: The URL discovery problem is going to be one of our main concern when archiving things (cf dropbox), so can we build our own URL database/site explorer? [08:11] We already got a lot of data from urlteam. And we have a lot of WARCs we could extract URL from that were considered “off site” at the time of the crawl. [08:14] *** Start has quit IRC (Quit: Disconnected.) [08:15] *** BlueMaxim has quit IRC (Quit: Leaving) [08:21] *** schbirid has joined #archiveteam-bs [08:33] PurpleSym: I'd be delighted if someone took the initiative to convert (and even better, keep up to date) an indexed database of all the URLs in the URLTeam and ArchiveBot collections. It'd be non-trivial in size, but not *too* big, I think. [08:34] What size are we talking about here, Somebody1? GB? TB? [08:37] you can get an idea by downloading all the archivebot CDXes [08:38] (incidentally that's also a great start to an indexed database) [08:42] *** Somebody1 has quit IRC (Ping timeout: 370 seconds) [08:43] That’s a good idea. I’ll try to get a size estimate from the CDXes. [08:46] *** wm_ has quit IRC (Ping timeout: 260 seconds) [08:46] *** midas1 has quit IRC (Ping timeout: 260 seconds) [08:47] *** Kenshin has quit IRC (Read error: Connection reset by peer) [08:47] *** Kenshin has joined #archiveteam-bs [08:48] *** midas1 has joined #archiveteam-bs [08:51] *** Somebody1 has joined #archiveteam-bs [08:52] PurpleSym: IDK -- probably in the couple of GB range, I think...? [08:52] That sounds manageable. [08:54] *** wm_ has joined #archiveteam-bs [09:05] so the (compressed) URLTeam data is a total of 200G [09:05] but the format is silly, so storing it in a more sensible way might bring the size down [09:06] IDK how big the ArchiveBot collection is [09:25] ps. Well done for coming into something thats been running for years, and going "HERES A BETTER WAY.... ITS BETTER..." There is method in the madness [09:31] it's the same data we've always had [09:34] *** DiscantX has joined #archiveteam-bs [09:34] URLteam, yes, but we do not extract *all* URLs from WARCs, do we? The CDXes only contain URLs archivebot actually retrieved. [09:35] a cdx for a WARC is a full index of that WARC so uh [09:35] I'm not sure what the question is [09:36] The HTML inside the WARC may contain links to sites that were not crawled due to exceptions or recursion limits. [09:37] These are not in the CDX. [09:37] right [09:38] These are the URLs I was talking about. [09:38] but that's fine because that URL doesn't exist in the WARC [09:39] oh ok [09:40] I mean, there's ways of varying accuracy to get those URLs [09:47] HCross: which topic was your comment in reference to? [09:48] *** GE has joined #archiveteam-bs [09:49] yipdw: I was thinking something as simple and stupid as looking a regex: https?://[^"]+ [09:49] you'll want scheme-relative URLs and relative URLs [09:49] lots of webapps generate those [09:50] ah, very good point -- and those are somewhat more of a PITA (although still not that bad) [09:51] that's more like covering a case where you might have a partial failure on a single site but I guess if you don't trust the CDXes you can't assume that either [09:51] *** krazedkat has joined #archiveteam-bs [09:52] come to think of it, that's not uncommon [09:53] Hm, it doesn't look like the ArchiveBot JSON records whether or not offsite-links were grabbed: https://ia801200.us.archive.org/14/items/archiveteam_archivebot_go_20160802210001/0cch.com-inf-20160801-163213-5t2w6.json [09:53] those don't, the warcinfo record does [09:54] because all we get is ShortURL = LongURL (at least I think) [09:54] HCross: ? [09:56] URLteam does only store mappings from short to long URLs, yes -- but I'm still confused about how this relates...? [09:58] specifically, if you see --span-hosts-allow page-requisites,linked-pages in the Wpull-Argv field of the warcinfo record, --no-offsite-links was off [09:58] *** Chii has joined #archiveteam-bs [09:58] anybody have a fave SC story (such as https://youtu.be/6r65WwO1-uA?t=462 )? [09:58] So You Want To Murder a Software Patent Jason Scott - [51m15s] 2014-09-27 - Adrian Crenshaw - 33,605 views [09:58] the JSON file is kind of useless for now [09:58] ;part [09:58] *** Chii has left [09:59] car crash / hotel anecdote [10:00] * ranma ducks [10:02] yipdw: good to know about the Wpull-Argv field though [10:02] in a lot of cases the warcinfo record is the best place to go first [10:03] it's the first record of each wpull warc, so you can grab it by asking for e.g. the first 100 kB of the WARC [10:03] there's probably a more refined way to get just the first record [10:03] I think the only thing you *can't* get from that is the pipeline and job initiator nicks [10:03] but really who cares about those [10:23] *** Somebody1 has quit IRC (Ping timeout: 370 seconds) [10:37] *** GE has quit IRC (Remote host closed the connection) [11:12] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [11:15] *** BartoCH has joined #archiveteam-bs [11:48] *** Igloo has quit IRC (Ping timeout: 250 seconds) [11:49] *** Igloo has joined #archiveteam-bs [12:15] *** GE has joined #archiveteam-bs [12:19] *** GE has quit IRC (Remote host closed the connection) [12:39] *** DiscantX has quit IRC (Ping timeout: 244 seconds) [13:04] *** DiscantX has joined #archiveteam-bs [13:10] *** VADemon has joined #archiveteam-bs [13:29] *** Start has joined #archiveteam-bs [13:38] *** Start has quit IRC (Ping timeout: 506 seconds) [14:14] *** Kaz| has joined #archiveteam-bs [14:14] *** Kaz| has quit IRC (Client Quit) [14:33] *** Kaz has quit IRC (Quit: boop) [14:35] *** Kaz has joined #archiveteam-bs [14:44] *** Kaz has quit IRC (Read error: Connection reset by peer) [14:46] *** Kaz has joined #archiveteam-bs [15:22] *** DiscantX has quit IRC (Ping timeout: 244 seconds) [15:35] *** Start has joined #archiveteam-bs [15:38] *** Start has quit IRC (Client Quit) [15:40] *** Start has joined #archiveteam-bs [15:40] *** Start has quit IRC (Client Quit) [16:08] *** jspiros has quit IRC (Read error: Operation timed out) [16:12] *** jspiros has joined #archiveteam-bs [16:31] *** GE has joined #archiveteam-bs [16:57] *** Somebody1 has joined #archiveteam-bs [17:23] *** Somebody1 has quit IRC (Ping timeout: 370 seconds) [17:38] *** vitzli has joined #archiveteam-bs [17:49] *** Jonimus has joined #archiveteam-bs [17:49] *** swebb sets mode: +o Jonimus [17:50] So my workplaces hosting provider apparently screwed something up a few months back such that our website has been down for quite a while and so we said screw them and moved to a simple squarespace site [17:50] but it looks like the wayback machine does not have a complete crawl of our old website from any recent time. [17:52] If I fudge my /etc/hosts such that I can get the old site to work and do warc/crawl is there any way to back date it and get it into the wayback machine as if it was a few days ago? [17:52] or is there a way to have archivebot do this? [17:54] I don't think that's allowed [17:55] no [17:55] you don't get to edit history [17:56] *** VADemon has quit IRC (Read error: Connection reset by peer) [17:57] but please do that and upload a warc anyway. it just won't go into wayback, but it's still good to have! [17:58] yeah, by all means do a crawl. It's just Wayback that requires authenticity [18:02] I figured as much, just thought I'd ask. [18:02] oh, hi Jonimus. didn't recognize your name for some reason. [18:05] *** vitzli has quit IRC (Quit: Leaving) [18:06] oh, hi, yeah I lurk in #archiveteam but I figured this was more of a -bs question [18:06] side not, what are the wget paramters I'd want for good warc creation? [18:08] 1 sec [18:09] we have some suggestions here http://archiveteam.org/index.php?title=Wget [18:09] specifically the "creating warc with wget" section [18:10] if you're not opposed to installing new software, wpull is somewhat better than wget [18:10] https://github.com/chfoo/wpull [18:10] it allows you to stop and resume gracefully [18:10] (among other things) [18:13] the wpull args can be a lot to swallow, I use ArchiveBot as a reference [18:13] https://github.com/ArchiveTeam/ArchiveBot/blob/master/pipeline/archivebot/seesaw/wpull.py#L22 [18:19] yeah well it looks like they still have broken things because when doing wget http://ip --header "Host: actualdomain.com" I get a 500 error [18:21] so I'm not sure I'll be able to do an actual crawl correctly because it be broke. [18:21] maybe it has two Host headers in the requests [18:29] I checked and that does not appear to be the case as of wget 1.10, and this system as 1.17.1 with warc support. [18:29] hm ok [18:33] yeah it seems our old go-daddy reseller really broke things well, there is a reason we were gonna switch.. [18:43] Well it looks like it will be absically impossible to do this without fighting the host to fix their shit so there goes that idea. [18:44] I was able to make a backup of the site via ftp but that include privite info so I can't really upload it to archive.org without spending a bunch of time removing the private stuff. [19:36] *** Start has joined #archiveteam-bs [20:07] *** Aranje has joined #archiveteam-bs [20:58] *** Start has quit IRC (Quit: Disconnected.) [21:02] *** Start has joined #archiveteam-bs [21:03] *** Start has quit IRC (Client Quit) [21:16] *** Start has joined #archiveteam-bs [21:33] *** Start has quit IRC (Quit: Disconnected.) [21:47] *** schbirid has quit IRC (Quit: Leaving) [22:07] *** BlueMaxim has joined #archiveteam-bs [22:10] *** FalconK has joined #archiveteam-bs [22:21] *** Aranje has quit IRC (Ping timeout: 260 seconds) [22:21] *** Start has joined #archiveteam-bs [22:22] *** Start has quit IRC (Client Quit) [22:22] *** Madthias has joined #archiveteam-bs [22:39] so grawker sitemap grabs are up to 2016-11-30 now [23:24] *** Ravenloft has joined #archiveteam-bs [23:29] *** Ravenloft has quit IRC (Ping timeout: 272 seconds)