#archiveteam-bs 2016-12-19,Mon

↑back Search

Time Nickname Message
00:17 πŸ”— Mayeau is now known as Mayonaise
00:36 πŸ”— kristian_ has quit IRC (Quit: Leaving)
01:23 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
01:26 πŸ”— dashcloud has joined #archiveteam-bs
01:31 πŸ”— krazedkat has quit IRC (Quit: Leaving)
02:19 πŸ”— VADemon has quit IRC (Quit: left4dead)
03:42 πŸ”— jrwr has quit IRC (Remote host closed the connection)
04:30 πŸ”— ravetcofx has quit IRC (Ping timeout: 1208 seconds)
04:40 πŸ”— ravetcofx has joined #archiveteam-bs
04:55 πŸ”— alembic is bayimg all backed up?
05:28 πŸ”— Sk1d has quit IRC (Ping timeout: 194 seconds)
05:34 πŸ”— Sk1d has joined #archiveteam-bs
06:09 πŸ”— ravetcofx has quit IRC (Read error: Operation timed out)
06:50 πŸ”— Somebody1 has quit IRC (Read error: Operation timed out)
07:22 πŸ”— Somebody1 has joined #archiveteam-bs
08:06 πŸ”— Medowar alembic: yes
08:11 πŸ”— PurpleSym Short braindump: The URL discovery problem is going to be one of our main concern when archiving things (cf dropbox), so can we build our own URL database/site explorer?
08:11 πŸ”— PurpleSym We already got a lot of data from urlteam. And we have a lot of WARCs we could extract URL from that were considered β€œoff site” at the time of the crawl.
08:14 πŸ”— Start has quit IRC (Quit: Disconnected.)
08:15 πŸ”— BlueMaxim has quit IRC (Quit: Leaving)
08:21 πŸ”— schbirid has joined #archiveteam-bs
08:33 πŸ”— Somebody1 PurpleSym: I'd be delighted if someone took the initiative to convert (and even better, keep up to date) an indexed database of all the URLs in the URLTeam and ArchiveBot collections. It'd be non-trivial in size, but not *too* big, I think.
08:34 πŸ”— PurpleSym What size are we talking about here, Somebody1? GB? TB?
08:37 πŸ”— yipdw you can get an idea by downloading all the archivebot CDXes
08:38 πŸ”— yipdw (incidentally that's also a great start to an indexed database)
08:42 πŸ”— Somebody1 has quit IRC (Ping timeout: 370 seconds)
08:43 πŸ”— PurpleSym That’s a good idea. I’ll try to get a size estimate from the CDXes.
08:46 πŸ”— wm_ has quit IRC (Ping timeout: 260 seconds)
08:46 πŸ”— midas1 has quit IRC (Ping timeout: 260 seconds)
08:47 πŸ”— Kenshin has quit IRC (Read error: Connection reset by peer)
08:47 πŸ”— Kenshin has joined #archiveteam-bs
08:48 πŸ”— midas1 has joined #archiveteam-bs
08:51 πŸ”— Somebody1 has joined #archiveteam-bs
08:52 πŸ”— Somebody1 PurpleSym: IDK -- probably in the couple of GB range, I think...?
08:52 πŸ”— PurpleSym That sounds manageable.
08:54 πŸ”— wm_ has joined #archiveteam-bs
09:05 πŸ”— Somebody1 so the (compressed) URLTeam data is a total of 200G
09:05 πŸ”— Somebody1 but the format is silly, so storing it in a more sensible way might bring the size down
09:06 πŸ”— Somebody1 IDK how big the ArchiveBot collection is
09:25 πŸ”— HCross ps. Well done for coming into something thats been running for years, and going "HERES A BETTER WAY.... ITS BETTER..." There is method in the madness
09:31 πŸ”— yipdw it's the same data we've always had
09:34 πŸ”— DiscantX has joined #archiveteam-bs
09:34 πŸ”— PurpleSym URLteam, yes, but we do not extract *all* URLs from WARCs, do we? The CDXes only contain URLs archivebot actually retrieved.
09:35 πŸ”— yipdw a cdx for a WARC is a full index of that WARC so uh
09:35 πŸ”— yipdw I'm not sure what the question is
09:36 πŸ”— PurpleSym The HTML inside the WARC may contain links to sites that were not crawled due to exceptions or recursion limits.
09:37 πŸ”— PurpleSym These are not in the CDX.
09:37 πŸ”— yipdw right
09:38 πŸ”— PurpleSym These are the URLs I was talking about.
09:38 πŸ”— yipdw but that's fine because that URL doesn't exist in the WARC
09:39 πŸ”— yipdw oh ok
09:40 πŸ”— yipdw I mean, there's ways of varying accuracy to get those URLs
09:47 πŸ”— Somebody1 HCross: which topic was your comment in reference to?
09:48 πŸ”— GE has joined #archiveteam-bs
09:49 πŸ”— Somebody1 yipdw: I was thinking something as simple and stupid as looking a regex: https?://[^"]+
09:49 πŸ”— yipdw you'll want scheme-relative URLs and relative URLs
09:49 πŸ”— yipdw lots of webapps generate those
09:50 πŸ”— Somebody1 ah, very good point -- and those are somewhat more of a PITA (although still not that bad)
09:51 πŸ”— yipdw that's more like covering a case where you might have a partial failure on a single site but I guess if you don't trust the CDXes you can't assume that either
09:51 πŸ”— krazedkat has joined #archiveteam-bs
09:52 πŸ”— yipdw come to think of it, that's not uncommon
09:53 πŸ”— Somebody1 Hm, it doesn't look like the ArchiveBot JSON records whether or not offsite-links were grabbed: https://ia801200.us.archive.org/14/items/archiveteam_archivebot_go_20160802210001/0cch.com-inf-20160801-163213-5t2w6.json
09:53 πŸ”— yipdw those don't, the warcinfo record does
09:54 πŸ”— HCross because all we get is ShortURL = LongURL (at least I think)
09:54 πŸ”— Somebody1 HCross: ?
09:56 πŸ”— Somebody1 URLteam does only store mappings from short to long URLs, yes -- but I'm still confused about how this relates...?
09:58 πŸ”— yipdw specifically, if you see --span-hosts-allow page-requisites,linked-pages in the Wpull-Argv field of the warcinfo record, --no-offsite-links was off
09:58 πŸ”— Chii has joined #archiveteam-bs
09:58 πŸ”— ranma anybody have a fave SC story (such as https://youtu.be/6r65WwO1-uA?t=462 )?
09:58 πŸ”— Chii So You Want To Murder a Software Patent Jason Scott - [51m15s] 2014-09-27 - Adrian Crenshaw - 33,605 views
09:58 πŸ”— yipdw the JSON file is kind of useless for now
09:58 πŸ”— ranma ;part
09:58 πŸ”— Chii has left
09:59 πŸ”— ranma car crash / hotel anecdote
10:00 πŸ”— * ranma ducks
10:02 πŸ”— Somebody1 yipdw: good to know about the Wpull-Argv field though
10:02 πŸ”— yipdw in a lot of cases the warcinfo record is the best place to go first
10:03 πŸ”— yipdw it's the first record of each wpull warc, so you can grab it by asking for e.g. the first 100 kB of the WARC
10:03 πŸ”— yipdw there's probably a more refined way to get just the first record
10:03 πŸ”— yipdw I think the only thing you *can't* get from that is the pipeline and job initiator nicks
10:03 πŸ”— yipdw but really who cares about those
10:23 πŸ”— Somebody1 has quit IRC (Ping timeout: 370 seconds)
10:37 πŸ”— GE has quit IRC (Remote host closed the connection)
11:12 πŸ”— BartoCH has quit IRC (Ping timeout: 260 seconds)
11:15 πŸ”— BartoCH has joined #archiveteam-bs
11:48 πŸ”— Igloo has quit IRC (Ping timeout: 250 seconds)
11:49 πŸ”— Igloo has joined #archiveteam-bs
12:15 πŸ”— GE has joined #archiveteam-bs
12:19 πŸ”— GE has quit IRC (Remote host closed the connection)
12:39 πŸ”— DiscantX has quit IRC (Ping timeout: 244 seconds)
13:04 πŸ”— DiscantX has joined #archiveteam-bs
13:10 πŸ”— VADemon has joined #archiveteam-bs
13:29 πŸ”— Start has joined #archiveteam-bs
13:38 πŸ”— Start has quit IRC (Ping timeout: 506 seconds)
14:14 πŸ”— Kaz| has joined #archiveteam-bs
14:14 πŸ”— Kaz| has quit IRC (Client Quit)
14:33 πŸ”— Kaz has quit IRC (Quit: boop)
14:35 πŸ”— Kaz has joined #archiveteam-bs
14:44 πŸ”— Kaz has quit IRC (Read error: Connection reset by peer)
14:46 πŸ”— Kaz has joined #archiveteam-bs
15:22 πŸ”— DiscantX has quit IRC (Ping timeout: 244 seconds)
15:35 πŸ”— Start has joined #archiveteam-bs
15:38 πŸ”— Start has quit IRC (Client Quit)
15:40 πŸ”— Start has joined #archiveteam-bs
15:40 πŸ”— Start has quit IRC (Client Quit)
16:08 πŸ”— jspiros has quit IRC (Read error: Operation timed out)
16:12 πŸ”— jspiros has joined #archiveteam-bs
16:31 πŸ”— GE has joined #archiveteam-bs
16:57 πŸ”— Somebody1 has joined #archiveteam-bs
17:23 πŸ”— Somebody1 has quit IRC (Ping timeout: 370 seconds)
17:38 πŸ”— vitzli has joined #archiveteam-bs
17:49 πŸ”— Jonimus has joined #archiveteam-bs
17:49 πŸ”— swebb sets mode: +o Jonimus
17:50 πŸ”— Jonimus So my workplaces hosting provider apparently screwed something up a few months back such that our website has been down for quite a while and so we said screw them and moved to a simple squarespace site
17:50 πŸ”— Jonimus but it looks like the wayback machine does not have a complete crawl of our old website from any recent time.
17:52 πŸ”— Jonimus If I fudge my /etc/hosts such that I can get the old site to work and do warc/crawl is there any way to back date it and get it into the wayback machine as if it was a few days ago?
17:52 πŸ”— Jonimus or is there a way to have archivebot do this?
17:54 πŸ”— Frogging I don't think that's allowed
17:55 πŸ”— Smiley no
17:55 πŸ”— Smiley you don't get to edit history
17:56 πŸ”— VADemon has quit IRC (Read error: Connection reset by peer)
17:57 πŸ”— xmc but please do that and upload a warc anyway. it just won't go into wayback, but it's still good to have!
17:58 πŸ”— Frogging yeah, by all means do a crawl. It's just Wayback that requires authenticity
18:02 πŸ”— Jonimus I figured as much, just thought I'd ask.
18:02 πŸ”— xmc oh, hi Jonimus. didn't recognize your name for some reason.
18:05 πŸ”— vitzli has quit IRC (Quit: Leaving)
18:06 πŸ”— Jonimus oh, hi, yeah I lurk in #archiveteam but I figured this was more of a -bs question
18:06 πŸ”— Jonimus side not, what are the wget paramters I'd want for good warc creation?
18:08 πŸ”— xmc 1 sec
18:09 πŸ”— xmc we have some suggestions here http://archiveteam.org/index.php?title=Wget
18:09 πŸ”— xmc specifically the "creating warc with wget" section
18:10 πŸ”— xmc if you're not opposed to installing new software, wpull is somewhat better than wget
18:10 πŸ”— xmc https://github.com/chfoo/wpull
18:10 πŸ”— xmc it allows you to stop and resume gracefully
18:10 πŸ”— xmc (among other things)
18:13 πŸ”— Frogging the wpull args can be a lot to swallow, I use ArchiveBot as a reference
18:13 πŸ”— Frogging https://github.com/ArchiveTeam/ArchiveBot/blob/master/pipeline/archivebot/seesaw/wpull.py#L22
18:19 πŸ”— Jonimus yeah well it looks like they still have broken things because when doing wget http://ip --header "Host: actualdomain.com" I get a 500 error
18:21 πŸ”— Jonimus so I'm not sure I'll be able to do an actual crawl correctly because it be broke.
18:21 πŸ”— xmc maybe it has two Host headers in the requests
18:29 πŸ”— Jonimus I checked and that does not appear to be the case as of wget 1.10, and this system as 1.17.1 with warc support.
18:29 πŸ”— xmc hm ok
18:33 πŸ”— Jonimus yeah it seems our old go-daddy reseller really broke things well, there is a reason we were gonna switch..
18:43 πŸ”— Jonimus Well it looks like it will be absically impossible to do this without fighting the host to fix their shit so there goes that idea.
18:44 πŸ”— Jonimus I was able to make a backup of the site via ftp but that include privite info so I can't really upload it to archive.org without spending a bunch of time removing the private stuff.
19:36 πŸ”— Start has joined #archiveteam-bs
20:07 πŸ”— Aranje has joined #archiveteam-bs
20:58 πŸ”— Start has quit IRC (Quit: Disconnected.)
21:02 πŸ”— Start has joined #archiveteam-bs
21:03 πŸ”— Start has quit IRC (Client Quit)
21:16 πŸ”— Start has joined #archiveteam-bs
21:33 πŸ”— Start has quit IRC (Quit: Disconnected.)
21:47 πŸ”— schbirid has quit IRC (Quit: Leaving)
22:07 πŸ”— BlueMaxim has joined #archiveteam-bs
22:10 πŸ”— FalconK has joined #archiveteam-bs
22:21 πŸ”— Aranje has quit IRC (Ping timeout: 260 seconds)
22:21 πŸ”— Start has joined #archiveteam-bs
22:22 πŸ”— Start has quit IRC (Client Quit)
22:22 πŸ”— Madthias has joined #archiveteam-bs
22:39 πŸ”— godane so grawker sitemap grabs are up to 2016-11-30 now
23:24 πŸ”— Ravenloft has joined #archiveteam-bs
23:29 πŸ”— Ravenloft has quit IRC (Ping timeout: 272 seconds)

irclogger-viewer