#archiveteam-bs 2018-09-15,Sat

↑back Search

Time	Nickname	Message
00:05 ^🔗		BlueMax has joined #archiveteam-bs
00:13 ^🔗		Flashfire has joined #archiveteam-bs
00:17 ^🔗		Mayeau is now known as Mayonaise
00:25 ^🔗		coldice has joined #archiveteam-bs
00:32 ^🔗	JAA	coldice: Have a look at our wiki. It contains a wealth of information on archival.
00:32 ^🔗	JAA	Generally speaking, you'll want to archive websites in the WARC format, which preserves request and response entirely (including HTTP headers) and also contains relevant metadata.
00:34 ^🔗	JAA	There are several tools and approaches to do this. The one we use most of the time (including through ArchiveBot and the warrior project) is a crawler like wpull or wget. This works pretty well for most sites. The major exception here are websites that make heavy use of JavaScript.
00:35 ^🔗	coldice	So old websites before 2010 is safe to wpull
00:36 ^🔗	coldice	Anything else through PanthomJS or something?
00:36 ^🔗	JAA	Even modern sites might work fine with wpull. It really just depends on how the site is built.
00:37 ^🔗	JAA	If the site's browsable with JS disabled in the browser, then it will usually work fine with those crawlers.
00:37 ^🔗	JAA	PhantomJS doesn't work very well.
00:38 ^🔗	JAA	We don't really have a proper solution for JS-heavy websites yet. It's a quite tricky problem, especially when links aren't even real links, clicks get hijacked, etc.
00:39 ^🔗	JAA	You can always archive that stuff through a browser using a proxy that writes everything to WARC, e.g. warcprox. But that doesn't necessarily mean that it can also be played back later.
00:40 ^🔗	JAA	And it's not well automatable in the general case. So you typically need to write custom code for each such site you want to grab.
00:43 ^🔗	coldice	Alright, to get started I need https://github.com/ludios/grab-site right?
00:44 ^🔗	coldice	Unless I want to join the pool
00:44 ^🔗	JAA	Yeah, that's one way. grab-site is a wrapper around wpull to make it easier to use.
00:45 ^🔗	coldice	Btw, is there a list of archived websites? I can't seem to find it on the wiki
00:45 ^🔗	JAA	That would be a long list.
00:46 ^🔗	kiska	A very long list
00:46 ^🔗	kiska	From #archivebot Major: Job status: 95273 completed
00:48 ^🔗	coldice	So the data is archived, but not available? Am I missing something?
00:48 ^🔗	JAA	All our data is uploaded to the Internet Archive and included in the Wayback Machine.
00:49 ^🔗	JAA	https://archive.org/details/archiveteam
01:02 ^🔗		sknebel has quit IRC (Quit: No Ping reply in 180 seconds.)
01:05 ^🔗		sknebel has joined #archiveteam-bs
01:44 ^🔗		BlueMax has quit IRC (Quit: Leaving)
01:46 ^🔗		BlueMax has joined #archiveteam-bs
02:08 ^🔗	coldice	Thanks for your help JAA, my grabber is working fine. https://i.imgur.com/EmB3bQY.png - I got a few TB of storage, which should get me pretty far....
02:09 ^🔗	JAA	Happy to help. :-)
02:10 ^🔗		Odd0002 has quit IRC (Read error: Operation timed out)
02:17 ^🔗		Odd0002 has joined #archiveteam-bs
02:55 ^🔗	ivan	coldice: you can set up grab-site and an uploader to upload and remove WARCs before the crawls finish
02:56 ^🔗	ivan	the grab-site component is --finished-warc-dir= and the uploader can be something like https://gist.github.com/ivan/079530350ac94851d581b55b1d372440 for IA
03:02 ^🔗		bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…)
03:06 ^🔗		odemg has quit IRC (Ping timeout: 260 seconds)
03:27 ^🔗		bitBaron has joined #archiveteam-bs
03:45 ^🔗	coldice	Anyone.. think my grab-site is running of... seeing a lot of requests to https://static.xx.fbcdn.net/rsrc.php/* - should that be in the ignore pattern?
03:45 ^🔗	FlashBack	All good
03:45 ^🔗	Flashfire	Coldice its facebook java script crap
03:45 ^🔗	coldice	Whelp, a lot of it too
03:45 ^🔗	Flashfire	grabbing it has no harm at all but geel free to ignore it as well
03:46 ^🔗	coldice	Is it possibly for me to interact with the script too like the IRC bot? Just command-line wise
03:47 ^🔗	Flashfire	No clue with grab site
03:58 ^🔗	coldice	JAA, may I know what you use in customs scripts to scrape websites for archive? Scrapy?
04:23 ^🔗		ndiddy has quit IRC (Read error: Operation timed out)
04:27 ^🔗		bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…)
04:30 ^🔗	Raccoon	I just saw the bot's link to a wiki of ISP Hosts. Maybe somebody would similarly find this list interesting. https://gist.github.com/a-raccoon/15c55e8d4048bb120b56
04:38 ^🔗		faoling__ has joined #archiveteam-bs
04:42 ^🔗		Pixi` has joined #archiveteam-bs
04:44 ^🔗		faolingf_ has quit IRC (Ping timeout: 360 seconds)
04:47 ^🔗		dxrt has quit IRC (Read error: Operation timed out)
04:47 ^🔗		dxrt has joined #archiveteam-bs
04:47 ^🔗		Atom-- has joined #archiveteam-bs
04:48 ^🔗		Frogging has quit IRC (Read error: Operation timed out)
04:48 ^🔗		Frogging has joined #archiveteam-bs
04:48 ^🔗		twigfoot has quit IRC (Ping timeout: 360 seconds)
04:48 ^🔗		Pixi has quit IRC (Read error: Operation timed out)
04:49 ^🔗		underscor has quit IRC (Ping timeout: 360 seconds)
04:49 ^🔗		underscor has joined #archiveteam-bs
04:49 ^🔗		svchfoo1 sets mode: +o underscor
04:50 ^🔗		arkiver has quit IRC (Read error: Operation timed out)
04:50 ^🔗		superkuh has quit IRC (Excess Flood)
04:51 ^🔗		twigfoot has joined #archiveteam-bs
04:51 ^🔗		betamax_ has joined #archiveteam-bs
04:52 ^🔗		swebb has quit IRC (Ping timeout: 360 seconds)
04:52 ^🔗		Somebody2 has quit IRC (Ping timeout: 360 seconds)
04:52 ^🔗		unlobito has quit IRC (Ping timeout: 360 seconds)
04:52 ^🔗		unlobito has joined #archiveteam-bs
04:52 ^🔗		swebb has joined #archiveteam-bs
04:52 ^🔗		svchfoo1 sets mode: +o swebb
04:53 ^🔗		Cameron_D has quit IRC (Read error: Operation timed out)
04:53 ^🔗		sknebel_ has joined #archiveteam-bs
04:54 ^🔗		arkiver has joined #archiveteam-bs
04:54 ^🔗		Darkstar has quit IRC (Read error: Connection reset by peer)
04:54 ^🔗		Cameron_D has joined #archiveteam-bs
04:55 ^🔗		Somebody2 has joined #archiveteam-bs
04:55 ^🔗		godane has quit IRC (Read error: Operation timed out)
04:56 ^🔗		twigfoot has quit IRC (Read error: Operation timed out)
04:56 ^🔗		betamax has quit IRC (Read error: Operation timed out)
04:57 ^🔗		Atom has quit IRC (Read error: Operation timed out)
04:57 ^🔗		godane has joined #archiveteam-bs
04:58 ^🔗		svchfoo1 sets mode: +o godane
04:58 ^🔗		Yurume has joined #archiveteam-bs
04:59 ^🔗		astrid has quit IRC (Read error: Operation timed out)
05:00 ^🔗		twigfoot has joined #archiveteam-bs
05:00 ^🔗		Cameron_D has quit IRC (Ping timeout: 360 seconds)
05:01 ^🔗		Cameron_D has joined #archiveteam-bs
05:02 ^🔗		Somebody2 has quit IRC (Ping timeout: 360 seconds)
05:02 ^🔗		phirephl- has quit IRC (Ping timeout: 360 seconds)
05:02 ^🔗	godane	SketchCow: any news?
05:04 ^🔗		astrid has joined #archiveteam-bs
05:04 ^🔗		swebb sets mode: +o astrid
05:04 ^🔗		MrRadar has quit IRC (Read error: Operation timed out)
05:05 ^🔗		Darkstar has joined #archiveteam-bs
05:06 ^🔗		sknebel has quit IRC (Read error: Operation timed out)
05:07 ^🔗		twigfoot has quit IRC (Read error: Operation timed out)
05:07 ^🔗		twigfoot has joined #archiveteam-bs
05:07 ^🔗		Yurume_ has quit IRC (Read error: Operation timed out)
05:08 ^🔗		zino_ has quit IRC (Excess Flood)
05:11 ^🔗		MrRadar has joined #archiveteam-bs
05:11 ^🔗		superkuh has joined #archiveteam-bs
05:12 ^🔗		phirephly has joined #archiveteam-bs
05:12 ^🔗		Darkstar has quit IRC (Read error: Connection reset by peer)
05:13 ^🔗		Somebody2 has joined #archiveteam-bs
05:15 ^🔗		zino has joined #archiveteam-bs
05:15 ^🔗		Darkstar has joined #archiveteam-bs
05:25 ^🔗	hook54321	JAA: Have we started grabbing XUL addons from addons.mozilla.org? The deadline is "early October, 2018"
05:36 ^🔗		m007a83 has quit IRC (Fuck you Comcast)
05:55 ^🔗		HCross has quit IRC (Ping timeout: 268 seconds)
05:56 ^🔗		HCross has joined #archiveteam-bs
05:56 ^🔗		HCross has quit IRC (Excess Flood)
05:57 ^🔗		Yurume has quit IRC (Ping timeout: 268 seconds)
05:57 ^🔗		TC04 has quit IRC (Ping timeout: 268 seconds)
05:57 ^🔗		svchfoo1 has quit IRC (Ping timeout: 268 seconds)
05:57 ^🔗		TC01 has joined #archiveteam-bs
05:57 ^🔗		Yurume has joined #archiveteam-bs
05:57 ^🔗		kiskabak2 has quit IRC (Ping timeout: 268 seconds)
05:58 ^🔗		Kaz has quit IRC (Ping timeout: 268 seconds)
06:02 ^🔗		betamax_ has quit IRC (Ping timeout: 268 seconds)
06:02 ^🔗		betamax has joined #archiveteam-bs
06:14 ^🔗		BlueMax has quit IRC (Quit: Leaving)
06:26 ^🔗		dxrt_ has joined #archiveteam-bs
06:28 ^🔗		sec^nd has quit IRC (Quit: ZNC 1.6.5 - http://znc.in)
06:36 ^🔗		second has joined #archiveteam-bs
06:43 ^🔗		BlueMax has joined #archiveteam-bs
06:55 ^🔗		HCross has joined #archiveteam-bs
07:01 ^🔗		erin has joined #archiveteam-bs
07:55 ^🔗		svchfoo1 has joined #archiveteam-bs
07:55 ^🔗		svchfoo3 sets mode: +o svchfoo1
08:14 ^🔗		kiskabak2 has joined #archiveteam-bs
08:30 ^🔗		coldice_ has joined #archiveteam-bs
08:35 ^🔗		coldice has quit IRC (Read error: Operation timed out)
09:31 ^🔗		BartoCH has quit IRC (Quit: WeeChat 2.2)
09:31 ^🔗		BartoCH has joined #archiveteam-bs
09:39 ^🔗		faoling__ is now known as faolingfa
09:53 ^🔗		faolingfa has quit IRC (Leaving)
10:20 ^🔗	coldice_	Flashfire, Yea, the part about upload the warc file I'm not quite sure about yet
10:21 ^🔗	Flashfire	Yeah I am not so great with that you are going to need to ask someone else for help I am sorry
10:50 ^🔗		Kaz has joined #archiveteam-bs
11:02 ^🔗	JAA	coldice_: I'm using custom code on top of a modified version of aiohttp when it has to be fast and can easily be split up into individual work items. If I just want to do a recursive crawl, I use wpull.
11:02 ^🔗		SimpBrain has quit IRC (Read error: Operation timed out)
11:03 ^🔗	JAA	hook54321: The warrior project isn't started yet, but arkiver said it should be ready soon. I grabbed all Firefox addons yesterday, and I'll grab the Thunderbird and Seamonkey ones today. But I'm only grabbing the actual .xpi (and occasionally .zip) files, not the web page; the latter is also very important since it contains description, screenshots, metadata, changelogs, license information, etc.
11:03 ^🔗	JAA	-> #outofammo
12:14 ^🔗		Mateon1 has quit IRC (Read error: Operation timed out)
12:16 ^🔗		Mateon1 has joined #archiveteam-bs
12:19 ^🔗		TC01 has quit IRC (Read error: Operation timed out)
12:23 ^🔗		TC01 has joined #archiveteam-bs
12:38 ^🔗		chferfa has quit IRC ()
13:02 ^🔗		coldice_ is now known as coldice
13:04 ^🔗	coldice	Ops, turns out the site I was crawling requires login to access the forum part.... anyone know how to parse a login site? :\|
13:25 ^🔗		BlueMax has quit IRC (Read error: Connection reset by peer)
13:47 ^🔗		m007a83 has joined #archiveteam-bs
13:50 ^🔗	coldice	Anyone, grab-site has hit a nationalgeographic url and doesn't proceed... think it's stuck
13:51 ^🔗	coldice	can I stop and continue where it left or something?
14:08 ^🔗		wp494 has quit IRC (Read error: Operation timed out)
14:09 ^🔗		wp494 has joined #archiveteam-bs
14:11 ^🔗		Atom__ has joined #archiveteam-bs
14:14 ^🔗		Atom-- has quit IRC (Read error: Operation timed out)
14:20 ^🔗	mr_archiv	@coldice, manually login using a web browser, note the cookie(s) it sets and their values and send that as a part of each request with the web scrapper you are using.
15:32 ^🔗		odemg has joined #archiveteam-bs
15:45 ^🔗		zhongfu has quit IRC (Ping timeout: 260 seconds)
16:14 ^🔗		zhongfu has joined #archiveteam-bs
16:42 ^🔗		zhongfu has quit IRC (Ping timeout: 260 seconds)
18:17 ^🔗		Mateon1 has quit IRC (Quit: Mateon1)
18:18 ^🔗		Mateon1 has joined #archiveteam-bs
19:03 ^🔗	godane	so i got a beta player at Savers for $8
19:03 ^🔗	godane	i will have to see works works but i did test in store and it does power one
19:03 ^🔗	godane	*on
19:42 ^🔗		RichardG_ has quit IRC (Read error: Operation timed out)
19:43 ^🔗		ndiddy has joined #archiveteam-bs
19:59 ^🔗	godane	so tape will not load
19:59 ^🔗	godane	figures
20:05 ^🔗	godane	i'm digitizing a tape called 'The Valley of Miracles'
20:19 ^🔗	godane	this is a vhs tape i bought from savers
20:19 ^🔗	godane	the only thing that i would think that needs to be digitize maybe
20:34 ^🔗	Raccoon	what sort of tapes do you like to digitize.
20:36 ^🔗	Raccoon	I have a bunch of VHS from our wildlife refuge I was about to toss, because my bitch cat peed in the box (destroying them with smell even if I cleaned them well). But the cassettes themselves were undamaged.
20:36 ^🔗	Raccoon	they're either visitor education films or wildlife management and heavy machinery crew instructional videos.
20:37 ^🔗	Raccoon	Bobcat and JohnDeer brand training
20:45 ^🔗		atluxity has quit IRC (Be the person your dog think you are.)
20:56 ^🔗	ivan	coldice: add nationalgeographic to ignores and raise the concurrency
20:56 ^🔗	ivan	coldice: it'll resume soon enough
20:58 ^🔗		RichardG has joined #archiveteam-bs
21:04 ^🔗	godane	Raccoon: i'm not taking cat peed on tapes if possible
21:04 ^🔗	godane	at least i would like to see pictures of the tapes first
21:05 ^🔗	Raccoon	thought not :) I mean, the tapes are clean, but the pretty case cover art can't be salvaged except for maybe a photograph
21:06 ^🔗	Raccoon	boring stuff about birds and sandhill cranes anyway
21:06 ^🔗	Raccoon	riogrande
21:11 ^🔗		RichardG has quit IRC (Ping timeout: 246 seconds)
21:29 ^🔗	godane	i'm doing 'from boxing to ballet' tape
21:29 ^🔗	godane	its pushing 10Mbits
22:22 ^🔗		coldice has quit IRC (Read error: Operation timed out)
23:13 ^🔗		BlueMax has joined #archiveteam-bs

irclogger-viewer