#archiveteam 2013-11-04,Mon

↑back Search

Time	Nickname	Message
09:59 ^🔗	Sum1	Is anyone active currently?
10:10 ^🔗	Tomcat_	define "active" ;)
10:11 ^🔗	Sum1	Well, helllo :)
10:11 ^🔗	Tomcat_	I'm awake and working, but not doing anything archive-related.
10:11 ^🔗	Sum1	I see, was just wondering what would be the best way to make a superficial archive of an entire forum.
10:12 ^🔗	Tomcat_	I'm not into the technical details of the whole archiving operations, but I'm pretty sure there are people here who know how to do that.
10:13 ^🔗	Tomcat_	http://archiveteam.org/index.php?title=Software
10:15 ^🔗	Sum1	Do you know if these scrapers pick up where they leave off? Eg: if I close the app for the day and come back will it continue?
10:17 ^🔗	Sum1	Ahh, read HTTrack's wikipedia and it seems I can, brilliant. Will read up more, thanks :)
10:17 ^🔗	Tomcat_	Most do. ;)
10:26 ^🔗	ersi	wget won't.
10:26 ^🔗	ersi	I wouldn't say "most do", I'd say "test it"
10:27 ^🔗	Sum1	Has anyone used SiteSucker for OSX?
10:27 ^🔗	ersi	People here mostly use wget and/or HTTrack
10:28 ^🔗	Sum1	I've used it once or twice for small sites, but was wondering if it would be feasible for larger sites. Primarily using OSX for this.
10:30 ^🔗	Sum1	Mmm, maybe I'll get a Windows user to help with the scraping then.
10:31 ^🔗	ersi	Sounds horrible. OS X and Windows for archiving. :)
10:31 ^🔗	ersi	Unless you archive straight into WARC output without touching the filthy filesystems of OS X or Windows
10:32 ^🔗	ersi	I'd wholeheartedly recommend using something that can produce WARC output (ie. save to the 'WARC format'). Since then it'll be of use for the Internet Archive (and plenty other archival organisations)
10:37 ^🔗	Sum1	I'll look into it, I'm guessing it's some kind of container format for archives?
10:39 ^🔗	Sum1	btw might be afk for a bit soonish
10:53 ^🔗	ersi	I think HTTrack has support. I know wget has support
11:02 ^🔗	Cameron_D	You could always use WARCproxy and run HTTrack through it, its a bit roundabout, but it should work
11:14 ^🔗	joepie93	Sum1: WARC is a format specifically for web archiving
11:14 ^🔗	joepie93	it retains headers, error pages, and all the other metadata that you'd throw away when saving to disk
11:15 ^🔗	joepie93	alard; awake?
11:57 ^🔗	ersi	joepie93: he's missing since a while
12:00 ^🔗	joepie93	:/
12:00 ^🔗	joepie93	ersi: really need someone to write a pipeline for hyves
12:01 ^🔗	ersi	I havn't seen him arounds for months though
12:01 ^🔗	joepie93	I wrote a user discovery script.. tbh I don't really have time for this project but I get the idea that if I don't, it won't get done at all
12:01 ^🔗	joepie93	but I really can't afford time-wise to also write a pipeline
12:01 ^🔗	ersi	That's the general gist of all projects
12:01 ^🔗	ersi	"Do or it won't happen"
12:01 ^🔗	joepie93	problem is that I'm too busy to run an entire project
12:02 ^🔗	ersi	some is better than none
12:02 ^🔗	joepie93	yes, but user discovery alone is not going to help
12:02 ^🔗	ersi	sure it is, that's one less thing to do
12:02 ^🔗	joepie93	keyword: "alone"
13:58 ^🔗	odie5533	Cameron_D: So you know, I wrote a MITM alternative to WarcProxy that supports SSL. Two actually.
13:58 ^🔗	Cameron_D	oh neat, link?
13:59 ^🔗	odie5533	https://github.com/odie5533/WarcMITMProxy and https://github.com/odie5533/WarcTwistedMITMProxy
13:59 ^🔗	Cameron_D	I do lots of scraping with Majestic 12, so I've been considering throwing that into the middle and then uploading the WARCs to IA
13:59 ^🔗	odie5533	The former is probably more stable and complete at this point, but I'm now just working on the latter one.
14:01 ^🔗	odie5533	because I definitely think going forward, having a stable, scalable WARC proxy is very important since it removes the need to keep rewriting WARC handling in various programs.
14:03 ^🔗	odie5533	Cameron_D: If you're looking for alternative scrapers, give Scrapy a try. I've already got some groundwork completed in it. https://github.com/odie5533/WarcMiddleware
14:04 ^🔗	Cameron_D	great, I'll keep a watch on those
15:07 ^🔗	Sum1	I vaguely recall reading that you can submit WARC files to the Internet Archive's Wayback Machine, does anyone know if this is the case?
15:08 ^🔗	Lord_Nigh	ask SketchCow or undersco2
15:38 ^🔗	SketchCow	What
15:38 ^🔗	SketchCow	You can upload stuff to Internet Archive and alert us to it.
15:38 ^🔗	SketchCow	We have to look at it.
15:40 ^🔗	DFJustin	huh I hadn't thought of using httrack together with a warc proxy
15:41 ^🔗	DFJustin	that could be extremely useful
15:47 ^🔗	balrog	Sum1: http://archive.org/upload/
16:54 ^🔗	SketchCow	godane: We already have an HPR collection someone is maintaining.
16:54 ^🔗	SketchCow	Other than that, I've been putting your items into collections.
17:07 ^🔗	godane	SketchCow: it look like there was not all of them there
17:08 ^🔗	godane	that was my only release for doing it
17:10 ^🔗	godane	SketchCow: the collection mostly has 10 mp3s in a item
17:10 ^🔗	godane	and stoped at 620 for some reason
17:11 ^🔗	godane	then someone uploaded hpr1282 and put hpr1284 into that item too
17:11 ^🔗	godane	its just crazy in the way its being done
17:15 ^🔗	godane	SketchCow: also know that the geekbeattvreviews goes into computerandtechvideos collection
17:16 ^🔗	godane	the way the collection is now it looks like its going to be under texts
17:21 ^🔗	godane	SketchCow: there is also geekbeat.tv episodes in community videos
17:21 ^🔗	godane	i'm up to about episode 702 now on that one
17:27 ^🔗	edsu_	kind of a dumb question here: are warcs that are harvested uploaded to internet archive to become part of the general web collection ... available through wayback?
17:28 ^🔗	SketchCow	Yes.
17:28 ^🔗	edsu_	nice
17:28 ^🔗	edsu_	do the warcs separately go up there as files?
17:28 ^🔗	edsu_	where they can be viewed as other uploaded files?
17:30 ^🔗	edsu_	i'm giving a talk about web preservation in new zealand and want to really highlight the awesome work archiveteam does http://www.ndf.org.nz/
17:30 ^🔗	edsu_	so i need to get my facts straight :-D
17:31 ^🔗	SketchCow	Ah.
17:31 ^🔗	SketchCow	OK, so.
17:31 ^🔗	SketchCow	The way the Wayback machine works on Internet Archive is it looks at the web collection, in which there are WARC files.
17:31 ^🔗	SketchCow	There are indexers that figure out what URLs are backed up, and what item has that information, in what file.
17:32 ^🔗	SketchCow	Since the Internet Archive has done its own crawling (not just taken from Alexa Internet), it has done it this way.
17:32 ^🔗	SketchCow	What Archive Team did was make outsiders/"just folks" provide things to this collection.
17:32 ^🔗	SketchCow	So, the upshot is that we can add items to the web collection, but it has to be done by an admin.
17:33 ^🔗	SketchCow	It can't just happen, and this is why my life is filled with so many requests from darling AT members to make something web.
17:33 ^🔗	SketchCow	We have ways to return the item and go 'where is this from' and find the object.
17:33 ^🔗	SketchCow	Each time something is read, that item gets a read.
17:34 ^🔗	SketchCow	This is why these web objects will say "downloaded XXX times" - XXX is the amount of times people used wayback.
17:38 ^🔗	edsu	got it, thanks SketchCow
17:38 ^🔗	SketchCow	No problem.
18:06 ^🔗	SketchCow	https://archive.org/details/archiveteam_archivebot_go_002 still kicking ass
21:19 ^🔗	ersi	Well, in one way - it would be kind of stupid if we as outsiders could add material to the Wayback. Who knows what we put into our WARCs. :)
21:28 ^🔗	xmc	right, the custody issue
21:41 ^🔗	odie5533	If I create a list of urls, can the warrior grab them?
21:46 ^🔗	ersi	Not unless you write a project for the warrior.
21:48 ^🔗	odie5533	Does the Warrior work as a good dev environment for creating warrior projects?
22:09 ^🔗	odie5533	I just booted it up for the first time, and it doesn't seem to be a good dev environment heh
22:34 ^🔗	dashcloud	SketchCow: I never knew that the number of times a web grab is downloaded is the number of times it's been viewed in the wayback
22:54 ^🔗	SketchCow	Yes

irclogger-viewer