#archiveteam 2013-11-04,Mon

↑back Search

Time Nickname Message
09:59 🔗 Sum1 Is anyone active currently?
10:10 🔗 Tomcat_ define "active" ;)
10:11 🔗 Sum1 Well, helllo :)
10:11 🔗 Tomcat_ I'm awake and working, but not doing anything archive-related.
10:11 🔗 Sum1 I see, was just wondering what would be the best way to make a superficial archive of an entire forum.
10:12 🔗 Tomcat_ I'm not into the technical details of the whole archiving operations, but I'm pretty sure there are people here who know how to do that.
10:13 🔗 Tomcat_ http://archiveteam.org/index.php?title=Software
10:15 🔗 Sum1 Do you know if these scrapers pick up where they leave off? Eg: if I close the app for the day and come back will it continue?
10:17 🔗 Sum1 Ahh, read HTTrack's wikipedia and it seems I can, brilliant. Will read up more, thanks :)
10:17 🔗 Tomcat_ Most do. ;)
10:26 🔗 ersi wget won't.
10:26 🔗 ersi I wouldn't say "most do", I'd say "test it"
10:27 🔗 Sum1 Has anyone used SiteSucker for OSX?
10:27 🔗 ersi People here mostly use wget and/or HTTrack
10:28 🔗 Sum1 I've used it once or twice for small sites, but was wondering if it would be feasible for larger sites. Primarily using OSX for this.
10:30 🔗 Sum1 Mmm, maybe I'll get a Windows user to help with the scraping then.
10:31 🔗 ersi Sounds horrible. OS X and Windows for archiving. :)
10:31 🔗 ersi Unless you archive straight into WARC output without touching the filthy filesystems of OS X or Windows
10:32 🔗 ersi I'd wholeheartedly recommend using something that can produce WARC output (ie. save to the 'WARC format'). Since then it'll be of use for the Internet Archive (and plenty other archival organisations)
10:37 🔗 Sum1 I'll look into it, I'm guessing it's some kind of container format for archives?
10:39 🔗 Sum1 btw might be afk for a bit soonish
10:53 🔗 ersi I think HTTrack has support. I know wget *has* support
11:02 🔗 Cameron_D You could always use WARCproxy and run HTTrack through it, its a bit roundabout, but it should work
11:14 🔗 joepie93 Sum1: WARC is a format specifically for web archiving
11:14 🔗 joepie93 it retains headers, error pages, and all the other metadata that you'd throw away when saving to disk
11:15 🔗 joepie93 alard; awake?
11:57 🔗 ersi joepie93: he's missing since a while
12:00 🔗 joepie93 :/
12:00 🔗 joepie93 ersi: really need someone to write a pipeline for hyves
12:01 🔗 ersi I havn't seen him arounds for *months* though
12:01 🔗 joepie93 I wrote a user discovery script.. tbh I don't really have time for this project but I get the idea that if I don't, it won't get done at all
12:01 🔗 joepie93 but I really can't afford time-wise to also write a pipeline
12:01 🔗 ersi That's the general gist of all projects
12:01 🔗 ersi "Do or it won't happen"
12:01 🔗 joepie93 problem is that I'm too busy to run an entire project
12:02 🔗 ersi some is better than none
12:02 🔗 joepie93 yes, but user discovery alone is not going to help
12:02 🔗 ersi sure it is, that's one less thing to do
12:02 🔗 joepie93 keyword: "alone"
13:58 🔗 odie5533 Cameron_D: So you know, I wrote a MITM alternative to WarcProxy that supports SSL. Two actually.
13:58 🔗 Cameron_D oh neat, link?
13:59 🔗 odie5533 https://github.com/odie5533/WarcMITMProxy and https://github.com/odie5533/WarcTwistedMITMProxy
13:59 🔗 Cameron_D I do lots of scraping with Majestic 12, so I've been considering throwing that into the middle and then uploading the WARCs to IA
13:59 🔗 odie5533 The former is probably more stable and complete at this point, but I'm now just working on the latter one.
14:01 🔗 odie5533 because I definitely think going forward, having a stable, scalable WARC proxy is very important since it removes the need to keep rewriting WARC handling in various programs.
14:03 🔗 odie5533 Cameron_D: If you're looking for alternative scrapers, give Scrapy a try. I've already got some groundwork completed in it. https://github.com/odie5533/WarcMiddleware
14:04 🔗 Cameron_D great, I'll keep a watch on those
15:07 🔗 Sum1 I vaguely recall reading that you can submit WARC files to the Internet Archive's Wayback Machine, does anyone know if this is the case?
15:08 🔗 Lord_Nigh ask SketchCow or undersco2
15:38 🔗 SketchCow What
15:38 🔗 SketchCow You can upload stuff to Internet Archive and alert us to it.
15:38 🔗 SketchCow We have to look at it.
15:40 🔗 DFJustin huh I hadn't thought of using httrack together with a warc proxy
15:41 🔗 DFJustin that could be extremely useful
15:47 🔗 balrog Sum1: http://archive.org/upload/
16:54 🔗 SketchCow godane: We already have an HPR collection someone is maintaining.
16:54 🔗 SketchCow Other than that, I've been putting your items into collections.
17:07 🔗 godane SketchCow: it look like there was not all of them there
17:08 🔗 godane that was my only release for doing it
17:10 🔗 godane SketchCow: the collection mostly has 10 mp3s in a item
17:10 🔗 godane and stoped at 620 for some reason
17:11 🔗 godane then someone uploaded hpr1282 and put hpr1284 into that item too
17:11 🔗 godane its just crazy in the way its being done
17:15 🔗 godane SketchCow: also know that the geekbeattvreviews goes into computerandtechvideos collection
17:16 🔗 godane the way the collection is now it looks like its going to be under texts
17:21 🔗 godane SketchCow: there is also geekbeat.tv episodes in community videos
17:21 🔗 godane i'm up to about episode 702 now on that one
17:27 🔗 edsu_ kind of a dumb question here: are warcs that are harvested uploaded to internet archive to become part of the general web collection ... available through wayback?
17:28 🔗 SketchCow Yes.
17:28 🔗 edsu_ nice
17:28 🔗 edsu_ do the warcs separately go up there as files?
17:28 🔗 edsu_ where they can be viewed as other uploaded files?
17:30 🔗 edsu_ i'm giving a talk about web preservation in new zealand and want to really highlight the awesome work archiveteam does http://www.ndf.org.nz/
17:30 🔗 edsu_ so i need to get my facts straight :-D
17:31 🔗 SketchCow Ah.
17:31 🔗 SketchCow OK, so.
17:31 🔗 SketchCow The way the Wayback machine works on Internet Archive is it looks at the web collection, in which there are WARC files.
17:31 🔗 SketchCow There are indexers that figure out what URLs are backed up, and what item has that information, in what file.
17:32 🔗 SketchCow Since the Internet Archive has done its own crawling (not just taken from Alexa Internet), it has done it this way.
17:32 🔗 SketchCow What Archive Team did was make outsiders/"just folks" provide things to this collection.
17:32 🔗 SketchCow So, the upshot is that we can add items to the web collection, but it has to be done by an admin.
17:33 🔗 SketchCow It can't just happen, and this is why my life is filled with so many requests from darling AT members to make something web.
17:33 🔗 SketchCow We have ways to return the item and go 'where is this from' and find the object.
17:33 🔗 SketchCow Each time something is read, that item gets a read.
17:34 🔗 SketchCow This is why these web objects will say "downloaded XXX times" - XXX is the amount of times people used wayback.
17:38 🔗 edsu got it, thanks SketchCow
17:38 🔗 SketchCow No problem.
18:06 🔗 SketchCow https://archive.org/details/archiveteam_archivebot_go_002 still kicking ass
21:19 🔗 ersi Well, in one way - it would be kind of stupid if we as outsiders could add material to the Wayback. Who knows what we put into our WARCs. :)
21:28 🔗 xmc right, the custody issue
21:41 🔗 odie5533 If I create a list of urls, can the warrior grab them?
21:46 🔗 ersi Not unless you write a project for the warrior.
21:48 🔗 odie5533 Does the Warrior work as a good dev environment for creating warrior projects?
22:09 🔗 odie5533 I just booted it up for the first time, and it doesn't seem to be a good dev environment heh
22:34 🔗 dashcloud SketchCow: I never knew that the number of times a web grab is downloaded is the number of times it's been viewed in the wayback
22:54 🔗 SketchCow Yes

irclogger-viewer