Time |
Nickname |
Message |
09:59
🔗
|
Sum1 |
Is anyone active currently? |
10:10
🔗
|
Tomcat_ |
define "active" ;) |
10:11
🔗
|
Sum1 |
Well, helllo :) |
10:11
🔗
|
Tomcat_ |
I'm awake and working, but not doing anything archive-related. |
10:11
🔗
|
Sum1 |
I see, was just wondering what would be the best way to make a superficial archive of an entire forum. |
10:12
🔗
|
Tomcat_ |
I'm not into the technical details of the whole archiving operations, but I'm pretty sure there are people here who know how to do that. |
10:13
🔗
|
Tomcat_ |
http://archiveteam.org/index.php?title=Software |
10:15
🔗
|
Sum1 |
Do you know if these scrapers pick up where they leave off? Eg: if I close the app for the day and come back will it continue? |
10:17
🔗
|
Sum1 |
Ahh, read HTTrack's wikipedia and it seems I can, brilliant. Will read up more, thanks :) |
10:17
🔗
|
Tomcat_ |
Most do. ;) |
10:26
🔗
|
ersi |
wget won't. |
10:26
🔗
|
ersi |
I wouldn't say "most do", I'd say "test it" |
10:27
🔗
|
Sum1 |
Has anyone used SiteSucker for OSX? |
10:27
🔗
|
ersi |
People here mostly use wget and/or HTTrack |
10:28
🔗
|
Sum1 |
I've used it once or twice for small sites, but was wondering if it would be feasible for larger sites. Primarily using OSX for this. |
10:30
🔗
|
Sum1 |
Mmm, maybe I'll get a Windows user to help with the scraping then. |
10:31
🔗
|
ersi |
Sounds horrible. OS X and Windows for archiving. :) |
10:31
🔗
|
ersi |
Unless you archive straight into WARC output without touching the filthy filesystems of OS X or Windows |
10:32
🔗
|
ersi |
I'd wholeheartedly recommend using something that can produce WARC output (ie. save to the 'WARC format'). Since then it'll be of use for the Internet Archive (and plenty other archival organisations) |
10:37
🔗
|
Sum1 |
I'll look into it, I'm guessing it's some kind of container format for archives? |
10:39
🔗
|
Sum1 |
btw might be afk for a bit soonish |
10:53
🔗
|
ersi |
I think HTTrack has support. I know wget *has* support |
11:02
🔗
|
Cameron_D |
You could always use WARCproxy and run HTTrack through it, its a bit roundabout, but it should work |
11:14
🔗
|
joepie93 |
Sum1: WARC is a format specifically for web archiving |
11:14
🔗
|
joepie93 |
it retains headers, error pages, and all the other metadata that you'd throw away when saving to disk |
11:15
🔗
|
joepie93 |
alard; awake? |
11:57
🔗
|
ersi |
joepie93: he's missing since a while |
12:00
🔗
|
joepie93 |
:/ |
12:00
🔗
|
joepie93 |
ersi: really need someone to write a pipeline for hyves |
12:01
🔗
|
ersi |
I havn't seen him arounds for *months* though |
12:01
🔗
|
joepie93 |
I wrote a user discovery script.. tbh I don't really have time for this project but I get the idea that if I don't, it won't get done at all |
12:01
🔗
|
joepie93 |
but I really can't afford time-wise to also write a pipeline |
12:01
🔗
|
ersi |
That's the general gist of all projects |
12:01
🔗
|
ersi |
"Do or it won't happen" |
12:01
🔗
|
joepie93 |
problem is that I'm too busy to run an entire project |
12:02
🔗
|
ersi |
some is better than none |
12:02
🔗
|
joepie93 |
yes, but user discovery alone is not going to help |
12:02
🔗
|
ersi |
sure it is, that's one less thing to do |
12:02
🔗
|
joepie93 |
keyword: "alone" |
13:58
🔗
|
odie5533 |
Cameron_D: So you know, I wrote a MITM alternative to WarcProxy that supports SSL. Two actually. |
13:58
🔗
|
Cameron_D |
oh neat, link? |
13:59
🔗
|
odie5533 |
https://github.com/odie5533/WarcMITMProxy and https://github.com/odie5533/WarcTwistedMITMProxy |
13:59
🔗
|
Cameron_D |
I do lots of scraping with Majestic 12, so I've been considering throwing that into the middle and then uploading the WARCs to IA |
13:59
🔗
|
odie5533 |
The former is probably more stable and complete at this point, but I'm now just working on the latter one. |
14:01
🔗
|
odie5533 |
because I definitely think going forward, having a stable, scalable WARC proxy is very important since it removes the need to keep rewriting WARC handling in various programs. |
14:03
🔗
|
odie5533 |
Cameron_D: If you're looking for alternative scrapers, give Scrapy a try. I've already got some groundwork completed in it. https://github.com/odie5533/WarcMiddleware |
14:04
🔗
|
Cameron_D |
great, I'll keep a watch on those |
15:07
🔗
|
Sum1 |
I vaguely recall reading that you can submit WARC files to the Internet Archive's Wayback Machine, does anyone know if this is the case? |
15:08
🔗
|
Lord_Nigh |
ask SketchCow or undersco2 |
15:38
🔗
|
SketchCow |
What |
15:38
🔗
|
SketchCow |
You can upload stuff to Internet Archive and alert us to it. |
15:38
🔗
|
SketchCow |
We have to look at it. |
15:40
🔗
|
DFJustin |
huh I hadn't thought of using httrack together with a warc proxy |
15:41
🔗
|
DFJustin |
that could be extremely useful |
15:47
🔗
|
balrog |
Sum1: http://archive.org/upload/ |
16:54
🔗
|
SketchCow |
godane: We already have an HPR collection someone is maintaining. |
16:54
🔗
|
SketchCow |
Other than that, I've been putting your items into collections. |
17:07
🔗
|
godane |
SketchCow: it look like there was not all of them there |
17:08
🔗
|
godane |
that was my only release for doing it |
17:10
🔗
|
godane |
SketchCow: the collection mostly has 10 mp3s in a item |
17:10
🔗
|
godane |
and stoped at 620 for some reason |
17:11
🔗
|
godane |
then someone uploaded hpr1282 and put hpr1284 into that item too |
17:11
🔗
|
godane |
its just crazy in the way its being done |
17:15
🔗
|
godane |
SketchCow: also know that the geekbeattvreviews goes into computerandtechvideos collection |
17:16
🔗
|
godane |
the way the collection is now it looks like its going to be under texts |
17:21
🔗
|
godane |
SketchCow: there is also geekbeat.tv episodes in community videos |
17:21
🔗
|
godane |
i'm up to about episode 702 now on that one |
17:27
🔗
|
edsu_ |
kind of a dumb question here: are warcs that are harvested uploaded to internet archive to become part of the general web collection ... available through wayback? |
17:28
🔗
|
SketchCow |
Yes. |
17:28
🔗
|
edsu_ |
nice |
17:28
🔗
|
edsu_ |
do the warcs separately go up there as files? |
17:28
🔗
|
edsu_ |
where they can be viewed as other uploaded files? |
17:30
🔗
|
edsu_ |
i'm giving a talk about web preservation in new zealand and want to really highlight the awesome work archiveteam does http://www.ndf.org.nz/ |
17:30
🔗
|
edsu_ |
so i need to get my facts straight :-D |
17:31
🔗
|
SketchCow |
Ah. |
17:31
🔗
|
SketchCow |
OK, so. |
17:31
🔗
|
SketchCow |
The way the Wayback machine works on Internet Archive is it looks at the web collection, in which there are WARC files. |
17:31
🔗
|
SketchCow |
There are indexers that figure out what URLs are backed up, and what item has that information, in what file. |
17:32
🔗
|
SketchCow |
Since the Internet Archive has done its own crawling (not just taken from Alexa Internet), it has done it this way. |
17:32
🔗
|
SketchCow |
What Archive Team did was make outsiders/"just folks" provide things to this collection. |
17:32
🔗
|
SketchCow |
So, the upshot is that we can add items to the web collection, but it has to be done by an admin. |
17:33
🔗
|
SketchCow |
It can't just happen, and this is why my life is filled with so many requests from darling AT members to make something web. |
17:33
🔗
|
SketchCow |
We have ways to return the item and go 'where is this from' and find the object. |
17:33
🔗
|
SketchCow |
Each time something is read, that item gets a read. |
17:34
🔗
|
SketchCow |
This is why these web objects will say "downloaded XXX times" - XXX is the amount of times people used wayback. |
17:38
🔗
|
edsu |
got it, thanks SketchCow |
17:38
🔗
|
SketchCow |
No problem. |
18:06
🔗
|
SketchCow |
https://archive.org/details/archiveteam_archivebot_go_002 still kicking ass |
21:19
🔗
|
ersi |
Well, in one way - it would be kind of stupid if we as outsiders could add material to the Wayback. Who knows what we put into our WARCs. :) |
21:28
🔗
|
xmc |
right, the custody issue |
21:41
🔗
|
odie5533 |
If I create a list of urls, can the warrior grab them? |
21:46
🔗
|
ersi |
Not unless you write a project for the warrior. |
21:48
🔗
|
odie5533 |
Does the Warrior work as a good dev environment for creating warrior projects? |
22:09
🔗
|
odie5533 |
I just booted it up for the first time, and it doesn't seem to be a good dev environment heh |
22:34
🔗
|
dashcloud |
SketchCow: I never knew that the number of times a web grab is downloaded is the number of times it's been viewed in the wayback |
22:54
🔗
|
SketchCow |
Yes |