| Time |
Nickname |
Message |
|
09:59
🔗
|
Sum1 |
Is anyone active currently? |
|
10:10
🔗
|
Tomcat_ |
define "active" ;) |
|
10:11
🔗
|
Sum1 |
Well, helllo :) |
|
10:11
🔗
|
Tomcat_ |
I'm awake and working, but not doing anything archive-related. |
|
10:11
🔗
|
Sum1 |
I see, was just wondering what would be the best way to make a superficial archive of an entire forum. |
|
10:12
🔗
|
Tomcat_ |
I'm not into the technical details of the whole archiving operations, but I'm pretty sure there are people here who know how to do that. |
|
10:13
🔗
|
Tomcat_ |
http://archiveteam.org/index.php?title=Software |
|
10:15
🔗
|
Sum1 |
Do you know if these scrapers pick up where they leave off? Eg: if I close the app for the day and come back will it continue? |
|
10:17
🔗
|
Sum1 |
Ahh, read HTTrack's wikipedia and it seems I can, brilliant. Will read up more, thanks :) |
|
10:17
🔗
|
Tomcat_ |
Most do. ;) |
|
10:26
🔗
|
ersi |
wget won't. |
|
10:26
🔗
|
ersi |
I wouldn't say "most do", I'd say "test it" |
|
10:27
🔗
|
Sum1 |
Has anyone used SiteSucker for OSX? |
|
10:27
🔗
|
ersi |
People here mostly use wget and/or HTTrack |
|
10:28
🔗
|
Sum1 |
I've used it once or twice for small sites, but was wondering if it would be feasible for larger sites. Primarily using OSX for this. |
|
10:30
🔗
|
Sum1 |
Mmm, maybe I'll get a Windows user to help with the scraping then. |
|
10:31
🔗
|
ersi |
Sounds horrible. OS X and Windows for archiving. :) |
|
10:31
🔗
|
ersi |
Unless you archive straight into WARC output without touching the filthy filesystems of OS X or Windows |
|
10:32
🔗
|
ersi |
I'd wholeheartedly recommend using something that can produce WARC output (ie. save to the 'WARC format'). Since then it'll be of use for the Internet Archive (and plenty other archival organisations) |
|
10:37
🔗
|
Sum1 |
I'll look into it, I'm guessing it's some kind of container format for archives? |
|
10:39
🔗
|
Sum1 |
btw might be afk for a bit soonish |
|
10:53
🔗
|
ersi |
I think HTTrack has support. I know wget *has* support |
|
11:02
🔗
|
Cameron_D |
You could always use WARCproxy and run HTTrack through it, its a bit roundabout, but it should work |
|
11:14
🔗
|
joepie93 |
Sum1: WARC is a format specifically for web archiving |
|
11:14
🔗
|
joepie93 |
it retains headers, error pages, and all the other metadata that you'd throw away when saving to disk |
|
11:15
🔗
|
joepie93 |
alard; awake? |
|
11:57
🔗
|
ersi |
joepie93: he's missing since a while |
|
12:00
🔗
|
joepie93 |
:/ |
|
12:00
🔗
|
joepie93 |
ersi: really need someone to write a pipeline for hyves |
|
12:01
🔗
|
ersi |
I havn't seen him arounds for *months* though |
|
12:01
🔗
|
joepie93 |
I wrote a user discovery script.. tbh I don't really have time for this project but I get the idea that if I don't, it won't get done at all |
|
12:01
🔗
|
joepie93 |
but I really can't afford time-wise to also write a pipeline |
|
12:01
🔗
|
ersi |
That's the general gist of all projects |
|
12:01
🔗
|
ersi |
"Do or it won't happen" |
|
12:01
🔗
|
joepie93 |
problem is that I'm too busy to run an entire project |
|
12:02
🔗
|
ersi |
some is better than none |
|
12:02
🔗
|
joepie93 |
yes, but user discovery alone is not going to help |
|
12:02
🔗
|
ersi |
sure it is, that's one less thing to do |
|
12:02
🔗
|
joepie93 |
keyword: "alone" |
|
13:58
🔗
|
odie5533 |
Cameron_D: So you know, I wrote a MITM alternative to WarcProxy that supports SSL. Two actually. |
|
13:58
🔗
|
Cameron_D |
oh neat, link? |
|
13:59
🔗
|
odie5533 |
https://github.com/odie5533/WarcMITMProxy and https://github.com/odie5533/WarcTwistedMITMProxy |
|
13:59
🔗
|
Cameron_D |
I do lots of scraping with Majestic 12, so I've been considering throwing that into the middle and then uploading the WARCs to IA |
|
13:59
🔗
|
odie5533 |
The former is probably more stable and complete at this point, but I'm now just working on the latter one. |
|
14:01
🔗
|
odie5533 |
because I definitely think going forward, having a stable, scalable WARC proxy is very important since it removes the need to keep rewriting WARC handling in various programs. |
|
14:03
🔗
|
odie5533 |
Cameron_D: If you're looking for alternative scrapers, give Scrapy a try. I've already got some groundwork completed in it. https://github.com/odie5533/WarcMiddleware |
|
14:04
🔗
|
Cameron_D |
great, I'll keep a watch on those |
|
15:07
🔗
|
Sum1 |
I vaguely recall reading that you can submit WARC files to the Internet Archive's Wayback Machine, does anyone know if this is the case? |
|
15:08
🔗
|
Lord_Nigh |
ask SketchCow or undersco2 |
|
15:38
🔗
|
SketchCow |
What |
|
15:38
🔗
|
SketchCow |
You can upload stuff to Internet Archive and alert us to it. |
|
15:38
🔗
|
SketchCow |
We have to look at it. |
|
15:40
🔗
|
DFJustin |
huh I hadn't thought of using httrack together with a warc proxy |
|
15:41
🔗
|
DFJustin |
that could be extremely useful |
|
15:47
🔗
|
balrog |
Sum1: http://archive.org/upload/ |
|
16:54
🔗
|
SketchCow |
godane: We already have an HPR collection someone is maintaining. |
|
16:54
🔗
|
SketchCow |
Other than that, I've been putting your items into collections. |
|
17:07
🔗
|
godane |
SketchCow: it look like there was not all of them there |
|
17:08
🔗
|
godane |
that was my only release for doing it |
|
17:10
🔗
|
godane |
SketchCow: the collection mostly has 10 mp3s in a item |
|
17:10
🔗
|
godane |
and stoped at 620 for some reason |
|
17:11
🔗
|
godane |
then someone uploaded hpr1282 and put hpr1284 into that item too |
|
17:11
🔗
|
godane |
its just crazy in the way its being done |
|
17:15
🔗
|
godane |
SketchCow: also know that the geekbeattvreviews goes into computerandtechvideos collection |
|
17:16
🔗
|
godane |
the way the collection is now it looks like its going to be under texts |
|
17:21
🔗
|
godane |
SketchCow: there is also geekbeat.tv episodes in community videos |
|
17:21
🔗
|
godane |
i'm up to about episode 702 now on that one |
|
17:27
🔗
|
edsu_ |
kind of a dumb question here: are warcs that are harvested uploaded to internet archive to become part of the general web collection ... available through wayback? |
|
17:28
🔗
|
SketchCow |
Yes. |
|
17:28
🔗
|
edsu_ |
nice |
|
17:28
🔗
|
edsu_ |
do the warcs separately go up there as files? |
|
17:28
🔗
|
edsu_ |
where they can be viewed as other uploaded files? |
|
17:30
🔗
|
edsu_ |
i'm giving a talk about web preservation in new zealand and want to really highlight the awesome work archiveteam does http://www.ndf.org.nz/ |
|
17:30
🔗
|
edsu_ |
so i need to get my facts straight :-D |
|
17:31
🔗
|
SketchCow |
Ah. |
|
17:31
🔗
|
SketchCow |
OK, so. |
|
17:31
🔗
|
SketchCow |
The way the Wayback machine works on Internet Archive is it looks at the web collection, in which there are WARC files. |
|
17:31
🔗
|
SketchCow |
There are indexers that figure out what URLs are backed up, and what item has that information, in what file. |
|
17:32
🔗
|
SketchCow |
Since the Internet Archive has done its own crawling (not just taken from Alexa Internet), it has done it this way. |
|
17:32
🔗
|
SketchCow |
What Archive Team did was make outsiders/"just folks" provide things to this collection. |
|
17:32
🔗
|
SketchCow |
So, the upshot is that we can add items to the web collection, but it has to be done by an admin. |
|
17:33
🔗
|
SketchCow |
It can't just happen, and this is why my life is filled with so many requests from darling AT members to make something web. |
|
17:33
🔗
|
SketchCow |
We have ways to return the item and go 'where is this from' and find the object. |
|
17:33
🔗
|
SketchCow |
Each time something is read, that item gets a read. |
|
17:34
🔗
|
SketchCow |
This is why these web objects will say "downloaded XXX times" - XXX is the amount of times people used wayback. |
|
17:38
🔗
|
edsu |
got it, thanks SketchCow |
|
17:38
🔗
|
SketchCow |
No problem. |
|
18:06
🔗
|
SketchCow |
https://archive.org/details/archiveteam_archivebot_go_002 still kicking ass |
|
21:19
🔗
|
ersi |
Well, in one way - it would be kind of stupid if we as outsiders could add material to the Wayback. Who knows what we put into our WARCs. :) |
|
21:28
🔗
|
xmc |
right, the custody issue |
|
21:41
🔗
|
odie5533 |
If I create a list of urls, can the warrior grab them? |
|
21:46
🔗
|
ersi |
Not unless you write a project for the warrior. |
|
21:48
🔗
|
odie5533 |
Does the Warrior work as a good dev environment for creating warrior projects? |
|
22:09
🔗
|
odie5533 |
I just booted it up for the first time, and it doesn't seem to be a good dev environment heh |
|
22:34
🔗
|
dashcloud |
SketchCow: I never knew that the number of times a web grab is downloaded is the number of times it's been viewed in the wayback |
|
22:54
🔗
|
SketchCow |
Yes |