Time |
Nickname |
Message |
00:50
π
|
|
Honno has joined #archiveteam-bs |
01:17
π
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
01:30
π
|
|
drumstick has quit IRC (Ping timeout: 360 seconds) |
01:30
π
|
|
drumstick has joined #archiveteam-bs |
02:00
π
|
|
Honno has quit IRC (Read error: Operation timed out) |
02:16
π
|
hook54321 |
I've been recording a camera feed from a bar in Catalonia for hours... But I'm still on the fence on if I should upload it to archive.org or not. |
02:24
π
|
|
schbirid has quit IRC (Ping timeout: 255 seconds) |
02:36
π
|
|
schbirid has joined #archiveteam-bs |
02:37
π
|
|
pizzaiolo has joined #archiveteam-bs |
03:30
π
|
godane |
so this guy has a interesting set of vhs : https://www.ebay.com/sch/VHS-Video-Tapes/149960/m.html?_nkw=&_armrs=1&_ipg=&_from=&_ssn=froggyholler |
03:30
π
|
godane |
i'm only interested in one tape that said Dog Day Afternoon with time of 1h 57m |
03:30
π
|
godane |
on TB |
03:31
π
|
godane |
*TBS |
03:35
π
|
hook54321 |
I wonder what would happen if we set up something like a GoFundMe so we could obtain stuff like that |
03:43
π
|
godane |
i do have a patreon page: https://www.patreon.com/godane |
03:43
π
|
godane |
hook54321: its how i got the last set of tapes from ebay |
03:55
π
|
godane |
SketchCow: i'm uploading the original broadcast of Dinotopia so don't upload my vhs stuff for the next day |
03:56
π
|
godane |
its about 17,272,156,160 in size |
04:19
π
|
|
qw3rty113 has joined #archiveteam-bs |
04:22
π
|
|
Stilett0 has joined #archiveteam-bs |
04:25
π
|
|
qw3rty112 has quit IRC (Read error: Operation timed out) |
04:32
π
|
|
Mateon1 has joined #archiveteam-bs |
04:33
π
|
phillipsj |
I saw a few episodes of that.. it was OK, but not great. |
04:33
π
|
* |
phillipsj is getting old. |
04:58
π
|
|
drumstick has quit IRC (Ping timeout: 255 seconds) |
04:58
π
|
|
drumstick has joined #archiveteam-bs |
06:20
π
|
|
pizzaiolo has quit IRC (Ping timeout: 246 seconds) |
06:41
π
|
SketchCow |
Great |
08:51
π
|
hook54321 |
I think I've found some Catalonia radio stations that can listened to online, however I've been blocked from them (I'm pretty sure for trying different ports), so I can't record them. :/ |
09:39
π
|
|
mls has joined #archiveteam-bs |
09:42
π
|
|
mls has left |
09:45
π
|
hook54321 |
https://www.youtube.com/watch?v=cfmiNyneO88 |
09:53
π
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
10:15
π
|
JAA |
SketchCow: Is there anything that can be done to speed up transfers to FOS? My ArchiveBot pipeline has trouble keeping up as the uploads average only 2-3 MB/s. Or should I look into uploading to IA directly instead? |
10:16
π
|
|
Honno has joined #archiveteam-bs |
10:44
π
|
hook54321 |
If anyone is interested in joining a discord server where there are people from Catalonia, dm me and if I'm awake I'll probably send you the link, I might ask you a few questions about why you want to join it. |
10:50
π
|
|
Aerochrom has joined #archiveteam-bs |
12:25
π
|
|
nepeat has quit IRC (ZNC 1.6.5 - http://znc.in) |
12:50
π
|
|
will has joined #archiveteam-bs |
13:45
π
|
|
pizzaiolo has joined #archiveteam-bs |
14:19
π
|
|
Aoede has quit IRC (Ping timeout: 255 seconds) |
14:20
π
|
|
Aoede has joined #archiveteam-bs |
14:20
π
|
|
Aoede has quit IRC (Connection closed) |
14:20
π
|
|
Aoede has joined #archiveteam-bs |
14:25
π
|
|
fie has quit IRC (Quit: Leaving) |
14:40
π
|
|
drumstick has quit IRC (Read error: Operation timed out) |
15:13
π
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
15:48
π
|
SketchCow |
JAA: Noo |
15:48
π
|
SketchCow |
JAA: Also Noo |
15:49
π
|
SketchCow |
FOS is slow, is about to be replaced by another FOS |
15:54
π
|
JAA |
Ok, sweet. |
15:54
π
|
JAA |
Any idea when that'll happen? |
16:04
π
|
SketchCow |
Soon? |
16:20
π
|
|
julius_ is now known as jschwart |
16:21
π
|
jschwart |
I've made a collection of discs now that I can probably sent to the internet archive, are tehre any instructions online for that? |
16:22
π
|
jschwart |
another thing I have is old hardware with manuals, discs, etc. does anybody have any suggestion on what would be nice to do with that? |
16:22
π
|
jschwart |
and what about software discs which came with relatively big books as manuals? |
16:24
π
|
SketchCow |
Describe discs in this context. |
16:26
π
|
jschwart |
I have for instance Klik & Play, an old tool to make games, it came with quite a big book |
16:26
π
|
jschwart |
also Superlogo which came with a (Dutch) book and discs |
17:06
π
|
|
pizzaiolo has joined #archiveteam-bs |
17:31
π
|
|
Specular has joined #archiveteam-bs |
17:32
π
|
Specular |
asking again in case someone here knows: is it possible to save a full list of results for a generic 'site:' query from Google? |
17:33
π
|
dashcloud |
Specular: not completely automated- Google quickly hits you with captchas and bans if you try to scrape them |
17:33
π
|
Specular |
god damn it |
17:33
π
|
dashcloud |
but there's a number of tools you can use in browser |
17:33
π
|
Specular |
a site recently hid an entire section of its siteβaccording to one user around 36 million postsβand I'm not sure for how long Google will retain the results of its previous crawls |
17:34
π
|
dashcloud |
apparently the thing you want to do is very popular in the SEO community, so there's a number of extensions for this, and guides on doing it |
17:35
π
|
Specular |
dashcloud, what should I be entering when searching? Couldn't find much when I tried last |
17:35
π
|
Specular |
(that is to find more about how to do this) |
17:35
π
|
dashcloud |
you probably want download google search results to csv |
17:36
π
|
Specular |
for an enormous scrape wouldn't the size become unmanageably large to open? |
17:38
π
|
Specular |
I'm not actually sure if this would cause problems but thought I'd ask |
17:38
π
|
dashcloud |
possibly- then you'd use a different tool to open the file |
17:38
π
|
dashcloud |
notepad++ , a database, or possibly LibreOffice/Excel |
17:39
π
|
dashcloud |
there's this Chrome extension which may or may not work: https://chrome.google.com/webstore/detail/linkclump/lfpjkncokllnfokkgpkobnkbkmelfefj?hl=en |
17:40
π
|
dashcloud |
here's a promising firefox extension: https://jurnsearch.wordpress.com/2012/01/27/how-to-extract-google-search-results-with-url-title-and-snippet-in-a-csv-file/ |
17:41
π
|
dashcloud |
jschwart: if you want to upload stuff yourself to the Internet Archive, I can give you some basic guidelines |
17:45
π
|
dashcloud |
if you'd prefer to send stuff to IA, ask SketchCow for the mailing address |
17:49
π
|
Specular |
Google seems like it might offer a way to grab this officially via a service called the 'Search Console', but to get over a certain number of results one needs the pro version. Perhaps someone on the aforementioned site has a subscription. |
18:00
π
|
|
benuski has joined #archiveteam-bs |
18:25
π
|
|
Specular has quit IRC (be back later) |
18:32
π
|
jschwart |
dashcloud: I'd like to free up the physical space as well |
18:33
π
|
jschwart |
so if I could send things somewhere where they would be useful, I'd rather do that |
19:03
π
|
|
schbirid2 has joined #archiveteam-bs |
19:04
π
|
|
schbirid has quit IRC (Read error: Connection reset by peer) |
19:59
π
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
20:05
π
|
|
zhongfu has joined #archiveteam-bs |
20:51
π
|
|
benuski has quit IRC (Read error: Operation timed out) |
20:54
π
|
|
BlueMaxim has joined #archiveteam-bs |
21:07
π
|
|
benuski has joined #archiveteam-bs |
21:47
π
|
|
BartoCH has quit IRC (Quit: WeeChat 1.9.1) |
21:52
π
|
|
BartoCH has joined #archiveteam-bs |
22:12
π
|
|
c4rc4s has quit IRC (Quit: words) |
22:25
π
|
|
c4rc4s has joined #archiveteam-bs |
22:32
π
|
|
Ceryn has joined #archiveteam-bs |
22:34
π
|
Ceryn |
Hey. Do you guys archive websites? If yes, what tools do you use? Custom ones? wget/warc? |
22:35
π
|
Ceryn |
I'm looking at some available options. It seems I might have to write my own to get the features I want (such as keeping the archive up to date, handling changes in pages, ...). |
22:35
π
|
JAA |
That's what we do. We mostly use wpull (e.g. ArchiveBot and many people using it manually) or wget-lua (in the warrior) and write WARCs. |
22:36
π
|
Ceryn |
Hm. warrior? |
22:36
π
|
JAA |
Our tool to launch Distributed Preservation of Service attacks |
22:36
π
|
JAA |
http://archiveteam.org/index.php?title=Warrior |
22:36
π
|
Ceryn |
Haha. |
22:38
π
|
JAA |
wpull's quite nice if you need to tweak its behaviour. It's written in Python, i.e. you can monkeypatch everything. |
22:38
π
|
Ceryn |
I'll look into Warrior. Thanks. |
22:38
π
|
Ceryn |
wpull's a bot command? |
22:38
π
|
JAA |
wpull's a replacement for wget. |
22:38
π
|
JAA |
It's mostly a reimplementation of wget actually, but in Python rather than nasty unmaintainable C code. |
22:39
π
|
Ceryn |
I want to defend C, but I suppose Python *is* the better language in this case. |
22:39
π
|
JAA |
Nothing wrong with C in general, but from what I've heard, the wget code is really ugly. |
22:40
π
|
Ceryn |
This one? https://github.com/chfoo/wpull |
22:40
π
|
JAA |
Yep |
22:40
π
|
Ceryn |
Cool. |
22:40
π
|
DrasticAc |
There are C# solutions that exist https://github.com/antiufo/Shaman.Scraping |
22:41
π
|
JAA |
You'll want to use either version 1.2.3 though or the fork by FalconK. 2.0.1 is really buggy. |
22:41
π
|
Ceryn |
And why do you want WARC files? Solely for Time Machine compatibility? |
22:41
π
|
DrasticAc |
Although it needs work, I had to hack it to get it compiled |
22:41
π
|
JAA |
Wayback Machine* |
22:41
π
|
Ceryn |
Right. |
22:41
π
|
JAA |
Well, WARC's a nice format that saves all the relevant metadata (request and response headers etc.). |
22:42
π
|
DrasticAc |
It's easy enough to playback the warc if you want to scrap the HTML. |
22:42
π
|
DrasticAc |
Or whatever it is your getting |
22:43
π
|
Ceryn |
So you want the WARC metadata to be able to re-do scrapes? |
22:43
π
|
JAA |
Yep. You can run pywb or a similar tool to playback a WARC and browse the site just like it was the live site. |
22:43
π
|
JAA |
No, because metadata is just as important as the content itself. |
22:43
π
|
Ceryn |
Hm. Interesting. I was thinking of just storing HTML pages for ease of use, but I'll look into pywb or similar too. |
22:43
π
|
JAA |
Also, you need at least two pieces of metadata (URL and date) to be able to do anything meaningful with the archive at all. |
22:44
π
|
JAA |
Or to even call it an archive, in my opinion. |
22:44
π
|
JAA |
Otherwise it's just a copy (and possibly a modified one to get images etc. to work). |
22:45
π
|
Ceryn |
Oh? ELI5 why full meta data, to the extent WARC provides, is as important as the data? I can see uses and nice-to-haves, but I don't know about must-haves. |
22:45
π
|
DrasticAc |
Oh yeah, I'm just saying if you _did_ want to scrap it after the fact to mine it, it's much easier to use the WARC than the live site (which will probably be down) |
22:45
π
|
DrasticAc |
So getting the warc first makes total sense in any case |
22:46
π
|
DrasticAc |
btw, is archive.org down? I can't connect to the S3 and the main website seems down. |
22:46
π
|
Ceryn |
So WARCs preserve everything in mint condition. Saves me from doing a lot of patchwork myself. |
22:46
π
|
|
drumstick has joined #archiveteam-bs |
22:47
π
|
Ceryn |
DrasticAc: archive.org looks like it's timing out here. |
22:48
π
|
JAA |
Yep, down at the moment. THere was some talk about it earlier in #archivebot. |
22:50
π
|
JAA |
WARC doesn't actually contain that much metadata. It contains the URL and timestamp, a record type (to distinguish between requests, responses, and other things), the IP address to which a host resolved at that point in time, and hash digests for verifying that the contents aren't corrupted. |
22:52
π
|
JAA |
You'll want to preserve HTTP requests and response headers to be able to reconstruct what you actually archived. For example, if a site modifies its pages based on user agent detection and you don't store the HTTP request, you'll never know what triggered that response. |
22:52
π
|
Ceryn |
DrasticAc: I hesitate to use a C# solution though. Maybe if it does exactly what I want. But otherwise I risk having to write C#. |
22:52
π
|
DrasticAc |
Yeah, I agree. Wish it was F#. |
22:52
π
|
JAA |
Response headers contain additional information, e.g. about the server software used or when a resource was last modified. |
22:53
π
|
JAA |
I would argue that all of this is very important information. |
22:54
π
|
Ceryn |
I suppose I can't say it's non-important. |
22:54
π
|
Ceryn |
My main use case, I think, would be having copies of sites to go back to for nostalgia or something. |
22:55
π
|
Ceryn |
But maybe WARCs' a better way to go about it anyway. |
22:56
π
|
JAA |
Yeah, I think they're superior in general. Having everything in a single file (or a few of them) is much easier to handle than thousands of files spread across directories (e.g. if you use wget --mirror). |
22:58
π
|
Ceryn |
Are you supposed to dump all logs for a scrape in a single WARC file? Or size-capped WARC files? |
22:58
π
|
JAA |
They're also compressed, which can be a *massive* space saver. |
23:00
π
|
JAA |
Usually size-capped files of a few GB (I use 2 GiB for my own grabs, ArchiveBot uses 5 GiB by default), though I'm not entirely sure why. It shouldn't really be a problem to have bigger files. |
23:00
π
|
JAA |
You can always split or merge them, too. |
23:00
π
|
Ceryn |
Right. |
23:02
π
|
Ceryn |
If you want to update the archive, do you just WARC from scratch? Or re-visiting links from the WARC data? |
23:02
π
|
Ceryn |
Maybe we're into custom code here. |
23:03
π
|
JAA |
Depends on what you want to do exactly. |
23:03
π
|
odemg |
Doesn't seem like just the site is down, my uploads stopped over torrent and python |
23:03
π
|
Ceryn |
Well, I want to keep my archive up to date while still saving older copies of pages. |
23:04
π
|
JAA |
So WARC supports deduplication by referencing old records. You could make use of that. |
23:04
π
|
JAA |
You can of course also skip any URLs that you know (or assume) haven't changed in the meantime. |
23:05
π
|
JAA |
On playback, you'd get the old version in that case. |
23:05
π
|
JAA |
But yes, this would probably require custom code. |
23:05
π
|
JAA |
The deduplication part, I mean. |
23:05
π
|
JAA |
Heritrix might support it though. |
23:06
π
|
Ceryn |
Hm. WARC does sound better and better. |
23:06
π
|
|
jschwart has quit IRC (Quit: Konversation terminated!) |
23:07
π
|
Ceryn |
Sweet. |
23:09
π
|
Ceryn |
Perhaps I might want to use WARC anyway, haha. I think I might still end up writing relevant code, but I can probably base a whole lot on the WARC library and maybe wpull. |
23:10
π
|
|
kvieta has quit IRC (Quit: greedo shot first) |
23:11
π
|
Ceryn |
So thanks for that. |
23:12
π
|
Ceryn |
When you guys archive stuff, is it for your own use/hoard? |
23:12
π
|
|
kvieta has joined #archiveteam-bs |
23:15
π
|
JAA |
I'm sure people in here archive stuff for their own use, but most of it goes to the Internet Archive for public consumption. |
23:16
π
|
Ceryn |
And that works by people just archiving and uploading whatever? |
23:17
π
|
JAA |
I'm not entirely sure how it works. I still haven't uploaded my grabs to the IA because I'm too lazy to figure out how to do it. |
23:17
π
|
Ceryn |
Haha okay. That'll be me once it's all up and running. |
23:17
π
|
JAA |
I think you need to set an attribute when uploading the files. Not sure if it happens automatically afterwards or not. |
23:19
π
|
Ceryn |
Do you archive everything on the domain for a given domain name? All resources, images, video, objects? |
23:19
π
|
JAA |
Depends strongly on the site. |
23:19
π
|
JAA |
Usually, there is some stuff you need to skip. |
23:20
π
|
JAA |
Also, you'll generally want to retrieve images etc. also if they're not on the same domain. |
23:20
π
|
Ceryn |
Yes, I thought about that. And maybe the page you're archiving links to other pages you ought to have too? |
23:21
π
|
JAA |
Yeah, that's what ArchiveBot does by default, retrieving one extra layer of pages "around" the actual target for context. |
23:21
π
|
Ceryn |
(Maybe just the specific pages linked to, and not the entire thing. So a follow-links number of 2 or something.) |
23:21
π
|
Ceryn |
Cool. |
23:22
π
|
Ceryn |
What stuff would you need to skip? |
23:24
π
|
JAA |
Misparsed stuff (e.g. from wpull's JavaScript "parser", which just extracts anything that looks like a path), infinite loops (e.g. calendars), share links, other useless stuff (e.g. links that require an account), etc. |
23:24
π
|
JAA |
Sometimes, you get session IDs in the URL, which you also need to handle carefully to not retrieve everything hundreds of times. |
23:26
π
|
JAA |
Then there's specific stuff like archiving forums where you might want to skip the links for individual posts depending on how many posts there are. |
23:26
π
|
godane |
so i'm now down to $35 on my patreon: https://www.patreon.com/godane |
23:35
π
|
Ceryn |
Hm. Is manual work usually required for scraping a site, then? |
23:35
π
|
Ceryn |
Apart from starting the scraper. |
23:36
π
|
JAA |
Mostly just adding ignore patterns and fiddling with the concurrency and delay settings. |
23:37
π
|
JAA |
With plain wpull, that's a bit of a pain though. |
23:37
π
|
JAA |
So you might want to look into grab-site. |
23:37
π
|
JAA |
Not sure how easily that is customisable though. |
23:39
π
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
23:41
π
|
|
pizzaiolo has joined #archiveteam-bs |
23:41
π
|
Ceryn |
Huh. That's a lot of ignore patterns grab-site has. |
23:42
π
|
JAA |
Yup |
23:43
π
|
Ceryn |
Ignore patterns seem to be a pain in the ass. |
23:44
π
|
Ceryn |
By the time you realise you need an ignore pattern, surely you'll have accumulated a ton of crap? |
23:47
π
|
JAA |
That's usually how it goes, yes. |
23:48
π
|
Ceryn |
And then... You start over? Because you don't want all that crap? |
23:48
π
|
Ceryn |
Or maybe you can just prune it? |
23:48
π
|
JAA |
Depends on your goal. We usually just leave it. |
23:49
π
|
JAA |
But yeah, you could filter the WARC. |
23:50
π
|
Ceryn |
Shit. That's going to be a nightmare. |
23:50
π
|
godane |
so archive.org is not working for me |
23:50
π
|
Ceryn |
How do these bad patterns work? Infinite length urls? Or circular links? |
23:50
π
|
JAA |
godane: Yep, they're down since at least two hours. |
23:50
π
|
JAA |
No word yet on what's going on. |
23:52
π
|
JAA |
Ceryn: Basically, the stuff I listed above. Circular links are already handled by wpull. Infinite loops need to be handled manually in most cases. |
23:53
π
|
Ceryn |
Oh. |
23:53
π
|
godane |
ok then |
23:54
π
|
Ceryn |
Maybe a recursion depth could help too. |
23:54
π
|
JAA |
Yeah, depends on the site really. |
23:54
π
|
Ceryn |
Tough to figure out how far it should go though. |
23:54
π
|
Ceryn |
Yeah. I had hoped I could automate this pretty much fully. |
23:54
π
|
dashcloud |
as you've probably noticed, site-grabbing is more of an art than a science, but the more you do it, the better you can get at it |
23:55
π
|
JAA |
Yep, this. |
23:55
π
|
Ceryn |
Heh, yeah. |
23:55
π
|
JAA |
And if you just want to regrab the same site(s) periodically, you can probably automate most of it. |
23:55
π
|
Ceryn |
I suppose it ought to only be painful the first time. |
23:55
π
|
Ceryn |
There's a joke in here somewhere. |
23:55
π
|
dashcloud |
you're facing much the same problem browser vendors do- they have to support tons of crazy ideas, and things that should never have seen the light of day |
23:56
π
|
JAA |
<marquee>I'm not sure what you're talking about.</marquee> |
23:57
π
|
dashcloud |
you didn't like the marquee + midi music aesthestic of the 90s? |
23:57
π
|
Frogging |
it's better than a lot of what we have now tbh |
23:57
π
|
JAA |
Yeah, that's probably true. |
23:57
π
|
DrasticAc |
I use marquee whenever possible |
23:58
π
|
Ceryn |
Haha. I had forgotten (repressed?) that effect. |
23:58
π
|
DrasticAc |
It can still be used by all major browsers, even though I think they'll complain in the console if you do |