#archiveteam-bs 2017-10-29,Sun

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***Honno has joined #archiveteam-bs [00:50]
...... (idle for 27mn)
Mateon1 has quit IRC (Read error: Operation timed out) [01:17]
drumstick has quit IRC (Ping timeout: 360 seconds)
drumstick has joined #archiveteam-bs
[01:30]
....... (idle for 30mn)
Honno has quit IRC (Read error: Operation timed out) [02:00]
.... (idle for 16mn)
hook54321I've been recording a camera feed from a bar in Catalonia for hours... But I'm still on the fence on if I should upload it to archive.org or not. [02:16]
***schbirid has quit IRC (Ping timeout: 255 seconds) [02:24]
schbirid has joined #archiveteam-bs
pizzaiolo has joined #archiveteam-bs
[02:36]
........... (idle for 53mn)
godaneso this guy has a interesting set of vhs : https://www.ebay.com/sch/VHS-Video-Tapes/149960/m.html?_nkw=&_armrs=1&_ipg=&_from=&_ssn=froggyholler
i'm only interested in one tape that said Dog Day Afternoon with time of 1h 57m
on TB
*TBS
[03:30]
hook54321I wonder what would happen if we set up something like a GoFundMe so we could obtain stuff like that [03:35]
godanei do have a patreon page: https://www.patreon.com/godane
hook54321: its how i got the last set of tapes from ebay
[03:43]
SketchCow: i'm uploading the original broadcast of Dinotopia so don't upload my vhs stuff for the next day
its about 17,272,156,160 in size
[03:55]
..... (idle for 23mn)
***qw3rty113 has joined #archiveteam-bs
Stilett0 has joined #archiveteam-bs
qw3rty112 has quit IRC (Read error: Operation timed out)
[04:19]
Mateon1 has joined #archiveteam-bs [04:32]
phillipsjI saw a few episodes of that.. it was OK, but not great.
phillipsj is getting old.
[04:33]
...... (idle for 25mn)
***drumstick has quit IRC (Ping timeout: 255 seconds)
drumstick has joined #archiveteam-bs
[04:58]
................. (idle for 1h22mn)
pizzaiolo has quit IRC (Ping timeout: 246 seconds) [06:20]
..... (idle for 21mn)
SketchCowGreat [06:41]
........................... (idle for 2h10mn)
hook54321I think I've found some Catalonia radio stations that can listened to online, however I've been blocked from them (I'm pretty sure for trying different ports), so I can't record them. :/ [08:51]
.......... (idle for 48mn)
***mls has joined #archiveteam-bs
mls has left
[09:39]
hook54321https://www.youtube.com/watch?v=cfmiNyneO88 [09:45]
***BlueMaxim has quit IRC (Quit: Leaving) [09:53]
..... (idle for 22mn)
JAASketchCow: Is there anything that can be done to speed up transfers to FOS? My ArchiveBot pipeline has trouble keeping up as the uploads average only 2-3 MB/s. Or should I look into uploading to IA directly instead? [10:15]
***Honno has joined #archiveteam-bs [10:16]
...... (idle for 28mn)
hook54321If anyone is interested in joining a discord server where there are people from Catalonia, dm me and if I'm awake I'll probably send you the link, I might ask you a few questions about why you want to join it. [10:44]
***Aerochrom has joined #archiveteam-bs [10:50]
.................... (idle for 1h35mn)
nepeat has quit IRC (ZNC 1.6.5 - http://znc.in) [12:25]
...... (idle for 25mn)
will has joined #archiveteam-bs [12:50]
............ (idle for 55mn)
pizzaiolo has joined #archiveteam-bs [13:45]
....... (idle for 34mn)
Aoede has quit IRC (Ping timeout: 255 seconds)
Aoede has joined #archiveteam-bs
Aoede has quit IRC (Connection closed)
Aoede has joined #archiveteam-bs
[14:19]
fie has quit IRC (Quit: Leaving) [14:25]
.... (idle for 15mn)
drumstick has quit IRC (Read error: Operation timed out) [14:40]
....... (idle for 33mn)
pizzaiolo has quit IRC (Remote host closed the connection) [15:13]
........ (idle for 35mn)
SketchCowJAA: Noo
JAA: Also Noo
FOS is slow, is about to be replaced by another FOS
[15:48]
JAAOk, sweet.
Any idea when that'll happen?
[15:54]
SketchCowSoon? [16:04]
.... (idle for 16mn)
***julius_ is now known as jschwart [16:20]
jschwartI've made a collection of discs now that I can probably sent to the internet archive, are tehre any instructions online for that?
another thing I have is old hardware with manuals, discs, etc. does anybody have any suggestion on what would be nice to do with that?
and what about software discs which came with relatively big books as manuals?
[16:21]
SketchCowDescribe discs in this context. [16:24]
jschwartI have for instance Klik & Play, an old tool to make games, it came with quite a big book
also Superlogo which came with a (Dutch) book and discs
[16:26]
......... (idle for 40mn)
***pizzaiolo has joined #archiveteam-bs [17:06]
...... (idle for 25mn)
Specular has joined #archiveteam-bs [17:31]
Specularasking again in case someone here knows: is it possible to save a full list of results for a generic 'site:' query from Google? [17:32]
dashcloudSpecular: not completely automated- Google quickly hits you with captchas and bans if you try to scrape them [17:33]
Speculargod damn it [17:33]
dashcloudbut there's a number of tools you can use in browser [17:33]
Speculara site recently hid an entire section of its site—according to one user around 36 million posts—and I'm not sure for how long Google will retain the results of its previous crawls [17:33]
dashcloudapparently the thing you want to do is very popular in the SEO community, so there's a number of extensions for this, and guides on doing it [17:34]
Speculardashcloud, what should I be entering when searching? Couldn't find much when I tried last
(that is to find more about how to do this)
[17:35]
dashcloudyou probably want download google search results to csv [17:35]
Specularfor an enormous scrape wouldn't the size become unmanageably large to open?
I'm not actually sure if this would cause problems but thought I'd ask
[17:36]
dashcloudpossibly- then you'd use a different tool to open the file
notepad++ , a database, or possibly LibreOffice/Excel
there's this Chrome extension which may or may not work: https://chrome.google.com/webstore/detail/linkclump/lfpjkncokllnfokkgpkobnkbkmelfefj?hl=en
here's a promising firefox extension: https://jurnsearch.wordpress.com/2012/01/27/how-to-extract-google-search-results-with-url-title-and-snippet-in-a-csv-file/
jschwart: if you want to upload stuff yourself to the Internet Archive, I can give you some basic guidelines
if you'd prefer to send stuff to IA, ask SketchCow for the mailing address
[17:38]
SpecularGoogle seems like it might offer a way to grab this officially via a service called the 'Search Console', but to get over a certain number of results one needs the pro version. Perhaps someone on the aforementioned site has a subscription. [17:49]
***benuski has joined #archiveteam-bs [18:00]
...... (idle for 25mn)
Specular has quit IRC (be back later) [18:25]
jschwartdashcloud: I'd like to free up the physical space as well
so if I could send things somewhere where they would be useful, I'd rather do that
[18:32]
....... (idle for 30mn)
***schbirid2 has joined #archiveteam-bs
schbirid has quit IRC (Read error: Connection reset by peer)
[19:03]
............ (idle for 55mn)
zhongfu has quit IRC (Ping timeout: 260 seconds) [19:59]
zhongfu has joined #archiveteam-bs [20:05]
.......... (idle for 46mn)
benuski has quit IRC (Read error: Operation timed out)
BlueMaxim has joined #archiveteam-bs
[20:51]
benuski has joined #archiveteam-bs [21:07]
......... (idle for 40mn)
BartoCH has quit IRC (Quit: WeeChat 1.9.1) [21:47]
BartoCH has joined #archiveteam-bs [21:52]
..... (idle for 20mn)
c4rc4s has quit IRC (Quit: words) [22:12]
c4rc4s has joined #archiveteam-bs [22:25]
Ceryn has joined #archiveteam-bs [22:32]
CerynHey. Do you guys archive websites? If yes, what tools do you use? Custom ones? wget/warc?
I'm looking at some available options. It seems I might have to write my own to get the features I want (such as keeping the archive up to date, handling changes in pages, ...).
[22:34]
JAAThat's what we do. We mostly use wpull (e.g. ArchiveBot and many people using it manually) or wget-lua (in the warrior) and write WARCs. [22:35]
CerynHm. warrior? [22:36]
JAAOur tool to launch Distributed Preservation of Service attacks
http://archiveteam.org/index.php?title=Warrior
[22:36]
CerynHaha. [22:36]
JAAwpull's quite nice if you need to tweak its behaviour. It's written in Python, i.e. you can monkeypatch everything. [22:38]
CerynI'll look into Warrior. Thanks.
wpull's a bot command?
[22:38]
JAAwpull's a replacement for wget.
It's mostly a reimplementation of wget actually, but in Python rather than nasty unmaintainable C code.
[22:38]
CerynI want to defend C, but I suppose Python *is* the better language in this case. [22:39]
JAANothing wrong with C in general, but from what I've heard, the wget code is really ugly. [22:39]
CerynThis one? https://github.com/chfoo/wpull [22:40]
JAAYep [22:40]
CerynCool. [22:40]
DrasticAcThere are C# solutions that exist https://github.com/antiufo/Shaman.Scraping [22:40]
JAAYou'll want to use either version 1.2.3 though or the fork by FalconK. 2.0.1 is really buggy. [22:41]
CerynAnd why do you want WARC files? Solely for Time Machine compatibility? [22:41]
DrasticAcAlthough it needs work, I had to hack it to get it compiled [22:41]
JAAWayback Machine* [22:41]
CerynRight. [22:41]
JAAWell, WARC's a nice format that saves all the relevant metadata (request and response headers etc.). [22:41]
DrasticAcIt's easy enough to playback the warc if you want to scrap the HTML.
Or whatever it is your getting
[22:42]
CerynSo you want the WARC metadata to be able to re-do scrapes? [22:43]
JAAYep. You can run pywb or a similar tool to playback a WARC and browse the site just like it was the live site.
No, because metadata is just as important as the content itself.
[22:43]
CerynHm. Interesting. I was thinking of just storing HTML pages for ease of use, but I'll look into pywb or similar too. [22:43]
JAAAlso, you need at least two pieces of metadata (URL and date) to be able to do anything meaningful with the archive at all.
Or to even call it an archive, in my opinion.
Otherwise it's just a copy (and possibly a modified one to get images etc. to work).
[22:43]
CerynOh? ELI5 why full meta data, to the extent WARC provides, is as important as the data? I can see uses and nice-to-haves, but I don't know about must-haves. [22:45]
DrasticAcOh yeah, I'm just saying if you _did_ want to scrap it after the fact to mine it, it's much easier to use the WARC than the live site (which will probably be down)
So getting the warc first makes total sense in any case
btw, is archive.org down? I can't connect to the S3 and the main website seems down.
[22:45]
CerynSo WARCs preserve everything in mint condition. Saves me from doing a lot of patchwork myself. [22:46]
***drumstick has joined #archiveteam-bs [22:46]
CerynDrasticAc: archive.org looks like it's timing out here. [22:47]
JAAYep, down at the moment. THere was some talk about it earlier in #archivebot.
WARC doesn't actually contain that much metadata. It contains the URL and timestamp, a record type (to distinguish between requests, responses, and other things), the IP address to which a host resolved at that point in time, and hash digests for verifying that the contents aren't corrupted.
You'll want to preserve HTTP requests and response headers to be able to reconstruct what you actually archived. For example, if a site modifies its pages based on user agent detection and you don't store the HTTP request, you'll never know what triggered that response.
[22:48]
CerynDrasticAc: I hesitate to use a C# solution though. Maybe if it does exactly what I want. But otherwise I risk having to write C#. [22:52]
DrasticAcYeah, I agree. Wish it was F#. [22:52]
JAAResponse headers contain additional information, e.g. about the server software used or when a resource was last modified.
I would argue that all of this is very important information.
[22:52]
CerynI suppose I can't say it's non-important.
My main use case, I think, would be having copies of sites to go back to for nostalgia or something.
But maybe WARCs' a better way to go about it anyway.
[22:54]
JAAYeah, I think they're superior in general. Having everything in a single file (or a few of them) is much easier to handle than thousands of files spread across directories (e.g. if you use wget --mirror). [22:56]
CerynAre you supposed to dump all logs for a scrape in a single WARC file? Or size-capped WARC files? [22:58]
JAAThey're also compressed, which can be a *massive* space saver.
Usually size-capped files of a few GB (I use 2 GiB for my own grabs, ArchiveBot uses 5 GiB by default), though I'm not entirely sure why. It shouldn't really be a problem to have bigger files.
You can always split or merge them, too.
[22:58]
CerynRight.
If you want to update the archive, do you just WARC from scratch? Or re-visiting links from the WARC data?
Maybe we're into custom code here.
[23:00]
JAADepends on what you want to do exactly. [23:03]
odemgDoesn't seem like just the site is down, my uploads stopped over torrent and python [23:03]
CerynWell, I want to keep my archive up to date while still saving older copies of pages. [23:03]
JAASo WARC supports deduplication by referencing old records. You could make use of that.
You can of course also skip any URLs that you know (or assume) haven't changed in the meantime.
On playback, you'd get the old version in that case.
But yes, this would probably require custom code.
The deduplication part, I mean.
Heritrix might support it though.
[23:04]
CerynHm. WARC does sound better and better. [23:06]
***jschwart has quit IRC (Quit: Konversation terminated!) [23:06]
CerynSweet.
Perhaps I might want to use WARC anyway, haha. I think I might still end up writing relevant code, but I can probably base a whole lot on the WARC library and maybe wpull.
[23:07]
***kvieta has quit IRC (Quit: greedo shot first) [23:10]
CerynSo thanks for that.
When you guys archive stuff, is it for your own use/hoard?
[23:11]
***kvieta has joined #archiveteam-bs [23:12]
JAAI'm sure people in here archive stuff for their own use, but most of it goes to the Internet Archive for public consumption. [23:15]
CerynAnd that works by people just archiving and uploading whatever? [23:16]
JAAI'm not entirely sure how it works. I still haven't uploaded my grabs to the IA because I'm too lazy to figure out how to do it. [23:17]
CerynHaha okay. That'll be me once it's all up and running. [23:17]
JAAI think you need to set an attribute when uploading the files. Not sure if it happens automatically afterwards or not. [23:17]
CerynDo you archive everything on the domain for a given domain name? All resources, images, video, objects? [23:19]
JAADepends strongly on the site.
Usually, there is some stuff you need to skip.
Also, you'll generally want to retrieve images etc. also if they're not on the same domain.
[23:19]
CerynYes, I thought about that. And maybe the page you're archiving links to other pages you ought to have too? [23:20]
JAAYeah, that's what ArchiveBot does by default, retrieving one extra layer of pages "around" the actual target for context. [23:21]
Ceryn(Maybe just the specific pages linked to, and not the entire thing. So a follow-links number of 2 or something.)
Cool.
What stuff would you need to skip?
[23:21]
JAAMisparsed stuff (e.g. from wpull's JavaScript "parser", which just extracts anything that looks like a path), infinite loops (e.g. calendars), share links, other useless stuff (e.g. links that require an account), etc.
Sometimes, you get session IDs in the URL, which you also need to handle carefully to not retrieve everything hundreds of times.
Then there's specific stuff like archiving forums where you might want to skip the links for individual posts depending on how many posts there are.
[23:24]
godaneso i'm now down to $35 on my patreon: https://www.patreon.com/godane [23:26]
CerynHm. Is manual work usually required for scraping a site, then?
Apart from starting the scraper.
[23:35]
JAAMostly just adding ignore patterns and fiddling with the concurrency and delay settings.
With plain wpull, that's a bit of a pain though.
So you might want to look into grab-site.
Not sure how easily that is customisable though.
[23:36]
***pizzaiolo has quit IRC (Remote host closed the connection)
pizzaiolo has joined #archiveteam-bs
[23:39]
CerynHuh. That's a lot of ignore patterns grab-site has. [23:41]
JAAYup [23:42]
CerynIgnore patterns seem to be a pain in the ass.
By the time you realise you need an ignore pattern, surely you'll have accumulated a ton of crap?
[23:43]
JAAThat's usually how it goes, yes. [23:47]
CerynAnd then... You start over? Because you don't want all that crap?
Or maybe you can just prune it?
[23:48]
JAADepends on your goal. We usually just leave it.
But yeah, you could filter the WARC.
[23:48]
CerynShit. That's going to be a nightmare. [23:50]
godaneso archive.org is not working for me [23:50]
CerynHow do these bad patterns work? Infinite length urls? Or circular links? [23:50]
JAAgodane: Yep, they're down since at least two hours.
No word yet on what's going on.
Ceryn: Basically, the stuff I listed above. Circular links are already handled by wpull. Infinite loops need to be handled manually in most cases.
[23:50]
CerynOh. [23:53]
godaneok then [23:53]
CerynMaybe a recursion depth could help too. [23:54]
JAAYeah, depends on the site really. [23:54]
CerynTough to figure out how far it should go though.
Yeah. I had hoped I could automate this pretty much fully.
[23:54]
dashcloudas you've probably noticed, site-grabbing is more of an art than a science, but the more you do it, the better you can get at it [23:54]
JAAYep, this. [23:55]
CerynHeh, yeah. [23:55]
JAAAnd if you just want to regrab the same site(s) periodically, you can probably automate most of it. [23:55]
CerynI suppose it ought to only be painful the first time.
There's a joke in here somewhere.
[23:55]
dashcloudyou're facing much the same problem browser vendors do- they have to support tons of crazy ideas, and things that should never have seen the light of day [23:55]
JAA<marquee>I'm not sure what you're talking about.</marquee> [23:56]
dashcloudyou didn't like the marquee + midi music aesthestic of the 90s? [23:57]
Froggingit's better than a lot of what we have now tbh [23:57]
JAAYeah, that's probably true. [23:57]
DrasticAcI use marquee whenever possible [23:57]
CerynHaha. I had forgotten (repressed?) that effect. [23:58]
DrasticAcIt can still be used by all major browsers, even though I think they'll complain in the console if you do [23:58]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)