#archiveteam-bs 2020-08-14,Fri

↑back Search

Time Nickname Message
00:01 πŸ”— Raccoon is now known as A-real-ni
00:01 πŸ”— A-real-ni is now known as Raccoon
00:01 πŸ”— Ajay19 has quit IRC (Client Quit)
00:01 πŸ”— Ajay14 has joined #archiveteam-bs
00:03 πŸ”— Ajay1 has quit IRC (Ping timeout: 265 seconds)
00:03 πŸ”— Ajay14 is now known as Ajay1
00:04 πŸ”— Ajay15 has joined #archiveteam-bs
00:08 πŸ”— Ajay1 has quit IRC (Ping timeout: 265 seconds)
00:08 πŸ”— Ajay15 is now known as Ajay1
00:08 πŸ”— Ajay10 has joined #archiveteam-bs
00:10 πŸ”— Ajay10 has quit IRC (Client Quit)
00:11 πŸ”— Ajay10 has joined #archiveteam-bs
00:13 πŸ”— Ajay1 has quit IRC (Ping timeout: 265 seconds)
00:13 πŸ”— Ajay10 is now known as Ajay1
00:14 πŸ”— Ajay14 has joined #archiveteam-bs
00:15 πŸ”— Ajay14 is now known as Ajay
00:16 πŸ”— Ajay has quit IRC (Client Quit)
00:18 πŸ”— Ajay1 has quit IRC (Ping timeout: 265 seconds)
00:24 πŸ”— jshoard has quit IRC (Quit: Leaving)
01:18 πŸ”— RichardG_ has joined #archiveteam-bs
01:24 πŸ”— RichardG has quit IRC (Read error: Operation timed out)
01:28 πŸ”— lennier2 has joined #archiveteam-bs
01:32 πŸ”— lennier1 has quit IRC (Ping timeout: 272 seconds)
01:32 πŸ”— lennier2 is now known as lennier1
01:52 πŸ”— Ryz ...So, I was going to archive each of those file downloads from https://d-indiegames.blogspot.com/ via AB - some had dead links, this one like https://d-indiegames.blogspot.com/2014/03/misao.html in particular, this link being http://www.mediafire.com/download/kba90hx4zrfkwa9/Misao.zip got DMCA'd by Nintendo...even though looking from the screensho
01:52 πŸ”— Ryz ts, there isn't anything Nintendo related at all
01:54 πŸ”— Arcorann_ https://vgperson.com/games/misao.htm
01:55 πŸ”— Arcorann_ That is weird, though
02:00 πŸ”— nico_32 https://web.archive.org/web/*/https://vgperson.com/games/Misao303.zip
02:00 πŸ”— nico_32 Tue, 07 Apr 2020 12:42:17 GMT (why: wikicollections, wikipedia-eventstream, wikipediaoutlinks)
02:00 πŸ”— nico_32 interesting way for IA to grab this file
02:13 πŸ”— HP_Archiv has joined #archiveteam-bs
02:23 πŸ”— Pixi has quit IRC (Read error: Connection reset by peer)
02:24 πŸ”— Ryz There's this page https://d-indiegames.blogspot.com/2014/08/misao-version-3.html - and that download link is still up fortunately
02:24 πŸ”— Ryz ...It's just really bizarre that somehow this was taken down by Nintendo :/
02:29 πŸ”— Pixi has joined #archiveteam-bs
03:17 πŸ”— HP_Archiv has quit IRC (Quit: Leaving)
03:25 πŸ”— Ryz Interesting, this Tumblr account has it set up which https://terriball-tl.tumblr.com/ redirects to https://terriball-tl.tumblr.com/tagged/release
03:26 πŸ”— qw3rty_ has joined #archiveteam-bs
03:27 πŸ”— OrIdow6 voltagex on the HN page: "A 2019 paper says there's 47TB of Docker images on the Hub. Get scraping."
03:27 πŸ”— OrIdow6 I've been meaning to contact voltagex about the Samsung VR thing anyway...
03:28 πŸ”— OrIdow6 That quote wasn't cited, obviously
03:28 πŸ”— wyatt8740 has quit IRC (Read error: Operation timed out)
03:30 πŸ”— OrIdow6 Wait, he's here
03:33 πŸ”— qw3rty has quit IRC (Read error: Operation timed out)
04:20 πŸ”— OrIdow6 And apparently put the link in -ot
05:18 πŸ”— phuzion has joined #archiveteam-bs
05:19 πŸ”— phuzion_ has quit IRC (Read error: Connection reset by peer)
05:21 πŸ”— Ctrl has quit IRC (Read error: Operation timed out)
05:21 πŸ”— wessel152 has quit IRC (Read error: Operation timed out)
05:22 πŸ”— Ctrl has joined #archiveteam-bs
05:31 πŸ”— Ctrl has quit IRC (Read error: Operation timed out)
05:32 πŸ”— legoktm has quit IRC (Read error: Connection reset by peer)
05:32 πŸ”— BnAboyZ has quit IRC (Read error: Connection reset by peer)
05:32 πŸ”— underscor has quit IRC (Read error: Connection reset by peer)
05:32 πŸ”— underscor has joined #archiveteam-bs
05:33 πŸ”— BnAboyZ has joined #archiveteam-bs
05:34 πŸ”— legoktm has joined #archiveteam-bs
05:43 πŸ”— Ctrl has joined #archiveteam-bs
05:44 πŸ”— britmob_ has joined #archiveteam-bs
05:45 πŸ”— Jake has quit IRC (Read error: Operation timed out)
05:49 πŸ”— britmob has quit IRC (Read error: Operation timed out)
05:54 πŸ”— Jake has joined #archiveteam-bs
06:36 πŸ”— HP_Archiv has joined #archiveteam-bs
07:16 πŸ”— HP_Archiv has quit IRC (Read error: Connection reset by peer)
07:26 πŸ”— HP_Archiv has joined #archiveteam-bs
07:55 πŸ”— user_ has joined #archiveteam-bs
08:09 πŸ”— user_ Hi, I have a couple of questions. I've partially archived a newspaper website (zviazda.by) with Wget and would like to upload the resulting .warc.gz data to the Internet Archive. Could you please advise:
08:09 πŸ”— user_ - How exactly can I tag my items with the subject keyword "archiveteam"? Are there any working examples for the "ia" command line tool?
08:09 πŸ”— user_ - A few .warc.gz blocks were lost because of low disk space on my machine. Is it still OK to upload the remaining blocks, or should I fully recrawl the site?
08:09 πŸ”— user_ Thanks in advance for any comments.
08:13 πŸ”— kiska1825 has quit IRC (Read error: Operation timed out)
08:14 πŸ”— Ryz has quit IRC (Quit: Ping timeout (120 seconds))
08:16 πŸ”— jshoard has joined #archiveteam-bs
08:17 πŸ”— Ryz has joined #archiveteam-bs
08:17 πŸ”— kiska1825 has joined #archiveteam-bs
08:18 πŸ”— svchfoo3 sets mode: +o Ryz
08:59 πŸ”— mgrandi Can you recrawl the parts that were not saved?
08:59 πŸ”— Jonimoose has quit IRC (Read error: Operation timed out)
09:00 πŸ”— Jonimoose has joined #archiveteam-bs
09:00 πŸ”— HP_Archiv has quit IRC (Quit: Leaving)
09:02 πŸ”— mgrandi And I think you modify metadata either by having like the warc headers or using `ia metadata` https://archive.org/services/docs/api/internetarchive/cli.html#metadata
09:04 πŸ”— OrIdow6 user_: IIRC the proper set of arguments would be '-m "subject:Archiveteam"'; but as (unless you have a more familiar nick) no one here knows you, your warcs would not be added to the Wayback Machine, nor (I think) moved to the Archiveteam collection
09:05 πŸ”— OrIdow6 (You may be able to do without the quotes around the second argument depending on your shell, obviously)
09:07 πŸ”— mgrandi has left
09:07 πŸ”— mgrandi has joined #archiveteam-bs
09:09 πŸ”— VerifiedJ has joined #archiveteam-bs
09:10 πŸ”— user_ <mgrandi>: Yes, I'm going to recrawl unsaved parts, it will take several days. Basically it will be another set of .warc.gz blocks.
09:10 πŸ”— user_ <mgrandi>, <OrIdow6>: Thanks for these instructions. I don't have a more familiar nick, as I'm writing to this IRC channel for the first time. Could you please explain what details should I provide about myself to be able to add warcs to the ArchiveTeam collection?
09:16 πŸ”— BnAboyZ has quit IRC (Read error: Connection reset by peer)
09:18 πŸ”— BnAboyZ has joined #archiveteam-bs
09:22 πŸ”— Ctrl has quit IRC (Read error: Operation timed out)
09:22 πŸ”— BnAboyZ has quit IRC (Read error: Connection reset by peer)
09:23 πŸ”— BnAboyZ has joined #archiveteam-bs
09:23 πŸ”— kyledrake has quit IRC (Read error: Operation timed out)
09:24 πŸ”— kyledrake has joined #archiveteam-bs
09:25 πŸ”— Ctrl has joined #archiveteam-bs
09:33 πŸ”— mgrandi @JAA: did anyone ever look at that hltv thing? Do we need something to make a url crawler script?
09:33 πŸ”— OrIdow6 user_: YOu can stay on until the right people get on who can really answer that, but in any case I don't think your chances are that good, especially with the compression thing
09:35 πŸ”— OrIdow6 And the wget issue with the brackets in the warc (where wget outputs a warc format slightly different from the de facto (but technically wrong) standard)
09:35 πŸ”— OrIdow6 Though maybe the WBM can get around that, I haven't been keeping track of it
09:36 πŸ”— Kaz I don't have a working example for the ia upload tool, someone else might do though. As for the missing blocks, obviously a full recrawl would be ideal but if the warcs are still usable without it then that's fine (just obviously.. the data will be missing)
09:36 πŸ”— betamax has quit IRC (Read error: Operation timed out)
09:36 πŸ”— Kaz I will note that your warcs will *not* end up available in the wayback machine though
09:43 πŸ”— pikami has quit IRC (Ping timeout: 615 seconds)
09:43 πŸ”— wessel152 has joined #archiveteam-bs
09:43 πŸ”— phuzion has quit IRC (Ping timeout: 615 seconds)
09:43 πŸ”— phuzion has joined #archiveteam-bs
09:46 πŸ”— user_ <OrIdow6>, <Kaz>: Got it, thanks! As the website isn't dying as of yet, I'll do a recrawl. I'm aware of the brackets issue – would you please recommend a command line tool other than wget that I can use for recrawling? Basically I want to back up a list of URLs (news stories published by the website) without doing a full traversal, so I thought a feature-rich crawler like Heritrix may be an overkill.
09:46 πŸ”— pikami has joined #archiveteam-bs
09:47 πŸ”— Kaz brackets issue?
09:47 πŸ”— betamax has joined #archiveteam-bs
09:50 πŸ”— user_ Discussed here: http://fileformats.archiveteam.org/wiki/WARC (Ctrl+F "wget")
09:54 πŸ”— Arcorann_ That's something I'm interested in looking into as well (saving the full contents of various blogs into the Wayback Machine, in my case)
10:02 πŸ”— Kaz user_: we use https://github.com/ArchiveTeam/wget-lua for most/all of our projects, I'd say that's safe
10:02 πŸ”— Kaz Arcorann_: archivebot is likely your best bet. You won't be able to upload to IA and have warcs ingested into the IA on your own
10:10 πŸ”— user_ <Kaz>: Thank you for sharing this! Sorry for another stupid question, but does wget-lua accept the same command line parameters as GNU wget? I'm running the crawl like this:
10:10 πŸ”— user_ wget -e robots=off --page-requisites --waitretry 5 --timeout 60 --tries 5 --wait 2 --random-wait --warc-header "operator: <redacted>" --warc-cdx --warc-file="zviazda-2020.08.01" -U "Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0" --input-file="links.txt" --reject-regex 'zviazda_theme/img/(blank|fancybox)' --warc-max-size=100M
10:10 πŸ”— user_ Will it also work with wget-lua? And should I change any of the arguments for compatibility with the existing practice?
10:21 πŸ”— mgrandi I think wget Lua just adds additional arguments
10:24 πŸ”— mgrandi It seems fine, you can see what the arguments are for projects: https://github.com/ArchiveTeam/github-grab/blob/master/pipeline.py#L192
10:24 πŸ”— recruit_m has joined #archiveteam-bs
10:25 πŸ”— recruit_m @kaz are you here
10:25 πŸ”— Kaz hello
10:25 πŸ”— Kaz user_: I believe so yes
10:25 πŸ”— recruit_m ok, thanks for the linkl
10:26 πŸ”— mgrandi Also: does anyone know what the "new style hash" is for a CDX file? Is it base64 sha1? The documentation literally just says "new style hash"
10:27 πŸ”— mgrandi (The other style of metadata file format seems much more documented and easy to parse but I digress)
10:31 πŸ”— Kaz cc JAA
10:32 πŸ”— user_ Great, thank you guys!
11:02 πŸ”— godane has quit IRC (Ping timeout: 260 seconds)
11:12 πŸ”— JAA mgrandi: I haven't looked at HLTV yet.
11:12 πŸ”— JAA mgrandi: In my experience, the 'new style hash' in the CDX at least as written by IA is the same as recorded in the WARC, i.e. SHA-1 in base36.
11:21 πŸ”— recruit_m so Jason Scott (from the twitter link at the top of this channel) does a paid podcast? how does that add up? does someone have links for free or a comment on how you can be an archivist and put content behind paywalls?
11:25 πŸ”— godane has joined #archiveteam-bs
11:33 πŸ”— omglolba- has joined #archiveteam-bs
11:33 πŸ”— omglolbah has quit IRC (Read error: Connection reset by peer)
11:40 πŸ”— Arcorann_ Kaz: thanks for the response. Are there any specific requirements for using archivebot?
11:41 πŸ”— Kaz recruit_m: we're archivists, not communists
11:41 πŸ”— Kaz I mean I'm sure there's a few, but as a whole nobody's got an issue with paywalls existing
11:42 πŸ”— recruit_m ok
11:42 πŸ”— recruit_m when will the podcast be archived :^
11:43 πŸ”— Kaz Arcorann_: 'read the docs' basically. I don't use it much so there's definitely people that know more about the day-to-day
11:43 πŸ”— Arcorann_ I see
11:43 πŸ”— Kaz drop into #archivebot if you're not already
11:43 πŸ”— recruit_m done
11:45 πŸ”— Kaz recruit_m: i would imagine it's already archived somewhere
11:45 πŸ”— Kaz that doesn't mean it's *public*
11:45 πŸ”— recruit_m interesting
11:46 πŸ”— recruit_m so how do open source/culture and archiv(ism?) relate?
11:47 πŸ”— Kaz well, it's certainly easier to archive things that are already public / open source
11:50 πŸ”— recruit_m ok, thanks
11:51 πŸ”— recruit_m I am new and exploring, at work, are there network or capacity overload risks I may bump into?
11:56 πŸ”— trc has quit IRC (Quit: Goodbye)
13:14 πŸ”— Arcorann_ is now known as Arcorann
14:08 πŸ”— fredgido has quit IRC (Leaving)
14:54 πŸ”— recruit_m has quit IRC (Ping timeout: 252 seconds)
15:05 πŸ”— Ctrl has quit IRC (Read error: Operation timed out)
15:16 πŸ”— Nikchemny has joined #archiveteam-bs
15:22 πŸ”— Nikchemny JAA: As I know, AT saves flash games, right? https://vk.com/dev/no_flash and https://vk.com/dev/no_flash_2.0 may be interesting. The games are only for users, so AT must have account(s) on Vk
15:23 πŸ”— Nikchemny Kaz VoynichCr maybe SketchCow
15:25 πŸ”— Ctrl has joined #archiveteam-bs
15:34 πŸ”— Nikchemny I mean, all the games were before 2018
15:36 πŸ”— Nikchemny has quit IRC (Quit: Page closed)
15:36 πŸ”— Arcorann has quit IRC (Read error: Connection reset by peer)
16:18 πŸ”— Doran has joined #archiveteam-bs
16:25 πŸ”— Doranwen has quit IRC (Read error: Operation timed out)
16:29 πŸ”— Doran has quit IRC (Ping timeout: 272 seconds)
16:29 πŸ”— Doran has joined #archiveteam-bs
16:41 πŸ”— Raccoon has quit IRC (Ping timeout: 272 seconds)
17:13 πŸ”— britmob_ has quit IRC (Read error: Connection reset by peer)
17:15 πŸ”— britmob has joined #archiveteam-bs
17:38 πŸ”— JAA Nikchemny: Flashpoint is the project you're looking for I think. We haven't really done much Flash stuff here as far as I know.
17:42 πŸ”— OrIdow6^2 has joined #archiveteam-bs
17:44 πŸ”— OrIdow6 has quit IRC (Ping timeout: 265 seconds)
17:44 πŸ”— SketchCow OK, don't do that.
17:45 πŸ”— SketchCow Who... the fuck.... is recruit_m
17:46 πŸ”— SketchCow Like, did he just wander into the channel to go "So.... Jason has a paid podcast... explain THAT"
17:46 πŸ”— SketchCow But, just for the future, although whatever:
17:46 πŸ”— SketchCow - Podcast episodes are released on Patreon
17:46 πŸ”— SketchCow - Anywhere from 2 weeks to a month after they are released on youtube, apple podcasts, google podcasts, libsyn and a couple others I don't know. For free.
17:47 πŸ”— SketchCow - Episodes are all archived at archive.org in a collection
17:47 πŸ”— SketchCow - Fuck you
17:48 πŸ”— JAA SketchCow: But don't you know that all content from archivists is required to be in the public domain immediatelyβ€½
17:49 πŸ”— SketchCow Like, what even the fuck WAS that
17:51 πŸ”— SketchCow Does remind me to release an episode though
17:52 πŸ”— SketchCow Also I see the archiveteam inbox on IA needs me to write more sorting routines
18:03 πŸ”— lennier1 I've heard rumors people here have jobs writing closed source software. :)
18:10 πŸ”— JAA Whaaaaa
18:10 πŸ”— JAA Mind blown.
18:35 πŸ”— SketchCow The entire POINT of archiveteam is all the compromises that have been made and trying to work with them
18:35 πŸ”— SketchCow The deriver queue is finally going down on IA
18:35 πŸ”— SketchCow https://archive.org/~tracey/mrtg/derivesg.html
18:35 πŸ”— SketchCow It is merely the worst it has been in a year instead of the worst in a decade
20:20 πŸ”— jshoard_ has joined #archiveteam-bs
20:21 πŸ”— chfoo has quit IRC (Read error: Operation timed out)
20:21 πŸ”— chfoo has joined #archiveteam-bs
20:21 πŸ”— pikami has quit IRC (Read error: Operation timed out)
20:22 πŸ”— sHATNER has quit IRC (Read error: Operation timed out)
20:22 πŸ”— sHATNER has joined #archiveteam-bs
20:22 πŸ”— thejsa has quit IRC (Read error: Operation timed out)
20:22 πŸ”— svchfoo3 sets mode: +o chfoo
20:22 πŸ”— Jon has quit IRC (Write error: Broken pipe)
20:22 πŸ”— jshoard has quit IRC (Write error: Broken pipe)
20:22 πŸ”— nico_32 has quit IRC (Read error: Operation timed out)
20:22 πŸ”— Jon has joined #archiveteam-bs
20:22 πŸ”— nico_32 has joined #archiveteam-bs
20:22 πŸ”— wessel152 has quit IRC (Write error: Broken pipe)
20:23 πŸ”— pikami has joined #archiveteam-bs
20:23 πŸ”— kisspunch has quit IRC (Read error: Operation timed out)
20:23 πŸ”— kisspunch has joined #archiveteam-bs
20:24 πŸ”— VerifiedJ has quit IRC (Read error: Operation timed out)
20:25 πŸ”— MillerBOS has quit IRC (Read error: Operation timed out)
20:25 πŸ”— kiskaWee has quit IRC (Read error: Operation timed out)
20:26 πŸ”— Yurume has quit IRC (Read error: Operation timed out)
20:26 πŸ”— Ctrl has quit IRC (Read error: Operation timed out)
20:27 πŸ”— Yurume has joined #archiveteam-bs
20:30 πŸ”— VerifiedJ has joined #archiveteam-bs
20:30 πŸ”— MillerBOS has joined #archiveteam-bs
20:31 πŸ”— kiskaWee has joined #archiveteam-bs
20:32 πŸ”— thejsa has joined #archiveteam-bs
20:32 πŸ”— BnAboyZ has quit IRC (Read error: Connection reset by peer)
20:32 πŸ”— VerifiedJ has quit IRC (Client Quit)
20:33 πŸ”— Ctrl has joined #archiveteam-bs
20:35 πŸ”— BnAboyZ has joined #archiveteam-bs
20:51 πŸ”— JAA What was the peak queue size at the worst in a decade?
20:52 πŸ”— JAA Also, thanks for the URL. I had been looking for those graphs before and couldn't find them anymore.
21:02 πŸ”— Ryz Is that a good thing? Like the backlog of stuff to process is getting smaller and more manageable? o:
21:04 πŸ”— JAA Yep
21:39 πŸ”— Ryz The contents being processed is more than just AB's stuff to clarify?
21:39 πŸ”— JAA Lots more I'm sure.
21:41 πŸ”— Ryz What are the effects of this being lowered down over time? Like content would appear in IA or WBM more faster?
21:41 πŸ”— JAA Yup
21:41 πŸ”— JAA Derives have been delayed for a week or more from what I've heard.
21:42 πŸ”— Ryz Ooo, instead of 2 days, it could be 1 day or maybe less~
21:42 πŸ”— Ryz Oh, it was much longer, welp
21:42 πŸ”— JAA That should go back down to reasonable times soon.
22:22 πŸ”— wyatt8740 has joined #archiveteam-bs
22:24 πŸ”— wyatt8740 has quit IRC (Read error: Operation timed out)
22:32 πŸ”— trc has joined #archiveteam-bs
22:39 πŸ”— wyatt8740 has joined #archiveteam-bs
22:41 πŸ”— wyatt8740 has quit IRC (Read error: Operation timed out)
22:42 πŸ”— wyatt8740 has joined #archiveteam-bs
22:50 πŸ”— wyatt8740 has quit IRC (Read error: Operation timed out)
22:51 πŸ”— wyatt8740 has joined #archiveteam-bs
22:55 πŸ”— BlueMax has joined #archiveteam-bs
23:03 πŸ”— Ryz So, I was trying to find archives of Prevention magazine ( https://en.wikipedia.org/wiki/Prevention_(magazine) ) and Google Books as a small selection of stuff, I browsed one of 'em because I got the small magazines from my parents, thinking, this could be a sufficent replacement so I don't have to manually mine for website links;
23:03 πŸ”— Ryz ...And then this came up: https://books.google.com/books?id=9MYDAAAAMBAJ&lpg=PP1&pg=PA56#v=onepage&q&f=false
23:04 πŸ”— Ryz Not only is this a crappy scan, but it doesn't match the ad from the magazine that I have at all
23:06 πŸ”— wessel152 has joined #archiveteam-bs
23:07 πŸ”— Raccoon has joined #archiveteam-bs
23:08 πŸ”— Ryz Has there been an initiative for bots or people to scan through documents like magazines and mine out links that may or may not exist anymore? I managed to mine at least 1 website that I threw into AB, which is https://www.videoeye.com/ - which came from an advertisement
23:12 πŸ”— Raccoon` has joined #archiveteam-bs
23:14 πŸ”— Raccoon has quit IRC (Ping timeout: 376 seconds)
23:14 πŸ”— Raccoon` is now known as Raccoon
23:27 πŸ”— Raccoon` has joined #archiveteam-bs
23:28 πŸ”— wyatt8740 has quit IRC (Read error: Operation timed out)
23:31 πŸ”— Raccoon has quit IRC (Ping timeout: 272 seconds)
23:32 πŸ”— Raccoon` has quit IRC (Ping timeout: 265 seconds)
23:32 πŸ”— wyatt8740 has joined #archiveteam-bs
23:32 πŸ”— Raccoon has joined #archiveteam-bs
23:46 πŸ”— jshoard_ has quit IRC (Leaving)

irclogger-viewer