#archiveteam-bs 2017-09-19,Tue

↑back Search

Time Nickname Message
00:07 🔗 fie has quit IRC (Read error: Operation timed out)
00:22 🔗 fie has joined #archiveteam-bs
00:30 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
00:41 🔗 drumstick has joined #archiveteam-bs
01:06 🔗 odemg godane, https://i.imgur.com/eV1bXZD.png << wtf?
01:06 🔗 odemg SketchCow, so IA didn't get tape guys tapes?
01:10 🔗 drumstick has quit IRC (Ping timeout: 255 seconds)
01:10 🔗 odemg godane, https://i.imgur.com/EZOGmx4.png
01:12 🔗 godane i'm reading the forum post now
01:14 🔗 second Has this been archived? https://www.manualslib.com
01:14 🔗 second JAA: ^
01:15 🔗 RichardG has quit IRC (Ping timeout: 370 seconds)
01:20 🔗 odemg godane, I'm saddened that this ended up at MySpleen and not with archive.org, there's 18 years of video and a small, obscure torrent tracker got it over the internet archive? That makes no sense whatsoever. It's safe to say this won't end up being archived properly and the physical tapes wont end up being preserved. I honestly can not believe after speaking to Jason he decided to go with some random guy from
01:20 🔗 odemg a torrent site.. it makes me cringe to think that this collection will be parted out and go to waste.
01:22 🔗 godane i fully get that
01:23 🔗 godane i prefer IA getting it but i just hope this guy has some sort of 8 to 12 vcr tape grabber
01:24 🔗 godane i also hope he put its at 60 or 120 on dvd5 size
01:24 🔗 godane my tapes are about 4.7gb per a 120 minutes
01:24 🔗 godane which is 5192kbps
01:25 🔗 godane 192kbps is for the mp3 audio
01:34 🔗 odemg that's something I didn't even think about, he's not even going to have the storage or bandwidth to get these out... fuck me!! Come on SketchCow don't let this happen :/
01:36 🔗 godane odemg: plus side is is you got over 100+ tapes from me over the last few months
01:37 🔗 godane i upload them to myspleen then upload them to FOS
01:38 🔗 RichardG has joined #archiveteam-bs
01:38 🔗 godane odemg: https://archive.org/details/@jason_scott?&sort=-publicdate&and[]=mediatype%3A%22movies%22&and[]=collection%3A%22vhsvault%22
01:39 🔗 odemg I'm out of local storage at the moment, only 9TB free of 1.6PB slowly moving over to x265 to get a little space back (re-encoding from source on my 720p tv series stuff) not planning any significant storage upgrades until around feburary (56x10TB iron wolf drives)
01:57 🔗 Selavi I doubt the seller knows this guy is doing it for a tracker, but to choose some guy over the IA...
02:08 🔗 odemg Selavi, we should make him aware... it's fucking insanity to give it to MySpleen!
02:08 🔗 Selavi agreed
02:09 🔗 hook54321 wtf? why?
02:10 🔗 hook54321 Why on earth
02:10 🔗 hook54321 Live stream from IA starts soon: https://www.youtube.com/watch?v=HzVyIy4baFg
02:16 🔗 hook54321 I could drive over to the VHS guy's house and demand to know why. :P
02:18 🔗 felti has quit IRC ()
02:30 🔗 drumstick has joined #archiveteam-bs
03:20 🔗 hook54321 okay, so I emailed the VHS guy...
03:21 🔗 hook54321 I got an name of the future owner of the collection
03:21 🔗 hook54321 *a name
03:21 🔗 hook54321 https://www.irccloud.com/pastebin/MzCiCNHt/
03:22 🔗 hook54321 Here's the response I got:
03:22 🔗 hook54321 https://www.irccloud.com/pastebin/Hk4j8Hk1/
03:25 🔗 hook54321 And now he's forwarding me a bunch of threads
03:26 🔗 pizzaiolo has quit IRC (Quit: pizzaiolo)
03:26 🔗 Selavi hmm
03:34 🔗 hook54321 From looking at the threads, it looks like he was giving it to whoever could get there first.
03:35 🔗 hook54321 https://www.irccloud.com/pastebin/aCh4yK2O/
03:36 🔗 hook54321 https://www.irccloud.com/pastebin/5x9XlNnw/
03:36 🔗 hook54321 https://www.irccloud.com/pastebin/KnMPWw73/
03:37 🔗 hook54321 https://www.irccloud.com/pastebin/E9Yvwxge/
03:38 🔗 hook54321 https://www.irccloud.com/pastebin/BXnPtFrc/
03:41 🔗 hook54321 https://www.irccloud.com/pastebin/e76O3sS4/
03:43 🔗 hook54321 That was from yesterday ^
04:04 🔗 godane hook54321: i need to get some of those vhs tapes
04:04 🔗 godane but i may only do about 50 to 100 at a time
04:06 🔗 hook54321 At this point if IA isn't going to get them, then I think we should be mostly concerned about where they're going to be stored and how quickly they will be digitized, since the longer they are stored improperly the lower the quality the end result will be.
04:06 🔗 hook54321 Do you live in/around Oregon?
04:06 🔗 godane i'm in NH
04:07 🔗 godane so i'm no where near there
04:07 🔗 hook54321 ah. ok.
04:08 🔗 godane we may have to go after toysrus : https://www.toysrusinc.com/press/toysrus-inc-commences-court-supervised-processes-to-implement-financial-restructuring
04:09 🔗 hook54321 This might be a long shot, but I was thinking about having some sort of event at a high school in oregon or something where we would try to get tons of volunteers to come and help digitize them. However, I'm not very familiar with the equipment involved, so I don't know how portable everything needed to digitize them is.
04:11 🔗 hook54321 lol, should we start saving toysrus ads that we get in the mail? :P
04:12 🔗 godane i was hoping there was a archive of them somewhere
04:16 🔗 BlueMaxim has joined #archiveteam-bs
04:31 🔗 BlueMaxim has quit IRC (Quit: Leaving)
04:33 🔗 SketchCow hook54321: Please calm the fuck down
04:36 🔗 godane SketchCow: so what exactly happen with the tape collection?
04:36 🔗 godane i thought you said you guys had it
04:37 🔗 SketchCow No, I never said that.
04:38 🔗 SketchCow I said that IA was willing to work with whoever COULD take it and do the work of digitizing it, and we'd be there if things got to the point he couldn't find a home.
04:38 🔗 SketchCow A safety net, not a first line of defense.
04:38 🔗 godane oh ok
04:38 🔗 SketchCow We'll work with the myspleen guys
04:38 🔗 SketchCow Who are in Portland, and right there
04:39 🔗 godane that makes sense
04:39 🔗 hook54321 ah ok
04:40 🔗 godane a part of me hopes the guy has like 8 to 12 vcrs to digitize it
04:41 🔗 godane cause other wise it will take forever
04:45 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
04:47 🔗 hook54321 I was under the impression that private torrent trackers generally like to keep their content exclusive
04:48 🔗 SketchCow Pirvate torrent trackers work with us
04:51 🔗 Sk1d has joined #archiveteam-bs
04:53 🔗 godane i upload alot of myspleen vhs stuff to FOS
04:54 🔗 hook54321 I haven't really heard of MySpleen before. Sounds interesting.
04:57 🔗 hook54321 When was there registration last open?
04:58 🔗 godane i want say over a year ago maybe
04:58 🔗 godane no invites for the moment
05:24 🔗 ranma odemg: how are you retaining grain with x265?
05:25 🔗 ranma last i tried --deblock -3:-3 (or whatever is a good x265 equiv) was in no way coming close to x264
05:25 🔗 ranma i really want to know, too
05:26 🔗 BlueMaxim has joined #archiveteam-bs
06:19 🔗 Asparagir http://rhizome.org/editorial/2017/sep/18/rhizome-to-host-national-forum-on-ethics-and-archiving-the-web-march-22-24-2018/
06:20 🔗 Asparagir okay, they're getting $100,000 in government money to talk about ethics in game journalism...er, web archiving
06:21 🔗 Frogging i know of someone who headed a research team who got $60 000 from the Canadian government to research furries
06:21 🔗 Frogging tru fax
06:21 🔗 Asparagir One of the people putting that conference together once lectured me on Twitter, back when my account was unlocked, for daring to archive some Black Lives Matter related videos that people were publicly posting to Twitter and Facebook. In other words, he was ANGRY that we were archiving it.
06:22 🔗 Frogging oh so maybe it's going to be one of *those* talks about how we shouldn't archive things
06:22 🔗 Asparagir RIGHT.
06:22 🔗 Asparagir His belief was that it is up to marginalized communities to decide what should and shouldn't be archived, and how to best present that material -- even though they purposely just posted it all to social media!
06:23 🔗 Asparagir He wants archives -- who must be headed by people with MLIS degrees, I guess? -- to ask permission to archive public materials on the web. And he calls thaht ethics.
06:25 🔗 Asparagir Still wasn't as bad as the time that one of his friends, a very well known archivist/librarian at an Ivy League University, also lectured me over Twitter about how he wanted to delete the Library of Congress' copy of the Twitter archive, that he felt like that data should be destroyed.
06:26 🔗 Frogging why though
06:27 🔗 Asparagir It didn't entirely make sense to me. But his point seemed to be that it's wrong to turn the panopticon on the materials that people choose to publish, because people and institutions often react differently and sometimes unfairly to who is doing the publishing.
06:27 🔗 Asparagir But my gut feeling is that these people are information gatekeepers who cosplay as archivists.
06:28 🔗 Asparagir They don't want certain information to be available unless they get to be the one who decides it should be available.
06:29 🔗 Asparagir Anyway, that's a $100,000 grant from the MLIS for this upcoming "ethics" conference. Good thing no museums or libraries could have used that money, eh?
06:30 🔗 Frogging yeah :/
06:31 🔗 Asparagir ...by the way, did that Canadian furries research ever lead to a publication? That would be fun to read. :-)
06:31 🔗 Frogging I am not sure
06:31 🔗 hook54321 I don't see how people could be honestly angry about stuff like that
06:31 🔗 Frogging :p
06:32 🔗 zhongfu has quit IRC (Ping timeout: 260 seconds)
06:33 🔗 zhongfu has joined #archiveteam-bs
06:40 🔗 hook54321 If I see something that I think will be deleted, I try to not take into account my personal views on the subject, I just try to make sure that it's saved.
06:41 🔗 Mateon1 has quit IRC (Ping timeout: 260 seconds)
06:42 🔗 Mateon1 has joined #archiveteam-bs
06:42 🔗 Asparagir Good.
06:43 🔗 Asparagir has quit IRC (Asparagir)
07:17 🔗 Atom has quit IRC (Read error: Operation timed out)
08:44 🔗 JAA second: Doesn't look like it has. It looks massive though.
08:51 🔗 JAA 2.7M manuals, that'll easily be over 100M pages.
08:51 🔗 JAA Which means it's definitely way too big for ArchiveBot.
08:53 🔗 JAA Also, the downloads are behind reCAPTCHA... :-/
08:55 🔗 jtn2 has quit IRC (Read error: Operation timed out)
08:57 🔗 godane JAA: i looked into manualslib.com
08:57 🔗 godane we can grab it
08:57 🔗 godane https://www.manualslib.com/manual/121874/A.html
08:58 🔗 godane it auto redirect to the page
08:58 🔗 godane we can at least web grab the thing
08:59 🔗 JAA Yes, we can grab it as HTML, but we won't be able to get the PDFs.
09:00 🔗 JAA You can actually just recursively grab the entire thing through wpull et al., that should find all manuals and all pages.
09:01 🔗 godane to grab image of manual : https://data2.manualslib.com/big_image/big_image.php?id_manual=121874&page=1
09:01 🔗 godane wget will work with that
09:02 🔗 jtn2 has joined #archiveteam-bs
09:02 🔗 JAA I know. I never said it can't be archived.
09:02 🔗 godane at worse i will figure a way to use the auto redirect to make cbz files
09:02 🔗 JAA Have you found a way around the captcha on downloads?
09:03 🔗 godane that way we have ${number}_$(name}.cbz
09:03 🔗 JAA Just to be clear, I mean this: https://www.manualslib.com/download/121874/Peripheral-Electronics-Pghhd1.html
09:03 🔗 godane i know
09:14 🔗 JAA Looks like they know what they're doing.
09:15 🔗 JAA The download URL contains a token specific to each file, and I don't see a way to get it without solving the captcha.
09:16 🔗 godane so i think the cbz way maybe the best way then
09:20 🔗 RichardG has quit IRC (west.us.hub irc.Prison.NET)
09:20 🔗 brayden has quit IRC (west.us.hub irc.Prison.NET)
09:20 🔗 second has quit IRC (west.us.hub irc.Prison.NET)
09:20 🔗 Igloo has quit IRC (west.us.hub irc.Prison.NET)
09:20 🔗 zerkalo has quit IRC (west.us.hub irc.Prison.NET)
09:21 🔗 JAA Hmm, that big_image.php page doesn't seem to work for everything though.
09:21 🔗 JAA For example, the last page of that manual 404s: https://data2.manualslib.com/big_image/big_image.php?id_manual=121874&page=16
09:22 🔗 JAA As does your link now, wat?
09:23 🔗 JAA After refreshing the HTML page, it works again...
09:23 🔗 zerkalo_ has joined #archiveteam-bs
09:23 🔗 Igloo_ has joined #archiveteam-bs
09:23 🔗 godane i think i know why
09:24 🔗 hook54321 https://arstechnica.com/information-technology/2012/05/google-recaptcha-brought-to-its-knees/
09:24 🔗 godane its cause it needs --referer https://www.manualslib.com/
09:24 🔗 JAA Oh
09:24 🔗 godane neverymind
09:25 🔗 JAA Wait, no, that can't be the reason. My browser doesn't send a referrer.
09:27 🔗 hook54321 I got it to not show the captcha thing, once, i think at least.
09:28 🔗 hook54321 If you try to access it through TOR it has a different captcha system it makes you do.
09:35 🔗 second1 has joined #archiveteam-bs
09:37 🔗 drumstick has quit IRC (Read error: Operation timed out)
09:38 🔗 drumstick has joined #archiveteam-bs
10:07 🔗 godane JAA: i found this: https://data2.manualslib.com/big_img_log/no300_ref_history.txt
10:13 🔗 godane anyways i may look at manualsonline.com
10:17 🔗 JAA manuallib.com is also a good candidate. 704k manuals, downloadable directly.
10:25 🔗 godane thanks for manuallib.com link
10:26 🔗 godane it looks way better with filenames
13:06 🔗 BlueMaxim has quit IRC (Quit: Leaving)
13:57 🔗 dd0a13f37 has joined #archiveteam-bs
13:57 🔗 dd0a13f37 hook54321: what do you mean? I get regular recaptcha 2 with tor
13:59 🔗 dd0a13f37 I see what you mean now, like on this page https://www.manualslib.com/manual/121874/Peripheral-Electronics-Pghhd1.html?page=2#manual ?
13:59 🔗 dd0a13f37 >https://www.manualslib.com/divine_code.php?id=46.165.223.217&rnd=42779718
14:00 🔗 dd0a13f37 easy to crack, you can get it from a different IP and still keep hte id? parameter the same
14:00 🔗 dd0a13f37 It's always the same, refresh will just cause it to render again
14:00 🔗 dd0a13f37 and it doesn't have lines noise or background
14:00 🔗 dd0a13f37 So regular OCR will work just fine
14:03 🔗 dd0a13f37 "the cbz approach" is not as simple as you htink it is, it also has rendered text on top
14:04 🔗 godane i figured that out
14:05 🔗 godane there is text at the bottom of pages saying its from manualslib.com
14:06 🔗 dd0a13f37 Could you post a pdf url? It doesn't work at all with tor
14:06 🔗 dd0a13f37 the recap v2
14:07 🔗 dd0a13f37 https://www.manualslib.com/contacts.html?message=Error+code+1402%2C+%28104.223.123.98%29&topic=Download+error
14:07 🔗 dd0a13f37 "error 1402"
14:11 🔗 dd0a13f37 If you can't download your manual for any reasons - for example, you have problems with the captcha - please, fill this form. Please, tick the box below to get your link:
14:17 🔗 pizzaiolo has joined #archiveteam-bs
14:18 🔗 qwebirc82 has joined #archiveteam-bs
14:20 🔗 zhongfu has quit IRC (Ping timeout: 260 seconds)
14:20 🔗 zhongfu has joined #archiveteam-bs
14:20 🔗 dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
14:21 🔗 qwebirc82 is now known as dd0a13f37
14:24 🔗 godane in case anyone here want some wwf videos: http://smtp.bob-consulting.net/
14:27 🔗 dd0a13f37 Is it a dropzone or something?
14:27 🔗 dd0a13f37 Or did they just name the domain bob-consulting
14:28 🔗 qwebirc12 has joined #archiveteam-bs
14:28 🔗 godane no idea
14:28 🔗 godane i just found it
14:29 🔗 godane i was looking up vhsrip on filepursuit.com
14:29 🔗 godane and that came up
14:29 🔗 godane anyways i'm of to bed
14:29 🔗 godane bbl
14:29 🔗 qwebirc12 I found a video about a north korean IT exhibition where they appear to talk about ubuntu the other day
14:31 🔗 qwebirc12 I have a question
14:31 🔗 qwebirc12 What about sites with stateful CMSes, like KCNA?
14:31 🔗 qwebirc12 If you pull the WARC files into wayback, you're just going to get random language pages
14:31 🔗 qwebirc12 It uses server-side state instead of a ?lang=XX parameter
14:32 🔗 dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
14:32 🔗 qwebirc12 is now known as dd0a13f37
14:41 🔗 noirscape has joined #archiveteam-bs
14:45 🔗 drumstick has quit IRC (Read error: Operation timed out)
15:05 🔗 icedice has joined #archiveteam-bs
15:11 🔗 hook54321 dd0a13f37: This is interesting https://en.wikipedia.org/wiki/Red_Star_OS
15:13 🔗 dd0a13f37 The pyongyang books website actually runs on it
15:13 🔗 dd0a13f37 you can download a .iso too
15:13 🔗 dd0a13f37 but they mainly use cracked chinese winxp/win7
15:13 🔗 dd0a13f37 for example in the bg of pyongyang IT expo you can see win7 backgrounds, and kroea books site uses \r\n, not \n
15:15 🔗 hook54321 pyongyyang?
15:16 🔗 dd0a13f37 Yes, it's a video on kcna.kp
15:16 🔗 hook54321 Where'd you find an ISO?
15:16 🔗 dd0a13f37 google read star os iso
15:17 🔗 dd0a13f37 https://archive.org/details/RedStarOS
15:17 🔗 dd0a13f37 https://my.mixtape.moe/dysoqm.mp4 here is the video
15:27 🔗 kristian_ has joined #archiveteam-bs
15:29 🔗 dd0a13f37 Manualslib is quite shoddily constructed
15:30 🔗 dd0a13f37 external smtp provider, frontend is a patchwork, sends your password hash in every cookie, "ts.php not found"
15:45 🔗 schbirid has joined #archiveteam-bs
15:59 🔗 icedice How do I remove administrator privileges from my main local user account in Windows 10? I have a second local user account that has administrator privileges that I'd like to use authorize software that requires administrator privileges while logged into an account without administrator privileges.
16:00 🔗 dd0a13f37 Log in to other account -> manage user accounts?
16:00 🔗 dd0a13f37 Why are you using windows 10?
16:08 🔗 icedice It's a gaming computer
16:09 🔗 icedice I neutered it though: https://github.com/10se1ucgo/DisableWinTracking
16:09 🔗 dd0a13f37 "neutered"
16:09 🔗 dd0a13f37 Install windows 7, it works just as well
16:10 🔗 icedice I use whatever I want
16:10 🔗 icedice I dislike Windows 10, but that's where things are going
16:10 🔗 icedice Windows 7 and 8 got spy updates anyway
16:11 🔗 icedice They're better, but still the same toxic pile of shit
16:12 🔗 JAA Regarding how shitty manualslib.com is: They also serve a PDF as application/pdf by default, but if you add a "take=binary" parameter to the URL, they serve it as binary/octet-stream to force browsers to download it.
16:28 🔗 dd0a13f37 It shows up as tiff in my browser
16:31 🔗 dd0a13f37 Easiest way if you want to bulk download is probably: get a bunch of clean IPs, see if you can force it to legacy captcha,
16:31 🔗 dd0a13f37 use some of the applications that already exist for them (shouldn't be too hard- you can segment them easily), then you use archivebot for the actual download (most commercial proxy providers are ~50kbit or charge per bandwidth and the PDF dl is unauthed (how it usually is- CDNs are dumb))
16:31 🔗 dd0a13f37 for the actual download (most commercial proxy providers are ~50kbit or charge per bandwidth and the PDF dl is unauthed (how it usually is- CDNs are dumb))
16:38 🔗 JAA An alternative would be one of those captcha solving services. But honestly, before going down either of these routes, I'd rather just grab the HTML version and focus on other, more easily archiveable sites. It's not like manualslib.com is the only repository of user manuals.
16:39 🔗 dd0a13f37 Captcha solving costs $2/1000
16:40 🔗 dd0a13f37 That's extremely expensive for a million or two manuals
16:40 🔗 dd0a13f37 It would likely be cheaper to just hire someone with extensive web security knowledge and so on
16:41 🔗 dd0a13f37 For $2000 you could buy a shitton of proxies and scrape more interesting services than some manual dump
16:41 🔗 JAA Yeah, exactly. Not really worth it at that point.
16:46 🔗 dd0a13f37 Until recently manualslib only had 700k manuals
16:46 🔗 dd0a13f37 and manualsonline right now has 300k
16:46 🔗 dd0a13f37 so they likely just scraped another site
16:47 🔗 joepie91_ try https://mesnotices.20minutes.fr/ instead
16:47 🔗 joepie91_ they have a pile of manuals and last I checked they were freely downloadable
16:47 🔗 Stiletto has quit IRC (Read error: Operation timed out)
16:49 🔗 JAA Any idea about the numbers?
16:49 🔗 dd0a13f37 The numbers? On mesnotices?
16:49 🔗 JAA Yeah, how many manuals they have.
16:50 🔗 dd0a13f37 560k+ judging from id parameter
16:51 🔗 JAA I've seen an ID of almost 6M, that doesn't seem right.
16:51 🔗 dd0a13f37 https://mesnotices.20minutes.fr/014159.php?ID=5955995&k=18bbfd9b88e5d1b8eaa36702300ee324&q=LENOVO%20MOTO%20G4%20PLAY
16:51 🔗 dd0a13f37 oh never mind I can't count
16:51 🔗 dd0a13f37 But most seemed to be in 5-6m range
16:51 🔗 dd0a13f37 however
16:51 🔗 dd0a13f37 >Vous n'avez pas correctement saisi le code de verification.
16:51 🔗 JAA Also, the allegedly newest manual has an ID of around 5.4M: https://mesnotices.20minutes.fr/last-add/1
16:52 🔗 dd0a13f37 I don't speak french but I can make out "correct" "code" "verification"
16:52 🔗 JAA "You haven't entered the verification code correctly." or something like that.
16:52 🔗 JAA My French's a bit rusty.
16:52 🔗 hook54321 It must be annoying for people that speak other languages to oftentimes have their URLs based on English.
16:53 🔗 joepie91_ dd0a13f37: hm? I get straight PDF embeds
16:53 🔗 joepie91_ when I hit the telecharger button
16:53 🔗 JAA Same
16:53 🔗 JAA It's handled with JavaScript though. Not sure if wpull would be able to pick it up.
16:53 🔗 dd0a13f37 No, why?
16:54 🔗 dd0a13f37 You could always script it and then get the pdf links
16:54 🔗 dd0a13f37 and feed it to !ao < .txt
16:55 🔗 JAA Yeah, I'd probably write a wpull plugin to handle that.
16:55 🔗 JAA Just wanted to mention it.
16:55 🔗 JAA An !a will probably not be sufficient.
16:55 🔗 dd0a13f37 Do you really _need_ six million download pages in french? ALso, can't wpull use phantomJS
16:56 🔗 JAA I still doubt it's anywhere close to 6M pages, honestly.
16:56 🔗 JAA And yes, wpull can use PhantomJS, but it's pretty horrible.
16:56 🔗 dd0a13f37 They don't seem to be autoincrementing
16:56 🔗 dd0a13f37 There are large gaps
16:56 🔗 JAA It causes massive duplication.
16:56 🔗 JAA Because each URL is retrieved in an individual PhantomJS process and thus re-downloads all page resources every time...
16:57 🔗 hook54321 Ignore some of the page resources?
16:57 🔗 dd0a13f37 Can't you use a caching proxy?
16:57 🔗 JAA I guess so, but then you're back to scripting. It's probably easier to just extract the PDF's URL with a wpull plugin (i.e. some string processing in Python).
16:58 🔗 icedice but thanks for the advice anyway, I'll try it
16:58 🔗 dd0a13f37 and then dump it into !ao
16:58 🔗 dd0a13f37 and then dump it into !ao
16:58 🔗 dd0a13f37 joepie91_: Is this a big site? Are they well-known in France or anythinglike that/
16:59 🔗 JAA Nah, add it to the wpull list directly.
16:59 🔗 JAA 20min is a "newspaper" (more like a tabloid really).
16:59 🔗 joepie91_ dd0a13f37: no idea, I found them when looking for a manual for my washing machine
16:59 🔗 joepie91_ they were the only site carrying it
16:59 🔗 joepie91_ and I'm not exaggerating
16:59 🔗 joepie91_ :p
17:00 🔗 JAA At least some of the stuff is in French. Well, that'd be a mess to upload cleanly to IA.
17:01 🔗 JAA To upload the PDFs as individual items, I mean.
17:01 🔗 dd0a13f37 Don't you just dump them all in a folder named "20minutes pdf manuals", then that's 1 item?
17:02 🔗 JAA You could do that, but then finding the manual for a particular device isn't exactly easy.
17:03 🔗 JAA I have to go. I'll look at that again later. I have an idea for how to get an estimate of how many manuals they have.
17:04 🔗 DFJustin one item per manual is better
17:05 🔗 dd0a13f37 Won't that be painful? What's the best practice for creating some thousands of items?
17:05 🔗 astrid there is a tool!
17:05 🔗 astrid https://pypi.python.org/pypi/internetarchive
17:06 🔗 hook54321 !ao https://twitter.com/LeonieWatson/status/910027469394243584 --phantomjs
17:06 🔗 joepie91_ dd0a13f37: big items are a much bigger problem for IA than many items
17:06 🔗 hook54321 oops
17:06 🔗 joepie91_ in terms of infra
17:07 🔗 dd0a13f37 in terms of me
17:07 🔗 dd0a13f37 That seems neat, could use that to archive all the NK pdfs
17:08 🔗 dd0a13f37 But I can't use that to add from URL, so I need to download them all as WARC, right?
17:09 🔗 dd0a13f37 The page info is absolutely uninteresting, it's just GET something.php?id=XXX that gives you a .pdf file with Content-disposition
17:09 🔗 hook54321 I'm surprised that no one has leaked the W3C's press conference yet.
17:11 🔗 dd0a13f37 Aren't they under NDAs and stuff?
17:12 🔗 hook54321 Probably
17:15 🔗 dd0a13f37 I don't understand, haven't EME been in HTML for a long time?
17:17 🔗 dd0a13f37 https://www.theguardian.com/technology/2013/jun/06/html5-drm-w3c-open-web
17:18 🔗 BartoCH has joined #archiveteam-bs
17:20 🔗 icedice Apparently I only had to reboot for the account settings changes to take effect
17:20 🔗 icedice So that's why it didn't work for me initially
17:22 🔗 hook54321 dd0a13f37: It has been in web browsers, but it just became an official standard by the W3C
17:24 🔗 dd0a13f37 if it's in chrome it's a standard, if not it's completely irrelevant, so why is this a big deal?
17:28 🔗 hook54321 Can't explain right now, gotta be somewhere soon. I'll be back in about 3 to 5 hours.
17:31 🔗 kristian_ has quit IRC (Quit: Leaving)
18:17 🔗 sep332 has quit IRC (Read error: Operation timed out)
18:25 🔗 dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
18:37 🔗 JensRex has quit IRC (Remote host closed the connection)
18:38 🔗 JensRex has joined #archiveteam-bs
18:51 🔗 icedice has quit IRC (Read error: Operation timed out)
19:07 🔗 sep332 has joined #archiveteam-bs
19:32 🔗 schbirid2 has joined #archiveteam-bs
19:36 🔗 schbirid has quit IRC (Read error: Operation timed out)
19:39 🔗 icedice has joined #archiveteam-bs
19:47 🔗 icedice has quit IRC (Quit: Leaving)
20:04 🔗 stringsho has joined #archiveteam-bs
20:19 🔗 icedice has joined #archiveteam-bs
20:29 🔗 schbirid2 has quit IRC (Quit: Leaving)
20:33 🔗 JAA So I'm running a discovery on the 20min manuals archive now. Assuming they don't ban me, I'll have a good estimate for the number of manuals tomorrow morning.
20:43 🔗 tuluu_ has quit IRC (Read error: Operation timed out)
20:45 🔗 Stilett0- has joined #archiveteam-bs
20:45 🔗 tuluu has joined #archiveteam-bs
20:47 🔗 JAA Actually, the discovery is already done.
20:48 🔗 JAA joepie91_, dd0a13f37: There are approximately 292742 manuals on that 20min site.
20:50 🔗 JAA This is from grabbing the sitemap and running zgrep -Po '(?<=href=")/manuel-[^/]+/[^/]+/[^/]+(?=")' 20min-manuals-sitemaps-00000.warc.gz | awk '!seen[$0]++ { count += 1 } END { printf "%d\n", count }'
21:11 🔗 icedice has quit IRC (Ping timeout: 260 seconds)
21:17 🔗 zino has joined #archiveteam-bs
21:35 🔗 icedice has joined #archiveteam-bs
22:06 🔗 BartoCH has quit IRC (Quit: WeeChat 1.9)
22:10 🔗 drumstick has joined #archiveteam-bs
22:12 🔗 etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
22:15 🔗 Stilett0- has quit IRC (Ping timeout: 255 seconds)
22:19 🔗 icedice2 has joined #archiveteam-bs
22:22 🔗 icedice has quit IRC (Ping timeout: 255 seconds)
22:31 🔗 dashcloud has quit IRC (Remote host closed the connection)
22:33 🔗 godane i'm grabbing Eli the Computer Guy videos
22:34 🔗 godane i'm only doing the -f webm version cause thats how i started it
22:39 🔗 icedice2 has quit IRC (Quit: Leaving)
22:39 🔗 dashcloud has joined #archiveteam-bs
22:39 🔗 icedice has joined #archiveteam-bs
22:40 🔗 odemg true
22:40 🔗 odemg godane, see what I already uploaded though too
22:41 🔗 odemg godane, https://archive.org/details/@ohhdemgirls?and[]=subject%3A%22Youtube%22&and[]=creator%3A%22eli+the+computer+guy%22
22:42 🔗 godane ok then
22:43 🔗 godane i will finish the 2013-01 upload then stop
22:44 🔗 godane i stopped upload it
22:44 🔗 godane that saved 8gb
22:45 🔗 godane turned out i had 4.6gb of webm for just 2013-01 so i say screw uploading that
22:46 🔗 odemg :3
22:47 🔗 godane just know there are at least 500+ videos
22:47 🔗 godane you only have 306
22:48 🔗 odemg true
23:02 🔗 dashcloud has quit IRC (Remote host closed the connection)
23:04 🔗 dashcloud has joined #archiveteam-bs
23:23 🔗 BlueMaxim has joined #archiveteam-bs

irclogger-viewer