#archiveteam-bs 2017-08-24,Thu

↑back Search

Time Nickname Message
00:26 🔗 Geekonoci has joined #archiveteam-bs
00:34 🔗 Geekonoci has quit IRC (Quit: Page closed)
00:46 🔗 hook54321 How should I organize that? A table? or just a plain list?
00:48 🔗 JAA What information do you have?
00:49 🔗 JAA If it's really just a list of account names plus maybe some comments/notes, then I'd say plain list (alphabetically sorted, I guess). But if you have additional data for many of the accounts, a table might be better.
00:58 🔗 BlueMaxim has joined #archiveteam-bs
01:02 🔗 hook54321 A lot of the time I discover it because people ask for it to be excluded, publicly on twitter...
01:11 🔗 hook54321 balrog, JAA
01:14 🔗 qw3rty116 has joined #archiveteam-bs
01:17 🔗 qw3rty115 has quit IRC (Read error: Operation timed out)
01:30 🔗 godane i'm at 1083k items
01:31 🔗 arkiver hook54321: nice
01:33 🔗 hook54321 arkiver: I'm supposed to work with you on OpenNIC stuff
01:33 🔗 arkiver I just read that yeah :P
01:34 🔗 Fusl o/
01:34 🔗 Aranje has quit IRC (Remote host closed the connection)
01:34 🔗 Fusl hook54321: thanks for taking care of and getting AT involved in this
01:34 🔗 hook54321 np
01:37 🔗 hook54321 I'm not sure what we're going to do about the .free tld. At this point it makes the most sense to just grab the .libre sites, since the .free sites were moved over there. However, I'm not sure if we should be worried about other webpages still using .free URLs.
01:38 🔗 hook54321 I also have no idea what we'll do if ICANN decides to create a TLD that's already an OpenNIC tld.
01:38 🔗 hook54321 again
01:39 🔗 Frogging There's nothing to do about it. The DNS has no provisions for conflicting roots
01:39 🔗 Frogging (does it?)
01:40 🔗 hook54321 In the case of .free, OpenNIC is basically just moving all of the .free domains over to .libre
01:42 🔗 hook54321 Personally, I'm hoping that's what they're planning to do if this happens again.
01:42 🔗 Frogging what else can be done?
01:44 🔗 hook54321 This is their position on it:
01:44 🔗 Somebody2 a subdomain? .free.opennic?
01:44 🔗 hook54321 "What is OpenNIC's relationship with the other alternative roots and ICANN?
01:44 🔗 hook54321 OpenNIC currently recognizes and peers all of the existing ICANN TLDs (.com, .uk, etc.). Therefore, if you configure your computer to resolve OpenNIC domains, you'll also be able to resolve all of the ICANN TLDs automatically.
01:44 🔗 hook54321 OpenNIC has not yet evaluated nor does it hold a formal position on the current/future ICANN TLDs."
01:45 🔗 hook54321 Somebody2: I don't think they would like that
01:45 🔗 Somebody2 hook54321: who?
01:45 🔗 Somebody2 ICANN or OpenNIC, or someone else?
01:45 🔗 hook54321 OpenNIC
01:46 🔗 Somebody2 why not? It would make it clear the source of the domains...?
01:46 🔗 joepie91 today in "wtf Google": https://twitter.com/joepie91/status/900534232296161284
01:46 🔗 Somebody2 they could even do the same with IACANN domains: .com.icann
01:46 🔗 hook54321 That would break ssl certificates though
01:47 🔗 Frogging joepie91: http://i.imgur.com/By95Lva.jpg
01:47 🔗 hook54321 and it kinda defeats the purpose of what they're trying to do
01:47 🔗 joepie91 lol
01:56 🔗 Somebody2 hook54321: between converting domains previously at .free into domains at .libre vs converting them into domains at .free.opennic -- I'm not sure why .libre is better... can you clarify?
01:57 🔗 Somebody2 and how does adding .opennic or .icann at the end of domains defeat what they are trying to do?
01:59 🔗 hook54321 because then it's a subdomain
02:00 🔗 Frogging example.com is a subdomain of .com
02:00 🔗 hook54321 .com is the TLD though
02:00 🔗 Frogging DNS doesn't distinguish
02:01 🔗 Frogging http://com/ is valid
02:01 🔗 Somebody2 DNS doesn't, but various protocols on top do in various subtle ways
02:01 🔗 Frogging and as a side note, I am now immensely confused at the result of visiting that URL
02:01 🔗 Somebody2 like various browsers do different things depending on whether a DNS segment is a toplevel or not
02:02 🔗 Frogging oh I see. someone has a sense of humour. ".XYZ is the next .COM. .XYZ is the #1 new domain in the world"
02:03 🔗 Somebody2 ok, so rather than .free.opennic, you could do performance horrible things by using a different delimiter, e.g. .free-opennic, or even .free;opennic
02:05 🔗 Somebody2 Frogging: oddly, neither dig nor curl are able to follow that redirect.
02:08 🔗 Frogging indeed, I think that one does not actually work, it was my browser adding .com to it automatically
02:08 🔗 Somebody2 Ah, this is because the browser automatically converts "http://com/" into a request for http://www.com.com/
02:09 🔗 Frogging try this one http://dk/
02:10 🔗 Somebody2 yes, that one works in dig and curl
02:10 🔗 Somebody2 301 redirecting to https://www.dk-hostmaster.dk/
02:10 🔗 Frogging yup
02:34 🔗 Frogging I was thinking about that line ".XYZ is the next .COM" and it occurred to me that I've never in my life seen a legitimate .xyz website
02:34 🔗 Frogging so I went to one of the ones linked on the page
02:34 🔗 Frogging https://www.goinnovate.xyz/
02:35 🔗 Frogging this is... frightful.
02:35 🔗 Frogging "fog computing"
02:36 🔗 Frogging http://www.exponentials.xyz/posts/the-roles-of-cloud-computing-and-fog-computing-in-the-internet-of-things-revolut-6205388
02:53 🔗 drumstick has quit IRC (Ping timeout: 268 seconds)
03:00 🔗 drumstick has joined #archiveteam-bs
03:14 🔗 qw3rty117 has joined #archiveteam-bs
03:18 🔗 hook54321 I'm surprised that IA actually complied with this guy's request: https://twitter.com/darthodius/status/658731881626783745
03:20 🔗 qw3rty116 has quit IRC (Read error: Operation timed out)
03:41 🔗 Fletcher has quit IRC (Remote host closed the connection)
03:56 🔗 Fletcher has joined #archiveteam-bs
04:15 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
04:16 🔗 hook54321 public.resource.org refers to the Internet Archive building as "the Church of the Internet Archive"
04:20 🔗 Asparagir Well, the bulding really was an old church once.
04:20 🔗 Asparagir *building
04:20 🔗 Asparagir It still has pews and all that.
04:22 🔗 Sk1d has joined #archiveteam-bs
04:38 🔗 hook54321 Asparagir: yeah, i know.
04:39 🔗 hook54321 I wonder if anyone has ever gone there in person and angerly demanded their site get removed from the wayback machine
04:58 🔗 Asparagir has quit IRC (Asparagir)
05:38 🔗 jrwr this guy is my history hero https://www.youtube.com/watch?v=ZqUm1YXTxNc
05:38 🔗 jrwr Steve1989MREInfo
06:12 🔗 Honno has joined #archiveteam-bs
06:15 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
06:17 🔗 RichardG has joined #archiveteam-bs
06:31 🔗 soja92 has joined #archiveteam-bs
07:08 🔗 kristian_ has joined #archiveteam-bs
07:44 🔗 drumstick has quit IRC (Read error: Operation timed out)
07:44 🔗 drumstick has joined #archiveteam-bs
08:07 🔗 tuluu has quit IRC (Ping timeout: 245 seconds)
08:17 🔗 tuluu has joined #archiveteam-bs
08:52 🔗 Boppen has quit IRC (Ping timeout: 194 seconds)
08:53 🔗 drumstick has quit IRC (Ping timeout: 268 seconds)
08:54 🔗 kristian_ has quit IRC (Read error: Operation timed out)
09:00 🔗 etudier has joined #archiveteam-bs
09:05 🔗 drumstick has joined #archiveteam-bs
09:15 🔗 Boppen has joined #archiveteam-bs
09:40 🔗 dashcloud has quit IRC (Read error: Connection reset by peer)
09:40 🔗 dashcloud has joined #archiveteam-bs
11:11 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
11:26 🔗 drumstick has quit IRC (Read error: Operation timed out)
12:15 🔗 brayden has quit IRC (Read error: Connection reset by peer)
12:17 🔗 brayden has joined #archiveteam-bs
12:17 🔗 swebb sets mode: +o brayden
12:49 🔗 kurt has quit IRC (Read error: Operation timed out)
13:44 🔗 SketchCow The Archive has had to deal with a lot of crazy walking right in and demanding things, yes.
13:44 🔗 SketchCow Two levels: The one you think of, someone coming in and demanding something related to content.
13:44 🔗 SketchCow The other: Since it looks like a church, church-related things, like homeless or people asking for sanctuary, etc.
13:44 🔗 SketchCow There are between 3-5 people who sleep at the Archive at night, since it looks like a church
13:45 🔗 SketchCow There was one who was across the street for years, Thomas. He died last October, and even the night he died, we'd brought over extra food from the celebration going on, which he had.
13:47 🔗 SketchCow http://richmondsfblog.com/2016/10/27/thomas-resident-homeless-man-at-funston-clement-passed-away-wednesday-night/
13:54 🔗 BartoCH has quit IRC (Quit: WeeChat 1.9)
14:08 🔗 vitzli has joined #archiveteam-bs
14:21 🔗 TheLovina has quit IRC (Quit: Leaving)
14:30 🔗 closure_ til
14:31 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
14:31 🔗 Mateon1 has joined #archiveteam-bs
14:36 🔗 REiN^ has quit IRC (Max SendQ exceeded)
14:36 🔗 REiN^ has joined #archiveteam-bs
14:36 🔗 JAA hook54321: That's for pages which are explicitly blocked due to a direct request to IA, I assume?
14:37 🔗 hook54321 I'm not exactly sure yet.
14:38 🔗 hook54321 Whatever we make it out to be I guess
14:38 🔗 hook54321 Maybe both, idk
14:39 🔗 JAA I just think that a list of pages blocked by robots.txt is probably *massive*.
14:40 🔗 hook54321 yeah. on the other hand, it could be used for us to have some sort of database of what IA might not be crawling.
14:40 🔗 hook54321 *could be useful\
16:48 🔗 vitzli has quit IRC (Quit: Leaving)
17:25 🔗 godane another page is missing for 01-01-2014
17:26 🔗 godane *2014-01-01
17:26 🔗 godane anyways is page 24
17:35 🔗 Asparagir has joined #archiveteam-bs
18:05 🔗 namespace has joined #archiveteam-bs
18:06 🔗 namespace So I'm trying to OCR/type up/something this PDF so it can be put on a public website as an actually readable/searchable/etc historical archive: https://archive.org/details/8BBSArchiveP1V1
18:06 🔗 namespace (Also to generate more interest in it through sex appeal, because it's actually a fairy important bit of phreaker history.)
18:07 🔗 namespace And it is just the nastiest PDF out there (blurry monospace font, holepunched so that sometimes entire bits of text are missing, scuff marks, etc).
18:08 🔗 namespace The automatic archive.org OCR is actually better than what I could get using OCRFeeder: https://archive.org/stream/8BBSArchiveP1V1/8BBS_Archive_P1V1_djvu.txt
18:08 🔗 namespace Is there any way to do better, or?
18:08 🔗 schbirid has joined #archiveteam-bs
18:29 🔗 astrid wow that is nasty yeah
18:29 🔗 astrid i have a very flakey ocr program that works well on single-font fixed-width material
18:29 🔗 astrid https://github.com/chronomex/ess-ocr if you want to poke at it
18:30 🔗 astrid all the magic bits are hardcoded, so youd have to hardcode new constants to describe the pages you want to ocr
18:30 🔗 astrid and that sort of thing
18:34 🔗 namespace Yeah here's the strategies I've thought up so far:
18:35 🔗 namespace - Retype the whole thing (would prefer not to, takes TON of time, solitary)
18:35 🔗 astrid i mean, this is the sort of thing that i wrote my ocr program for
18:35 🔗 astrid this will ocr quite well
18:35 🔗 namespace - Highlight all the actual posts and then extract them as images, put up on a wiki and let other people help.
18:35 🔗 namespace - OCR it, which yeah I'll try that thing you just posted thanks.
18:35 🔗 astrid it's extremely fragile but this is fixed width enough to work super well
18:36 🔗 astrid i am planning to do a total rewrite of the tool to make it useful, but for now you will be able to get it to work
18:36 🔗 astrid watch out, tool is designed to work with pages that have a black frame around them, so you might need to remove some code about that
18:36 🔗 astrid (it used the black frame for fiducial registration)
18:37 🔗 namespace Eheheh.
18:37 🔗 astrid crop_to_rect is the function to comment out calls to
18:37 🔗 atrocity where's the PDF you're trying to OCR?
18:37 🔗 astrid https://archive.org/stream/8BBSArchiveP1V1/8BBS_Archive_P1V1#page/n3/mode/1up
18:39 🔗 atrocity i have pdf ocr software at work, i'll try running it on it, lol
18:41 🔗 atrocity downloading at like 300KiB/s, ugh
18:41 🔗 namespace wow lol
18:44 🔗 namespace astrid: So stupid question, how do I compile this?
18:44 🔗 namespace There's two files, no makefile, do I just compile them both separately and then execute one?
18:46 🔗 astrid they both have a comment at the beginning saying the magic compiler spells to cast
18:46 🔗 astrid deskew comes from the leptonica library
18:46 🔗 namespace Ah, k.
18:46 🔗 namespace Thanks.
18:46 🔗 astrid so you might need to set that up
18:46 🔗 astrid also you need to feed it 'pgm' images
18:47 🔗 namespace Any dependencies? ^^;;
18:47 🔗 astrid ess-ocr needs you to create a directory 'training' or it'll core
18:47 🔗 astrid yeah libnetpnm among others
18:47 🔗 astrid as i said
18:47 🔗 astrid very rough :)
18:49 🔗 namespace Yeah I'm smelling a goose chase, thanks anyway. :p
18:49 🔗 astrid once you get it running, call it over one of your images
18:49 🔗 namespace Yes see that first bit.
18:49 🔗 namespace Is the bit that is very unlikely to happen.
18:50 🔗 astrid it'll write out 'crop.pgm' which should be cropped to include the content and include the grid it's using for segmentation
18:50 🔗 astrid oh :(
18:50 🔗 astrid ok
18:50 🔗 namespace This needs like, a readme.txt
18:50 🔗 astrid well my plan for it involves a rewrite, because proper monospace ocr is a thing the world needs
18:50 🔗 astrid desperately
18:50 🔗 namespace Otherwise I'll just be asking you questions all day.
18:50 🔗 astrid as we all know
18:50 🔗 namespace Yes, yes it does.
18:50 🔗 astrid yeah :|
18:50 🔗 namespace Here's my advice.
18:50 🔗 namespace If you want to make this usable to others.
18:50 🔗 namespace Make a debian/ubuntu/etc vm.
18:51 🔗 namespace And set the thing up from scratch, and write down each step as you do.
18:51 🔗 namespace That's your readme.txt
18:51 🔗 astrid it's not currently intended to be useful, it's a quick hack
18:51 🔗 astrid yeah
18:51 🔗 namespace It's how I do all my setup readme files and they always come out excellent, whereas people who just write them from memory always end up skipping steps and stuff.
18:51 🔗 astrid "if you want to do monospace ocr, here's a very fiddly thing that gets good results"
18:52 🔗 astrid i usually wind up with 2 or 3 misrecognized characters per page at most
18:52 🔗 namespace Yeah that would be incredible.
18:52 🔗 astrid it doesn't assume it knows what letters look like
18:52 🔗 astrid instead it's like "hm, idk what this is, hey user, what is it?" and you say "that's a B, and you can use it as an example of other Bs"
18:53 🔗 astrid so it builds training data as you go
18:53 🔗 astrid but i'm planning a rewrite with maintainability and also smartness
18:53 🔗 astrid next version will lay down a grid over the page that can be skewed and bent, so that photographs of monospace text can be recognized f.ex
18:54 🔗 astrid and it'll be much much less manual fiddly process
18:54 🔗 namespace Yes well, the next version is in the indeterminate future and I'd like this to be on the net now. :P
18:54 🔗 astrid because i have about 500 pages of photographs of typewritten text that i'd like to get into readable form
18:54 🔗 astrid yes i know :
18:54 🔗 astrid :)
18:58 🔗 atrocity i have my work OCR running, it's going VERY slowly, lol
18:59 🔗 namespace Unsurprising. XD
19:01 🔗 atrocity i feel like i should've just ran it on the first 10 pages or soemthing instead of all 700, lol
19:02 🔗 godane so now page 36 and 37 of 2014-04-30 of east bay express is giving me 404
19:02 🔗 HarryCros has quit IRC (Read error: Connection reset by peer)
19:03 🔗 HarryCros has joined #archiveteam-bs
19:06 🔗 godane ok looks there was a middle booklet for 'Bike to work day'
19:06 🔗 godane on may 8 that year
19:06 🔗 godane after that in continues from page 28
19:07 🔗 godane so its really page 8 and 9 of the 'bike to work day' booklet
19:09 🔗 HarryCros has quit IRC (Remote host closed the connection)
19:09 🔗 HarryCros has joined #archiveteam-bs
19:10 🔗 godane so there pages 31 to 33 missing from 2014-06-04
19:11 🔗 atrocity http://meddl.com/temp/ocrtest.txt
19:11 🔗 atrocity that's what my work OCR app did, lol
19:11 🔗 atrocity the first 10 pages at least
19:11 🔗 HarryCros has quit IRC (Remote host closed the connection)
19:12 🔗 namespace Not that bad, tbh. But only of about comparable quality to the archive.org OCR scan.
19:13 🔗 atrocity yeah, lol
19:13 🔗 atrocity doesn't help that there's holes all through it
19:13 🔗 namespace It really *really* doesn't.
19:13 🔗 namespace Like I said, nasty.
19:15 🔗 atrocity lol
19:26 🔗 C4K3_ has quit IRC (leaving)
19:30 🔗 C4K3 has joined #archiveteam-bs
19:36 🔗 kristian_ has joined #archiveteam-bs
19:40 🔗 odemg has quit IRC (Read error: Operation timed out)
19:55 🔗 odemg has joined #archiveteam-bs
20:41 🔗 bitBaron has joined #archiveteam-bs
20:41 🔗 bitBaron has quit IRC (Client Quit)
20:48 🔗 schbirid yeah, why not move the "WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD" bs to a specific channel?
20:48 🔗 schbirid if the main channel is too holy to even discuss itself
20:51 🔗 Sanqui I would not give password to this person >> <coltaine_> and I give you sucky sucky
20:52 🔗 schbirid Sanqui Sanqui?
21:02 🔗 atrocity $5
21:07 🔗 astrid i'm with Sanqui here
21:09 🔗 Honno has quit IRC (Read error: Operation timed out)
21:12 🔗 HarryCros has joined #archiveteam-bs
21:20 🔗 Pudsey has joined #archiveteam-bs
21:21 🔗 HarryCros has quit IRC (Remote host closed the connection)
21:21 🔗 kristian_ has quit IRC (Ping timeout: 370 seconds)
21:24 🔗 HarryCros has joined #archiveteam-bs
21:25 🔗 HarryCros has quit IRC (Read error: Connection reset by peer)
21:25 🔗 HarryCros has joined #archiveteam-bs
21:38 🔗 Pudsey has quit IRC (Remote host closed the connection)
21:40 🔗 ZexaronS has joined #archiveteam-bs
21:42 🔗 kristian_ has joined #archiveteam-bs
21:50 🔗 REiN^ has quit IRC (Max SendQ exceeded)
21:51 🔗 REiN^ has joined #archiveteam-bs
21:51 🔗 REiN^ has quit IRC (Max SendQ exceeded)
21:52 🔗 REiN^ has joined #archiveteam-bs
21:55 🔗 drumstick has joined #archiveteam-bs
21:56 🔗 joepie91 I'm not sure that this is what was meant with "steal from work": https://twitter.com/SeamusHughes/status/900790149017219073
21:56 🔗 joepie91 :p
21:59 🔗 godane i guy put this up on r/opendirectories : http://95.211.186.214/Incoming/
22:00 🔗 joepie91 ohh, lots of oldish stuff there
22:00 🔗 joepie91 ha
22:00 🔗 joepie91 yeah I'm pretty sure this guy went to... 33C3?
22:00 🔗 godane i can't grab it but some one here would want it
22:00 🔗 joepie91 fairly certain that the Leaks directory came off one of the FTPs there
22:01 🔗 joepie91 godane: definitely, thanks :P
22:01 🔗 godane i got from here: https://www.reddit.com/r/opendirectories/comments/6vri47/large_dj_sets_directory/
22:01 🔗 godane i know that SketchCow is looking for older dj sets
22:02 🔗 joepie91 godane: oh, is he? any particular type?
22:02 🔗 joepie91 I may have a pile of my own
22:02 🔗 joepie91 that is, sets from a specific internet radio channel and some of their live events
22:02 🔗 joepie91 (afterhoursdjs.org, but it's stuff that isn't yet in the collection on IA)
22:02 🔗 godane he is doing the hip hop mixtapes collection
22:03 🔗 joepie91 ah yeah, this is def not hip hop :p
22:03 🔗 godane he will take it
22:05 🔗 joepie91 this is what I have laying around: https://gist.githubusercontent.com/joepie91/1afb987f86a2b417c33a61a35a6c0f29/raw/84a200d991cf846e79133b78e046558b8104a822/gistfile1.txt (cc SketchCow - let me know if you want me to ship them to FOS or such)
22:05 🔗 joepie91 haven't gotten around to sorting out the metadata yet
22:06 🔗 joepie91 goes back to 2002 in some places :P
22:28 🔗 godane uploaded : https://archive.org/details/forum.kingsnake.com-1997-to-2003-archives-20161203:
22:28 🔗 godane *uploaded : https://archive.org/details/forum.kingsnake.com-1997-to-2003-archives-20161203
22:52 🔗 Stiletto has quit IRC (Read error: Operation timed out)
23:28 🔗 kristian_ has quit IRC (Quit: Leaving)
23:38 🔗 SketchCow That DJ directory looks like crap.
23:40 🔗 BlueMaxim has joined #archiveteam-bs
23:52 🔗 Stilett0 has joined #archiveteam-bs
23:59 🔗 TC01 has quit IRC (Remote host closed the connection)

irclogger-viewer